Data Deluge: Researchers Turn to Cloud Computing as Genomic Sequencing Data Threatens to Overwhelm Traditional IT Systems

February 8, 2017
Data Deluge: Researchers Turn to Cloud Computing as Genomic Sequencing Data Threatens to Overwhelm Traditional IT Systems
As the volume of sequencing data continues to skyrocket, large research organizations, and individual researches alike will continue to turn to cloud computing for secure handling of their genomic data. [Maxiphoto / Getty Images]

Chris Anderson, Editor in Chief

In early 2007, the cost of sequencing a single human genome was around $10 million, according to data compiled by the National Human Genome Research Institute, and the decreasing cost was roughly following the path predicted by Moore’s Law. This trend implies that the lowering cost of sequencing was directly related to increases in computing power.

But all that changed by late 2007, as a variety of new, high-throughput sequencing methods—referenced today under the catchall phrase “next-generation sequencing” (NGS)—began to replace the Sanger method. Less than four years later, the cost had plummeted to $10,000 amid wide-eyed chatter of the promise of the $1,000 genome. Now, we are nearly there. And not surprisingly, there is a new target—the $100 genome—which is a mere formality according to sequencing heavyweight Illumina, which launched NovaSeq in January, the tool it says will get us there.

Today, NGS has democratized genomic research. It is routinely used by individual pharmaceutical and academic researchers alike, with thousands of researchers worldwide routinely plumbing the depths of the coding regions of DNA in ways barely imaginable 10 years ago. The availability of broad, detailed datasets from this work has also infiltrated the clinic where doctors can now access genomic information to provide more precise care to their patients.

But as the costs of sequencing have plummeted, the volume of generated sequencing data has concomitantly exploded, presenting challenges in how to store and effectively analyze the growing mountain of genomic data.

According to Bryan Spielman, EVP of strategy and corporate development with biomedical data analysis company Seven Bridges, the pace of change has made even significant infrastructure investments in on-premises computing capacity inadequate. To provide a sense of the scale of data being generated, Spielman notes that the genomic data of 11,000 people currently housed in The Cancer Genome Atlas (NIH-funded) weighs in at more than 1.5 petabytes.

“I was speaking with someone at a top-five pharma company, and 1.5 petabytes is 50% of the storage capacity of their own, on-premises, high-performance computing cluster,” he says. 

In an era when a major undertaking in the U.K. promises to sequence 100,000 genomes and there are both public and private projects that aim to sequence 1 million genomes, it becomes clear that new thinking and strategies for how to manage and leverage the data are needed.

Into the Cloud

As the cost of sequencing has dropped and adoption continues to grow, the move to cloud computing was almost a necessity for the most active sequencing operations. In testimony to the U.S. Congress in the summer of 2014, human genome pioneer J. Craig Venter cited two major developments that had allowed him to start his precision medicine company Human Longevity: the cost of sequencing passing an affordability threshold, and the ability to move the sequencing data it generated to the cloud.

“We are going to rely very heavily on cloud computing, not only to house this massive database, but to be able to use it internationally,” Venter testified regarding the then-fledgling company. He went on to describe how even with a dedicated, fiberoptic network the data moved so slowly between his company in La Jolla, CA, and his non-profit genomic research entity the J. Craig Venter Institute in Rockville, MD, that they would routinely ship data on hard disks via FedEx between locations. “The use of the cloud is the entire future of this field,” he concluded.

Another significant factor speeding adoption of cloud computing comes when an organization’s on-premises capability can’t keep up with the speed and data demands of NGS, says David Shaywitz, M.D., Ph.D., CMO of cloud-based genome informatics and data management company DNAnexus. “People would say to me ‘we have an overwhelming amount of work to do and it shuts down our cluster when we try to do it.’ When they move to the cloud: what would be months of work for them before, they can do in the cloud in hours, so that’s obviously better,” Dr. Shaywitz says.

Further, because the hurdles to entry for NGS are now much lower, and don’t require a significant IT backbone, the lower sequencing costs combined with cloud computing have democratized genomic research. “You are putting the power of sequencing into single-researcher hands with things like [Illumina’s desktop sequencer] MiSeq,” says John Shon, VP, bioinformatics and data science at Illumina. “So even though some of the work has to happen on premises, you can have push-button analysis in the cloud.”

That’s a far cry from just a few years ago, notes Shon, whose background includes stints with Janssen (a division of Johnson & Johnson) and Roche. “There were a lot of homegrown tools back then, almost exclusively local storage, and not very much was standardized at all,” he says. “In the research setting: the data would be collected in one place, you’d have the molecular biology lab that did sample processing, you’d have a sequencing center, and the data would be sent to the bioinformatics groups. So it was not uncommon to have five or six different departments involved in that process.”

But the benefits of the cloud extend beyond more computing power and massive data storage, to providing an environment that fosters scientific collaboration on national and global scales. One example of how the cloud fosters collaborations is found in PrecisionFDA, the FDA’s cloud-based collaborative portal that provides tools for researchers, including reference genomes, allows participating organizations to upload their own data and share tools and analytic methods for querying genomic data.

Launched in December 2015 as part of President Obama’s Precision Medicine Initiative, PrecisionFDA to quickly grew to more than 1,500 researchers representing roughly 600 different companies and organizations. According to Taha Kass-Hout, M.D., FDA chief health information officer, roughly one-third of the participants in PrecisionFDA hail from outside the U.S. “It’s amazing to see how the global community is coming together, and they are contributing data, as well as software [to PrecisionFDA],” Dr. Kass-Hout notes in a 2016 online interview outlining the program.

“The community is working toward advancing the regulatory science behind assuring the accuracy of the next-gen software for the human genome. To do that, we want to provide an environment to share some of the innovations happening in this field, as well as any reference materials they might have,” Dr. Kass-Hout explains. “We also realized there are several members in the community that need the computation platform to help them do the heavy [data-]crunching. We consider it a social experiment behind advancing regulatory science behind NGS.”

“If you are looking for the opportunity to facilitate [collaboration] between distant facilities—because science is global and there is a need for global representation—there is hardly a better way to do it than the cloud,” Dr. Shaywitz concludes.

 

Click here to access the rest of this article.