After 25 years of breakthroughs and $14-billion in federal support, the revolution in genomics—and all its promises of miracle cures—is now firmly in the hands of the computer geeks.
On one level, the effort to sequence the human genome has already been a smashing success. The first genome was decoded in 2003, two years ahead of schedule. Subsequent technological gains now allow patients to have their genomes read for just a few thousand dollars apiece.
Actual medical breakthroughs, though, are still elusive. The problem, increasingly evident in the last couple of years, boils down to computer processing, storage, and transmission: With each person’s genome consisting of three billion separate data points, multiplied by billions of people and thousands of diseases and medical conditions, no workable system exists for analyzing all that information.
“There’s a huge challenge in how we grapple with the huge amount of DNA and genomic data that’s being produced,” said Mark B. Gerstein, a professor of biomedical informatics at Yale University.
And so, an elite group of some of the world’s top research universities and corporations, pulled together by the National Science Foundation, is now being put on the case.
The coalition, called CompGen, includes IBM, Intel, and Microsoft, as well as several academic leaders in computing and genomics, like the University of Illinois, Baylor College of Medicine, and Washington University in St. Louis.
Until now, most attempts to handle the crush of genomics data have centered on adaptions of off-the-shelf computer hardware and cloud storage models. New companies such as DNAnexus, Bina Technologies, and Illumina help researchers analyze genomic data and store it across shared computer systems. Apache’s Hadoop and related software are popular choices for cloud storage systems.
But such solutions just aren’t enough for the staggering amounts of data coming from genomics studies, and the CompGen participants will be studying whether an entirely new type of computer system—both hardware and software—should be designed and built just for the task.
It’s a bit of a gamble either way. Mr. Gerstein, who is not associated with CompGen, is among those who contend that, as a general rule, scientists fare better by letting computer technology advance broadly rather than pushing it in targeted areas. He cites high-energy physics and climate science as among the many fields beyond genomics that need enormous computing capacity—in storing, analyzing, and rapidly sharing vast amounts of data—and would therefore all benefit from grant agencies bolstering computer science generally.
Given that, said Mr. Gerstein, who serves as co-director of Yale’s program in computational biology and bioinformatics, it’s not immediately clear why genomics alone requires some kind of new computer or processing chip.
“The history of computing,” he said, “has been all about how generic solutions trump special-interest hardware.”
CompGen’s principal investigator, Steven S. Lumetta, an associate professor of electrical and computer engineering at the University of Illinois at Urbana-Champaign, said no definite approach had been settled on.
But examples like IBM’s Blue Gene project may be instructive, Mr. Lumetta said. Blue Gene was initially designed to explore the complexities of protein folding—the mysterious process by which proteins innately acquire their unique three-dimensional shapes. Blue Gene supercomputers have helped with that topic and much more, including nuclear-weapons design, long-term weather predictions, and even human-genome mapping.
“If we look historically at where big advances in computers have come from, it’s often the case they were driven by domain-specific applications,” Mr. Lumetta said.
Technological Challenges
The NSF grant totals just $1.8-million over four years. Contributions from the partner institutions bring the total project value to about $2.6-million, Mr. Lumetta said.
Intel’s liaison to CompGen, Nicholas P. Carter, came to the company in 2007 after working as an assistant professor of electrical and computer engineering at the University of Illinois. He envisions a world in which every major hospital has a gene-sequencing machine that’s connected to a supercomputer facility that can quickly receive a patient’s genomic information and return diagnoses. To get there, the supercomputers will probably need some kind of specialized hardware, Mr. Carter said.
One leading technological tool for CompGen is likely to be “die stacking,” the process of mounting multiple chips on top of one another, he said. That method of construction allows the individual transistors and other components inside chips to sit closer to each other, letting them operate faster, he said.
For many computing purposes, designers have been struggling to figure out how to make such three-dimensional chip arrays without the inside pieces overheating. But the special needs of genomic research, with its heavy demand for the memory needed to assess three billion nucleotides per patient, may be especially conducive to die stacking, Mr. Carter said. That’s because memory chips generate less heat than processing chips, potentially allowing for a specialized die-stacked component where layers of memory separate layers of processors, he said.
“That’s very important for genomics,” Mr. Carter said, “because of the amounts of data involved, and because the way the data is accessed is likely to be different than the normal computer applications.”
The software side also needs development customized to the needs of genomics. As a start, software engineers have written computer code that uses shorthand notations for remembering long strings of the four letters—A, C, T and G—that represent the four nucleic-acid bases that make up DNA. A specially designed computer might use only two binary digits, known as bits, for those letters rather than the eight bits that computers typically use for letters and symbols. There’s also shorthand notation such as “A12,” where only three characters are used to represent 12 instances of “A” in a genomic string.
Some types of approximations may save computer memory and processing speed at the cost of reduced accuracy when reconverting the data, but that might be acceptable for certain uses, Mr. Lumetta said. For instance, a researcher may just want one example of a situation where a particular gene becomes associated with a medical condition, so false negatives would pose little problem, he said. Other times, the researcher may want to know all the instances when an association occurs, so a relatively few false positives found manually can just be discarded later.
All those situations could be handled much more efficiently in a specially designed system of hardware and software, which could prove critical to researchers struggling to cull meaning out of billions upon billions of combinations of genes, people, diseases, eating and lifestyle habits, and more, Mr. Lumetta said.
Legal Issues
Computing prowess isn’t the only obstacle to a future in which doctors routinely treat patients with the help of genetic profiles. The director of the National Institutes of Health, Francis S. Collins, who led the government’s Human Genome Project during the 1990s, has said that laws surrounding patient rights need more clarity to permit populationwide comparisons of individual genetic maps.
The new genomic technologies “cannot fully realize their potential if the relevant policy, legal, and regulatory issues are not adequately addressed,” Dr. Collins and Margaret A. Hamburg, commissioner of the Food and Drug Administration, wrote this month in The New England Journal of Medicine.
Such considerations loom as an important element of computer-system designs, Mr. Gerstein said. “There’s this very complicated issue of how do you do a calculation where you can exploit the information in a big common database, but yet not really look in too much detail at an individual person in it,” he said.
Although not associated with CompGen, David Haussler, a professor of biomolecular engineering at the University of California at Santa Cruz, said he’s excited by the attempt. A custom-designed computer system may turn out to be necessary, though it carries risks, said Mr. Haussler, who is also director of his university’s Center for Biomolecular Science and Engineering. “One thing about custom computing hardware for a smaller market is that it is always hard for it to keep up with the relentless improvements in commodity hardware,” he said.
Dirk Pleiter, a professor of theoretical physics at the University of Regensburg, in Germany, who works with Blue Gene systems, said he agreed that any use-specific computer design carries the risk that its systems will quickly be outpaced by more general advances in supercomputing. But the caliber of the CompGen partnership, of which he is not a member, would seem to reduce that risk, Mr. Pleiter said.
Either way, it’s going to take time. Genomics researchers should not expect a brand-new special-purpose supercomputer from the NSF grant, Mr. Carter said. A more realistic four-year expectation for CompGen, he said, might be only the creation of large-scale prototypes for the chips that would go into such a supercomputer.