Harvard Stores 70 Billion Books Using DNA
Harvard researchers have been able to use sequencing technology to store 70 billion copies of a yet-unpublished book in DNA binary code.
The results of the project by researchers at Harvard University's Wyss Institute for Biologically Inspired Engineering at Harvard University were published last week in the peer-reviewed journal Science.
"The total world's information, which is 1.8 zettabytes, [could be stored] in about four grams of DNA," said Sriram Kosuri, a senior scientist at the Wyss Institute and senior author of the paper, in a video presentation.
The researchers created the binary code through DNA markers to preserve the text of the book, Regenesis: How Synthetic Biology Will Reinvent Nature and Ourselves in DNA. The book was written by research team member George Church.
"We ... wanted something that represents modern digital, so we used an HTML version of a book," Church said in a video presentation.
"The HTML form -- let's say the web form -- includes digital images [and a] Java script programming language that performs something interactively with a person. So we encoded that into zeros and ones into DNA," Church added.
Church, a professor of genetics at the Harvard Medical School, helped develop the first direct genomic sequencing method in 1984. He was also a member of the team that initiated the Human Genome Project that year as a scientist working at Biogen Inc.
The Harvard researchers stored 5.5 petabits, or 1 million gigabits, per cubic millimeter in the DNA storage medium. Because of the slow process for setting down the data, the researchers consider the DNA storage medium currently suitable only for data archive purposes.
"The information density and scale compare favorably with other experimental storage methods from biology and physics," Kosuri said.
The team also included Yuan Gao, a former Wyss postdoctoral scholar and now an associate professor of biomedical engineering at Johns Hopkins University.
Scientists have long seen DNA as a potential storage medium because of its atomic size, stability and its lifespan of thousands of years. The Harvard researchers were able to boost the data capacity of previous attempts by 1,000 times.
"Most non-DNA methods store on a plane, while DNA can be stored in the volume (beaker). The density is remarkably high - as little as one bit per base, one base per cubit nanometer. So we can store on the order of almost a zetabyte in a gram of DNA - a millimeter volume," Church added.
Last year, Keio University Institute for Advanced Biosciences and the Keio University Shonan Fujisawa Campus announced that researchers there used artificial DNA to carry more than 100 bits of data within the genome sequence.
The Japanese universities said they successfully encoded "e= mc2 1905!" -- Einstein's theory of relativity and the year he enunciated it -- on common soil bacteria Bacillius subtilis.
The Harvard researchers used the four DNA nucleobases - adenine (A), cytosine (C), guanine (G) and thymine (T) - as binary markers. The A and C stand for the digit 0 and the T and G represent the digit 1, according to Kosuri.
And where some experimental media -- like quantum holography -- require temperatures approaching absolute zero (273 degrees Celsius) and tremendous energy, DNA is stable at room temperature, the researchers noted. "You can drop it wherever you want, in the desert or your backyard, and it will be there 400,000 years later," Church said.
Unlike earlier researchers, Church said his team was able to use commercial DNA microchips to create standalone DNA.
"We purposefully avoided living cells," Church said. "In an organism, your message is a tiny fraction of the whole cell, so there's a lot of wasted space. But more importantly, almost as soon as a DNA goes into a cell, if that DNA doesn't earn its keep, if it isn't evolutionarily advantageous, the cell will start mutating it, and eventually the cell will completely delete it."
In another departure from earlier research, the team rejected so-called "shotgun sequencing," which reassembles long DNA sequences by identifying overlaps in short strands.
Instead, the Harvard team took their cue from information technology, and encoded the book in 96-bit data blocks, each with a 19-bit address to guide reassembly. Including jpeg images and HTML formatting, the code for the book required 54,898 of these data blocks, each a unique DNA sequence.
"We wanted to illustrate how the modern world is really full of zeroes and ones, not As through Zs alone," Kosuri said.
Lucas Mearian covers storage, disaster recovery and business continuity, financial services infrastructure and health care IT for Computerworld. Follow Lucas on Twitter at @lucasmearian, or subscribe to Lucas's RSS feed . His e-mail address is firstname.lastname@example.org.
Read more about emerging technologies in Computerworld's Emerging Technologies Topic Center.