Description
Abstract
Towards creating a platform for information storage in artificial DNA, our team built a user-friendly web-based tool Bio101 to encode and decode information in DNA sequences. Through a five-step process, compression, encryption, bit-to-nt conversion, indexing and validation (Fig.1), Bio101 can be used to encode conventional electronic computer files into nucleotide sequences which are ready for chemical synthesis and decode the sequencing results of a DNA sample reversely. In addition, we proposed a solution for DNA-based file editing.
Fig.1.Five-step design of Bio101
This coding tool can create a convenient DNA information workflow, so researchers can choose any files they want and focus on synthesizing DNA. The surprisingly simple idea has the potential to reshape the global face of data storage in the not-too-distant future, and our work contributes to practical areas can accelerate the step to success.
Background
Living in an information explosion era, digital production, transmission and storage have not only revolutionized the way information is accessed and used, but also made information archiving an increasingly complex task[1].Have you been perplexed by vast quantity of information? And have you ever imagined if there exists a practical, high-capacity, low-maintenance, and even self-copy information storage medium which would be still readable after thousands of years? It is not just a dreamy illusion anymore because of the appearance of DNA storage technology.
DNA, one of the most miraculous masterpiece created by Nature as the stable genetic material, holds a great promise for high-density, long-term and massive information storage. For example, human genome, just 3 billion base pairs, encodes all of the complex biology information of human being, including appearance, metabolism, growth, development, production and many other delicate functions (Fig.2).
Fig.2. Genome controls the growth and development of human being.
Why do not we utilize DNA to store information? Researches indicate that it is extremely dense, and spectacularly high-capacity with a raw storage density limit of 1 exabyte/mm3(109GB/mm3)[2]. In other words, every gram of DNA is equivalent to 14 thousand 50 GB blue-ray discs or 233 x 3 TB hard-disks which weighs more than 151 kg. Meanwhile, compared with current routinely used information storage media, DNA can be stored up to centuries (Fig.3)[2]. Grass et al developed an innovative preservation of digital information on DNA which presumably can last 1 million years in silica[3]. In conclusion, a great number of virtues make DNA an ideal archival material for information storage.
Fig.3.DNA storage as the bottom level of the storage hierarchy[2].
Biotechnology Availability
You may wonder how a piece of arbitrary information is stored into DNA molecules. On the physical reading side, it depends on the DNA sequencing technology (Fig.4) In 1977, Frederick Sanger adopted primer-extension strategy to develop the DNA sequencing method "DNA sequencing with chain-terminating inhibitors"[4], and it directly facilitated human genome project(HGP). But Sanger method is too expensive and takes too long time, such that HGP project spent 3 billion dollars over fifteen years [5]. Until next-generation sequencing (NGS), also named high-throughput sequencing, technologies generated, the cost of vast sequencing became comprehensive and really acceptable. 454 pyrosequencing [6], illumina (Solexa) sequencing [7], SOLiD sequencing [8] are three mainly applied popular methods. Meanwhile the third generation nanopore DNA sequencing has appeared [9]. Owing to technology development, sequencing cost per genome exponential decrease from 1 million dollars in 2008 to just 1 thousand now [10]. Encouragingly, the cost will keep dropping, and sequencing would become as simple as reading information from a hard disk or compact optical disk.
Fig.4.The history of DNA sequencing technology progress.
On the other hand, writing information physically into DNA molecules depends on the chemical synthesis of artificial DNA sequences. Currently, oligonucleotide synthesis is used for preparing primers, gene probes, etc. The process is implemented as solid-phase synthesis using phosphoramidite method [11] and phosphoramidite building blocks derived from protected 2'-deoxynucleosides (dA, dC, dG, and T), ribonucleosides (A, C, G, and U), or chemically modified nucleosides. (Fig.5) Unfortunately, we still cannot synthesize any DNA molecules as we wish, for example long DNA segments or some possessing special second structures or high GC-content now are still hard or impossible to synthesize chemically. Although oligonucleotide synthesis has been proposed as early as in 1955 [12], the technique is still very expensive, comparing the synthesis cost 0.04 US cent per base (add reference here.) with just 1 cent in reading 1 million bases. Therefore, it is not surprising to find the writing of one megabyte of information cost $12,400 while the reading only costs $220 in a recent DNA storage experiment [14].
Fig.5.Synthetic cycle for preparation of oligonucleotides by phosphoramidite method[13].
However, we believe as the rapid development of biotechnology the DNA synthesis will become cheap enough for the information storage usage. What remains to be solved is how to encode an arbitrary computer file into DNA sequences and decode the DNA sequencing information. There lacks a bridge between today’s electronic based information world and the future biotechnology based information world, and what we want to do is to connect them.
References
- [1] Goldman N, Bertone P, Chen S, et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA[J]. Nature, 2013, 494(7435): 77-80.
- [2] Bornholt J, Lopez R, Carmean D M, et al. A DNA-based archival storage system[C]//Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2016: 637-649.
- [3] Grass R N, Heckel R, Puddu M, et al. Robust Chemical Preservation of Digital Information on DNA in Silica with Error‐Correcting Codes[J]. Angewandte Chemie International Edition, 2015, 54(8): 2552-2555.
- [4] Sanger F, Nicklen S, Coulson A R. DNA sequencing with chain-terminating inhibitors[J]. Proceedings of the National Academy of Sciences, 1977, 74(12): 5463-5467.
- [5] Lander E S, Linton L M, Birren B, et al. Initial sequencing and analysis of the human genome[J]. Nature, 2001, 409(6822): 860-921.
- [6] Margulies M, Egholm M, Altman W E, et al. Genome sequencing in microfabricated high-density picolitre reactors[J]. Nature, 2005, 437(7057): 376-380.
- [7] Bentley D R, Balasubramanian S, Swerdlow H P, et al. Accurate whole human genome sequencing using reversible terminator chemistry[J]. nature, 2008, 456(7218): 53-59.
- [8] Valouev A, Ichikawa J, Tonthat T, et al. A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning[J]. Genome research, 2008, 18(7): 1051-1063.
- [9] Clarke J, Wu H C, Jayasinghe L, et al. Continuous base identification for single-molecule nanopore DNA sequencing[J]. Nature nanotechnology, 2009, 4(4): 265-270.
- [10] Kedes L, Liu E T. The Archon Genomics X PRIZE for whole human genome sequencing[J]. Nature genetics, 2010, 42(11): 917-918.
- [11] Reese C B. Oligo-and poly-nucleotides: 50 years of chemical synthesis[J]. Organic & biomolecular chemistry, 2005, 3(21): 3851-3868.
- [12] Michelson A M, Todd A R. Nucleotides part XXXII. Synthesis of a dithymidine dinucleotide containing a 3′: 5′-internucleotidic linkage[J]. Journal of the Chemical Society (Resumed), 1955: 2632-2638.
- [13] https://commons.wikimedia.org/wiki/File%3AOligocycle1.png
- [14] Goldman N, Bertone P, Chen S, et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA[J]. Nature, 2013, 494(7435): 77-80.