Description
Every one of us is creating, storing and transmitting lots of data every day. Text messages, emails, online clouds are only a few examples. Since 2002 the amount of digitally stored data has exceeded the amount of data stored on analog media. By now less than 6 % of the world’s data is still analog [1].
It is not surprising that data breaches orchestrated by hackers are on the rise as well. Financial and legal records, military and government documents, these are examples of important information that must be preserved for a long time, but could cause great damage in the wrong hands. Moreover, the Dutch National Cyber Security Center revealed that it is us citizens who are most likely to be attacked by cybercrime. We have become a civilization dependent on information, and this information must be stored somewhere. As a result, we are faced with two problems: where do we store all of our data, and how do we keep it safe?
In addition to the low safety of digital and analog data storage, they require a large amount of resources like storage space and electricity. Data is stored in huge data centers as can be seen in Figure 1. In 2015, 416.2 TWh were used for storage of digital data 3 costing 41 billion USD. This is higher than the annual power consumption of the entire UK [2], and is responsible for approximately 2% of global greenhouse emissions, rivaling the airline industry [4]. In 2015 about 2,500,000 TB of new data were produced per day [1]. While the densest storage medium in use has a capacity of 10 GB/mm³, data up to a density of up to 109 GB/ mm³ can be stored in DNA [5]. It is furthermore expected that the demand for silicon, which is required for flash memory, is to exceed silicon supply by 2040 [6].
The iGEM team Groningen 2016 thinks it is about time to develop a safe and novel data transmission system. Especially as scientists we are in the urgent need of that. As our team consists of biologists as well as computer scientists we developed a multidisciplinary approach. Safe encryption of digital data, conversion into a DNA sequence and integration into a bacterial genome: This approach provides a system with multiple digital and biological safety layers. DNA is an infinite resource and in a spore it is safe from environmental influences.
Storage of data in DNA has been proposed as early as the 1960’s, but has only recently become a hot topic [7]. This is in part due to the ever-growing demand for data storage, as well as advancements in DNA synthesis and sequencing technologies. Our goal is to create a system for long-term data storage and data transfer which cannot be hacked by digital means. Digital methods of encrypting information and converting it into binary code are well established, and data storage in DNA has already been demonstrated. Our project combines these two approaches by first converting information into binary code, encrypting it, and then storing it safely in DNA. Additional measures based on molecular biology will prevent unauthorized access, ensuring the safety of the stored information.
Our system will be useful for the kind of information that should be stored and transferred in a very secure manner, but does not have to be accessed within seconds. It will be possible to obtain the message in about 24-48 hours; however, this timeframe is likely to be reduced as new sequencing technologies are developed.
DNA is a far more stable data storage medium compared to magnetic and optical media, remaining intact for at least 700,000 years at -4 °C [8]. Even in harsh environments, DNA has a half-life of over 500 years [9]. In contrast, current storage technology is rated to last only up to 30 years [10]. Given the stability and compactness of DNA, our system could also be adapted to serve as a time capsule for human knowledge. Additionally, DNA storage will soon become a cheaper alternative for data storage as DNA synthesis and sequencing costs drop. It is estimated to become a cost-effective method for long-term data storage within approximately ten years [11].
DNA data storage is an apocalypse-proof technology because DNA will be relevant to future civilizations. As long as intelligent DNA-based life exists, there will be compelling reasons to study and manipulate DNA.
Our system is especially safe as it cannot be hacked by computer scientists and it requires the specific knowledge of the recipient to retrieve the stored data.
As our project is designed by BioBricks it is easily adjustable to individual wishes. Message and key are easy to implement and exchange, and biological safety layers can be customized. According to the iGEM values we worked with BioBricks that we designed ourselves as well as with BioBricks of previous teams. Read more about how we worked on the improvement of the nuclease BBa_K729004 and how we characterized the B. subtilis integration vector BBa_K823023.
History of data storage
The concept of storing data in DNA molecules was created and published in the 1960s by the Soviet physicist Mikhail Samoilovich Neiman [12]. He came up with the idea that digital data can be stored in the base sequence of DNA. Because DNA exists of four different nucleotides, the information density can be up to two times higher compared to our familiar binary storage systems. Since the transition from analog to digital storage devices the storage half-life of our digital information has dropped a lot. Besides this current efforts to guarantee longevity of digital data storage are scarce [13]. Optical and magnetic storage devices are not reliable for long-term data storage. When DNA is encapsulated within silica spheres and stored at -18°C it is possible to recover data more than 1 million years later. But if you don’t have a freezer and live in central Europe your data will be safe for up to 2k years [14]. So storing information in DNA can be done with higher density and it can be kept much longer.
The first messages that had actually been stored in DNA had to wait until 1988 where Davis managed to do so [15]. In 2010 scientists were able to encode 7920 bits in synthetic DNA [16]. Up to this milestone it was difficult to write and read long perfect DNA sequences. In 2012 Church et al. developed a new strategy to encode arbitrary digital information with an encoding scheme that uses better DNA synthesis and sequencing technologies. In this research they were able to encode and decode a message containing a little over 50k words and a few images [17]. To learn more about data storage we visited the archives of the city Groningen (figure 3).
Encryption
The original message is encrypted into a new message (the ciphertext) by using the Rijndael algorithm [18], which was developed in 1998 by two Belgian cryptographers, Joan Daemen and Vincent Rijmen. In November 2001 this algorithm was selected by the U.S. National Institute of Standards and Technology (NIST) as the new Advanced Encryption Standard (AES) [19]. Since then, it has been adopted by the U.S. government to secure highly classified data and has been used worldwide. The process of encryption is represented in figure 4.
After encryption, the message will be converted into a binary message, by making use of the American Standard Code for Information Interchange (ASCII). ASCII encodes characters into integers, which can be represented as sets of binary digits.
The encrypted binary message is translated into a sequence of the nucleotides ACTG by using the following translation scheme: the binary pair 00 will be represented as A, 10 as T, 01 as C and 11 as G (see table 1). Subsequently, the obtained string of nucleotides is integrated into the DNA of Bacillus subtilis, which will serve as the carrier organism for our secret message. For example, table 2 shows the translation of the plaintext “Hello world” into a sequence of nucleotides (encryption is not applied in this example).
The same strategy is applied to the encryption key, which is integrated into the DNA of a separate Bacillus subtilis strain in the same manner. In order to retrieve the original message, the message needs to be decrypted by using the same key that was used for encryption.
00 | A |
10 | T |
01 | C |
11 | G |
Letter | H | e | l | l | o | w | o | r | l | d | |
---|---|---|---|---|---|---|---|---|---|---|---|
ASCII | 072 | 101 | 108 | 108 | 111 | 032 | 119 | 111 | 114 | 108 | 100 |
Binary | 0100 1000 |
0110 0101 |
0110 1100 |
0110 1100 |
0110 1111 |
0010 0000 |
0111 0111 |
0110 1111 |
0111 0010 |
0110 1100 |
0110 0100 |
DNA | ATAC | CCTC | AGTC | AGTC | GGTC | AATA | GCGC | GGTC | TAGC | AGTC | ACTC |
DNA Synthesis
DNA synthesis is the natural or artificial creation of deoxyribonucleic acid (DNA) molecules. In the cell, each of the two strands of the DNA molecule acts as a template for the synthesis of a complementary strand. Based on a similar principle, polymerase chain reaction (PCR) too is being used for DNA synthesis in vitro. Further, with advances made in science, it is now possible to create artificially synthesized novel nucleotide DNA sequences [20]
DNA replication – the natural DNA biosynthesis (in vivo DNA amplification)
DNA and DNA replication mechanisms appeared late in early life history. DNA traces its origin from RNA/Protein [21], however, to remain in the scope of our current project, the natural process of DNA replication proceeds in an enzymatically catalyzed and coordinated steps: initiation, elongation and termination.
The first step in DNA replication involves the unzipping of the double helix structure of the DNA. This is carried out by an enzyme called helicase, which breaks the hydrogen bonds between the complementary bases pairs of DNA (A with T, C with G). This leads to the separation of two single strands of DNA, creating a ‘Y’ shaped replication fork. The two separated strands serve as templates for making new strands of DNA subsequently.
One of the strands is oriented in the 3’ to 5’ direction (towards the replication fork), whereas the other strand is oriented in the 5’ to 3’ direction (away from the replication fork). Due to this difference in their orientation, the two strands replicate differently.
- A short piece of RNA sequence called primer (produced by an enzyme called primase) binds to the end of the leading strand (3’ to 5’). The primer acts as the starting point for DNA synthesis. Thereafter DNA polymerase binds to the leading strand and starts adding new complementary nucleotides to the template DNA in the 5’ to 3’ direction.
- The replication process for the lagging strand (5’ to 3’) involves multiple RNA primers binding at random points along the template DNA. This leads to the formation of short chunks of DNA in the 5’ to 3’ direction, called Okazaki fragments.
Once the bases pairs are formed (A with T, C with G), another enzyme called exonuclease dissociates the primer from the DNA strand. The gaps are filled by complementary nucleotides and the new strand of DNA is proofread by DNA polymerase. Finally, an enzyme called DNA ligase seals the sequence of DNA into two continuous double strands, following which the new DNA automatically winds up into a double helix.
Polymerase chain reaction - enzymatic DNA synthesis (in vitro DNA amplification)
Kary Mullis invented the PCR technique (Figure 5) in 1985 [22]. His work revolutionized the process of making millions of copies of a scarce sample of DNA, and was awarded the Nobel Prize for Chemistry, 1993. The procedure follows the basic principle of DNA replication in vivo. A small amount of the DNA containing the desired gene of interest is aliquoted into a tube consisting nucleotides, primers (pair of synthesized short DNA segments, that match segments on each side of the desired gene), DNA polymerase enzyme and a buffer that allows optimal activity of the polymerase enzyme. Thereafter the tube containing the mix is subjected to cycles of repeated heating and cooling which leads to amplification of the gene of interest.
- The first step in a regular cycling event involves heating the mix to 94–98 °C for 20–30 seconds. It disrupts the hydrogen bonds between the complementary base pairs, yielding single strands of DNA.
- This is followed by lowering of temperature to 50–65 °C for 20–40 seconds, allowing the primers to form hydrogen bonds and bind to the single-stranded DNA template. The polymerase binds to the primer-template hybrid and begins DNA formation.
- Following the binding of primers, the temperature is increased to about 72 °C for the DNA polymerase to synthesizes a new DNA strand complementary to the DNA template strand by adding dNTPs that are complementary to the template in 5' to 3' direction.
The processes of denaturation, annealing and elongation constitute of one cycle. Multiple cycles are required to amplify DNA.
Gene synthesis - physically creating artificial gene sequences (de novo DNA synthesis)
For over 60 years, the synthetic production of new DNA sequences has helped researchers understand and engineer biology. Gene synthesis is also accelerating research in well-established research fields by providing critical advantages over more laborious traditional molecular cloning techniques. De novo DNA synthesis involves the chemical synthesis of relatively short but specific fragments of nucleic acids. Chemical oligonucleotide synthesis does not have the limitation of unidirectional nucleotide addition (5’ to 3’), as compared to the naturally occurring DNA synthesis and PCR. To obtain the desired oligonucleotide, loose nucleotides are sequentially added to the growing oligonucleotide chain in the required sequence. Typically, synthetic oligonucleotides are single-stranded DNA or RNA molecules.
The synthesis starts with a non-nucleosidic linker being attached to a solid support material. The oligonucleotide sequence remains covalently bound to the support material over the entire course of the chain assembly via its 3'-terminal hydroxy group. The chain assembly is then continued until the completion, after which the release of the oligonucleotides occurs by the hydrolytic cleavage of a P-O bond that attaches the 3’-O of the 3’-terminal nucleotide residue to the universal linker.
Synthetic genes offer several advantages over cloned native DNA. These sequences are subjected to stringent quality checks to match 100% sequence verification by the private companies involved in synthesis of synthetic DNA. Moreover, artificial DNA synthesis allows the flexibility to researchers for changing enzyme specificities and activities to suit the needs of their experiments. Also, synthesis of specific sequences allows the insertion of localization signals to target specific protein/nucleic acid in vivo.
Decoding – costs and fidelity
The cost of sequencing DNA has plummeted in the last two decades. For instance, in the early 2000’s it took 13 years and $3 billion US dollars to sequence the entire human genome. With current technologies, we have approached the $1,000 dollars mark. This development can be seen in figure 6. In fact, since 2007 you can have your whole DNA sequenced for less than that! However, there are still issues with the fidelity and accuracy of the readings obtained that would prevent them from being used for our system [27].
Existing laboratory-level DNA sequencing technologies typically allow for a reading error of ~1%. Thanks to optimization and fine-tunning, traditional readings using Sanger biochemistry offer now accuracies of up to 99.999% in a read length of 1,000 bp. With modern, second generation or cyclic sequencing, higher reading lengths have been achieved but with a decrease in accuracy [26].
Among the bioencryption layers, CryptoGERM prevents unauthorized parties from reading the message by having a high ratio of decoy spores that contain a useless sequence. In addition, if the right growing conditions are not supplied our system prevents germination and replication of spores that do contain the message. So, what ratio of decoy:spores do I need to prevent brute force sequencing and message retrieval from a third party? (see figure 7)
Fox et. al. (2014) reports that there is a 50% chance of accurately distinguishing a true subclonal variant from a sequencing artifact in an excess of 100 wild-type DNA sequences using standard Q30 filter reads (error rate: 10-2) [23]. In agreement with that result, in an experiment carried out to detect genomic variations in marine pests, Pochon (2013) was able to detect one variant out of 150 wild-type sequences [24].
Reading DNA has become increasingly more accurate. Schmidt (2012) developed a method called Duplex Sequencing that uses both strands of DNA to obtain a more precise consensus sequence yielding an theoretical error of 3x10-10 [25]. That means that we could transmit a message with a length in the order of gigabytes without expecting any loss! On the other hand, that also means that they allow a more precise measurement in decoy-spores mixtures, and a 1:150 spore:decoy ratio might be insufficient in the future. In fact, using Duplex Sequencing they were able to identify one mutant sequence per 10,000 wild-type molecules.
Integration in B. subtilis
The chassis B. subtilis 168The key and message sequence are stored in Bacillus Subtilis (B. Subtilis), a gram-positive, catalase positive bacteria, usually found in soil and gastrointestinal tracts. This rod-shaped bacterium is about 4-10 µm in length and has a diameter of about 0.25–1.0 µm [28]. The cell is heavily flagellated, allowing the microbe to move quickly in liquid medium. It is one of the most well studied gram-positive microbe, and is one of the widely adopted model organism to study bacterial cell differentiation and sporulation. These cells are amenable to a wide array of genetic manipulations, and the ease of transformation has allowed B. Subtilis to be used for selective protein expression to suit our requirements. Moreover, the integration of DNA sequences into its genome is well known, which has been an alternative to using plasmids. The process of integration is demonstrate in figure 8.
Storing and sending
Spores
B. Subtilis form endospores when environmental factors do not favour survival or reproduction [29]. These endospores are highly resistant and durable structures. They have a central cytoplasmic core where the DNA and ribosomes are protected by an impermeable and rigid coat. Spores, when released in the environment, can survive extreme heat and freezing, lack of water, high pressure exposure to many toxic chemicals and certain radiations. Compared to the vegetative cells, endospores are contained in a thicker cell wall along with additional layers that make them last long periods in its dehydrated metabolically inactive state.
Sporulation
The process of endospore formation within a B. subtilis can take up several hours to complete and is called sporulation or sporogenesis. Sporulation can be induced manually in the lab by limiting the availability of a key nutrient, such as the carbon or nitrogen source.
Sporulation begins with a small portion of cytoplasm, along with a newly replicated bacterial chromosome, is isolated by an ingrowth of the plasma membrane, called spore septum. The spore gets a double-layered membrane that will surround the chromosome and cytoplasm. This structure is entirely enclosed within the original cell, and is called forespore.
Peptidoglycan layers are laid down between the two membrane layers. A thick spore coat is formed around the outside membrane, which is responsible for the resistance of endospores. The last stage of sporulation is the degradation of the original cell and the release from the endospore [30].
Germination
An endospore returns to its vegetative state by a process called germination. This process is triggered by physical or chemical cell damage to the endospore coat. The enzymes of the endospores then break down the extra layers surrounding the endospore. The water enters and the metabolism resumes. [30]
Natural Competence
The natural competence of B. subtilis is one of its less used advantages. Competent B. subtilis can actively pull DNA fragments from their environment. These uptaken nucleotides change the genotype by homologous recombination, also known as natural transformation. In order for B. subtilis to integrate DNA from medium, the cells can synthesize a specific DNA-binding and uptake system as seen in figure 9. In this figure there are a few proteins which form the translocation complex drawn (A, NucA; C, ComC; E, ComE; F, ComF; G, ComG; CW, cell wall; CM, cell membrane; CYT, cytoplasm). This system has no specifity for DNA, therefore B. subtilis can take integrate plasmid DNA, phage DNA or chromosomal DNA. [31]
Storing of spores
Bacterial spores are tough, non-reproductive structures produced by bacteria. They are highly resistant to aging, radiation, heat and chemical damage. Endospores formed by Bacillus subtilis could be found viable after millions of years [32]. These properties make them the ideal storage medium for data in DNA.
During freeze-drying water is removed from a substance to increase the storage-life and to make shipping easier. The most effective method for long-term storage and thus for shipping of Bacillus subtilis spores appears to be freeze-drying [33]. Freeze-drying is commonly used for long-term storage of bacteria [34] and spores of B. subtilis. On top of the fact that spores are highly resistant under different harsh conditions, Fairhead et al. showed that spores of B. subtilis are very resistant to several cycles of freeze-drying [35]. The sending process is demonstrated in Figure 10.
References:
- [1] “Shift from analog to digital is nearly complete - Technology & science - Innovation | NBC News.” [Online].
- [2] “Global warming: Data centres to consume three times as much energy in next decade, experts warn | The Independent.” [Online].
- [4] GeSI SMARTer 2020: The Role of ICT in Driving a Sustainable Future. http://gesi.org/assets/js/lib/tinymce/jscripts/tiny_mce/plugins/ajaxfilemanager/uploaded/SMARTer2020-report.pdf
- [5] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, G. Seelig, and K. Strauss, “A DNA-Based Archival Storage System.”
- [6] V. Zhirnov, R. M. Zadegan, G. S. Sandhu, G. M. Church, and W. L. Hughes, “Nucleic acid memory.,” Nat. Mater., vol. 15, no. 4, pp. 366–70, Apr. 2016.
- [7] Some fundamental issues of microminiaturization Radiotekhnika, 1964, No. 1, pp. 3-12
- [8] Ancient DNA: Towards a million-year-old genome. Nature 499, 34–35 (04 July 2013) DOI:10.1038/nature12263
- [9] The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils. Proc Biol Sci. 2012 Dec 7;279(1748):4724-33DOI: 10.1098/rspb.2012.1745
- [10] A DNA-Based Archival Storage System. ASPLOS 2016 DOI: http://dx.doi.org/10.1145/2872362.2
- [11]Synthetic double-helix faithfully stores Shakespeare's sonnets. NatureDOI:10.1038/nature.2013.12279
- [12]“English - М.С. Нейман.” [Online. Accessed: 22-Sep-2016].
- [13]M. Hilbert and P. López, “The world’s technological capacity to store, communicate, and compute information,” science, vol. 332, no. 6025, pp. 60–65, 2011.
- [14]R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, and W. J. Stark, “Robust Chemical Preservation of Digital Information on DNA in Silica with Error-Correcting Codes,” Angew. Chem. Int. Ed., vol. 54, no. 8, pp. 2552–2555, Feb. 2015.
- [15]J. Davis, “Microvenus,” Art J., vol. 55, no. 1, pp. 70–74, 1996.
- [16]D. G. Gibson, J. I. Glass, C. Lartigue, V. N. Noskov, R.-Y. Chuang, M. A. Algire, G. A. Benders, M. G. Montague, L. Ma, M. M. Moodie, C. Merryman, S. Vashee, R. Krishnakumar, N. Assad-Garcia, C. Andrews-Pfannkoch, E. A. Denisova, L. Young, Z.-Q. Qi, T. H. Segall-Shapiro, C. H. Calvey, P. P. Parmar, C. A. Hutchison, H. O. Smith, and J. C. Venter, “Creation of a Bacterial Cell Controlled by a Chemically Synthesized Genome,” Science, vol. 329, no. 5987, pp. 52–56, Jul. 2010.
- [17]G. M. Church, Y. Gao, and S. Kosuri, “Next-Generation Digital Information Storage in DNA,” Science, vol. 337, no. 6102, pp. 1628–1628, Sep. 2012.
- [18] J. Daemen. V. Rijmen. "AES Proposal: Rijndael". National Institute of Standards and Technology. p. 1. March 9, 2003.
- [19] "Announcing the ADVANCED ENCRYPTION STANDARD (AES)". Federal Information Processing Standards Publication 197. United States National Institute of Standards and Technology (NIST). November 26, 2001.
- [20] Kosuri, S. and Church, G. 2014. Large-scale de novo DNA synthesis: technologies and applications. Nature Methods. 11, 5 (2014), 499–507.
- [21] Andras, P. and Andras, C. 2005. The origins of life – the “protein interaction world” hypothesis: protein interactions were the first form of self-reproducing life and nucleic acids evolved later as memory molecules. Medical Hypotheses. 64, 4 (2005), 678–688.
- [22] Mullis KB et al. "Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction." Cold Spring Harbor Symp. Quant. Biol. vol. 51 pp. 263–73 (1986)
- [23] Fox EJ, Reid-Bayliss KS, Emond MJ, Loeb LA (2014) Accuracy of Next Generation Sequencing Platforms. Next Generat Sequenc & Applic 1: 106. doi:10.4172/jngsa.1000106
- [24] Pochon (2013). Evaluating detection limits of next-generation sequencing for the surveillance and monitoring of international marine pests. PLos One 8(9):e73935
- [25] Schmitt MW, Kennedy SR, Salk JJ, Fox EJ, Hiatt JB, et al. (2012) Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci U S A 109: 14508-14513.
- [26] Wang, X., Blades, N., Ding, J., Sultana, R. and Parmigiani, G. (2012). Estimation of sequencing error rates in short reads. BMC Bioinformatics, 13:185
- [27] Zhu, X., Wang, J., Peng, B. and Shete, S. (2016). Empirical estimation of sequencing error rates using smoothing splines. BMC Bioinformatics, 17:177
- [28]Yu, Allen Chi-Shing; Loo, Jacky Foo Chuen; Yu, Samuel; Kong, Siu Kai; Chan, Ting-Fung (2013). "Monitoring bacterial growth using tunable resistive pulse sensing with a pore-based technique". Applied microbiology and biotechnology. 98 (2): 855–862.
- [29] Nicholson WL, Munakata N, Horneck G, Melosh HJ, Setlow P (2000). "Resistance of Bacillus endospores to extreme terrestrial and extraterrestrial environments". Microbiology and Molecular Biology Reviews. 64 (3): 548–72.
- [30] Gerard J. Tortora Microbiology: An Introduction, Global Edition12th Revised Edition, July 2015, 9781292099149, Pearson Education Limited
- [31] Leendert W. Hamoen, Gerard Venema and Oscar P. Kuipers (2003). Controlling competence in Bacillus subtilis: shared use of regulators. Microbiology (2003), 149, 9–17
- [32] Vreeland, R.H., Rosenzweig, W.D. & Powers, D.W., 2000. Isolation of a 250 million-year-old halotolerant bacterium from a primary salt crystal. Nature, 407(6806), pp.897–900. [Online. Accessed September 26, 2016].
- [33] Lacey, L.A., 2012. Manual of Techniques in Invertebrate Pathology., Elsevier Science.
- [34] Miyamoto-Shinohara, Y. et al., 2008. Survival of freeze-dried bacteria. The Journal of general and applied microbiology, 54(1), pp.9–24. Available at: http://www.ncbi.nlm.nih.gov/pubmed/18323678 [Accessed September 26, 2016].
- [35] Fairhead, H. et al., 1994. Small, acid-soluble proteins bound to DNA protect Bacillus subtilis spores from being killed by freeze-drying. Applied and environmental microbiology, 60(7), pp.2647–9. [Online. Accessed September 26, 2016].