Feature
Why do we choose DNA?
DNA, as the epochal information storage medium, has many amazing features, [i.e.,] high-density, massive, high-stability, easy-access and free-maintenance.
High-density and massiveDNA information storage technology will be a landmark in the future-oriented storage technology. We believe that DNA is an incredibly high-density and massive storage medium. At theoretical maximum, DNA can code two bits per nucleotide(nt) or 455 exabytes pergram of ssDNA[1] . Bio101 can transform 200MB files once because of the length of indexes now.
Fig.1. The history of the data storage.
DNA is a high-stability molecule, with a remarkable long life-span even in suboptimal environments, making it an ideal storage material. Indeed, more than 80% of the woolly mammoth (Mammoths primigenius) genome, comprising 3.3 billion nt, remains readable despite the fact that this species has disappeared from the planet at the end of the Pleistocene (10,000 years ago).
Fig.2.Extracting and reading DNA from Mammoth fossil.
Molecular biology now provides us with the tools to cut (restriction endonucleases), paste (DNA ligase) and copy (PCR) DNA as we might do with the text of a word document. DNA also does not require frequent maintenance. When reading, DNA storage technology will not encounter compatibility problem.
What do Bio101 develop or improve as a DNA information storage system?
When we were working on our Bio101, we found that CUHK[2] also developed a similar project in 2010. So we compared our project with CUHK’s project and the results are shown in Table1:
Tab.1. The comparison of two projects.
And more details about features of Bio101 are shown as follows:
1. Higher compressionWe use bzip2 algorithm to compress the file, which accelerated the code speed in order to fulfill demand of web-app. Through the Table2[3], we can find that bzip2 has a higher compression ratio than other compression algorithms which means less storage space and less bases, so we can save the cost of DNA synthesis.
Tab.2. Comparison of several kinds of compression software.
We use ISAAC[4]—an encryption algorithm as well as a fast cryptographic random number generator to ensure that the bases appearing in consequential DNA sequence are almost random and reduce the homopolymers.
3. New conversion for bit-to-ntWe transform one byte of bits into four bytes of A (00), T (11), C (01), G (10) so that the coding efficiency of our system improves greatly. The transform rules are showed on Table 3.
Tab.3. Encoding rules.
Our system involves readings of 200 bp long shifted by 50 bp to ensure four-fold[5] coverage of the sequences so we can always get the accurate information from the redundant sequence. Meanwhile, we add indexes to the sequence, which contains address code and check code. It will help us know the location of sequence in a file and examine whether the sequence goes wrong or not during the synthesizing, storing or sequencing progress.
Fig.3. Fourfold redundancy and index to improve fault tolerance.
Interface: We design a webpage that allows users to experience our software, through which users can upload any format file they want to encode or the file including DNA sequences to decode easily and quickly download the DNA sequence files generated or the original files conveniently.
Fig.4.User-friendly interface of Bio101.
Compatibility: Bio101 can work stably in a number of multi-task operating systems without frequent crashes. Also users can choose any file they want and then focus on synthesizing DNA by Bio101. The software is accessible for any device and platform.
Extendable: The evaluation criteria of a program should depend on its portability. Our code is open source, and we provide four APIs for developers to reuse the function of our software—ISAAC64 random encryption algorithm, bit-to-nt conversion, nt-to-bit conversion and Blast.
Fig.5.Conversion any file by Bio101 on different devices and platforms.
References
- [1] George M. Church. Yuan Gao, Spiram Kosupi. Next-Generation Digital Information Storage in DNA. Science, online August 16, 2012
- [2] https://2010.igem.org/Team:Hong_Kong-CUHK.
- [3] http://www.cnblogs.com/langzou/p/5823285.html.
- [4] http://burtleburtle.net/bob/rand/isaacafa.html.
- [5] Goldman N, Bertone P, Chen S, et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. [J]. Nature, 2013, 494(7435):77-80.