Project Description
Introduction
This summer our team, BabblED, designed and implemented a modular system for storing data in DNA. Through our modular design, we were able to incorporate error-correcting and safety mechanisms into our BabbleBricks (information containing DNA fragments). Read below to find out more about the motivations behind our project and how we have optimised DNA as a storage medium.
The Age of Big Data
Every day, we generate more than 2.5 billion gigabytes of new data1. At this rate, by 2020 the amount of digital data we have produced will exceed 40 zettabytes… this is estimated to be fifty-seven times the number of grains of sand on all beaches on Earth2. This data is generated and collected from everywhere: tweets, Facebook messages, Google maps searches, transaction records and pictures.
Much of this digitally generated data is stored instantly on the flash memory devices found in our phones or on our laptops, but much of it, as required by law, must be archived for a longer period of time.
Data Law
Parliamentary law in the United Kingdom requires companies to retain generated data for anywhere from 12 months to 50 years3. For example, the Data Retention and Investigatory Powers Act of 2014, states that any commercial company that provides communication services, such as telephone or internet companies, must retain all customer data for 12 months. This is so that the data can be accessed by law enforcement should it be needed to investigate a crime4. Other laws such as The Ionising Radiations Regulations 1999 or The Control of Substances Hazardous to Health Regulations 2002, necessitate data related to medical or hazardous substances be retained for 40-50 years.
In most cases, this data will likely be generated on a computer or a digital system. However, long-term storage of this data is unlikely to be on a computer or flash memory medium.
Archival Data Crisis
In addition to the data that is legally required to be retained, fundamental social institutions such as universities and libraries will archive data, whether it be ancient manuscripts or academic journals, for hundreds of years.
Archival data that is accessed infrequently and must be kept in a stable form for long periods of time is often stored on magnetic tape.
Through our conversations with data librarians, specialists and computing experts, we came to realise that there is an issue with storing archival data on tape. Our conversations with the National Library of Scotland shed some light on this. Magnetic tape has a life span of 6 years; though this is longer than the 2-year life span of hard drives, it poses a massive cost to archivists; every six years have to read and rewrite this data onto new tapes. This year it will take them 10 days to transfer 100TB of data, and it will cost them £230 in power costs, £310 to pay staff and £916 for the new tapes. In 2022, they will need to transfer 937TB of data; this will take them 71 days and cost £5,344 in total. Obviously, this process is unfeasible in the long term.
The National Library of Scotland stores data that dates back to the 9th century- imagine if these precious pieces of human history. If we lost our capacity to archive ancient texts, not to mention data from new discoveries, how would be retain human knowledge?
DNA Data Storage
DNA is nature’s form of information storage. Though not the first to think of it our team believes that DNA is the ideal long term storage medium both in nature but also for human generated data. DNA is denser, longer lasting and more sustainable than any other storage medium.
This summer, we decided to address the set backs faced by previous DNA storage projects. Firstly, we designed our system to be modular; information stored by our system is encoded in 50bp DNA fragments termed BabbleBricks. Our Bricks are assembled into larger constructs to densely store any type of data. Our BabbleBricks have allowed our system to be more accessible and more secure. For example, our assembly method is up to 3 times cheaper than de novo DNA synthesis. The BabbleBricks also allowed the incorporation of various error-correcting mechanisms that ensure your data can be accessed and read after long term storage.
References
1. https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
2. http://www.computerworld.com/article/2493701/data-center/by-2020--there-will-be-5-200-gb-of-data-for-every-person-on-earth.html
3. https://www.eradar.eu/data-retention-periods/
4. https://www.gov.uk/government/collections/data-retention-and-investigatory-powers-act-2014
5. http://science.sciencemag.org/content/337/6102/1628.full
6. http://www.nature.com/news/how-dna-could-store-all-the-world-s-data-1.20496
7. https://homes.cs.washington.edu/~luisceze/publications/dnastorage-asplos16.pdf