Team:Groningen/DecodingFidelity

CryptoGE®M
Team
Project
Biology
Computing
Human Practice
Acknowledgements

Decoding – costs and fidelity

The cost of sequencing DNA has plummeted in the last two decades. For instance, in the early 2000’s it took 13 years and $3 billion US dollars to sequence the entire human genome. With current technologies, we have approached the $1000 dollars mark. In fact, since 2007 you can have your whole DNA sequenced for less than that! However, there are still issues with the fidelity and accuracy of the readings obtained that would prevent them from being used for our system[5].

Price of sequencing one million of base pairs. Prices dropped from 2007, when second generation techniques were introduced in the marked. Source: National Human Genome Research Institute

Existing laboratory-level DNA sequencing technologies typically allow for a reading error of ~1%. Thanks to optimization and finne-tunning, traditional readings using Sanger biochemistry offer now accuracies of up to 99.999% in a read length of 1,000 bp. With modern, second generation or cyclic sequencing, higher reading lengths have been achieved but with a decrement in accuracy[4].

Among the bioencryption layers, CryptoGERM prevents unauthorized parties from reading the message by having a high ratio of decoy spores that contain a useless sequence. In addition, if the right growing conditions are not supplied our system prevents germination and replication of spores that do contain the message. So, what ratio of decoy:spores do I need to prevent brute sequencing and message retrieval from a third party?

A) One of the bioencryption bilayers is having a high proportion of decoy spores vs those that do contain the intended useful sequence. If the right conditions are not met (i.e. adding an X antibiotic to the system) those spores that contain our message will die and be outgrown by the decoy. B) Current sequencing techniques will not be able to distinguish the hidden message from noise.

Fox et. al. (2014) reports that there is a 50% chance of accurately distinguishing a true subclonal variant from a sequencing artifact in an excess of 100 wild-type DNA sequences using standard Q30 filter reads (error rate: 10-2)[1]. In agreement with that result, in an experiment carried out to detect genomic variations in marine pests, Pochon (2013) was able to detect one variant out of 150 wild-type sequences[2].

Reading DNA has become increasingly more accurate. Schmidt (2012) developed a method called Duplex Sequencing that uses both strands of DNA to obtain a more precise consensus sequence yielding an theoretical error of 3x10-10[3]. That means that we could transmit a message with a length in the order of Gigabytes without expecting any lost! On the other hand, that also means that they allow a more precise measurement in decoy-spores mixtures, and a 1:150 spore:decoy ratio might be insufficient in the future. In fact, using Duplex Sequencing they were able to identify one mutant sequence per 10,000 wild-type molecules.

References
  • [1] Fox EJ, Reid-Bayliss KS, Emond MJ, Loeb LA (2014) Accuracy of Next Generation Sequencing Platforms. Next Generat Sequenc & Applic 1: 106. doi:10.4172/jngsa.1000106
  • [2] Pochon (2013). Evaluating detection limits of next-generation sequencing for the surveillance and monitoring of international marine pests. PLos One 8(9):e73935
  • [3] Schmitt MW, Kennedy SR, Salk JJ, Fox EJ, Hiatt JB, et al. (2012) Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci U S A 109: 14508-14513.
  • [4] Wang, X., Blades, N., Ding, J., Sultana, R. and Parmigiani, G. (2012). Estimation of sequencing error rates in short reads. BMC Bioinformatics, 13:185
  • [5] Zhu, X., Wang, J., Peng, B. and Shete, S. (2016). Empirical estimation of sequencing error rates using smoothing splines. BMC Bioinformatics, 17:177
Oop top