Difference between revisions of "Team:Edinburgh UG/Error Correction"

Line 98: Line 98:
  
 
   <header>
 
   <header>
       <img class="img-responsive img-center" src="https://static.igem.org/mediawiki/2016/8/8e/EdiGEM16ug_errorCorrectionCrop.jpeg" alt="">
+
       <img class="img-responsive img-center" src="https://static.igem.org/mediawiki/2016/1/17/EdiGEM16ug_errorCorrectionWhiteBack.jpeg" alt="">
 
   </header>
 
   </header>
  

Revision as of 14:12, 2 September 2016

In any information storage system, the primary issue is being able to read back data with as high a fidelity as possible. Storage of archival data would be futile if the medium used could not guarantee this. BabblED utilises a powerful system of error detection and correction using checksums, optimal rectangular codes and Levenshtein distance to ensure fidelity of data even when a DNA sequence is damaged or there are mistakes in sequencing. Using these and the incredible properties of DNA we can be confident in accurately storing archival data for 10s or even 100s of years. You can read more about all these mechanisms below or jump straight into the code at our Github

A checksum is a commonly used technique in network engineering that we’ve adapted for use in DNA storage. In this format the checksum allows the detection of errors in a BabbleBlock (DNA sentence). The checksum holds the sum of the word coding regions of BabbleBricks (DNA words) in a BabbleBlock in a 4 BabbleBrick region that can be found towards the end of a BabbleBlock.

When a BabbleBlock is being translated the checksum is looked at first to see if it matches the sums of the word coding regions in the DNA we get back from sequencing. If this is the case, no errors have appeared and the translation program can proceed as normal however if there is no match an error has occurred somewhere in the BabbleBlock and other error correcting mechanisms will have to be used to make sure the stored information can still be read back.

The Optimal Rectangular Code (ORC) is a type of error correcting code that is present in all of our BabbleBricks. When the checksum detects an error the next action is to checks the ORCs in each word and use them to correct any errors that have taken place as follows:

Unfortunately, the ORC can only locate and fix single base pair errors however the properties of the ORC facilitate another added level of error correction.

If the Optimal Rectangular Code (ORC) detects more than the single base pair error it can fix using its normal actions the SuperFixer function is called. The SuperFixer uses the ORC and works out all the possible words in the lexicon that fit this pattern. For any given ORC this is a maximum of 4 words and indeed there are many words that have unique ORC’s. Once the SuperFixer has narrowed down the possible words the list of possibilities is compared to what appears in the actual DNA and the sequence with the lowest Levenshtein distance is used in the translation. Levenshtein distance is a metric which shows the number of deletions, insertions or substitutions it will take to turn one string (or sequence) into another.

By taking the minimum Levenshtein distance we always work under the assumption that the smallest number of errors is most probable. This assumption gives the best performance across errors of different magnitudes due to the robustness of DNA as a molecule making small errors considerably more likely than larger ones.

Finally, this graph represents the performance of our entire error correction system on errors of varying magnitudes: