Team:Edinburgh UG/Lexicon Encoding



To create BabblED we needed the capacity to rapidly design and process the information contained in a large lexicon of BabbleBricks. This would have been practically impossible to accomplish by hand and hence required the creation of a software tool with the functionality to encode and decode BabbleBlocks and BabbleBlocks. You can read more about all these mechanisms below or jump straight into the code at our Github



To encode the BabbleBricks that make up a lexicon we begin by taking a list of information units, for example words. We then enumerate this list first in decimal and then in base 4. This conversion enables us to encode numbers using digits 0-3 instead of the normal 0-9 in decimal. We convert these base 4 numbers into their DNA equivalent using the schema; A is 0, T is 1, G is 2, C is 3 and pad them all up to 5 base pairs. Now we have these variable sequences we must ensure that no illegal restriction sites can occur so we add gap sequences. Finally we append a stop codon region, restriction site preventing gapped error correcting region and hangs in each BabbleBrick form. For example:

When we assemble our BabbleBricks together to create BabbleBlocks its vital we know what the sequence will be for both verification purposes, so that we can instruct the user exactly which BabbleBricks to use and so we can work out our checksum and address values. We start by appending our word coding BabbleBricks together:

5' GGAGACCAAAATAGCTAATCACTTATGAAAGGAATTAAGGAATTAA + GGAGACCAAATTAGCTAATCACTTATGAAAGGATTTAAGGATTTAA

5' GGAGACCAAAATAGCTAATCACTTATGAAAGGAATTAAGGAATTAAGGAGACCAAATTAGCTAATCACTTATGAAAGGATTTAAGGATTTAA

We then look at the word coding regions spaced at regular intervals and use them to calculate a checksum as described in the error correction section here. Finally, we append an address BabbleBlock which acts like a line number telling the decoding program where this BabbleBlock lies in the overall archive.

In order to decode a BabbleBlock we first look at our checksum and use it to verify whether or not error correction needs to be done - if some mistakes are flagged we use our error correcting apparatus and return their results (what our sequence was before the change) back to the decoding program. The error correcting program will look at each word coding region, convert it back to its numerical values and use it as an index to look up the information value of that BabbleBrick in the lexicon. Having decoded all the BabbleBlocks in a batch these will then be sorted in order using the address found at the end of each sequence before the decoded information is returned to the user.




Follow Us