Team:Groningen/Software

CryptoGE®M
Team
Project
Biology
Computing
Human Practice
Acknowledgements

Software

This article will explain in detail how the software part of CryptoGErM works. To encrypt the message and translate it to DNA we wrote several Javascripts. They are used in the demonstration of our software on the Coding and Decoding pages. The following sections cover the same steps the software takes to turn a text into DNA. First encryption, then translation to DNA and finally packing it into a complete message sequence. The examples will use Hello world as the message text, and secret for the encryption key.

The software is divided into several modules, each in their own file. Two of the modules were not written by us, but used under an open source license. The AES module was written by Chris Veness at Movable-Type.co.uk. The CRC implementation was written by Github user chitchcock and published as Github Gist #5112270. All other code was written by us and is licensed under the MIT license.

Encryption & Decryption

We use the AES module for encryption and decryption, specifically: AES counter mode. This mode allows us to encrypt (and decrypt) messages of any length. It does this by splitting the message up into blocks and adding a counter to every block. The first block contains an initialization vector, which is randomized every time and also the number of blocks that follow it. As a result of this every block has a different cipher text, even if there are repetitions in the original message. In fact, the cipher text for the entire message is different, even if the same message is encrypted again.

Encrypting our example message Hello world with the key secret is now as easy as calling Aes.Ctr.encrypt(message, key) with (possibly) this result: (�5z�õøWqfÚ7­˜÷�''. The next step is to turn this cipher text into DNA.

The AES module is in aes.js and the counter mode addition is in aes-ctr.js.

DNA

Before the text can be integrated into a bacterium, it needs to be translated to DNA. Computers store letters, digits and other characters as numbers. The numbers are translated to the symbols you see via an encoding table. The most common encodings are ASCII and Unicode. The cipher text from the example is encoded as follows: 040 002 053 122 008 245 248 087 113 102 218 055 173 152 247 011 039 039 144. We use Unicode Transform Format 8, which means that every character is encoded by 1 up to 6 bytes of 8 bits each. In binary, the bytes of the cipher text look like this: 00101000 00000010 00110101 01111010 00001000 11110101 11111000 01010111 01110001 01100110 11011010 00110111 10101101 10011000 11110111 00001011 00100111 00100111 10010000. Binary has only two digits, 0 and 1, but DNA has four 'digits': ACTG. This means those bytes that need eight bits, need only 4 DNA base-pairs. The translation to DNA is done like this:

Lets take the letter 'm' as an example.

  1. The ASCII/Unicode code is 109.
  2. In binary it is 01101101.
  3. Cut off the right-most, the least significant, 2 bits: 01.
  4. Look up the base: 00 = A, 01 = C, 10 = T, 11 = G: C.
  5. Append the base to the output: C.
  6. The input number is now: 011011.
  7. Repeat steps 3 - 6 until there are no more bits left.

The result is: CGTC.

When we apply the above algorithm to all the characters in the cipher text, we get: ATTA TAAA CCGA TTGC ATAA GAAG CCGT GAAG ATGT GCCC CAGC TCTC GAAG TTCT GCGA TAAG CGTT TAAG ATCT GAAG GCGT GTAA GCTA GCTA TAAG AACT. This, however, is not enough DNA to be able to decode it after extracting it from a bacterium. The next section explains more about the meta-data that goes with the cipher text/DNA.

Translation to DNA is implemented by the EightBit and DNA modules.

Message format

Translating just the cipher text to DNA is not enough, we need some more information in the sequence to be able to decode it after extraction. We'll put this information in a header: a block of DNA that precedes the cipher text DNA. The decoding machine will first read the information in the header and use that to decode the rest of the message.

Start & end

The first problem is that our sequence is inserted into other DNA. When it is sequenced again there is also some DNA in front and after the end of the message. We need a way to find out exactly where our message starts.

To solve this, every message block always starts with the same sequence: GACCAAGCCTGC. The decoding machine looks for this sequence and then decodes the DNA that follows it. Of course, it is possible that this sequence appears as part of non-message related DNA, or it may be part of the message. The next pieces of message information help with solving this. What the decoder does is look for the start sequence and try to decode the rest, if that fails it will look for the next occurrence of the start sequence and decode that. If it reaches the end of the complete sequence without successfully decoding a message it will come to the conclusion that there is no intact message in the sequence.

Length

At the end of our sequence, after the message, is a footer: GCACCCACCGAC. This helps verify the sequence, but it suffers from the same

GACCAAGCCTGCAAAACCAAATTCAAAACCCTCCAGATTATAAACCG ATTGCATAAGAAGCCGTGAAGATGTGCCCCAGCTCTCGAAGTTCTGCGATAAGCGTTTAAGATAA AAACTGAAGGCGTGTAAGCTAGCTATAAGAACTGCACCCACCGAC.

Oop top