Team:Groningen/Software

CryptoGE®M
Team
Project
Biology
Computing
Human Practice
Acknowledgements

Software

This article will explain in detail how the software part of CryptoGErM works. To encrypt the message and translate it to DNA we wrote several Javascripts. They are used in the demonstration of our software on the Coding and Decoding pages. The following sections cover the same steps the software takes to turn a text into DNA. First encryption, then translation to DNA and finally packing it into a complete message sequence. The examples will use Hello world as the message text, and secret for the encryption key.

The software is divided into several modules, each in their own file. Two of the modules were not written by us, but used under an open source license. The encryption code, provided by the AES module was written by Chris Veness at Movable-Type.co.uk. The CRC implementation for computing checksums was written by Github user chitchcock and published as Github Gist #5112270. All other code was written by us and is licensed under the MIT license.

Encryption & Decryption

We use the AES module for encryption and decryption, specifically: AES counter mode. This mode allows us to encrypt (and decrypt) messages of any length. It does this by splitting the message up into blocks and adding a counter to every block. The first block contains an initialization vector, which is randomized every time and also the number of blocks that follow it. As a result of this every block has a different cipher text, even if there are repetitions in the original message. In fact, the cipher text for the entire message is different, even if the same message is encrypted again.

Encrypting our example message Hello world with the key secret is now as easy as calling Aes.Ctr.encrypt(message, key) with (possibly) this result: ÑTyûWáû~€zHP�Kúã. The next step is to turn this cipher text into DNA.

The AES module is in aes.js and the counter mode addition is in aes-ctr.js.

DNA

Before the text can be integrated into a bacterium, it needs to be translated to DNA. Computers store letters, digits and other characters as numbers. The numbers are translated to the symbols you see via an encoding table. The most common encodings are ASCII and Unicode. The cipher text from the example is encoded as follows: 138 000 014 209 084 121 251 087 225 251 126 128 122 072 080 007 075 250 227. We use Unicode Transform Format 8, which means that every character is encoded by 1 up to 6 bytes of 8 bits each. In binary, the bytes of the cipher text look like this: 10001010 00000000 00001110 11010001 01010100 01111001 11111011 01010111 11100001 11111011 01111110 10000000 01111010 01001000 01010000 00000111 01001011 11111010 11100011. Binary has only two digits, 0 and 1, but DNA has four 'digits': ACTG. This means those bytes that need eight bits, need only 4 DNA base-pairs. The translation to DNA is done like this:

Lets take the letter 'm' as an example.

  1. The ASCII/Unicode code is 109.
  2. In binary it is 01101101.
  3. Cut off the right-most, the least significant, 2 bits: 01.
  4. Look up the base: 00 = A, 01 = C, 10 = T, 11 = G: C.
  5. Append the base to the output: C.
  6. The input number is now: 011011.
  7. Repeat steps 3 - 6 until there are no more bits left.

The result is: CGTC.

When we apply the above algorithm to all the characters in the cipher text, we get: TAAG TTAT AAAA TGAA GAAG CACT ACCC CTGC GAAG GTGT GCCC GAAG CATT GAAG GTGT TGGC TAAG AAAT TTGC ATAC AACC GCAA GTAC GAAG TTGT GAAG GATT. This, however, is not enough DNA to be able to decode it after extracting it from a bacterium. The next section explains more about the meta-data that goes with the cipher text/DNA.

Translation to DNA is implemented by the EightBit and DNA modules.

Message format

Translating just the cipher text to DNA is not enough, we need some more information in the sequence to be able to decode it after extraction. We'll put this information in a header: a block of DNA that precedes the cipher text DNA. The decoding machine will first read the information in the header and use that to decode the rest of the message.

Start & end

The first problem is that our sequence is inserted into other DNA. When it is sequenced again there is also some DNA in front and after the end of the message. We need a way to find out exactly where our message starts.

To solve this, every message block always starts with the same sequence: GACCAAGCCTGC. The decoding machine looks for this sequence and then decodes the DNA that follows it. Of course, it is possible that this sequence appears as part of non-message related DNA, or it may be part of the message. The next pieces of message information help with solving this. What the decoder does is look for the start sequence and try to decode the rest, if that fails it will look for the next occurrence of the start sequence and decode that. If it reaches the end of the complete sequence without successfully decoding a message it will come to the conclusion that there is no intact message in the sequence.

Length

The next issue is that the decoder needs to know the length of the message sequence. At the end of our sequence, after the message, is a footer: GCACCCACCGAC. This helps verify the sequence, but it suffers from the same problem as the start of the header: it is possible that it occurs as part of the message sequence. So it is possible that the end is detected too soon. Therefore we also put the length of the message sequence in the header. After the start sequence we reserve 8 base-pairs that encode the length of the message: 8 bases encode 16 bits, which means that the maximum length is \(2^{16} - 1 = 65535\) base-pairs.

The decoder will look for the start sequence, then decode the length and look foor the footer sequence after length number of base-pairs. In our example, the length is 108 base-pairs, which translates to AGTCAAAA in the length field of the header.

Checksum

Using the length and the footer is enough to verify that no base-pairs have been inserted or deleted, but it cannot be used to detect whether base-pairs have changed for whatever reason. To do this we also add a checksum to the header.

A checksum is a kind of digital summary of data. The checksum function, in our case CRC16-CCITT, takes data of any length and computes a checksum of fixed length. In our case the checksum is 16 bits, so 8 base-pairs.

The encoding machine computes the checksum on the original message sequence and places it in the header. The decoder reads the sequenced message, computes the checksum and verifies that it matches the one in the header. If they are, the message is valid.

Because the message can be of any length (up to 65535 base-pairs), but the checksum is always 8 base-pairs, multiple messages will have the same checksum. However, due to the way the checksum is computed, two messages that are very similar but differ in only a few bits will have completely different checksums. So this method is very reliable for detecting small changes in the sequence. Note that the checksum can only be used to detect changes between the extracted message and the original, it cannot be used to correct them.

The checksum of our example message is 51063, so the encoder puts GCGCGCAG as the checksum in the header. The checksum function can be found in crc.js.

Restrictions

All the previous issues regarding length and integrity are fairly common in information exchange. Headers and checksums are also used, for example in computer network protocols. The last problem we face, however is specific to DNA. The full sequence may not contain any restriction enzymes, or it will be altered by the organism itself.

To deal with this the encoder looks for occurrences of the restriction enzyme sequences and splices a certain number of A-bases in the middle of each. The number of A-bases is determined by the longest sequence of A-bases that occurs in the full sequence already. This number is also put in the header, using 4 base-pairs, meaning it can have a maximum value of \(2^8 - 1 = 255\). The decoder reads this number and then removes all sequences of that many A-bases long before decoding and verifying the message. The encoder also puts 4 A-bases between the start sequence and the substitution length, in case their combination would form a restriction enzyme.

In our example the longest sequence of A-bases in the message sequence is 4, so the substitution sequence needs to be 1 base longer, which is encoded as CCAA. The restriction-substitution code is implemented in restrict.js and the rest of the message-formatting code is in message.js.

Conclusion

The encoder follows these steps to create a complete message sequence:

  1. Encrypt the message with the key.
  2. Convert the message to DNA.
  3. Compute the length and the checksum of the message sequence.
  4. Convert the length and checksum to DNA.
  5. Put the length, checksum, message and footer together as the tail sequence.
  6. Find the longest sequence of A-bases.
  7. Find sequences of restriction enzymes in the tail sequence.
  8. Splice sequences of A-bases, 1 longer than the longest found into the restriction sequences.
  9. Convert the substitution length to DNA.
  10. Put the start sequence, padding and substitution length together in front of the tail.

Putting it all together, the full message format looks as follows:

  1. The start sequence: GACCAAGCCTGC.
  2. Padding of 4 base-pairs.
  3. Substitition length: 4 base-pairs.
  4. Length of the message sequence: 8 base-pairs.
  5. Message checksum: 8 base-pairs.
  6. The message sequence itself: length times 4 base-pairs.
  7. The footer: GCACCCACCGAC.

Finally, this is our full example message sequence converted to DNA:GACCAAGCCTGCAAAACCAAAGTCAAAAG CGCGCAGTAAGTTATAAAATGAAGAAGCACTACCCCTGCGAAGGTGTGCCCGAAGCATTGAAGGTGT TGGCTAAGAAATTTGCATACAACCGCAAGTACGAAGTTGTGAAGGATTGCACCCACCGAC.

On the other end of the process the decoder takes the following steps to get the message back from extracted DNA:

  1. Look for the start sequence.
  2. Skip the next 4 base-pairs (the padding).
  3. Read the 4 base-pairs that follow it as the length of the substitition sequence.
  4. Cut out all sequences of that number A-bases.
  5. Read the next 8 base-pairs as the length of the message sequence.
  6. Then the following 8 base-pairs as the checksum.
  7. Read the message sequence of length base-pairs.
  8. Verify that the next 12 base-pairs are equal to the footer sequence.
  9. Compute the checksum of the message sequence and verify it with the checksum read earlier.
  10. Decode the message sequence.
  11. Decrypt the decoded message sequence with the key that the user provides.

While this article described the process for translating the encrypted message to a DNA sequence, it is much the same for the key sequence. The only difference is that the encryption and decryption steps are not taken, but the message format is still the same.

A full implementation of the encoding machine is available at Coding and the decoder can be found at Decoding.

Oop top