Software
This article will explain in detail how the software part of
CryptoGErM works. To encrypt the message and translate it to DNA we
wrote several Javascripts. They are used in the demonstration of our
software on the Coding and
Decoding pages.
The following sections cover the same steps the software takes to
turn a text into DNA. First encryption, then translation to DNA
and finally packing it into a complete message sequence. The
examples will use Hello world
as the message text, and
secret
for the encryption key.
The software is divided into several modules, each in their own file. Two of the modules were not written by us, but used under an open source license. The AES module was written by Chris Veness at Movable-Type.co.uk. The CRC implementation was written by Github user chitchcock and published as Github Gist #5112270. All other code was written by us and is licensed under the MIT license.
Encryption & Decryption
We use the AES module for encryption and decryption, specifically: AES counter mode. This mode allows us to encrypt (and decrypt) messages of any length. It does this by splitting the message up into blocks and adding a counter to every block. The first block contains an initialization vector, which is randomized every time and also the number of blocks that follow it. As a result of this every block has a different cipher text, even if there are repetitions in the original message. In fact, the cipher text for the entire message is different, even if the same message is encrypted again.
Encrypting our example message Hello world
with the
key secret
is now as easy as calling
Aes.Ctr.encrypt(message, key)
with (possibly) this result:
(�5z�õøWqfÚ7÷�''
. The
next step is to turn this cipher text into DNA.
The AES module is in aes.js and the counter mode addition is in aes-ctr.js.
DNA
Before the text can be integrated into a bacterium, it needs to
be translated to DNA. Computers store letters, digits and other
characters as numbers. The numbers are translated to the symbols
you see via an encoding table. The most common encodings are ASCII
and Unicode. The cipher text from the example is encoded as
follows: 040 002 053 122 008 245 248 087 113 102 218
055 173 152 247 011 039 039 144
. We use Unicode Transform
Format 8, which means that every character is encoded by 1 up to 6
bytes of 8 bits each. In binary, the bytes of the cipher text look
like this: 00101000 00000010 00110101 01111010
00001000 11110101 11111000 01010111 01110001 01100110 11011010
00110111 10101101 10011000 11110111 00001011 00100111 00100111
10010000
. Binary has only two digits, 0 and 1, but DNA has
four 'digits': ACTG. This means those bytes that need eight bits,
need only 4 DNA base-pairs. The translation to DNA is done like this:
Lets take the letter 'm' as an example.
- The ASCII/Unicode code is
109
. - In binary it is
01101101
. - Cut off the right-most, the least significant, 2 bits:
01
. - Look up the base: 00 = A, 01 = C, 10 = T, 11 = G:
C
. - Append the base to the output:
C
. - The input number is now:
011011
. - Repeat steps 3 - 6 until there are no more bits left.
The result is: CGTC
.
When we apply the above algorithm to all the characters in the
cipher text, we get: ATTA TAAA CCGA TTGC ATAA GAAG
CCGT GAAG ATGT GCCC CAGC TCTC GAAG TTCT GCGA TAAG CGTT TAAG ATCT
GAAG GCGT GTAA GCTA GCTA TAAG AACT
. This, however, is not
enough DNA to be able to decode it after extracting it from a
bacterium. The next section explains more about the meta-data that
goes with the cipher text/DNA.
Translation to DNA is implemented by the
EightBit
and DNA
modules.
Message format
Translating just the cipher text to DNA is not enough, we need some more information in the sequence to be able to decode it after extraction. We'll put this information in a header: a block of DNA that precedes the cipher text DNA. The decoding machine will first read the information in the header and use that to decode the rest of the message.
Start & end
The first problem is that our sequence is inserted into other DNA. When it is sequenced again there is also some DNA in front and after the end of the message. We need a way to find out exactly where our message starts.
To solve this, every message block always starts with the same
sequence: GACCAAGCCTGC
. The decoding machine looks for
this sequence and then decodes the DNA that follows it. Of course,
it is possible that this sequence appears as part of non-message
related DNA, or it may be part of the message. The next pieces of
message information help with solving this. What the decoder does is
look for the start sequence and try to decode the rest, if that
fails it will look for the next occurrence of the start sequence and
decode that. If it reaches the end of the complete sequence without
successfully decoding a message it will come to the conclusion that
there is no intact message in the sequence.
Length
At the end of our sequence, after the message, is a footer:
GCACCCACCGAC
. This helps verify the sequence, but it
suffers from the same
GACCAAGCCTGCAAAACCAAATTCAAAACCCTCCAGATTATAAACCG
ATTGCATAAGAAGCCGTGAAGATGTGCCCCAGCTCTCGAAGTTCTGCGATAAGCGTTTAAGATAA
AAACTGAAGGCGTGTAAGCTAGCTATAAGAACTGCACCCACCGAC
.