Team:Groningen/Model

CryptoGE®M
Team
Project
Biology
Computing
Human Practice
Acknowledgements

Modelling

We modelled several aspects of the CryptoGErM system. These models are described and explained in detail in the following articles:

  • The AI model explores the behaviour of the people who use, or try to abuse, the CryptoGErM system. The people are modelled in a multi-agent simulation using genetic algorithms.
  • Mutation Rates contains research of the literature on known mutation rates of Bacillus under the special conditions and treatments that will be applied.
  • In the Decoding Fidelity section we explore how accurate the current sequencing techniques are and how strong our decoy encryption security layer is.
  • Random Mutation uses computational modelling to calculate the probability of random mutations causing our message in the DNA to be corrupted.
  • The last model searches for an optimal strain for our photoswitchable antibiotic experiments.

Modelling Rational Agents with a Genetic Algorithm

In the following sections we propose a computational simulation that aims to model the behaviour of some of the agents that could make use of the CryptoGErM system. The model combines knowledge of different research areas such as Epistemic Logic, Machine Learning and Multi-Agent Systems. Two agents have been modelled, a possible Sender and a relative Criminal. The goal of the first one is to encrypt and send a top-secret message as safely as possible, while the criminal’s goal is to avoid this possible scenario. The agents have been modelled according to the BDI Architecture, a formal model that defines how rational agents act and behave [1], which has been modified and expanded specifically for this simulation. The individual parameters of this architecture have been estimated through the use of a genetic algorithm in order to understand which parts of the architecture turn out to be more significant in a possible CryptoGErM scenario.

Introduction

Once the CryptoGErM project started, before moving our attention to a broader perspective involving big data storage, we considered our idea as a possible way of sending sensitive information in a very secure way. Thanks to the combination of the Advanced Encryption Standard algorithm and the biological security layers that make it hard to get to the point in which it is possible to sequence the DNA of a bacteria, CryptoGErM originally would have been the newest method to use for sending highly reserved information. Most of the research that has been done in the lab in order to make our system as safe as possible, has been inspired by imagining possible scenarios involving the potential users of CryptoGErM. "What might happen if the agents wouldn’t have enough knowledge to successfully make CryptoGErM work?" or "What could happen if one agent would be in possession of enough sensitive information in order to make the whole procedure fail?" are only two out of the multiple questions that made it clear that in order to create an efficient product, there was need to understand and imagine future scenarios involving the possible users of the system. The two main agents that turned out to be the most significant ones were a potential Sender agent, that has as main goal hiding and sending as safely as possible a top-secret message, and a potential Criminal, who wants to avoid this scenario. Imagining all the possible interactions between these two actors is of course impossible to achieve, however it is possible to simulate part of their behaviours assuming they are rational.

In order to do so we have been inspired by work [1] where an Artificial Intelligence architecture based on Epistemic Logic is proposed for modelling rational agents. This gave us the opportunity to simulate different type of Agents and through the use of a genetic algorithm explore the different features that characterize their rationality. By exploring and computing the optimal features we have been able to identify the core points that have to be considered when building CryptoGErM in the most efficient way as possible. The following sections explain how the computational model has been built and trained in order to identify the most important aspects of the A.I. architecture, and how this knowledge has guided the whole lab work.

Architecture of the Model

The Agents have been modelled according to the BDI Architecture, an epistemic logic based system proposed in [1] in 1991. The paper states how important Beliefs, Desires and Intentions are in determining the behaviour of rational agents, more in detail the authors assert how every rational action taken from an agent is basically based on these three components. The combination of these features have an impact on the agent’s future, in fact they have the ability to produce different actions and as a consequence give rise to different future scenarios. These possible scenarios, called Worlds, are modelled according to a temporal structure and are defined as Time Trees, every moment in the time tree is the result of an action based on the three previously mentioned BDI features and is defined as a Situation. One Agent can move between different time situations until it achieves its goal. The three BDI components have a different impact on the agent’s behaviour and are defined as follows:

  • Beliefs: are defined as those worlds that the agent believes to be possible, technically every Belief is considered as a complete and independent Time Tree since there are no restrictions that can avoid an Agent to believe in something. This feature is formalized as follows: ∀ w ∈ W. Where w corresponds to a single time tree and W as the entire set of beliefs.
  • Desires: are those worlds that correspond to the optimal future scenario an agent wants to achieve, an agent can be happy with multiple outcomes but usually the main Desire is a single one and can be expressed by the following formalism: ∃! w ∈ W.
  • Intentions: are considered as sets of intention-accessible worlds. These worlds are ones that the agent has committed to attempt to realize. Differently from the Beliefs, where all possible worlds are considered, the Intention feature is characterized by a strong component of realism since the worlds that are part of this feature are only the ones that the Agent thinks can be a possible Desire.

In addition to the three BDI features some extra parameters have been added in order to model the agents. The reason this has been done is related to the genetic algorithm that has been used in order to simulate their behaviour and is explained into more detail in the next section. The parameters that have been added to the architecture are different according to the type of agent that has been modelled, however the general idea is the same one for both of them. In order to represent a potential level of Beliefs, Desires and Intentions in a CryptoGErM scenario some extra features related to this architecture have been added that might have an impact on all three BDI components and as a consequence on the Agents’ actions. Regarding a potential Sender the three additional features that have been added are the following:

  • γ: which corresponds to a particular level of riskiness the agent will take into account in his actions. The impact of this feature is different on the three BDI components, in fact a high level of riskiness makes the agent have a smaller set of total Beliefs since there is no reason to think about all those worlds that involve highly safe scenarios. On the other side it is possible that a high level of γ leads to the optimal desired world and as a consequence to a world the agent thinks requires commitment.
  • ψ: is a variable that is related to the level of efficiency that the agent wants to pursue in its actions. In this case the higher this value is, the higher the value of the three BDI components will be.
  • α: corresponds to the time resources the agent is willing to invest when creating its BDI features, the impact of this variable is similar to the previous one. The more time that is invested by an agent, the higher the set of Beliefs will be and the overall commitment to the possible worlds towards the optimal scenario.

Regarding the Criminal Agent the meaning of ψ and α has been changed. The first value corresponds to the amount of knowledge the agent has about the different Time Trees while the second one is related to the amount of motivation and effort he is willing to put while interfering with the Sender’s plan.

The idea of the model was to identify which of the three BDI components is considered as more important by a potential Sender agent who wants to hide and send an encrypted message as safely as possible and by a Criminal who wants to avoid this scenario. In order to find this out a genetic algorithm has been used on different randomly generated agents and the level of the three BDI features has been measured once the algorithm converged to the optimal result.

Genetic algorithms are a very common machine learning technique that is used in different Artificial Intelligence optimization problems [2] [3] [4], they are largely inspired by natural evolution and work as follows: a random set of possible solutions is created, each of these solutions goes through a fitness function which measures how close these solutions perform to the optimal result that wants to be achieved. It is important to notice that the individual values that make it possible to converge to the optimal result are not known beforehand, the only information that is known is the final outcome, which can be achieved in multiple ways.

Once the agents are evaluated by the fitness function a truncation process occurs, similarly as what happens in nature where only the fittest animals survive, also in the simulation the only agents that are allowed to be part and give rise to a new generation are the ones with the higher fitness scores. How new generations are created more into detail is presented in the next subsection.

The Sender

An Agent is created as a string of binary bits and is represented as a chromosome, each part of it represents the three BDI features together with the three additional parameters mentioned in the previous section. An example of a randomly generated agent is the following:

\[ Chromosome_1 = \underbrace{010010011010}_{Beliefs}\underbrace{011101110101}_{Intentions}\underbrace{010100101010}_{Desires} \]
\[ Beliefs = \underbrace{0100}_{\gamma} \underbrace{1001}_{\psi} \underbrace{1010}_{\alpha} \]

Every binary pair corresponds to a decimal number that if converted properly and summed together establishes the total fitness value of the chromosome. For the sender Agent an optimal value of 60 has been set, this turned out in the following fitness function, where n is the total length of the chromosome and the main aim is to find the optimal set of bits that satisfy it:

\[ f(x) = \sum\limits_{i=0}^{n} x[i] = 60 \]

For our Sender simulation the following parameters turned out to be the most suitable ones while considering optimal convergence performance and training time:

  • 20 Simulations: a total number of twenty different simulations have been ran in order to check if the results where coherent between each other.
  • 20 Pools of 100 Chromosomes: the simulation starts with 20 randomly generated chromosome pools with 100 agents each. This is the first set of agents that is evaluated by the fitness function and according to the fitness scores the breeding process starts.
  • Truncation Method: is the approach that has been used in order to keep the fittest agents allowed to reproduce themselves, in order to avoid overfitting issues the top 70% of the agents has been kept while the rest 30% has been discarded.
  • One point single crossover: is the breeding method that has been used. Once two chromosomes are chosen as parents they give birth to two new children by exchanging 3 of their bits. This technique is shown hereafter:
    \[ Parent_1 = \underbrace{\textbf{010}}_{Crossover Bits}{010011010011101110101010100101010} \]
    \[ Parent_2 = \underbrace{\textbf{011}}_{Crossover Bits}{010000010011101111001010111101000} \]
    \[ Child_1 = \underbrace{010}_{Parent 1 Bits} \underbrace{010000010011101111001010111101000}_{Parent 2 Bits} \]
    \[ Child_2 = \underbrace{011}_{Parent 2 Bits}\underbrace{010011010011101110101010100101010}_{Parent 2 Bits} \]
  • 1% mutation rate: some randomness has to be added to the newest generation in order to make the pool more varied and avoid local minima, in our case a 1% mutation rate has been set which means that one bit of one chromosome is randomly changed from 0 to 1 or vice-versa.

Figures 1 and 2 show the performance of the genetic algorithm and the analysis of the three different BDI features. These results are explained more into detail in the Results section.

SenderModel
Figure 1: The graph shows how after 20 generations of breeding the pool of chromosomes representing the Sender agents converge to the optimal result. Generation 4 and 11 show some local minima that thanks to the random mutation present in the genetic algorithm don’t impede the achievement of the optimal convergence.
SenderBDI
Figure 2: Histogram with the analysis of the BDI features. On the x-axis the decimal value represting that feature is presented while on the y-axis its amount of occurances during the simulation. The different colors in the historgram represent the different BDI components: Beliefs are shown in yellow, Desires in blue and Intentions in red.

The Criminal

As already introduced previously some of the additional BDI features have been changed while modelling the Criminal, however this doesn’t have any influence on how the Agent is represented which still corresponds to a string of binary bits. In order to make the computational simulation more varied some changes have been made to the genetic algorithm starting from the fitness function which in this case has as optimal result a value of 65 as shown by the following equation:

\[ f(x) = \sum\limits_{i=0}^{n} x[i] = 65 \]

Also in this case while training the Agents 20 different simulations have been ran that started with 20 pools of 100 chromosomes each, however the truncation method kept the top 80% of the Agents and not 70% as in the previous case and the breeding technique that has been used was the n double crossover method. It consists in exchanging not only one single set of bits but two of them of 3 bits each as explained hereafter, the mutation rate has been kept to 1%.

\[ Parent_1 = \underbrace{\textbf{010}}_{Crossover Bits_1}{010011010011101110101010100101} \underbrace{\textbf{010}}_{Crossover Bits_2} \]
\[ Parent_2 = \underbrace{\textbf{011}}_{Crossover Bits}{010000010011101111001010111101} \underbrace{\textbf{000}}_{Crossover Bits_2} \]
\[ Child_1 = \underbrace{010}_{Parent 1 Bits} \underbrace{010000010011101111001010111101}_{Parent 2 Bits} \underbrace{010}_{Parent 1 Bits} \]
\[ Child_2 = \underbrace{011}_{Parent 2 Bits}\underbrace{010011010011101110101010100101}_{Parent 1 Bits} \underbrace{000}_{Parent 2 Bits} \]
CriminalGenAl
Figure 3: The performance of the genetic algorithm. Also in this case an optimal convergence has been achieved after 20 generations of training.
CriminalBDI
Figure 4: The analysis of the BDI features, the meaning of the graph corresponds to the one presented in Figure 2 however some significant differences in the histogram lead to the interesting conclusions presented in the next section.

Discussion and Conclusion

The aim of this section is twofold, on the one side it proves how it is possible to model artificial agents through the use of a genetic algorithm while on the other side it explains the relevance of this multi-agent simulation in the CryptoGErM project and how the relative results have been inspiring for the lab work.

Considering the first goal, as explained by Figures 1 and 3 it is indeed possible to modify the structure of the Agents in order to satisfy their relative fitness functions through the use of a bio-computing technique. The structure of the string of bits representing the Sender and the Criminal makes it possible to modify the Agents in a relatively short training time by converging to the optimal result already in 20 generations. These results show how this particular machine learning technique can be used for modelling Multi-Agent Systems and are in line with work [3]. However the most remarkable and useful results have been provided by the final analysis of the different BDI features representing the agents. The histograms show in fact a significant difference that has lead the development of the whole CryptoGErM project.

Considering Figure 1, the results show how in the case of the Sender model, the Belief feature is the one that the generations of artificial agents discard in order to get to the optimal score. In fact for 15 times as shown by the yellow histogram it is the BDI component with the lowest value, 6. This is remarkable since it shows how not too many resources should be invested in imagining all possible scenarios when sending a secret message, but it is better to have a strong set of Desires that nicely match with the Intention feature and pursue a single one of those. This can nicely be seen in the graph where an optimal value of 13 has been chosen by the agents for the Desire and Intention features for more then 20 times. The same is not true for the Criminal model where in order to be successful the generations of agents have to consider as equally important all the three BDI components. This is supported by the histograms that show how Beliefs, Desires and Intentions occur almost equally with a value of 15.

Considering this simulation and the main goal of CryptoGErM that sees a potential Sender communicating in the safest way as possible a message to a Receiver and a relative Criminal with an opposite goal, we deduced that in order to make life as hard as possible to the Criminal we had to build the system in such a way that a potential criminal has to consider as many Worlds as possible in order to be successful. This lead us to the idea of adding besides the A.E.S. method, when hiding the message into the DNA, a list of biological inspired security layers that have to be considered if the DNA has to be sequenced properly. As shown by the computational simulation, the knowledge of these Worlds has to be known by the Criminal if he wants to be successful since considering the A.I. architecture it corresponds to the Belief feature.

Finally, it is important to mention that the computational model doesn’t allow the agents to take real actions and maybe compete between each other, however it still provides important insights by considering the features that make one agent as effective as possible by assuming it is rational as defined by the BDI architecture.

References

The Model has been built in Python and consists of the following files: main.py, create_sender.py, create_criminal.py, genetic_alg.py and plots.py, The model requires Python 2.7 and the random, matplotlib, numpy and seaborn libraries. Also make sure to download and place a /res directory in your working space. More information about how to run the code can be found on github: A Genetic Algorithm for Modeling Artificial Agents

  • [1] Rao, A. S., & Georgeff, M. P. (1991). Modeling rational agents within a BDI-architecture. KR, 91, 473-484.
  • [2] Herrera, F., & Verdegay, J. L. (1996). Genetic algorithms and soft computing. Physica-Verlag.
  • [3] Zhong, W., Liu, J., Xue, M., & Jiao, L. (2004). A multiagent genetic algorithm for global numerical optimization. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 34(2), 1128-1141.
  • [4] Goldberg, D. E., & Holland, J. H. (1988). Genetic algorithms and machine learning. Machine learning, 3(2), 95-99.
  • Mutation rates during the CryptoGErM process

    We present an estimate of the number of bases expected to mutate during the whole process of insertion, sporulation, sending, treatment, germination and reading, considering the mutation rates known to Bacillus subtillis.

    Figure 1. A mutation can occur in the sequence during one of the steps that our system requires. In essence: message insertion, sporulation, physical sending, treatment of spores and reading.

    Reported mutation rates of Bacillus subtilis

    A mutation in the genome of B. subtilis can occur in a coding or noncoding DNA sequence. If it occurs in a coding part, it can be an evolutionary advantage or a disadvantage. When it is a disadvantage, the bacteria will have less chance of passing its mutated genes to the next generation. Thus, some reports that measure the mutation frequency under evolutionary pressure conditions may underestimate its real value. As our message is a noncoding sequence, it will not be an evolutionary disadvantage (or an advantage) if it mutates or gets lost.

    Maughan et al. (2006) measured the mutation rates of B. subtilis in an antibiotic-resistance gene under conditions that did not require such a gene to survive [3]. They also found that the mutation rate itself may increase as generations pass by until it reaches a plateau. Nonetheless, the necessary number of generations to see a difference in mutation rates is higher than those our methodology requires. For that reason, the value of \(0.5 \cdot 10^{-8} \frac{mutations}{bp \cdot generation}\) will serve as a good estimate.

    Number of generations

    The probability that our message is changed by a random mutation also depends on the number of generations that have passed since the insertion until the sequencing. The doubling time of B. subtilis is estimated to be 120.1 min[1] but depending on the growth phase this value may vary. For instance, Maughan (2006) reports that the number of generations that passed in one week was 500 (doubling time = 201 min)[3].

    StepTime (h)Generations required
    Message insertion – sporulation7236
    Sporulation – sending00
    Sending – germination01
    Germination – message reading24 – 4812 – 24
    TOTAL:~50
    Figure 2. Using our system we take approximately 50 generations of cell growth counting from the sequence insertion to sequencing.

    DNA damage by UV radiation

    We planned to use a photoswitchable antibiotic as one of our bio-layers of security. Due to some experimental issues explained in the respective section, we are not using that approach anymore. However, we had considered that it might increase the mutation frequency to a considerable extent.

    First, our calculations (shown in the respective section) indicated that the spores would survive the required UV treatment required for antibiotic activation (325 nm, 30 min, 28.8 kJ/m2) and experimentally we did find them to survive it.

    In general, spores are 10-50 times more resistant to UV damage than vegetative cells and spores during the germination phase undergo a transcient period in which resistance to UV damage is even higher than that of normal spores. Spores also have a very interesting way of avoiding UV damage that is advantageous for our system, having an almost free-error repair system[2].

    Tanooka (1978) reports that upon irradiation of B. subtilis with UV (254 nm, 1-100 J/m2) the mutation frequency increases in a linearly on a log-log scale, being the spores approximately 10 times more resistant than vegetative cells[5].

    Calculation

    As the UV treatment is not used anymore, it is not taken into account in this calculation.

    \[ 50 generations \cdot 0.5 \cdot 10^{-8} \frac{mutations}{bp \cdot generation} \cdot 4,215,000 \frac{bp}{genome} = 1.05 \frac{mutations}{genome} \]

    Thus, each line of B. subtilis will have around one mutation by the time we want to read the message. This mutation does not have to be inside our message, and does not have to be the same bp mutated for each cell line as explained in the next diagram.

    Figure 3. Around one mutation will occur somewhere in the bacterial genome. Such mutation does not have to be inside our message and it does not need to be the exact same mutation for every cell line.
    References
    • [1] Burdett (1986). Growth kinetics of individual Bacillus subtilis cells and correlation with Nucleoid extension. Journal of Bacteriology, 167(1):219-230
    • [2] Lenhart (2012). DNA repair and genome maintenance in bacillus subtilis. Microbiology and Molecular Biology Reviews, 76(3):530-564
    • [3] Maughan (2006). The population genetics of phenotypic deterioration in experimental populations of Bacillus subtilis. Evolution 60(4): 686-695
    • [4] Moeller (2014). Resistance of Bacillus subtilis spore DNA to lethal Ionizing radiation relies primarily on spore core components and DNA repair, with minor effects of Oxygen Radical Detoxification. Applied and Environmental Microbiology, 80(1):104-109
    • [5] Tanooka (1978). Mutation induction with UV and X-Radiations in spores and vegetative cells of Bacillus subtilis. Mutation Research, 49:179-186

    Decoding – costs and fidelity

    The cost of sequencing DNA has plummeted in the last two decades. For instance, in the early 2000’s it took 13 years and $3 billion US dollars to sequence the entire human genome. With current technologies, we have approached the $1000 dollars mark. In fact, since 2007 you can have your whole DNA sequenced for less than that! However, there are still issues with the fidelity and accuracy of the readings obtained that would prevent them from being used for our system[5].

    Price of sequencing one million of base pairs. Prices dropped from 2007, when second generation techniques were introduced in the marked. Source: National Human Genome Research Institute

    Existing laboratory-level DNA sequencing technologies typically allow for a reading error of ~1%. Thanks to optimization and finne-tunning, traditional readings using Sanger biochemistry offer now accuracies of up to 99.999% in a read length of 1,000 bp. With modern, second generation or cyclic sequencing, higher reading lengths have been achieved but with a decrement in accuracy[4].

    Among the bioencryption layers, CryptoGERM prevents unauthorized parties from reading the message by having a high ratio of decoy spores that contain a useless sequence. In addition, if the right growing conditions are not supplied our system prevents germination and replication of spores that do contain the message. So, what ratio of decoy:spores do I need to prevent brute sequencing and message retrieval from a third party?

    A) One of the bioencryption bilayers is having a high proportion of decoy spores vs those that do contain the intended useful sequence. If the right conditions are not met (i.e. adding an X antibiotic to the system) those spores that contain our message will die and be outgrown by the decoy. B) Current sequencing techniques will not be able to distinguish the hidden message from noise.

    Fox et. al. (2014) reports that there is a 50% chance of accurately distinguishing a true subclonal variant from a sequencing artifact in an excess of 100 wild-type DNA sequences using standard Q30 filter reads (error rate: 10-2)[1]. In agreement with that result, in an experiment carried out to detect genomic variations in marine pests, Pochon (2013) was able to detect one variant out of 150 wild-type sequences[2].

    Reading DNA has become increasingly more accurate. Schmidt (2012) developed a method called Duplex Sequencing that uses both strands of DNA to obtain a more precise consensus sequence yielding an theoretical error of 3x10-10[3]. That means that we could transmit a message with a length in the order of Gigabytes without expecting any lost! On the other hand, that also means that they allow a more precise measurement in decoy-spores mixtures, and a 1:150 spore:decoy ratio might be insufficient in the future. In fact, using Duplex Sequencing they were able to identify one mutant sequence per 10,000 wild-type molecules.

    References
    • [1] Fox EJ, Reid-Bayliss KS, Emond MJ, Loeb LA (2014) Accuracy of Next Generation Sequencing Platforms. Next Generat Sequenc & Applic 1: 106. doi:10.4172/jngsa.1000106
    • [2] Pochon (2013). Evaluating detection limits of next-generation sequencing for the surveillance and monitoring of international marine pests. PLos One 8(9):e73935
    • [3] Schmitt MW, Kennedy SR, Salk JJ, Fox EJ, Hiatt JB, et al. (2012) Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci U S A 109: 14508-14513.
    • [4] Wang, X., Blades, N., Ding, J., Sultana, R. and Parmigiani, G. (2012). Estimation of sequencing error rates in short reads. BMC Bioinformatics, 13:185
    • [5] Zhu, X., Wang, J., Peng, B. and Shete, S. (2016). Empirical estimation of sequencing error rates using smoothing splines. BMC Bioinformatics, 17:177

    Random Mutation

    Special attention has been given during the modelling to the issue of random mutations. This particular phenomenon is related to the random modification of some of the ACTG components of the genome. Even though the reasons why this happens are unknown, the impact that these switches have on the DNA of Bacillus subtilis are very relevant in the CryptoGErM project. This is mainly related to the computational reasons that govern the encryption process of the data that is saved into the bacteria, in fact if even one single DNA component mutates it will be impossible to decrypt the message and get the original message back. The same is also true for the key, if one bit doesn’t correspond to the original ones, the key is not going to work on the encrypted message. In order to be sure that the message that we have stored into the genome wouldn’t have been affected by a random mutation we have ran some computational simulations that reproduce this phenomenon.

    The original length of Bacillus subtilis DNA corresponds to 4.215.619 base pairs, on the other side the length of the Sherlock Holmes quote that we have stored in the bacteria has a total length of 556 base pairs. The computational model simulates the random mutation in the DNA by randomly changing 1% of the ACTG components and checks if one of these mutations affects the inserted message. We have ran 1000 different simulations where this process has been computed iteratively and in none of these cases the random mutations affected the message. From this we concluded that for our specific case the impact of the random mutations influencing our project was extremely low and that we could have safely saved Arthur Conan Doyle’s quote.

    We have also asked ourselves how many of these quotes we could have theoretically been able to store safely into the bacteria, in order to do so we have modified the computational model where we recursively increased the size of the message at every simulation according to the following equation:

    \[ f(i) = 556 \prod_{i=1}^N i \]

    556 corresponds to the length of the original message which is recursively increased according to the value of N, the simulations show that the first 2 random mutations affecting the message occur when N reaches a value of 6. This corresponds to a 400.320 base pair long message. This result provides interesting insights to future research that aims to store always bigger amounts of data into the DNA of bacteria, more in detail it corresponds to the maximum amount of base pairs that can be modified in Bacillus subtilis.

    A Python implementation of the model is available at dna_analysis.py. The script requires Python 2.7 and the random, copy and numpy libraries. Also make sure to download and place BacillusSubtilis.txt in the same directory.

    Choosing optimal strain for photoswitchable antibiotic

    As can be read in the photoswitchable antibiotics section, the idea with the photoswitchable antibiotic was that it would kill the decoy B. subtilis but would not cause any harm to the engineered strain containing the DNA sequence.

    Figure 1. Spirofloxacin photo-activation.
    Figure 2. Spirofloxacin activity test. 1) LB agar with milliQ water as sterility control, 2-5) E. coli MC1061, DH5-alpha, BL4 and CS1562 strains.

    When not treated properly with the right wavelength of UV radiation (see figure 1), spirofloxacin remains in its inactive state. Because the sample sent to the receiver contains hundreds to thousands of times more decoy spores, it cannot be sequenced directly by the recipient (see also Decoding Fidelity). Instead, the receiver must activate the antibiotic. Once done, the spirofloxacin-resistant message-containing bacteria would outnumber the decoys.

    We explored the activity of spirofloxacin on different strains of E. coli, since it was a species of bacteria whose susceptibility to spirofloxacin had been previously well measured [7] (see figure 2).

    At this point of the lab work, experiments trying to measure the MIC of spirofloxacin “on”/”off” against B. subtilis had been inconsistent. We thought of Molecular Dynamics (MD) studies as a good alternative to measure how well the antibiotic performs on Bacillus, thus allowing us to continue faster with the engineering of the resistant strain.

    It is important for our system that the ratio of resistance to susceptibility in engineered and wild type is optimal. That means that in its inactive state, spirofloxacin must not have a high bactericidal activity and when activated, it must be potent enough to kill the wild-type cells while the engineered strain will survive.

    First, we needed to find a suitable crystallographic structure of a type-II topoisomerase (a protein-DNA complex) bound to a fluoroquinolone. Topoisomerase IV is the main fluoroquinolone’s target in Gram-positive bacteria, while gyrase (the other type of topoisomerase II) is the main target in Gram-negatives [1]. The crystallographic structure recently reported by Veselkov et. al. (2016) offered a good alternative as it has the two fluoroquinolone molecules bound to the protein-DNA complex.

    Secondly, we needed to create and adapt an appropriate force-field that would reproduce the behavior of the binding process at a reasonable computational cost. A force-field is a set of parameters that tells the software (we used GROMACS v5.0.4) how each atom behaves and interacts with others. As reported elsewhere [1][2] the binding process involves cation- and hydrogen-bonding interactions that can only be reproduced in the atomistic level. However, the computational cost of simulating a 150 kDa protein (embedded in a solvated box) is high enough to consider using lower-resolution scales. In addition, the crystallographic structure already is a bound topoisomerase – antibiotic complex, so there is no need for that high level of resolution. Coarse-grained models are refined enough to offer insights into the affinity of protein-ligand interactions (see i.e. [4]). Some of the molecular parameters for the spiropyran part of the molecule (the part that gives it is photoswitchable behavior) had been studied elsewhere [9]; nevertheless, its parameterization into atomistic and further coarse-grained requires some extra work.

    Figure 3: DNA in complex with topoisomerase IV and the molecule levofloxacin (a member of the cipro- and spirofloxacin family). The GROMOS 53a6 atomistic force-field is used, while for the coarse-grained the MARTINI force-field (v2.2) developed by the University of Groningen was used (not all beads shown). Origin PDB code: 3RAE.

    While the MARTINI force-field has been adapted and optimized for both proteins and DNA [3][6] it cannot reproduce cation- and hydrogen-mediated binding of ligands. So our best method is to use umbrella sampling in which basically the protein and the ligand are artificially placed in its “correct” orientation and dragged away from each other, measuring the Potential of Mean Force (PMF) [5].

    Before actually starting the umbrella simulations and due to the lack of consistent experimental results, we decided to change the photoswitchable antibiotic approach in our project. However, for the sake of clarity in the next figure we show what we expected to obtain from the simulations.

    Figure 4. In the umbrella sampling method the antibiotic is pulled away from the protein-DNA complex as marked by the arrow. As a result, we expected to obtain a profile as shown in the right side in which the spirofloxacin shows two binding affinity’s profiles. (Molecule shown, PDB code: 3RAE)
    References
    • [1] Aldred, K., Kerns, R. and Osheroff, N. (2014). Mechanism of Quinolone Action and Resitance. Biochemistry, 53, 1565-1574
    • [2] Lupala, C., Gomez-Gutierrez, P. and Perez, J. (2013). Molecular Determinants of the Bacterial Resistance to Fluoroquinolones: A Computational Study. Current Computer-Aided Drug Design, 9: 281-288
    • [3] Monticelli, L., Kandasamy, S., Periole, X., Larson, R., Tieleman, P., and Marrink, S. (2008). The MARTINI Coarse-Grained Force Field: Extension to Proteins. J. Chem. Theory and Comput., 4(5): 819-834
    • [4] Naughton, F., Kalli, A. and Sansom, M. (2016). Association of Peripheral Membrane Proteins with Membranes: Free Energy of Bingind of GRP1 PH Domain with Phosphatidylinositol Phosphate-Containing Model Membranes. J. Phys. Chem. Lett., 7(7): 1219-1224
    • [5] Torrie, G. and Valleau, J. (1977). Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling. J. Comput. Phys., 23(2): 187-199
    • [6] Uusitalo, J., Ingólfsson, H., Akhshi, P., Tieleman, P., Marrink, S. (2015). Martini Coarse-Grained Force Field: Extension to DNA. J. Chem. Theory Comput., 11(8): 3932-3945
    • [7] Velema, W., Hansen, M., Lerch, M., Driessen, A., Szymanski, W. and Feringa, B. (2015). Ciprofloxacin-Photoswitch Conjugates: A Facile Strategy for Photopharmacology. Bioconjugate Chemistry, 26: 2592-2597
    • [8] Veselkov, D., Lapanogov, I., Pan, X., Selvarajah, J., Skamrova, G., Branstrom, A., Prasad, L., Fisher, M. and Sanderson, M. (2016). Structure of a quinolone-stabilized cleavage complex of topoisomerase IV from Klebsiella pneumoniae and comparison with a related Streptococcus pneumoniae complex. Acta Cryst. D72: 488-496
    • [9] Zhai, GH., Yang, P., Wu, SM., Lei, YB. and Dou, YS. (2014). A semiclassical molecular dynamics of the photochromic ring-opening reaction of spiropyran. Chinese Chemical Letters, 25: 727-731

    Oop top