# Introduction

One of the goals of our project is to visualize if the system we designed to bring DNA strands closer works. We decided to use a tripartite GFP for that purpose. As we explained on our Strategy page, this system is composed of several parts [fig. 1]:

• The two dCas9 proteins
• The three parts of the GFP
Figure 1 : Tripartite split-GFP visualization tool

To visualize the fluorescence, the tripartite GFP needs to assemble. The 10th and 11th β-sheets of the GFP will be linked to the dCas9 and the GFP’s β-sheets from 1 to 9 will be free in the bacteria. We wanted to design our system to be sure that the GFP tri-partite assembles. To do that, we needed to find the optimal distance between the two target sequences that results in fluorescence.

# Constraints and Limits

The first constraint for the distance is the following: if the distance between the two dCas9 is too far, the tri-partite GFP may never assemble, as illustrated below [Fig. 2].

Figure 2: Prediction of the system if the distance is too long

On the other hand, when the end-to-end distance of the linker is too short, steric hindrance impedes upon the assembling of the GFP-tripartite. The two dCas9 proteins are not in the same plan and the DNA must curve. This affects the distance between the target sequences because curved DNA is longer than straight DNA[Fig.3].

Figure 3: Prediction of the system caused by the steric hindrance

Before using this system, we needed to answer a very essential question: what is the optimal distance between the two dCas9 proteins required for GFP to fluoresce?

In our model, we can liken the distance between the target sequences to the distance of the two dCas9 proteins, since each sequence is lodged in the core of each dCas9 protein. This allows us to calculate the distance between the two dCas9 proteins based on the distance between the target sequences [Fig. 4].

Figure 4: Structure of the dCas9 protein

To design the different target sequences, we converted the distance d between the two dCas9 proteins (in Angstrom) into the distance bp on the DNA (in base pair). To obtain the distance in base pairs, we need the length of a helix turn, which is 34Å, and the number of nucleic acids in a helix turn, which is 10.5 nucleic acids. This gives us the number of base pairs:

bp = (d * 10.5)/34

Our conversion relies on several hypotheses: we considered the DNA as a linear B-DNA, and we neglected the structural modification of the DNA due to the fixation of the two dCas9.

It appeared that the calculating power of a computer would be more than welcome to estimate the optimal distance through simulation.

# Implemented models

## 3D Model

We first thought of building a simple 3D model using the proteins’ known structures. We found these structures on the RCSB Protein Data Bank and used Pymol to assemble the system.
For this model we used:

• Two identical dCas9 proteins from Streptococcus Pyogenese (instead of Streptococcus Thermophiles and Neisseria Meningitidus in our biological system)
• A wild type GFP where we removed the 10th and 11th β sheets to mimic the tripartite GFP.

The big limitation of this approach was the lack of information regarding the linker. We could not find any information concerning its 3D configuration apart from its sequence. Hence, we decided to use the PEP-FOLD software to build a 3D simulation. However, the predictions we obtained seemed very unlikely to appear naturally. Indeed, the linker is supposed to be 100% unfolded because its sequence includes a majority of glycine, whereas some of the PEP-FOLD results showed β-sheets in the linker’s structure [Fig. 5].

Figure 5: A representative PEP-FOLD prediction of the linker

PEP-FOLD, a software designed for the prediction of the structural configuration of folded proteins, gives predictions biased towards folded structures due to its innate optimization algorithm. As such, the configurations we obtained for our linker, conceived to be unfolded, are most likely invalid.

So we decided to build two extreme models to try to flank the exact value: one with a small end-to-end distance and another one with a long distance.

For the small end-to-end distance, we used one of the PEP-FOLD conformations because even if the configuration seemed unlikely, the protein is maximally folded. As such, the smallest end-to-end distance we found with the PEP-FOLD predictions was 16.4 Å.

For the long end-to-end distance, we used the linker without any spatial folding, hence in a linear conformation. Similarly, this spatial configuration is very improbable too, but provides us with the maximal theoretical bound on the end-to-end distance. We obtained 107.3 Å.

With the preceding linker configurations, we then used the Pymol software to design the 3D model of the entire system and find the distance between the target sequences, lodged in the dCas9 attached to the GFP through the linker, in each case. According to the equation of conversion above, we acquire a first range of values for the optimal distance between the two target sequences:

• For the smallest linker, we found a distance of 230Å which corresponds to 71 base pairs
• For the longest linker, we found a distance of 356Å which corresponds to 110 base pairs

The limit of this approach is the way we modeled the linker, the distance between the target sequences depending on it. Hence, we decided to approximate the linker’s end-to-end distance through a mathematical equation which describes polymer physics and will provide an estimation of the end-to-end distance of the linker.

## Ideal Chain an Worm-Like Chain Models

We used two polymer physics models to approximate the linker end-to-end distance: the Ideal Chain model and the Worm like Chain model.

The Ideal Chain model (or freely jointed chain) is a simple model to describe polymers. It models a polymer with the help of a random walk, a mathematical object that assumes the construction of the object results from a succession of random steps. In this model, each amino acid in our linker will be considered as a rigid rod with a fixed length and modeled as a segment. We also consider that each amino acid is independent concerning its orientation and its position with its neighbor. Thus, we have 36 segments, and the length of each segment is 3.7Å. For N segments with a length of l, the contour length L is the total unfolded length:

R is the end-to-end vector, and its mean value writes:

According to the central limit theorem, if N>30, the probability density PN (r ⃗) for the end-to-end distance in the Ideal Chain takes the following Gaussian form:

Now that we have the Ideal Chain model equation governing the corresponding end-to-end distance, we will elaborate upon the Worm-Like Chain model which will serve as comparison to the results obtained with the Ideal Chain model.

In contrast with the Ideal Chain model, which considers the flexibility of segments in a discrete way, the Worm-Like Chain model considers the flexibility continuously, which is suited for describing semi-flexible polymers. Since we did not have any information concerning the flexibility of our linker, we decided to study both models to cover both cases and maximize the amount of information at our disposal concerning possible end-to-end distances according to flexibility.

The Worm-Like Chain model uses the persistence length, which quantifies the stiffness of a polymer, in the calculation of the end-to-end distance.

For our linker, we adopted the persistence length value used in the paper Huan-Xiang Zhou (2004): Polymer Models of Protein Stability, Folding, and Interactions, where the probability function is defined as follow:

With the equations determining the linker’s end-to-end distance from each model, we wrote a Python program that shows the density of probability distributions of the end-to-end distance in each case. The resulting graph is displayed on [Fig. 6] below:

Figure 6: End-to-end distance density of probability according to the model used

We see on this graph the distances predicted by the two models, with a mean of 18Å for the Ideal Chain model (in blue) and a mean of 27Å for the Worm-Like Chain model. The difference in the end-to-end distance between the two models is coherent because the Worm-Like Chain model is suited for stiffer polymers that are longer and less folded up than the polymers described with the Ideal Chain model.

These results are interesting because we can see that the end-to-end distance results we obtained with the PEP-FOLD prediction and with the mathematical models are quite similar.

These models help us adjust the distance we need for the linker. Indeed, we can assume that the end-to-end distance is closer to 20Å than 100Å. However, we cannot conclude anything yet and we cannot develop our model further because we do not have enough information about the stiffness of our linker.

The biggest limitation of these models is the fact that we lose information regarding the spatial arrangement of the repeat units and we do not have any information concerning the steric hindrance the repeat units would eventually apply upon each other.

If we consider our real polymer chain, the rotation of bonds around the backbone is restricted due to hindered internal rotations and excluded-volume effects. With this consideration, we know that our results are biased.

The ideal and worm-like chain models gave us the order of magnitude of the expected results, but we need to adapt our model to account for steric hindrance, so we decided to develop our own model to describe the behavior of the linker.

## Our mathematical model

### Definitions

We wanted our model to provide the end-to-end distance, as well as simulate the linker in 3D.
We use this scheme: all bonds have the same length, the angle θ represents the angle between two atoms, and Φ, Ψ, ω, are the dihedral angles [Fig. 7].
Unlike the mathematical models, in this model each segment is a bond rather than an amino acid. We thus have 36*3=108 segments.

Figure 7 : Scheme used for the elaboration of our model

To simulate our linker in 3D, we decided to write a program, where we represented each segment by a vector and then added these vectors in the same base. The sum of the segment vectors 0 to i is called Ui. In the end, we obtain one vector U107 that represents the end-to-end vector. We also keep the coordinates along the way so as to end up with a 3D representation of our linker.

This program was written in Python because this language is suited for biological modeling and the 3D library was suited for the construction of our linker.

Thanks to the Pymol software and the literature , we know the following constants:

• The length l of each segment : 1.5 Å
• The angle θ which represents the angle between each atom in the amino acid : 2π/3

We also use the Ramachandran plot to determine the appropriate dihedral angles Φ and Ψ. The figure below describes how we constructed our model in 3D.

For each segment of our model, we can be in any one of the three following situations [Fig. 8]:

• The segment is the N-Cα bond with the Φ dihedral angle
• The segment is the Cα-C bond with the Ψ dihedral angle
• The segment is the C-N peptide bond with a dihedral angle of 0 (a covalent link is planar)

The basis of reference is defined as one of the orthonormal bases where U0 is written:

Figure 8: 3D Scheme we used for the elaboration

All others segments will be expressed in that vector-base.

Then, to construct the linker, we need to define a change of basis matrix and consider the first two bonds, since for the peptide bond, ω = 0.

### Calculation

The change of basis matrix is composed of a translation matrix Tz, a rotation matrix Rz on the Oz axis and a rotation matrix Rx on the Ox axis.

• Tz represents the translation of 1.5 Å corresponding to the segment length
• Rz represents the rotation of a dihedral angle ϕ (which can be Φ, Ψ or 0)
• Rx represents the rotation of θ

We therefore obtain the change of basis matrix: P = Tz * Rz * Rx =

And, step by step, we can write the segment i in the basis of the segment 0 and obtain the vector Ui in the reference basis:

At every step, we keep in memory the coordinates of each vector for 3D visualization of the linker.

To finish, we implemented several improvements to complete our model. We consider that our linker does not only include glycine but also glutamates, leucines, serines, aspartates, valines, etc. and we adjust the dihedral angles accordingly. Indeed, the Ramachandran plot defines the biological probability of the dihedral angles, and amino acids do not share the same Ramachandran plots.

We also defined an exclusion zone for each vector. Indeed, in the first version of our model, some non-biological angles appeared, and as a consequence two segments overlapped. With the exclusion zone, we forbade the overlap of segments, thus excluding non-biological covering.

### Results

Our program is designed to give the end-to-end distance on N simulations, and show one simulation of the linker in 3D. For 5000 simulations, we obtain the following histogram [Fig. 9]:

Figure 9 : Results of our simulating programm

As we can see, the mean end-to-end distance is 19.66Å, and the standard deviation is 7.82Å.

The following graph [Figure 10] evaluates if the linker has a preferred orientation.

On the graph, each point represents the last segment of the linker compared to the origin. The color represents the end-to-end distance: the blue color indicates a small or minimal distance, the red color a large or maximal distance.
We can visually check that the cloud of linker last segment points follows an angularly homogeneous spherical distribution, where each angle starting from the origin is represented by an equal amount of points.
As such, we can conclude that the linker has no preferred orientation.

When simulating the linker in 3D, we obtain the following visualization [Fig. 11]:

Figure 11 : Result of the linker simulation in 3D, with all distance in Å

Compared to the polymer physics models, our model provides additional information concerning the linker, such as the spatial configurations it can take.

To sum up, we obtained a mean end-to-end distance of 18Å with the Ideal Chain model, 27 Å with the Worm-like Chain model, and 19.66Å with our model. When comparing our results to the ones obtained with these previous models, we observe our model is coherent with the 18Å result from the Ideal Chain model. One explanation for the divergence with the Worm-like Chain model’s result could be that the latter model is less suited for modeling our unfolded flexible linker protein.

We will now compare our linker’s spatial configuration results to those produced by the Gromacs software, given the same linker protein sequence.

# Gromacs software

Gromacs is a molecular dynamics package designed for simulating the structure of proteins and other biological molecules. Gromacs uses a PDB file to simulate a molecule’s dynamics. In our case, we used one of the PEP-FOLD predictions to run the molecular dynamics. To do so, we chose the PEP-FOLD prediction which had the closer end-to-end distance compared to our model’s mean end-to-end distance.

To run the molecular dynamics of the Gromacs software, we followed some steps:

1/ we defined a box full of water which simulates the aqueous system.
2/ we replaced few water molecules with ions to equilibrate the electrical charges of the solution and obtain electro-neutrality.
3/ we relaxed our system to minimal energy to ensure that it does not present any steric clashes or inappropriate geometry.
4/ we equilibrated the solvent (water molecules) and ions around the protein
5/ we finally ran the molecular dynamics of our linker for 10ns. The fact that the RMSD graph below reaches a plateau justifies limiting the time-course of the dynamics to this value.

We chose to plot the linker’s end-to-end distance so as to see the molecular dynamics obtained with Gromacs [Fig. 12] . The results are:

• Protein average end-to-end distance: 22.23 Å
• Average radius of gyration: 9.34Å

These results are very close to our model’s unfolded protein length predictions, which reinforces the validity of our model.

Figure 12 : Results from the Gromacs software

We also requested the RMSD & RMSF graphs from the Gromacs simulation [Fig. 13] , so as to extract more information regarding the dynamics of our linker.

Figure 13: Results from the Gromacs software for the RMSD & RMSF graphs

Thanks to Gromacs, we also had a movie [Fig. 14] which describe the dynamics of the linker during the 10ns:

Figure 14: Linker dynamics video obtained with the Gromacs software

In sum, the average end-to-end distance obtained in the Gromacs simulation is akin to that obtained in our model, which supports the assertion that our model does build the linker’s spatial configuration by taking into account steric hindrance effects just as the Gromacs software does, and more generally justifies the reasoning that led to the development of our model.

It is to be noted that, while the Gromacs simulation runs in more than 8 hours, our program runs in less than 5 minutes.

# Construction of the 3D model

Now that we retrieved one of the PDB formatted files from the Gromacs simulation, we can build a realistic 3D spatial configuration for our linker inside the Pymol software [Fig. 15] :

Figure 15 : Final model of our linker 3D spatial configuration

Seeing as the Gromacs software is much more complete and reliable than our model, we chose to use the conformation of the linker which presents an end-to-end distance equal to 22Å, in accordance with the Gromacs simulation.

The Pymol software allows us to calculate the distance between the two dCas9 proteins: 244.7 Å. Following the conversion equation explained above, we obtain the number of base pairs bp between the target sequences: bp = (244.7*10.5)/34 = 76

We can therefore conclude that the distance between the target sequences is approximately 76 base pairs. Since one helix turn correspond to 10.5 nucleic acids, we can very that there are 76/10.5 = 7.2 helix turn between the two dCas9. According to the hypotheses we made, we can conclude that the two dCas9 will be approximately in the same orientation.

# Conclusion

In this part, we wanted to find the optimal distance between the two dCas9 to visualize the fluorescence. All of the results we obtained allowed us to design the target sequences and has given us an overview of the dynamics of the system. To conclude, we find a distance of 244Å between the two dCas9s which correspond to 76 base pairs.

A possible improvement for this model would be the visualization of the molecular dynamics. Indeed, our program does not calculate or show the linker’s dynamics, giving instead a static image of the linker’s spatial configuration at time t. It is to be noted that our program is designed to predict and plot the spatial configuration of an unfolded protein, thus not taking into consideration pre-established α-helix or β-sheet types of folding, since the degree of unfolding of the linker of interest here is near 100%.

We have also shown that our model provides a reliable estimate of the end-to-end distance, which we have cross-checked with other models’ results, most notably the Gromacs software’s prediction.