Find slow dynamic modes via analyzing molecular dynamics simulation trajectories

Chuanbiao Zhang; Xin Zhou

doi:10.1088/1674-1056/abad24

1. Introduction

Molecular dynamics (MD) simulation has been widely used in recent a few decades in very various fields, such as detecting the relationship between chemical structures and functions of biological macromolecules. [1, 2] The raw data of MD simulations are high-dimensional time series with the total time length usually in the order of microsecond or even longer and containing the femtosecond resolution and all-atomic coordinates and velocities. The data are too complicated to provide an intuitive understanding of the simulated system; thus we usually need to find much less (collective) variables to characterize the motion of the system in a larger timescale to understand its main features. [3] For simpler systems, there may exist some direct ways to select some collective variables to (approximately) represent the main features via our experiences or intuitions. For example, for small proteins or peptides, the dihedral angles in the backbone can describe well the large (and usually slow) motions of molecules. Thus they can be used as the key collective variables. However, in more complex systems, it is usually needed to develop more systematical data analysis methods to capture the main features from big MD data.

Many methods were presented to simplify large data to take out the main features. For example, lots of clustering algorithms [4 - 6] and reduction dimensionality techniques [7 - 11] focused on finding the low-dimensional structure of data point set, and were applied in the analysis of MD simulation data of biomolecules. [12 - 23] The methods are based on the geometric distances between data points (conformations of molecules) can get the static structure of the whole data set without applying any dynamics information. A recent technique, the diffusion map, constructed an artificial diffusion dynamics among the static data points based on the distances then tried to find the slow modes of the diffusion dynamics, which achieved the static structure robustly under the help of the artificial dynamics information. [11] On the other hand, some methods, such as the Markov states model (MSM), [24 - 31] and its improvement, tICA, [32, 33] the trajectory map (TM), [34 - 40] applied the dynamics information of MD trajectories to find the slow-dynamic structure of data. The TM gives a simple and efficient way to take into account the time successiveness along the MD trajectory to construct the structure of slow dynamics robustly. It promises a general method in analyzing various complicated time series. The implementation and application of TM were presented in previous works. [34, 37 - 40] Here we revisit the mathematical principle behind the TM and the relationship between the TM and other related techniques; thus we can more clearly discuss the advantages and disadvantages of these methods.

2. Theory and method

2.1. Slow dynamics modes

Generally, let us consider a stochastic process, as the molecular dynamics (MD) simulation, e.g., following the (overdamped) Langevin equation

$$ \begin{eqnarray}\gamma {\rm{d}}{\boldsymbol{q}}=-\displaystyle \frac{\partial U}{\partial {\boldsymbol{q}}}{\rm{d}}t+\sqrt{2{k}_{{\rm{B}}}T\gamma {\rm{d}}t}{\rm{d}}{\boldsymbol{G}},\end{eqnarray}$$ (1)

where γ is the friction coefficient, q is a simple denotation of the (high dimensional) conformational coordinate, U(q) is the potential energy surface, k_B is the Boltzmann constant, T is the temperature, and d G is the normal Gaussian white noise. Equivalently, we can described the process by the Fokker–Planck equation based on the probability density function,

$$ \begin{eqnarray}\displaystyle \frac{\partial }{\partial t}P(q,t)={{\boldsymbol{L}}}_{FP}P(q,t),\end{eqnarray}$$ (2)

with the operator

1

Simply, we set γ to be a constant, (the unit if resetting time d τ = d t/γ). The equilibrium solution is the Boltzmann distribution, P_eq(q) = (1/Z) e^{−β U(q)}. Here Z = ∫ d q e^{−β U(q)}, named as the partition function, is the center quality in statistical physics, and β = 1/k_BT. In this paper, we will set k_BT as unit without explicit mention again.

Defining P (q, t) = [P eq (q)] 1/2 Ψ (q, t), we have

$$ \begin{eqnarray}-\displaystyle \frac{\partial }{\partial t}\varPsi (q,t)={\boldsymbol{H}}\varPsi (q,t),\end{eqnarray}$$ (3)

the imaginary time Schödinger equation, where H = – ∂² / ∂ q² + V(q) is a real symmetric operator, the same as the Hamilton operator in quantum mechanism. The effective potential

1

Therefore, we have the spectrum expansion

$$ \begin{eqnarray}P(q,t)={P}_{{\rm{eq}}}(q)\displaystyle \sum _{n=0}^{\infty }{a}_{n}(0){{\rm{e}}}^{-{\lambda }_{n}t}{\hat{\varPhi }}_{n}(q),\end{eqnarray}$$ (4)

where

{\hat{Φ}}_{n} (q) = Φ_{n} (q) / Φ_{0} (q)

, and Φ_n(q) is the eigenfunction of the operator H with the eigenvalue λ_n, i.e., HΦ_n(q) = λ_n Φ_n(q). We have 0 = λ₀ < λ₁ < … < λ_n < …, and Φ₀(q) = [P_eq(q)]^1/2. The orthogonal condition

{〈 {\hat{Φ}}_{n} (q) {\hat{Φ}}_{m} (q) 〉}_{eq} = \int Φ_{n} (q) Φ_{m} (q) d q = δ_{n m}

, and the completeness

P_{eq} (q) \sum_{n = 0}^{\infty} {\hat{Φ}}_{n} (q) {\hat{Φ}}_{n} (q^{'}) = δ (q - q^{'})

. Here 〈 ⋅ 〉_eq means the expectation under the distribution P_eq(q). The expansion coefficient

a_{n} (0) = {〈 {\hat{Φ}}_{n} (q) 〉}_{0} = \int d q {\hat{Φ}}_{n} (q) P (q, 0)

, the expectation under the initial distribution P(q,t = 0).

U (q) = \frac{1}{2} k q^{2}

2.2. Linear space of slow modes and metastable states

For any probability function P (q, t), we can suppose it already evolved a not-short time, thus

$$ \begin{eqnarray}\displaystyle \frac{P(q,t)}{{P}_{{\rm{eq}}}(q)}\approx 1+\displaystyle \sum _{n=1}^{N}{a}_{n}(t){\hat{\varPhi }}_{n}(q),\end{eqnarray}$$ (5)

i.e., it approximately belongs to the linear space spanned by the slow modes. For example, considering an MD trajectory q(t), t ∈ [0,τ], without losing generality, the corresponding probability function

P (q) = (1 / τ) \int_{0}^{τ} d t δ (q - q (t))

is a linear combination of the slow modes (if not, reset a later time as t = 0 and discarding the earlier conformations). Therefore, we can generate lots of MD trajectories from different initial conformations, the corresponding probability functions (or more exactly, the finite-size samples), denoted as {P_i(q)}, i = 1, …, m, can be applied to linearly combine to span the slow dynamic modes of the system

$$ \begin{eqnarray}{\hat{\varPhi }}_{n}(q)=\displaystyle \frac{\sum _{i}{b}_{n,i}{P}_{i}(q)}{{P}_{{\rm{eq}}}(q)},\end{eqnarray}$$ (6)

where a linear uncorrelated subset of {P_i(q)} is sufficient to expand any slow mode.

{{\hat{Φ}}_{n} (q)}

$$ \begin{eqnarray}{\hat{\varPhi }}_{n}(q)=\displaystyle \sum _{\alpha }{c}_{n\alpha }{\varTheta }_{\alpha }(q).\end{eqnarray}$$ (7)

Here Θ_α(q) is the characteristic function of the α metastable state, whose value is unit inside the state but zero otherwise. It means that conformations inside the same state are equal in the viewpoint of slow dynamics. As involving more (shorter) dynamic modes, more metastable states are split and more refined description about the conformational space are obtained.

MD trajectories similarly give the splitting of metastable states, since

$$ \begin{eqnarray}\displaystyle \frac{{P}_{i}(q)}{{P}_{{\rm{eq}}}(q)}=\displaystyle \sum _{\alpha }{c}_{i\alpha }^{^{\prime} }{\varTheta }_{\alpha }(q),\end{eqnarray}$$ (8)

i.e., the local equilibrium distribution is proportional to the global one inside each metastable state. It means that there is the same linear structure among slow-mode functions, the characteristic functions of metastable states, and the conformational probability functions from MD trajectories. The central idea of the trajectory map (TM)^[34,38] is to achieve slow modes or metastable states from MD trajectories.

2.3. Trajectory map

We apply a set of known conformational functions, denoted as {A α (q)}, μ = 1, 2, \dots, N 0, as basis functions to coarsely represent the linear space spanned by the slow-mode functions,

$$ \begin{eqnarray}\displaystyle \frac{{P}_{i}(q)}{{P}_{{\rm{eq}}}(q)}\approx \displaystyle \sum _{\mu }{p}_{i,\mu }{A}^{\mu }(q).\end{eqnarray}$$ (9)

Here, the approximation which comes from the incompleteness of the set of basis functions less affects the construction of the slow-mode space. The coefficient p_i,μ = 〈 A_μ(q) 〉_i is estimated from the finite-size conformational sample corresponded to P_i(q) (a segment of MD trajectory length of τ), where the conjugated basis function A_μ(q) satisfying

$$ \begin{eqnarray}{\langle {A}_{\mu }(q){A}^{\nu }(q)\rangle }_{{\rm{eq}}}={\delta }_{\mu \nu },\end{eqnarray}$$ (10)

is estimated in the conformational sample corresponded to P_eq(q). Usually, the equilibrium sample is absence, we can apply a reference distribution P_ref(q), for example, the mean of all MD trajectories, to replace P_eq(q), which less affects the construction of linear space of slow modes, since P_ref(q) is also proportional to P_eq(q) in each metastable state, thus P_i(q)/P_ref(q) still belongs to the slow-mode linear space.

{{\hat{A}}^{μ} (q) = {\hat{A}}_{μ} (q)}

Thus, each MD trajectory or segment P i (q) is mapped as a vector

$$ \begin{eqnarray}{{\boldsymbol{p}}}_{i}={p}_{i,\mu }{\hat{A}}^{\mu }(q),\end{eqnarray}$$ (11)

and all these vectors {p_i} span the slow-mode linear space (more exactly, its projection in the applied basis functions

{{\hat{A}}^{μ} (q)}

). A larger set of basis functionss is helpful for providing more complete information about slow modes, but even when the basis functions are not sufficient so that some of the slow modes are not able to distinguish completely, the linear projection does not bring any additional biased results. In practice, the size of sample corresponded to P_ref(q) also limits the number of linear uncorrelated basis functions

{{\hat{A}}^{μ} (q)}

. More discussion about the basis functions were shown in the previous literatures about TM.^[34,37–40]

{{\hat{B}}_{α} (q)}, α = 1, \dots, m

$$ \begin{eqnarray}{\hat{B}}_{\alpha }(q)={b}_{\alpha \mu }{\hat{A}}^{\mu }(q),\end{eqnarray}$$ (12)

here b_αμ is the α-th eigenvector of the variance–covariance matrix of these mapped points, ∑_ip_iμp_iν. We have

{〈 {\hat{B}}_{α} (q) {\hat{B}}_{β} (q) 〉}_{ref} = δ_{α β}

. A hint for choosing the number of principal components is provided by the plot of eigenvalues sorted in decreasing order. The first few eigenvalues which are significantly greater than zero are usually correspond to slow processes.

{{\hat{B}}_{α} (q)}

$$ \begin{eqnarray}{\hat{B}}_{\alpha }(t)=\displaystyle \frac{1}{\Delta t}\displaystyle {\int }_{t}^{t+\Delta t}{\hat{B}}_{\alpha }(q({t}^{^{\prime} })){\rm{d}}{t}^{^{\prime} }.\end{eqnarray}$$ (13)

Here we applied a time-window smoothing (with length Δ t ) to further filter fast dynamics modes of MD trajectory. Usually, Δ t is set about two to three orders of magnitude smaller than the length of trajectory segment (τ), and the results of TM is insensitive to the specific value of Δ t. Then we can identify the obvious change along the trajectories B(t) as transition events between metastable states, or calculate the two-point similar matrix in the B space, C(t₂,t₁) = B(t₂) ⋅ B(t₁) to get the transition events.^[38]

3. Application

In this section, we use the Trp-cage protein to illustrate the basic application of the TM. The Trp-cage contains 20 residues, includes three secondary structures in its native structure: an α helix, a 3-10 helix, and a polyproline II segment. [41 - 43] This protein can fold in microseconds, and the stability of native structure originates from the hydrophobic core around the Trp residue. [43, 44] Due to its small size and various meta-stable states, Trp-cage becomes an ideal protein for testing both sampling algorithms and force fields of simulations, thus it has been extensively studied by MD and experiments. [45 - 54] However, its folding kinetics and folding pathways are still not fully understood. [andryushchenko2016hydrodynamic] Recently, Lindorff-Larsen et al. have performed a 208-μs equilibrium MD simulation of the Trp-cage at 290 K. [43] They applied the CHARMM22* force field [56] and the modified TIP3P water model compatible with the CHARMM force field. [57, 58] The generated MD trajectory involves about 2.08 \times 10 5 snapshots, with a time interval of 200 ps. We downloaded the trajectory file from D. E. Shaw Research.

We choose the dihedral angles of protein backbone, ϕ i and ψ i with i = 1, \dots, 18, as collective variables to describe the large-scale motions. Due to the periodicity of dihedral angles, we transform the angles into their cosine and sine functions, to get 72 basis functions in total, [59] {A μ (q)}, μ = 1, \dots, 72 in the TM. The time evolution of the root-mean-square deviation (RMSD) clearly distinguishes the folded state (native structure) and unfolded structure, but more details inside the unfolded structure is not so clear (Fig. 1(a)).

Figure 1.(a) Time evolution of the RMSD of the Trp-cage to its native structure, the slow variables B₁ and B₂ obtained in the TM. Red line is time-window-smoothed one (Δ t = 200 ns). (b) Eigenvalue of the variance–covariance matrix of the trajectory-mapped points. The inset is the contribution of each basis function to the slow variables B₁ and B₂. (c) The free-energy landscape (in units of k_BT) in the slow-variable space (B₁, B₂).

Download full size

View all figures

We apply the TM with the free parameter τ = 2 μs to detect the slow modes in the τ time scale, which hiding in the total 208-μs-long all-atomic MD trajectory. As shown in Fig. 1(b), two eigenvalues of variance-covariance matrix are significantly greater than zero, which indicates that this system involves two slow dynamics processes. These two slow variables (denoted as B 1 and B 2) are shown in Fig. 1(a) . We show the free energy surface in the (B 1, B 2) space (Fig. 1(c)), Δ G = - k B T ln P, where P is the probability distribution of the molecular system along the slow variables. In this two-dimensional space, the entire region includes three minima that correspond to three metastable states, i.e., the folded state (S f), unfolded state (S u), and extended state (S e). The B 1 primarily distinguishes state S f from the other two states. In other words, the B 1 which combined by 72 angle-based basis functions has similar physical meaning with the RMSD which is common used to identify folded state of protein. The B 2 distinguishes state S e from the other states.

Figure 2(a) shows the time-ordered similarity matrix of the MD trajectory after projecting in the slow dynamics space, C (t 2, t 1) = B (t 2) \cdot B (t 1). The matrix is found to divide into some blocks. Inside each block (each time segment), the element of similarity matrix is almost equal to unity, it indicates that all the conformations in each time segment are almost identical in the B space, i.e., they belong to the same metastable state without occuring transition in the time segment. At the time points which separated different blocks, the MD trajectory occurs transitions from one to another metastable states, thus the similarity of conformations in the B space is obvious. Thus, this matrix can help us to identify transition events.

Figure 2.Time-ordered similarity matrix of the MD trajectory. The similarity between two samples C(t₂, t₁) = B(t₂) ⋅ B(t₁). (b) The time-rearranged similarity matrix, suggesting three metastable states. (c) Kinetic transition network. Numbers near the arrows are the corresponding transition rates. The population of each state in the 208-μs MD trajectory is listed in bracket (which approaches to the equilibrium one, in consistent with the fact the folding and unfolding transitions occur more than ten times during the MD simulation). Residue TRP6 and PRO17 are shown in blue, GLY11 in red.

Download full size

View all figures

ψ_{TRP}^{6}

4. Discussion

The propagator, G (q, t | q 0,0), also named as Green function, is the condition probability of the system to be q at time t while it was q 0 at t 0, which describes the dynamics of system. For large t, we have

$$ \begin{eqnarray}G(q,t|{q}_{0},0)={P}_{{\rm{eq}}}(q)\left[1+\displaystyle \sum _{n=1}^{N}{{\rm{e}}}^{-{\lambda }_{n}t}{\hat{\varPhi }}_{n}(q){\hat{\varPhi }}_{n}({q}_{0})\right],\end{eqnarray}$$ (14)

and P(q,t) = ∫ G(q,t|q₀,0) P(q₀,0) d q₀.

It is easy to know that the slow-mode functions are the first a few eigenfunctions of G (q, t | q 0,0). In principle, we can represent the propagator as a time correlation matrix by using the basis functions {A μ (q)},

$$ \begin{eqnarray}\begin{array}{lll}{G}_{\,\nu }^{\mu }(t,0) & = & \displaystyle \int {\rm{d}}q{\rm{d}}{q}^{^{\prime} }{A}^{\mu }(q)G(q,t|{q}^{^{\prime} },0){A}_{\nu }({q}^{^{\prime} })P({q}^{^{\prime} },0)\\ & \equiv & {\langle {A}^{\mu }(q(t)){A}_{\nu }(q(0))\rangle }_{0},\end{array}\end{eqnarray}$$ (15)

then diagonalize the matrix to get the slow-mode functions.

\hat{Φ} (q)

$$ \begin{eqnarray}\begin{array}{lll}{G}_{\,\nu }^{\mu }(t) & = & \displaystyle \frac{1}{{\hat{Z}}_{\mu }}{\langle {\theta }_{\mu }(q(t)){\theta }_{\nu }(q(0))\rangle }_{{\rm{eq}}}\\ & = & \displaystyle \frac{{n}_{\mu \nu }(t,0)}{{n}_{\nu }(0)},\end{array}\end{eqnarray}$$ (16)

provides a very detailed representation of the propagator. Here n_μν(t,0) is the probability that q locates in the μ cell at t and starts from the ν cell at t = 0, and n_ν(0) is the probability in the ν cell at t = 0. It is worth to mention, due to the character of the cell functions, we actually can use any initial probability P(q,0) to calculate the matrix without altering results. It is easy to know that

G_{ν}^{μ} (t) \geq 0

, and

\sum_{μ} G_{ν}^{μ} (t) \equiv 1

. Thus, the eigenvalues of the matrix are between 0 and 1 for any t, and the largest a few eigenvalues and eigenvectors give the slow modes of motion.

It is actually the main idea of MSM which was widely applied for analyzing metastable states of slow motions in biological molecular systems from ensemble dynamics simulations. [24 - 31] In MSM, the cells are defined in the sample set of sampled conformations. All sampled conformations are clustered into thousands of groups based on the pair distance of conformations, then each group of conformations corresponds into a cell function. Along the MD trajectories, the transition probability from one cell to another after a time interval t is estimated as the corresponding matrix element. Due to the completeness of basis functions and not requiring the initial distribution be equilibrium, MSM provides a complete construction of slow-dynamics modes in principle, and its eigenfunctions are one-to-one corresponding to the slow dynamics modes. However, in practice, very large data set of simulations is often required to get good estimate of each matrix element with sufficient statistical accuracy, which requiring sufficient events between each pair of cells.

G_{μ ν} (t, 0) = 〈 {\hat{A}}_{μ} (q (t)) {\hat{A}}_{ν} (q (0)) 〉

As a comparison, TM calculates and diagonalizes the variance-covariance matrix of trajectory-mapped points,

1

It is easy to know, Σ^μν is symmetric and a kind of average of G_μν(t₂,t₁),

$$ \begin{eqnarray}{\varSigma }^{\mu \nu }=\displaystyle \frac{1}{\tau }\displaystyle {\int }_{0}^{\tau }{\rm{d}}t\left(1-\displaystyle \frac{t}{\tau }\right)[{C}_{\mu \nu }(t)+{C}_{\nu \mu }(t)],\end{eqnarray}$$ (17)

where

$$ \begin{eqnarray}{C}_{\mu \nu }(t)=\displaystyle \frac{1}{\tau -t}\displaystyle {\int }_{0}^{\tau -t}{\rm{d}}{t}_{1}{G}_{\mu \nu }(t+{t}_{1},{t}_{1}).\end{eqnarray}$$ (18)

The TM focuses more on the difference of MD trajectories {P_i(q)}, but less on that of conformations in the same trajectory, since the latter is mainly related to the short-time correlation. Therefore, the average of the time correlation matrix in the TM provides a suitable way to more efficiently extract the slow dynamics modes by filtering fast motions.

5. Summary

The TM can extract the slow dynamic processes from time-series data. The key of TM is to make use of the time continuity between conformations, rather than only based on the geometric similarity of single conformation to build the slow variables of the system. Compared with the other methods, the TM is to apply the probability function of trajectories to combine the slow-dynamics functions directly linearly. It makes the TM robust and straightforward, less affecting by the incompleteness of basis functions in representing these slow dynamics functions. Besides, the TM gives slow-dynamics related analyzed collective variables, which not only provides a simple understanding of the slow dynamics of systems from MD trajectories but also is applied to extend the time scale of further MD simulations [39] by combining with some enhanced sampling techniques, such as metadynamics, [60] umbrella sampling, [61] and forward flux methods, [62] etc.