MDplot: Visualise Molecular Dynamics

The MDplot package provides plotting functions to allow for automated visualisation of molecular dynamics simulation output. It is especially useful in cases where the plot generation is rather tedious due to complex ﬁle formats or when a large number of plots are generated. The graphs that are supported range from those which are standard, such as RMSD/RMSF (root-mean-square deviation and root-mean-square ﬂuctuation, respectively) to less standard, such as thermodynamic integration analysis and hydrogen bond monitoring over time. All told, they address many commonly used analyses. In this article, we set out the MDplot package’s functions, give examples of the function calls, and show the associated plots. Plotting and data parsing is separated in all cases, i.e. the respective functions can be used independently. Thus, data manipulation and the integration of additional ﬁle formats is fairly easy. Currently, the loading functions support GROMOS, GROMACS, and AMBER ﬁle formats. Moreover, we also provide a Bash interface that allows simple embedding of MDplot into Bash scripts as the ﬁnal analysis step. Availability: The package can be obtained in the latest major version from CRAN ( https://cran.r-project.org/package=MDplot ) or in the most recent version from the project’s GitHub page at https://github.com/MDplot/MDplot , where feedback is also most welcome. MDplot is published under the GPL-3 license.


Introduction
The amount of data produced by molecular dynamics (MD) engines (such as GROMOS (Schmid et al., 2012;Eichenberger et al., 2011), GROMACS (Pronk et al., 2013), NAMD (Phillips et al., 2005), AMBER (Cornell et al., 1995), and CHARMM (Brooks et al., 2009)) has been constantly increasing over recent years.This is mainly due to more powerful and cheaper hardware.As a result of this, both the lengths and sheer number of MD simulations (i.e.trajectories) have increased enormously.Even large sets of simulations (e.g., in the context of drug design) are attainable nowadays; thus suggesting that the processing of the resulting information is undertaken automatically.
In this respect, automated yet flexible visualisation of molecular dynamics data would be highly advantageous: both in order to avoid repetitive tasks for the user and to yield the ultimately desired result instantly (see Figure 1).Moreover, generating some of the graphs can be cumbersome.An example would be the plotting of a time series of a clustering program or hydrogen bonds.Therefore, these cases are predestined to be handled by a plotting library.There have been attempts made in that direction, for example the package bio3d (Grant et al., 2006;Skjaerven et al., 2014) (which allows the trajectories to be processed in terms of principle component analysis (PCA), RMSD and RMSF calculations), MDtraj (McGibbon et al., 2015), or Rknots (Comoglio and Rinaldi, 2012).However, to the best of our knowledge, there is currently no R package available that offers the wide range of plotting functions and engine-support that is provided by MDplot.R is the natural choice for this undertaking because of both its power in data handling and its vast plotting abilities.
Figure 1: Shows the overall workflow typically applied in molecular dynamics simulations beginning with a single PDB (Berman et al., 2000) structure as the input for the simulation and ending with the graphical representation of the data obtained.For large amounts of data, generating figures might become a tedious, highly repetitive task.
In the following sections we outline all of the plotting functions that are currently supported.For each function, examples of the function calls based on the test data included in the package, the resulting plots, the return values, and a table of arguments are detailed.The respective code samples use the loading functions (reported below) to parse the input files located in folder 'extdata', which allows immediate testing and provides format information to users.Currently, the package supports The R Journal Vol.9/1, June 2017 ISSN 2073-4859 GROMOS, GROMACS, and AMBER file formats as input.1However, extensions in both format support and plotting functionalities are planned.

Plotting functions
The package currently offers 14 distinct plotting functions (Table ??), which cover many of the graphs that are commonly required.Although the focus of the package relies on the visualisation of data, in addition to this values are calculated to characterise the underlying data when appropriate.For example, TIcurve() calculates the thermodynamic integration free-energy values including error estimates and the hysteresis between the integration curves.In many cases, the plotting functions return useful information on the data used, e.g., range, mean and standard deviation of curves.
To provide simple access to these functions, they may be called from within a Bash script.Examples are provided at the end of the manuscript.

Plot function Description
clusters() Summary of clustering over trajectories (RMSD based).

hbond_ts()
Time series of hydrogen bonds.
Table 1: Lists all of the currently available plotting functions that have been implemented in MDplot.
Most functions accept a boolean parameter (barePlot), that indicates printing of the plotting area only, i.e. stripped from any additional features such as axis labels.

The clusters() function
Molecular dynamics simulation trajectories can be considered to be a set of atom configurations along the time axis.Clustering is a method, that can be applied in order to extract common structural features from these.The configurations are classified and grouped together based on the root-mean-square deviation (RMSD).These subsets of configurations around the cluster's central member structure and their relative occurrences allow for comparisons between different and within individual simulations.clusters() allows to plot a summary of all of the (selected) clusters over a set of trajectories (Figure 2).

The clusters_ts() function
In structural clustering, it is often instructive to have a look at the development over time rather than the overall summary.This functionality is provided by clusters_ts().In the top sub-plot the overall distribution is given, while the time series is shown at the bottom.The clusters are sorted beginning with the most populated one, in descending order.Selections can be made and clusters that are not selected do also not appear in the time series plot (white areas).The time axis may be shown in nanoseconds (see Figure 3 for an example).

The dssp() function
In terms of proteins the secondary structure can be annotated by the widely used program DSSP (Definition of Secondary Structure of Proteins) (Kabsch and Sander, 1983).This algorithm uses the backbone hydrogen bond pattern in order to assign secondary structure elements such as α-helices, βstrands, and turns to protein sequences.The plotting function dssp() has three different visualisation methods and plots the overall result over the trajectory and over the residues.The user can specify selections of residues and which elements should be taken into consideration (Figure 4).

The dssp_ts() function
The secondary structure information as described for the function dssp() can also be visualised along the time axis using function dssp_ts() (Figure 5).The time can be annotated in snapshots or time units (e.g., nanoseconds).

The hbond() function
In the context of biomolecules, hydrogen bonds are of particular importance.These bonds take place between a donor, a hydrogen, and an acceptor atom.This function plots the summary output of The R Journal Vol.9/1, June 2017 ISSN 2073-4859 Note, that for this example plot a sparse data set was used to reduce the size of the data file (hence the large white areas in the middle).
hydrogen bond calculations and allows selection of donor and acceptor residues.Occurrence over the whole trajectory is indicated by a colour scale.Note, that in case multiple hydrogen bond interactions between two particular residues take place (conveyed by different sets of atoms), the interaction with prevalence will be used for colour-coding (and by default, this interaction is marked with a black circle, see below).An example is given in Figure 6. hbond(load_hbond("inst/extdata/hbond_example.txt.gz"),main="MDplot::hbond()",donorRange=c(0,65)) Return value: Returns a table containing the information used for plotting in columns as follows: • resDonor Residue number (donor).
• percentage Percentage, that has been used for colour-coding.
• numberInteractions Number of hydrogen bond interactions taking place between the specified donor and acceptor residues.

Argument name Default value Description hbonds none
Table containing the hydrogen bond information in columns "hbondID", "resDonor", "resDonorName", "resAcceptor", "resAcceptorName", "atom-Donor", "atomDonorName", "atomH", "atomAcceptor", "atomAcceptor-Name", "percentage" (automatically generated by function load_hbond()).plotMethod "residue-wise"  The R Journal Vol.9/1, June 2017 ISSN 2073-4859 The hbond_ts() function The time series of hydrogen bond occurrences can be visualised using the function hbond_ts(), which plots them either according to their identifiers or in a human readable form in three-or one-letter code (the participating atoms can be shown as well) on the y-axis and the time on the x-axis.If the GROMOS input format is used, this function requires two different files: the summary of the hbond program and the time series file.The occurrence of a hydrogen bond is represented by a black bar and the occurrence summary can be added on the right hand side as a sub-plot (Figure 7).In addition to the time series file, depending on the MD engine format used, an additional summary file might also be necessary (see the documentation of the function load_hbond_ts() for further information).

The noe() function
The nuclear-Overhauser-effect is one of the most important measures of structure validity in the context of molecular dynamics simulations.These interactions are transmitted through space and arise from spin-spin coupling, which can be measured by nuclear magnetic resonance (NMR) spectroscopy.These measurements provide pivotal distance restrains which should be matched on average during molecular dynamics simulations of the same system and can hence be used for parameter validation.The plotting function noe() allows to visualise the number of distance restrain violations and their respective spatial deviation.As shown in Figure 8, multiple replicates or different protein systems are supported simultaneously.Note that negative violations are not considered. noe(load_noe(files=c("inst/extdata/noe_example_1.txt.gz","inst/extdata/noe_example_2.txt.gz")),main="MDplot::noe()") Return value: Returns a matrix, in which the first column holds the bin boundaries used and the following columns represent either the percentage or absolute numbers of the violations per bin, depending on the specification.

The ramachandran() function
This graph type (Ramachandran et al., 1963) is often used to show the sampling of the φ/ψ protein backbone dihedral angles in order to assign propensities of secondary structure elements to the protein of interest (so-called Ramachandran plots).These plots can provide crucial insight into energy barriers arising as required, for example, in the context of parameter validation (Margreitter and Oostenbrink, 2016).The function ramachandran() offers a 2D (Figure 9) and 3D (Figure 10) variant with the former offering the possibility to print user-defined secondary structure regions as well.The number of bins for the two axes and the colours used for the legend can be specified by the user.

The rmsd() function
The atom-positional root-mean-square deviation (RMSD) is one of the most commonly used plot types in the field of biophysical simulations.In the context of atom configurations, it is a measure for the positional divergence of one or multiple atoms.The input requires a list of alternating vectors of time indices and RMSD values.Multiple data sets can be plotted, given in separate input files.Figure 11 shows an example for two trajectories. rmsd(load_rmsd(c("inst/extdata/rmsd_example_1.txt.gz","inst/extdata/rmsd_example_2.txt.gz")),printLegend=TRUE,names=c("WT","mut"),main="MDplot::rmsd()") Return value: Returns a list of lists, where each sub-list represents a RMSD curve and contains the components: • minValue The minimum value over the whole time range.
• maxValue The maximum value over the whole time range.
• meanValue The mean value calculated over the whole time range.
• sd The standard deviation calculated over the whole time range.

Argument name Default value Description
rmsdData none List of (alternating) indices and RMSD value vectors, as produced by load_rmsd().printLegend TRUE A Boolean which triggers the plotting of the legend.factor 1000 A number specifying how many snapshots are within one timeUnit.timeUnit "ns" Specifies the time unit.rmsdUnit "nm" Specifies the RMSD unit.colours NA A vector of colours used for plotting.names NA A vector holding the names of the trajectories.legendPosition "bottomright" Indicates the position of the legend: either "bottomright", "bottomleft", "topleft", or "topright".barePlot FALSE A Boolean indicating whether the plot is to be made without any additional information.... none Additional arguments.

The rmsd_average() function
Nowadays, for many molecular systems multiple replicates of simulations are performed in order to enhance the sampling of the phase space.However, since the amount of analysis data grows accordingly, a joint representation of the results may be desirable.For the case of backbone-atom and other RMSD plots, the MDplot package supports average plotting.Instead of plotting every curve individually, the mean and the minimum and maximum values of all trajectories at a given time point is plotted.Thus, the spread of multiple simulations is represented as a 'corridor' over time.
• minimum The minimum RMSD value over all input sources at a given time.
• mean The mean RMSD value over all input sources at a given time.
• maximum The maximum RMSD value over all input sources at a given time.The R Journal Vol.9/1, June 2017 ISSN 2073-4859 The rmsf() function The atom-positional root-mean-square fluctuation (RMSF) represents the degree of positional variation of a given atom over time.The input requires one column with all residues or atoms and a second one holding RMSF values.Figure 13 shows, as an example, the RMSF of the first 75 atoms, calculated for two independent simulations.

The TIcurve() function
For calculations of the free energy difference occurring when transforming one chemical compound into another (alchemical changes) or for estimates of free energy changes upon binding, thermodynamic integration (Kirkwood, 1935) is one of the most trusted and applied approaches.The derivative of the Hamiltonian, as a function of a coupling parameter λ, is calculated over a series of λ state points (typically around 15).The integral of this curve is equivalent to the change in free energy (Figure 14).The function TIcurve() performs the integration and, if the data for both the forward and backward processes are provided, the hysteresis between them.
• integrationresults A matrix containing one row of "deltaG" and "error" columns from the integration for every data input series.
• hysteresis If two (i.e.forward and backward) data input series are provided, the resulting hysteresis is reported (and set to be NA otherwise).
• maxValue The maximum value over the whole set.
• meanValue The mean value over the whole set.
• sd The standard deviation over the whole set.The R Journal Vol.9/1, June 2017 ISSN 2073-4859 The xrmsd() function This function generates a plot which shows a heat-map of the atom positional root-mean-square differences between snapshots (figure 16).The structures are listed on the xand y-axes.The heat-map shows the difference between one structure and another using a coloured bin.The legend is adapted in accordance to the size of the values.The R Journal Vol.9/1, June 2017 ISSN 2073-4859

Figure 2 :
Figure 2:The clusters are plotted along the x-axis and the number of configurations for each trajectory for every cluster on the y-axis.The number of clusters is limited in this example to nine with the clustersNumber argument, which can be useful to omit scarcely populated clusters.
Figure4: Example of dssp() with plotType set to "dots" (default), "curves" or "bars".Note that the fractions do not necessarily sum up to a hundred percent, because some residues might not be in defined secondary structure elements all the time.In this figure, there is no legend plotted due to space limitations (see Figure5for a colour-code explanation).

Figure 5 :
Figure 5: Example showing all of the defined secondary structure elements per residue over time.Note, that for this example plot a sparse data set was used to reduce the size of the data file (hence the large white areas in the middle).

Figure 6 :
Figure 6: The acceptor residues are plotted on the x-axis whilst the donors are shown on the y-axis.The different colours indicate the occurrences throughout the whole trajectory.

Figure 7 :
Figure 7: Example figure generated by hbond_ts() for both an identifier and acceptor residues' selection.The labels for the hydrogen bonds may be printed as identifiers or with names composed of residue names (in single-or three-letter code) and those of the participating atoms.

Figure 9 :
Figure 9: Two-dimensional plot version "sparse" of the ramachandran() function with enabled contour plotting.The number of bins can be specified for both dimensions independently.

Figure 10 :
Figure 10: Three-dimensional example of the ramachandran() function.In addition to the colour, the height (z-axis) also represents the number of dihedrals per bin.

Figure 11 :
Figure 11: This plot shows the RMSD curves for two different trajectories.The time is given in nanoseconds, which requires a properly set factor parameter.

Figure 12 :
Figure 12:In black, the mean RMSD value at a given timepoint and in grey the respective minimum and maximum values are given.In this example, two rather similar curves have been used.
Return value: A list of vectors, alternately holding atom indices and their respective values.

Figure 14 :
Figure14: A forward and backward thermodynamic integration curve with the resulting hysteresis between them (precision as permitted by the error).

Figure 15 :
Figure 15: Shows time series with parameter snapshotsPerTimeInt set in a way such, that the proper time in nanoseconds is plotted.In addition, the legend has been moved to the bottom-right position.

Figure 16 :
Figure 16: An example xrmsd() plot showing only the upper half because of the mirroring of the values.

Table 2 :
Arguments of the clusters() function.
Example plot showing two different replicates of a protein simulation (they share the same molecule, but have different initial velocities).Note, that the maximum value (x-axis) over all replicates is used for the plot.The sum over all violations from left to right is shown by an additional curve on top.The number of violations may be given as fractions (in %), as shown above, or absolute numbers (flag printPercentages either TRUE or FALSE).

Table 8 :
Arguments of the noe() function.

Default value Description xrmsdValues
none Input matrix (three rows: x-values, y-values, RMSD-values).Can be generated by function load_xrmsd().printLegend TRUE If TRUE, a legend is printed on the right hand side.xaxisRange NA A vector of boundaries for the x-snapshots.yaxisRange NA A vector of boundaries for the y-snapshots.