[MDAKit / Hackathon] Trajectory clustering #25

IAlibay · 2023-09-25T00:13:41Z

Overview

It is common to use unsupervised learning algorithms to cluster a
trajectory by similarity to discover different states.

A typical approach is to first calculate the RMSD of a structure with
itself across all trajectory frames, forming a $T \times T$ symmetric
RMSD matrix where $T$ is the number of frames in the
trajectory. However, any other metric (or distance) instead of
RMSD will also work if you can produce a suitable similarity matrix.

One can then use any number of clustering algorithms to partition
the similarity matrix and thus assign different cluster numbers to the
different frames of the trajectory.

The mdaencore MDAKit for Ensemble Similarity Calculations
also implements some clustering methods (namely, Affinity Propogation and
DBSCAN/KMeans, via scikit-learn; as described
in the docs here).
However, a general-use cluster analysis tool, featuring a larger
selection of clustering algorithms would likely be useful to many users.

Clustering methods could include -

scikit-learn clustering contains many clustering algorithms
that can be either used with a similarity matrix or directly with
trajectory data like coordinates.
The GROMOS clustering algorithm is widely used in biomolecular
simulations [Daura 1999]. (See Issue
#2876.)

The following excerpt from Daura et al. describes the algorithm:

"To find clusters of structures in a trajectory the RMSD of atom
positions between all pairs of structures was determined. For each
structure the number of other structures for which the RMSD was
0.1 nm or less (backbone, residues 2 ± 6) for structure 1 or 0.08
nm or less (backbone, residues 2 ± 5) for structure 2 (neighbor
conformations) was calculated. The structure with the highest
number of neighbors was taken as the center of a cluster, and
formed together with all its neighbors a (first) cluster. The
structures of this cluster were thereafter eliminated from the
pool of structures. The process was repeated until the pool of
structures was empty. In this way, a series of nonoverlapping
clusters of structures was obtained."

Objectives

Create a ClusterAnalysis class that allows the user to run any of
the scikit-learn clustering algorithms that can work on raw
data (such as K-means). Use the AnalysisBase framework to write
the analysis class (see the tutorial on writing your own
trajectory analysis.
Create an MDAKit that makes ClusterAnalysis available.
Implement additional clustering methods such as GROMOS clustering
[Daura 1999] (described above in more detail).

References

X Daura, K Gademann, B Jaun, D Seebach, WF van Gunsteren, and AE Mark. Peptide folding: When simulation meets experiment. Angew. Chem. Int. Ed., 38(1-2):236–240, 1999. doi: 10.1002/(SICI)1521-3773(19990115)38:1/2<236::AID-ANIE236>3.0.CO;2-M.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MDAKit / Hackathon] Trajectory clustering #25

[MDAKit / Hackathon] Trajectory clustering #25

IAlibay commented Sep 25, 2023

[MDAKit / Hackathon] Trajectory clustering #25

[MDAKit / Hackathon] Trajectory clustering #25

Comments

IAlibay commented Sep 25, 2023

Overview

Objectives

References