Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MDAKit / Hackathon] Trajectory clustering #25

Open
IAlibay opened this issue Sep 25, 2023 · 0 comments
Open

[MDAKit / Hackathon] Trajectory clustering #25

IAlibay opened this issue Sep 25, 2023 · 0 comments

Comments

@IAlibay
Copy link
Member

IAlibay commented Sep 25, 2023

Overview

It is common to use unsupervised learning algorithms to cluster a
trajectory by similarity to discover different states.

A typical approach is to first calculate the RMSD of a structure with
itself across all trajectory frames, forming a $T \times T$ symmetric
RMSD matrix where $T$ is the number of frames in the
trajectory. However, any other metric (or distance) instead of
RMSD will also work if you can produce a suitable similarity matrix.

One can then use any number of clustering algorithms to partition
the similarity matrix and thus assign different cluster numbers to the
different frames of the trajectory.

The mdaencore MDAKit for Ensemble Similarity Calculations
also implements some clustering methods (namely, Affinity Propogation and
DBSCAN/KMeans, via scikit-learn; as described
in the docs here).
However, a general-use cluster analysis tool, featuring a larger
selection of clustering algorithms would likely be useful to many users.

Clustering methods could include -

  • scikit-learn clustering contains many clustering algorithms
    that can be either used with a similarity matrix or directly with
    trajectory data like coordinates.

  • The GROMOS clustering algorithm is widely used in biomolecular
    simulations [Daura 1999]. (See Issue
    #2876.)

    The following excerpt from Daura et al. describes the algorithm:

    "To find clusters of structures in a trajectory the RMSD of atom
    positions between all pairs of structures was determined. For each
    structure the number of other structures for which the RMSD was
    0.1 nm or less (backbone, residues 2 ± 6) for structure 1 or 0.08
    nm or less (backbone, residues 2 ± 5) for structure 2 (neighbor
    conformations) was calculated. The structure with the highest
    number of neighbors was taken as the center of a cluster, and
    formed together with all its neighbors a (first) cluster. The
    structures of this cluster were thereafter eliminated from the
    pool of structures. The process was repeated until the pool of
    structures was empty. In this way, a series of nonoverlapping
    clusters of structures was obtained."

Objectives

  • Create a ClusterAnalysis class that allows the user to run any of
    the scikit-learn clustering algorithms that can work on raw
    data (such as K-means). Use the AnalysisBase framework to write
    the analysis class (see the tutorial on writing your own
    trajectory analysis
    .
  • Create an MDAKit that makes ClusterAnalysis available.
  • Implement additional clustering methods such as GROMOS clustering
    [Daura 1999] (described above in more detail).

References

  1. X Daura, K Gademann, B Jaun, D Seebach, WF van Gunsteren, and AE Mark. Peptide folding: When simulation meets experiment. Angew. Chem. Int. Ed., 38(1-2):236–240, 1999. doi: 10.1002/(SICI)1521-3773(19990115)38:1/2<236::AID-ANIE236>3.0.CO;2-M.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant