Skip to content
This repository has been archived by the owner on Jan 26, 2022. It is now read-only.

MSMBuilder 3

kyleabeauchamp edited this page Feb 28, 2013 · 26 revisions

Abstract

This document is a collaborative wiki page for brainstorming the direction of new development efforts towards msmbuilder 3.

Overall Design Targets

  • Enhanced usability
    • Robust
    • Informative error messages
    • Sanity checking of inputs in library functions
    • Assume inexperienced users
  • Broaden Applicability
    • Not just protein folding
    • Conformational change / protein ligand binding
  • Better error analysis
    • Bayesian methods?
    • Add script for Nina's method
  • Better integration
    • MSMAccelerator
    • MD Codes
    • Folding@home
  • New visualization tools
    • Plot eigenvector components by state index?
    • Plot eigenvector components by observables?
    • Generally: Now that I've built it, what does this MSM mean?
    • Can we interact directly with OpenMM or PyMol to provide visualization tools?
    • At the very least, output an MSM trajectory since many users know how to analyze that already
  • Clean up code
    • Rewrite the Trajectory code.
      • "Smart" slicing and keeping track of the time index of every frame.
      • Use new PDB reader
      • Potentially use pandas
    • DRY the scripts. To the extent that they are "thin" wrappers for library functions, can they be auto-generated? (issue 159)
    • Rewrite all of the non-RMSD C code (asa, rg, etc) in cython.
  • Representations instead of metrics
    • Many users want to be able to calculate things like dihedral angles or native contacts
    • Currently the easiest way to do this is non-intuitive (metric.prepare_trajectory)
  • Improve performance and scalability
    • New datasets are becoming expensive to analyze, we need to be able to scale up efficiently
    • This includes improvements in:
      • Parallelization
      • Memory management
      • New algorithms (e.g. streaming clustering)
  • Improve documentation
    • Bring the latex tutorial onto readthedocs.org
    • Increase the amount of narrative content on readthedocs.org
  • New test systems and benchmarks
    • Muller potential (?)
    • BPTI (?)
  • Better understand how others are using MSMBuilder
    • Send out updates about releases/new features
    • Create a direct way to ask what users need/want in MSMBuilder

Specific Steps

  • DONE: Issue #156 || Create an MSMBuilder-users email list

  • Issue #163 || Change how cluster centers are stored on disk. Instead of saving the coordinates, saved the indices of the traj/frame. As we move towards a more "feature"-centric approach with dimensionality reduction, this helps to avoid the inverse problem of not knowing the all-atom coordinates of the cluster centers since we only save them to disk in their representation in the space they were clustered in.

  • Goals for PDB and Trajectory reader: see various open and closed issues: PDB IO should preserve all residue numbering found in PDBs: issue 67. PDB IO should work with single atom PDB files if necessary: issue #6. PDB IO should work with multi-frame PDBS: issue #31. Contain timestep metadata: issue #58. Also issue #41.

Questions

  • Should all of the scripts be accessible under a single command? (issue 159)

    • $ msmbuilder cluster
    • $ msmbuilder setup
    • $ msmbuilder assign
    • RTM prefers yes.
    • KAB: I'm open to the idea, not 100% sure. It would be cute to have everything accessible from one place, but I want to be sure that navigating the text-based menu system is clean. For example, one thing I don't like about Tinker and Gromacs is that their menus are sometimes just a "helpstring" and other times a user prompt. Another cute idea might be to print out the Python code for every action that is done via the Script interface--a way to help ease users into a more advanced interface.
    • CRS: I like this idea, we could even shorten it to msmb. This provides an easy way to see possible scripts. @KAB: that's an interesting idea, but could get verbose very quickly, perhaps we could do something like "Counting transitions (msmbuilder.MSMLib.get_counts_from_assignments(...))" so the user could slowly learn that interacting with the library isn't too difficult.
  • Is it time to re-work our "best practices" to better include conformational change? TICA? The MSMB3 paper might be a nice place to do a quantitative comparison of metrics and clustering algorithms, with the goal of providing a "recipe" for various situations (folding, conformational change, etc).

    • CRS: I think to do this well, we need to know what the recipe is, and so we need a developer to apply MSMB to new problems. I can try tICA with BPTI and GPCR from Shaw's group, perhaps.
    • CRS: Also, as for tICA, there are still some open questions that need to be resolved before we can really recommend it to others:
      • Can we calculate the tIC's from Folding@home datasets (i.e. many short(ish) trajectories)
      • Can we determine the best correlation lag time given some dataset
      • Are we throwing out too much by only picking 8 degrees of freedom?
        • For instance, when I build an MSM trajectory the frames seem very different
  • Should we reorganize the directory layout of the source code? Currently we have two main directories, src/python and src/ext. If you look at how a project like numpy or scipy is organized, the directory structure is by logical grouping instead. Like package/subpackage. It's not based on what language the code is in. This is, in RTM's opinion, related to the idea of bringing the current c code into cython and making it more "public" and robust.

    • CRS: I think this is a good idea, we already have the metrics directory, and will have more as we add new clustring algorithms / error analysis / visualization tools.
  • How should we handle trajectory loading? Especially as we move towards a more "feature"ized representation, should we save those features to disk as opposed to reloading them every time? How can the metric object cooardinate trajectory loading so that we only load the atoms/regions that we care about and save memory?

  • Currently, the main class that is used over and over again to actually do something is the distance metric. I (CRS) think the better way to approach this is to have the main class be a representation (the name could be different [feature_vector, etc.]). Then instead of having a metric on dihedral angles, you would just use the Vectorized metric on the dihedral representation. This makes it more intuitive on the user side when they want to interact with their data. For instance in order to calculate dihedral angles of a protein trajectory (traj) we would need to do this:

      dihedral_metric = msmbuilder.metrics.Dihedral(angles='phi/psi')
      dih_traj = dihedral_metric.prepare_trajectory(traj)
    
  • It would be more intuitive to do this:

      dihedral_representation = msmbuilder.representations.Dihedral(angles='phi/psi')
      dih_traj = dihedral_representation.get_representation(traj)
    
Clone this wiki locally