Merge branch 'master' into 45-SSI-dens-feat-improvements

Merging master to harmonise code
drorlab · Oct 1, 2023 · 9e94f26 · 9e94f26
2 parents 9a877b0 + 61e688b
commit 9e94f26
Show file tree

Hide file tree

Showing 66 changed files with 24,616 additions and 2,323 deletions.
diff --git a/.github/workflows/python-package-conda.yml b/.github/workflows/python-package-conda.yml
@@ -11,31 +11,20 @@ jobs:
       run:
         shell: bash -l {0}
     steps:
-    - uses: actions/checkout@v2
-    - name: Set up Python 3.7
-      uses: conda-incubator/setup-miniconda@v2
+    - uses: actions/checkout@v3
+    - name: Set up Python 3.9
+      uses: actions/setup-python@v4
       with:
-        activate-environment: anaconda-client-env
-        environment-file: environment.yml
-        python-version: 3.7
-    - name: Add conda to system path
-      run: |
-        # $CONDA is an environment variable pointing to the root of the miniconda directory
-        echo $CONDA/bin >> $GITHUB_PATH
-    - name: Activate conda environment
-      run: |
-        conda activate anaconda-client-env
-        conda info
-        conda list
+        python-version: 3.9
     - name: Lint with flake8
       run: |
-        conda install flake8
+        pip install flake8
         # stop the build if there are Python syntax errors or undefined names
         flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
         # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
         flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
     - name: Test with pytest
       run: |
-        conda install pytest
+        pip install pytest
         pip install -e .
         pytest --ignore pensa/diffnets/tests/test_api.py --ignore pensa/diffnets/tests/test_cli.py --ignore pensa/diffnets/tests/test_diffnets.py
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
-# Files generated by examples
+# Files generated by tests/examples
+sc-torsions/
 tutorial/traj/
 tutorial/results/
 tutorial/plots/
@@ -22,6 +23,7 @@ tests/test_data/MOR-*/.*.npz
 *.ipynb
 .DS_Store
 *.npy
+*.npz
 
 # Files from workload manager
 slurm*.out

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# PENSA - Protein Ensemble Analysis
+# PENSA - Python Ensemble Analysis
 
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4362136.svg)](https://doi.org/10.5281/zenodo.4362136)
 ![Package](https://github.com/drorlab/pensa/workflows/package/badge.svg)
@@ -7,24 +7,22 @@ Status](https://readthedocs.org/projects/pensa/badge/?version=latest)](http://pe
 [![GitHub license](https://img.shields.io/github/license/Naereen/StrapDown.js.svg)](https://github.com/drorlab/pensa/blob/master/LICENSE)
 [![Powered by MDAnalysis](https://img.shields.io/badge/powered%20by-MDAnalysis-orange.svg?logoWidth=16&logo=data:image/x-icon;base64,AAABAAEAEBAAAAEAIAAoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJD+XwCY/fEAkf3uAJf97wGT/a+HfHaoiIWE7n9/f+6Hh4fvgICAjwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACT/yYAlP//AJ///wCg//8JjvOchXly1oaGhv+Ghob/j4+P/39/f3IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJH8aQCY/8wAkv2kfY+elJ6al/yVlZX7iIiI8H9/f7h/f38UAAAAAAAAAAAAAAAAAAAAAAAAAAB/f38egYF/noqAebF8gYaagnx3oFpUUtZpaWr/WFhY8zo6OmT///8BAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgICAn46Ojv+Hh4b/jouJ/4iGhfcAAADnAAAA/wAAAP8AAADIAAAAAwCj/zIAnf2VAJD/PAAAAAAAAAAAAAAAAICAgNGHh4f/gICA/4SEhP+Xl5f/AwMD/wAAAP8AAAD/AAAA/wAAAB8Aov9/ALr//wCS/Z0AAAAAAAAAAAAAAACBgYGOjo6O/4mJif+Pj4//iYmJ/wAAAOAAAAD+AAAA/wAAAP8AAABhAP7+FgCi/38Axf4fAAAAAAAAAAAAAAAAiIiID4GBgYKCgoKogoB+fYSEgZhgYGDZXl5e/m9vb/9ISEjpEBAQxw8AAFQAAAAAAAAANQAAADcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAjo6Mb5iYmP+cnJz/jY2N95CQkO4pKSn/AAAA7gAAAP0AAAD7AAAAhgAAAAEAAAAAAAAAAACL/gsAkv2uAJX/QQAAAAB9fX3egoKC/4CAgP+NjY3/c3Nz+wAAAP8AAAD/AAAA/wAAAPUAAAAcAAAAAAAAAAAAnP4NAJL9rgCR/0YAAAAAfX19w4ODg/98fHz/i4uL/4qKivwAAAD/AAAA/wAAAP8AAAD1AAAAGwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAALGxsVyqqqr/mpqa/6mpqf9KSUn/AAAA5QAAAPkAAAD5AAAAhQAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADkUFBSuZ2dn/3V1df8uLi7bAAAATgBGfyQAAAA2AAAAMwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB0AAADoAAAA/wAAAP8AAAD/AAAAWgC3/2AAnv3eAJ/+dgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA9AAAA/wAAAP8AAAD/AAAA/wAKDzEAnP3WAKn//wCS/OgAf/8MAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIQAAANwAAADtAAAA7QAAAMAAABUMAJn9gwCe/e0Aj/2LAP//AQAAAAAAAAAA)](https://www.mdanalysis.org)
 
-A collection of Python methods for exploratory analysis and comparison of protein structural ensembles, e.g., from molecular dynamics simulations.
+A collection of Python methods for exploratory analysis and comparison of biomolecular conformational ensembles, e.g., from molecular dynamics simulations.
 All functionality is available as a Python package.  
 
-To get started, see the [__documentation__](https://pensa.readthedocs.io/en/latest/) which includes a tutorial for the PENSA library.
+To get started, see the [__documentation__](https://pensa.readthedocs.io/en/latest/) which includes a tutorial for the PENSA library, or read our [__preprint__](https://arxiv.org/abs/2212.02714).
 
 If you would like to contribute, check out our [__contribution guidelines__](https://github.com/drorlab/pensa/blob/master/CONTRIBUTING.md) and our [__to-do list__](https://github.com/drorlab/pensa/blob/master/TODO.md).
 
 ## Functionality
 
 With PENSA, you can (currently):
-- __compare structural ensembles__ of proteins via the relative entropy of their features, statistical tests, or state-specific information and visualize deviations on a reference structure.
+- __compare structural ensembles__ of biomolecules (proteins, DNA or RNA) via the relative entropy of their features or statistical tests and visualize deviations on a reference structure.
 - project several ensembles on a __joint reduced representation__ using principal component analysis (PCA) or time-lagged independent component analysis (tICA) and sort the structures along the obtained components.
 - __cluster structures across ensembles__ via k-means or regular-space clustering and write out the resulting clusters as trajectories.
 - trace allosteric information flow through a protein using __state-specific information__ analysis methods.
 
-Proteins are featurized via [PyEMMA](http://emma-project.org/latest/) using backbone torsions, sidechain torsions, or backbone C-alpha distances, making PENSA compatible to all functionality available in PyEMMA. In addition, we provide density-based methods to featurize water and ion pockets.
-
-Trajectories are processed and written using [MDAnalysis](https://www.mdanalysis.org/). Plots are generated using [Matplotlib](https://matplotlib.org/).
+Biomolecules can be featurized using backbone torsions, sidechain torsions, or arbitrary distances (e.g., between all backbone C-alpha atoms) and we provide density-based methods to featurize water and ion pockets. PENSA also includes trajectory processing tools based on [MDAnalysis](https://www.mdanalysis.org/) and plotting functions using [Matplotlib](https://matplotlib.org/).
 
 ## Documentation
 PENSA's documentation pages are [here](https://pensa.readthedocs.io/en/latest/), where you find installation instructions, API documentation, and a tutorial.
@@ -35,7 +33,7 @@ For the most common applications, example [Python scripts](https://github.com/dr
 #### Demo on Google Colab
 We demonstrate how to use the PENSA library in an interactive and animated example on Google Colab, where we use freely available simulations of a mu-Opioid Receptor from GPCRmd.
 
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1difJjlcwpN-0hSmGCGrPq9Cxq5wJ7ZDa)
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1msHB6uGeu2tBw_MnAFFTxcxeW4RnR0is)
 
 
 ## Citation
@@ -46,15 +44,22 @@ Martin Vögele, Neil Thomson, Sang Truong, Jasper McAvity. (2021). PENSA. Zenodo
 ```
 To get the citation and DOI for a particular version, see [Zenodo](https://zenodo.org/record/4362136).
 
+Please also consider citing our our [preprint](https://arxiv.org/abs/2212.02714):
+```
+Systematic Analysis of Biomolecular Conformational Ensembles with PENSA
+M. Vögele, N. J. Thomson, S. T. Truong, J. McAvity, U. Zachariae, R. O. Dror
+arXiv:2212.02714 [q-bio.BM] 2022
+```
+
 
 ## Acknowledgments
 
 #### Contributors
 Martin Vögele, Neil Thomson, Sang Truong, Jasper McAvity
 
 #### Beta-Testers
-Alex Powers, Lukas Stelzl, Nicole Ong, Eleanore Ocana, Callum Ives
+Alexander Powers, Lukas Stelzl, Nicole Ong, Eleanore Ocana, Emma Andrick, Callum Ives, and Bu Tran
 
 #### Funding & Support 
-This project was started by Martin Vögele at Stanford University, supported by an EMBO long-term fellowship (ALTF 235-2019), as part of the INCITE computing project 'Enabling the Design of Drugs that Achieve Good Effects Without Bad Ones' (BIP152).
+This project was started by Martin Vögele at Stanford University, supported by an EMBO long-term fellowship (ALTF 235-2019), as part of the INCITE computing project 'Enabling the Design of Drugs that Achieve Good Effects Without Bad Ones' (BIP152). Neil Thomson was supported by a BBSRC EASTBIO PhD studentship and Jasper McAvity by the Stanford Computer Science department via the CURIS program. Stanford University, the Stanford Research Computing Facility, and the University of Dundee provided additional computational resources and support that contributed to these research results.
 
diff --git a/TODO.md b/TODO.md
@@ -1,19 +1,9 @@
 ### In Progress
-
 - [ ] Tests
   - [x] Workflow test with example data
   - [ ] Trivial examples for each function
   - [ ] Unit tests for SSI 
   - [ ] Unit tests for density features
-- [ ] Integrate [DiffNets](https://doi.org/10.1101/2020.07.01.182725).
-  - [x] Lay out module structure in separate branch.
-  - [x] Copy core network from DiffNets repo.
-  - [ ] Try to use existing featurization.
-  - [ ] Include existing DiffNets featurization and compare.
-- [ ] exploratory analysis via correlation coefficients of the features
-  - [x] First tests --> not very promising.
-  - [ ] Try [different metric](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.correlation.html)
-  - [ ] Find useful application or leave it out.
 - [ ] Unified tutorial in documentation. Make one page for each subpackage
   - [x] preprocessing
     - [x] coordinates
@@ -28,15 +18,12 @@
   - [x] SSI
 
 ### Plans
-- [ ] Try using MDAnalysis instead of biotite for water featurization
-- [ ] Integrate more options for features from PyEMMA (think carefully about how to make it more flexible)
 - [ ] More example tcl scripts for VMD 
-- [ ] Facilitate calculation of JSD etc. on principal components
-- [ ] Facilitate calculation of SSI on results of joint clustering.
-- [ ] Weighted PCA/tICA? (to account for varying simulation lengths or uncertainty) 
 - [ ] Feature comparison of more than two ensembles
   - [ ] with respect to the joint ensemble (all metrics)
   - [ ] with respect to a reference ensemble (will not always work for KLD)
+- [ ] Use MDAnalysis instead of biotite for water featurization
+- [ ] Weighted PCA/tICA? (to account for varying simulation lengths or uncertainty) 
 - [ ] Implement T-distributed Stochastic Neighbor Embedding (t-SNE)
   - [ ] Read up on [t-SNE for molecular trajectories](https://www.frontiersin.org/articles/10.3389/fmolb.2020.00132/full)
   - [ ] See if we can import or adapt [existing code](https://github.com/spiwokv/tltsne).
@@ -49,17 +36,13 @@
   - [ ] First tests
   - [ ] write module
   - [ ] write unit tests
-- [ ] Put shared functionality of PCA and TICA into shared functions.
 - [ ] Make file format (png/pdf?) for matplotlib optional.
 - [ ] Implement [Linear Discriminant Analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis).
 - [ ] Implement [Non-Negative Matrix Factorization](https://onlinelibrary.wiley.com/doi/10.1002/env.3170050203).
-- [ ] Implement nucleic acid torsions and pseudo-torsions, as reviewed [Keating et al.](https://www.cambridge.org/core/journals/quarterly-reviews-of-biophysics/article/new-way-to-see-rna/2A2D428A5FAB150D2488A5A1D87007BD) and as used in [x3DNA](https://x3dna.org/highlights/pseudo-torsions-to-simplify-the-representation-of-dna-rna-backbone-conformation) or [Barnaba](https://rnajournal.cshlp.org/content/25/2/219) ([Barnaba code on GitHub](https://github.com/srnas/barnaba))
 
 ### Ideas
 - [ ] Logo
-- [ ] Hydrogen bonds as features
 - [ ] Contacts as features 
-  - [ ] can PyEMMA do this?
   - [ ] Think about a [GetContacts](https://getcontacts.github.io/) reader
 - [ ] Position deviations as features (similar to components of RMSD)
 - [ ] Estimate thresholds for significance of feature differences
@@ -68,10 +51,6 @@
   - [ ] modify p-value of KS test using number of simulation runs per ensemble
 - [ ] Wasserstein distance to compare ensembles
 - [ ] Add option to whiten features
-- [ ] Featurizers for other molecule types
-  - [ ] ligands
-  - [ ] lipids
-  - [ ] nucleic acids
 - [ ] Account for [Bonferroni correction](https://en.wikipedia.org/wiki/Bonferroni_correction) in comparison.
 - [ ] Implement conformational entropy calculations
   - [ ] Read papers, e.g, [1](https://www.pnas.org/content/111/43/15396), [2](https://www.mdpi.com/2079-3197/6/1/21/htm), [3](https://pubs.acs.org/doi/10.1021/acs.jcim.0c01375)
@@ -105,11 +84,21 @@
 - [x] Slack channel for all developers and testers, and to provide support for the user community.
 - [x] Implement clustering in principal component space
 - [x] Option to write and load features as CSV file.
+- [x] Implement nucleic acid torsions and pseudo-torsions, as reviewed [Keating et al.](https://www.cambridge.org/core/journals/quarterly-reviews-of-biophysics/article/new-way-to-see-rna/2A2D428A5FAB150D2488A5A1D87007BD) and as used in [x3DNA](https://x3dna.org/highlights/pseudo-torsions-to-simplify-the-representation-of-dna-rna-backbone-conformation) or [Barnaba](https://rnajournal.cshlp.org/content/25/2/219) ([Barnaba code on GitHub](https://github.com/srnas/barnaba))
+- [x] Hydrogen bods as features
+- [x] Use MDAnalysis instead of PyEMMA to read features (to avoid mmshare dependency).
+- [x] Use scikit-learn or [Deeptime](https://deeptime-ml.github.io/latest/index.html) instead of PyEMMA for clustering.
+- [x] Use scikit-learn or [Deeptime](https://deeptime-ml.github.io/latest/index.html) instead of PyEMMA for dimensionality reduction.
+- [x] exploratory analysis via correlation coefficients of the features
 
 ### Abandoned
-
 - [ ] Frame classification via CNN on features
   - [x] Prototype to classify simulation frames --> Diffnets probably more powerful.
   - [ ] Interpret weights as relevance of features
   - [ ] Write module
   - [ ] Write unit tests
+- [ ] Integrate [DiffNets](https://doi.org/10.1101/2020.07.01.182725).
+  - [x] Lay out module structure in separate branch.
+  - [x] Copy core network from DiffNets repo.
+  - [ ] Try to use existing featurization.
+  - [ ] Include existing DiffNets featurization and compare.
diff --git a/docs/conf.py b/docs/conf.py
@@ -21,7 +21,7 @@
 # -- Project information -----------------------------------------------------
 
 project = 'PENSA'
-copyright = '2020-2021, Martin Vögele, Neil Thomson, Sang Truong'
+copyright = '2020-2023, Martin Vögele, Neil Thomson, Sang Truong'
 author = 'Martin Vögele, Neil Thomson, Sang Truong'
 
 
@@ -64,12 +64,10 @@
 autodoc_mock_imports = [
     'numpy', 
     'scipy',
+    'pandas',
     'matplotlib',
-    'mdtraj',
-    'pyemma',
-    'mdshare',
+    'deeptime',
     'MDAnalysis',
-    'cython',
     'biotite'
 ]
 

diff --git a/docs/contribute.rst b/docs/contribute.rst
@@ -7,12 +7,12 @@ We are always happy to help and to hear about your work and your success stories
 Report a bug or request a feature
 ***********************************
 
-PENSA is open-source and available on `Github <https://github.com/drorlab/pensa>`_. Please submit issues or requests using the `issue tracker <https://github.com/drorlab/pensa/issues>`_.
+PENSA is open-source and available on `GitHub <https://github.com/drorlab/pensa>`_. Please submit issues or requests using the `issue tracker <https://github.com/drorlab/pensa/issues>`_.
 
 Add new functionality 
 ***********************************
 
-We welcome any kind of contributions to improve or expand the PENSA code. In particular, we are interested in readers for new feature types and new ways to analyze and compare structural ensembles. PENSA is maintained on `Github <https://github.com/drorlab/pensa>`_ so you can fork it and create a pull request. For guidance, see our `contribution guidelines <https://github.com/drorlab/pensa/blob/master/CONTRIBUTING.md>`_. Please make sure to properly test your contribution before the request. For large or complicated contributions, please get in contact so we can coordinate them with you. 
+We welcome any kind of contributions to improve or expand the PENSA code. In particular, we are interested in readers for new feature types and new ways to analyze and compare structural ensembles. PENSA is maintained on `GitHub <https://github.com/drorlab/pensa>`_ so you can fork it and create a pull request. For guidance, see our `contribution guidelines <https://github.com/drorlab/pensa/blob/master/CONTRIBUTING.md>`_. Please make sure to properly test your contribution before the request. For large or complicated contributions, please get in contact so we can coordinate them with you. 
 
 We explain two of the most common cases below:
 

diff --git a/environment.yml b/environment.yml
@@ -3,8 +3,11 @@ channels:
   - conda-forge
   - defaults
 dependencies: 
-  - python==3.7
-  - mdtraj==1.9.3 
-  - mdshare 
-  - pyemma 
+  - python==3.9
+  - scipy>=1.2
+  - numpy
+  - pandas
+  - matplotlib
   - MDAnalysis 
+  - deeptime
+  - biotite
diff --git a/pensa/__init__.py b/pensa/__init__.py
@@ -2,6 +2,6 @@
 from .features import *
 from .statesinfo import *
 from .clusters import *
-from .comparison import *  
+from .comparison import *
 from .dimensionality import *