Skip to content

Commit

Permalink
Merge branch 'master' into 45-SSI-dens-feat-improvements
Browse files Browse the repository at this point in the history
Merging master to harmonise code
  • Loading branch information
NeilJ-Thomson committed Oct 1, 2023
2 parents 9a877b0 + 61e688b commit 9e94f26
Show file tree
Hide file tree
Showing 66 changed files with 24,616 additions and 2,323 deletions.
23 changes: 6 additions & 17 deletions .github/workflows/python-package-conda.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,31 +11,20 @@ jobs:
run:
shell: bash -l {0}
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.7
uses: conda-incubator/setup-miniconda@v2
- uses: actions/checkout@v3
- name: Set up Python 3.9
uses: actions/setup-python@v4
with:
activate-environment: anaconda-client-env
environment-file: environment.yml
python-version: 3.7
- name: Add conda to system path
run: |
# $CONDA is an environment variable pointing to the root of the miniconda directory
echo $CONDA/bin >> $GITHUB_PATH
- name: Activate conda environment
run: |
conda activate anaconda-client-env
conda info
conda list
python-version: 3.9
- name: Lint with flake8
run: |
conda install flake8
pip install flake8
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
conda install pytest
pip install pytest
pip install -e .
pytest --ignore pensa/diffnets/tests/test_api.py --ignore pensa/diffnets/tests/test_cli.py --ignore pensa/diffnets/tests/test_diffnets.py
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Files generated by examples
# Files generated by tests/examples
sc-torsions/
tutorial/traj/
tutorial/results/
tutorial/plots/
Expand All @@ -22,6 +23,7 @@ tests/test_data/MOR-*/.*.npz
*.ipynb
.DS_Store
*.npy
*.npz

# Files from workload manager
slurm*.out
Expand Down
25 changes: 15 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# PENSA - Protein Ensemble Analysis
# PENSA - Python Ensemble Analysis

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4362136.svg)](https://doi.org/10.5281/zenodo.4362136)
![Package](https://github.com/drorlab/pensa/workflows/package/badge.svg)
Expand All @@ -7,24 +7,22 @@ Status](https://readthedocs.org/projects/pensa/badge/?version=latest)](http://pe
[![GitHub license](https://img.shields.io/github/license/Naereen/StrapDown.js.svg)](https://github.com/drorlab/pensa/blob/master/LICENSE)
[![Powered by MDAnalysis](https://img.shields.io/badge/powered%20by-MDAnalysis-orange.svg?logoWidth=16&logo=data:image/x-icon;base64,AAABAAEAEBAAAAEAIAAoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJD+XwCY/fEAkf3uAJf97wGT/a+HfHaoiIWE7n9/f+6Hh4fvgICAjwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACT/yYAlP//AJ///wCg//8JjvOchXly1oaGhv+Ghob/j4+P/39/f3IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJH8aQCY/8wAkv2kfY+elJ6al/yVlZX7iIiI8H9/f7h/f38UAAAAAAAAAAAAAAAAAAAAAAAAAAB/f38egYF/noqAebF8gYaagnx3oFpUUtZpaWr/WFhY8zo6OmT///8BAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgICAn46Ojv+Hh4b/jouJ/4iGhfcAAADnAAAA/wAAAP8AAADIAAAAAwCj/zIAnf2VAJD/PAAAAAAAAAAAAAAAAICAgNGHh4f/gICA/4SEhP+Xl5f/AwMD/wAAAP8AAAD/AAAA/wAAAB8Aov9/ALr//wCS/Z0AAAAAAAAAAAAAAACBgYGOjo6O/4mJif+Pj4//iYmJ/wAAAOAAAAD+AAAA/wAAAP8AAABhAP7+FgCi/38Axf4fAAAAAAAAAAAAAAAAiIiID4GBgYKCgoKogoB+fYSEgZhgYGDZXl5e/m9vb/9ISEjpEBAQxw8AAFQAAAAAAAAANQAAADcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAjo6Mb5iYmP+cnJz/jY2N95CQkO4pKSn/AAAA7gAAAP0AAAD7AAAAhgAAAAEAAAAAAAAAAACL/gsAkv2uAJX/QQAAAAB9fX3egoKC/4CAgP+NjY3/c3Nz+wAAAP8AAAD/AAAA/wAAAPUAAAAcAAAAAAAAAAAAnP4NAJL9rgCR/0YAAAAAfX19w4ODg/98fHz/i4uL/4qKivwAAAD/AAAA/wAAAP8AAAD1AAAAGwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAALGxsVyqqqr/mpqa/6mpqf9KSUn/AAAA5QAAAPkAAAD5AAAAhQAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADkUFBSuZ2dn/3V1df8uLi7bAAAATgBGfyQAAAA2AAAAMwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB0AAADoAAAA/wAAAP8AAAD/AAAAWgC3/2AAnv3eAJ/+dgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA9AAAA/wAAAP8AAAD/AAAA/wAKDzEAnP3WAKn//wCS/OgAf/8MAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIQAAANwAAADtAAAA7QAAAMAAABUMAJn9gwCe/e0Aj/2LAP//AQAAAAAAAAAA)](https://www.mdanalysis.org)

A collection of Python methods for exploratory analysis and comparison of protein structural ensembles, e.g., from molecular dynamics simulations.
A collection of Python methods for exploratory analysis and comparison of biomolecular conformational ensembles, e.g., from molecular dynamics simulations.
All functionality is available as a Python package.

To get started, see the [__documentation__](https://pensa.readthedocs.io/en/latest/) which includes a tutorial for the PENSA library.
To get started, see the [__documentation__](https://pensa.readthedocs.io/en/latest/) which includes a tutorial for the PENSA library, or read our [__preprint__](https://arxiv.org/abs/2212.02714).

If you would like to contribute, check out our [__contribution guidelines__](https://github.com/drorlab/pensa/blob/master/CONTRIBUTING.md) and our [__to-do list__](https://github.com/drorlab/pensa/blob/master/TODO.md).

## Functionality

With PENSA, you can (currently):
- __compare structural ensembles__ of proteins via the relative entropy of their features, statistical tests, or state-specific information and visualize deviations on a reference structure.
- __compare structural ensembles__ of biomolecules (proteins, DNA or RNA) via the relative entropy of their features or statistical tests and visualize deviations on a reference structure.
- project several ensembles on a __joint reduced representation__ using principal component analysis (PCA) or time-lagged independent component analysis (tICA) and sort the structures along the obtained components.
- __cluster structures across ensembles__ via k-means or regular-space clustering and write out the resulting clusters as trajectories.
- trace allosteric information flow through a protein using __state-specific information__ analysis methods.

Proteins are featurized via [PyEMMA](http://emma-project.org/latest/) using backbone torsions, sidechain torsions, or backbone C-alpha distances, making PENSA compatible to all functionality available in PyEMMA. In addition, we provide density-based methods to featurize water and ion pockets.

Trajectories are processed and written using [MDAnalysis](https://www.mdanalysis.org/). Plots are generated using [Matplotlib](https://matplotlib.org/).
Biomolecules can be featurized using backbone torsions, sidechain torsions, or arbitrary distances (e.g., between all backbone C-alpha atoms) and we provide density-based methods to featurize water and ion pockets. PENSA also includes trajectory processing tools based on [MDAnalysis](https://www.mdanalysis.org/) and plotting functions using [Matplotlib](https://matplotlib.org/).

## Documentation
PENSA's documentation pages are [here](https://pensa.readthedocs.io/en/latest/), where you find installation instructions, API documentation, and a tutorial.
Expand All @@ -35,7 +33,7 @@ For the most common applications, example [Python scripts](https://github.com/dr
#### Demo on Google Colab
We demonstrate how to use the PENSA library in an interactive and animated example on Google Colab, where we use freely available simulations of a mu-Opioid Receptor from GPCRmd.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1difJjlcwpN-0hSmGCGrPq9Cxq5wJ7ZDa)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1msHB6uGeu2tBw_MnAFFTxcxeW4RnR0is)


## Citation
Expand All @@ -46,15 +44,22 @@ Martin Vögele, Neil Thomson, Sang Truong, Jasper McAvity. (2021). PENSA. Zenodo
```
To get the citation and DOI for a particular version, see [Zenodo](https://zenodo.org/record/4362136).

Please also consider citing our our [preprint](https://arxiv.org/abs/2212.02714):
```
Systematic Analysis of Biomolecular Conformational Ensembles with PENSA
M. Vögele, N. J. Thomson, S. T. Truong, J. McAvity, U. Zachariae, R. O. Dror
arXiv:2212.02714 [q-bio.BM] 2022
```


## Acknowledgments

#### Contributors
Martin Vögele, Neil Thomson, Sang Truong, Jasper McAvity

#### Beta-Testers
Alex Powers, Lukas Stelzl, Nicole Ong, Eleanore Ocana, Callum Ives
Alexander Powers, Lukas Stelzl, Nicole Ong, Eleanore Ocana, Emma Andrick, Callum Ives, and Bu Tran

#### Funding & Support
This project was started by Martin Vögele at Stanford University, supported by an EMBO long-term fellowship (ALTF 235-2019), as part of the INCITE computing project 'Enabling the Design of Drugs that Achieve Good Effects Without Bad Ones' (BIP152).
This project was started by Martin Vögele at Stanford University, supported by an EMBO long-term fellowship (ALTF 235-2019), as part of the INCITE computing project 'Enabling the Design of Drugs that Achieve Good Effects Without Bad Ones' (BIP152). Neil Thomson was supported by a BBSRC EASTBIO PhD studentship and Jasper McAvity by the Stanford Computer Science department via the CURIS program. Stanford University, the Stanford Research Computing Facility, and the University of Dundee provided additional computational resources and support that contributed to these research results.

37 changes: 13 additions & 24 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,9 @@
### In Progress

- [ ] Tests
- [x] Workflow test with example data
- [ ] Trivial examples for each function
- [ ] Unit tests for SSI
- [ ] Unit tests for density features
- [ ] Integrate [DiffNets](https://doi.org/10.1101/2020.07.01.182725).
- [x] Lay out module structure in separate branch.
- [x] Copy core network from DiffNets repo.
- [ ] Try to use existing featurization.
- [ ] Include existing DiffNets featurization and compare.
- [ ] exploratory analysis via correlation coefficients of the features
- [x] First tests --> not very promising.
- [ ] Try [different metric](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.correlation.html)
- [ ] Find useful application or leave it out.
- [ ] Unified tutorial in documentation. Make one page for each subpackage
- [x] preprocessing
- [x] coordinates
Expand All @@ -28,15 +18,12 @@
- [x] SSI

### Plans
- [ ] Try using MDAnalysis instead of biotite for water featurization
- [ ] Integrate more options for features from PyEMMA (think carefully about how to make it more flexible)
- [ ] More example tcl scripts for VMD
- [ ] Facilitate calculation of JSD etc. on principal components
- [ ] Facilitate calculation of SSI on results of joint clustering.
- [ ] Weighted PCA/tICA? (to account for varying simulation lengths or uncertainty)
- [ ] Feature comparison of more than two ensembles
- [ ] with respect to the joint ensemble (all metrics)
- [ ] with respect to a reference ensemble (will not always work for KLD)
- [ ] Use MDAnalysis instead of biotite for water featurization
- [ ] Weighted PCA/tICA? (to account for varying simulation lengths or uncertainty)
- [ ] Implement T-distributed Stochastic Neighbor Embedding (t-SNE)
- [ ] Read up on [t-SNE for molecular trajectories](https://www.frontiersin.org/articles/10.3389/fmolb.2020.00132/full)
- [ ] See if we can import or adapt [existing code](https://github.com/spiwokv/tltsne).
Expand All @@ -49,17 +36,13 @@
- [ ] First tests
- [ ] write module
- [ ] write unit tests
- [ ] Put shared functionality of PCA and TICA into shared functions.
- [ ] Make file format (png/pdf?) for matplotlib optional.
- [ ] Implement [Linear Discriminant Analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis).
- [ ] Implement [Non-Negative Matrix Factorization](https://onlinelibrary.wiley.com/doi/10.1002/env.3170050203).
- [ ] Implement nucleic acid torsions and pseudo-torsions, as reviewed [Keating et al.](https://www.cambridge.org/core/journals/quarterly-reviews-of-biophysics/article/new-way-to-see-rna/2A2D428A5FAB150D2488A5A1D87007BD) and as used in [x3DNA](https://x3dna.org/highlights/pseudo-torsions-to-simplify-the-representation-of-dna-rna-backbone-conformation) or [Barnaba](https://rnajournal.cshlp.org/content/25/2/219) ([Barnaba code on GitHub](https://github.com/srnas/barnaba))

### Ideas
- [ ] Logo
- [ ] Hydrogen bonds as features
- [ ] Contacts as features
- [ ] can PyEMMA do this?
- [ ] Think about a [GetContacts](https://getcontacts.github.io/) reader
- [ ] Position deviations as features (similar to components of RMSD)
- [ ] Estimate thresholds for significance of feature differences
Expand All @@ -68,10 +51,6 @@
- [ ] modify p-value of KS test using number of simulation runs per ensemble
- [ ] Wasserstein distance to compare ensembles
- [ ] Add option to whiten features
- [ ] Featurizers for other molecule types
- [ ] ligands
- [ ] lipids
- [ ] nucleic acids
- [ ] Account for [Bonferroni correction](https://en.wikipedia.org/wiki/Bonferroni_correction) in comparison.
- [ ] Implement conformational entropy calculations
- [ ] Read papers, e.g, [1](https://www.pnas.org/content/111/43/15396), [2](https://www.mdpi.com/2079-3197/6/1/21/htm), [3](https://pubs.acs.org/doi/10.1021/acs.jcim.0c01375)
Expand Down Expand Up @@ -105,11 +84,21 @@
- [x] Slack channel for all developers and testers, and to provide support for the user community.
- [x] Implement clustering in principal component space
- [x] Option to write and load features as CSV file.
- [x] Implement nucleic acid torsions and pseudo-torsions, as reviewed [Keating et al.](https://www.cambridge.org/core/journals/quarterly-reviews-of-biophysics/article/new-way-to-see-rna/2A2D428A5FAB150D2488A5A1D87007BD) and as used in [x3DNA](https://x3dna.org/highlights/pseudo-torsions-to-simplify-the-representation-of-dna-rna-backbone-conformation) or [Barnaba](https://rnajournal.cshlp.org/content/25/2/219) ([Barnaba code on GitHub](https://github.com/srnas/barnaba))
- [x] Hydrogen bods as features
- [x] Use MDAnalysis instead of PyEMMA to read features (to avoid mmshare dependency).
- [x] Use scikit-learn or [Deeptime](https://deeptime-ml.github.io/latest/index.html) instead of PyEMMA for clustering.
- [x] Use scikit-learn or [Deeptime](https://deeptime-ml.github.io/latest/index.html) instead of PyEMMA for dimensionality reduction.
- [x] exploratory analysis via correlation coefficients of the features

### Abandoned

- [ ] Frame classification via CNN on features
- [x] Prototype to classify simulation frames --> Diffnets probably more powerful.
- [ ] Interpret weights as relevance of features
- [ ] Write module
- [ ] Write unit tests
- [ ] Integrate [DiffNets](https://doi.org/10.1101/2020.07.01.182725).
- [x] Lay out module structure in separate branch.
- [x] Copy core network from DiffNets repo.
- [ ] Try to use existing featurization.
- [ ] Include existing DiffNets featurization and compare.
8 changes: 3 additions & 5 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
# -- Project information -----------------------------------------------------

project = 'PENSA'
copyright = '2020-2021, Martin Vögele, Neil Thomson, Sang Truong'
copyright = '2020-2023, Martin Vögele, Neil Thomson, Sang Truong'
author = 'Martin Vögele, Neil Thomson, Sang Truong'


Expand Down Expand Up @@ -64,12 +64,10 @@
autodoc_mock_imports = [
'numpy',
'scipy',
'pandas',
'matplotlib',
'mdtraj',
'pyemma',
'mdshare',
'deeptime',
'MDAnalysis',
'cython',
'biotite'
]

Expand Down
4 changes: 2 additions & 2 deletions docs/contribute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ We are always happy to help and to hear about your work and your success stories
Report a bug or request a feature
***********************************

PENSA is open-source and available on `Github <https://github.com/drorlab/pensa>`_. Please submit issues or requests using the `issue tracker <https://github.com/drorlab/pensa/issues>`_.
PENSA is open-source and available on `GitHub <https://github.com/drorlab/pensa>`_. Please submit issues or requests using the `issue tracker <https://github.com/drorlab/pensa/issues>`_.

Add new functionality
***********************************

We welcome any kind of contributions to improve or expand the PENSA code. In particular, we are interested in readers for new feature types and new ways to analyze and compare structural ensembles. PENSA is maintained on `Github <https://github.com/drorlab/pensa>`_ so you can fork it and create a pull request. For guidance, see our `contribution guidelines <https://github.com/drorlab/pensa/blob/master/CONTRIBUTING.md>`_. Please make sure to properly test your contribution before the request. For large or complicated contributions, please get in contact so we can coordinate them with you.
We welcome any kind of contributions to improve or expand the PENSA code. In particular, we are interested in readers for new feature types and new ways to analyze and compare structural ensembles. PENSA is maintained on `GitHub <https://github.com/drorlab/pensa>`_ so you can fork it and create a pull request. For guidance, see our `contribution guidelines <https://github.com/drorlab/pensa/blob/master/CONTRIBUTING.md>`_. Please make sure to properly test your contribution before the request. For large or complicated contributions, please get in contact so we can coordinate them with you.

We explain two of the most common cases below:

Expand Down
11 changes: 7 additions & 4 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,11 @@ channels:
- conda-forge
- defaults
dependencies:
- python==3.7
- mdtraj==1.9.3
- mdshare
- pyemma
- python==3.9
- scipy>=1.2
- numpy
- pandas
- matplotlib
- MDAnalysis
- deeptime
- biotite
2 changes: 1 addition & 1 deletion pensa/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@
from .features import *
from .statesinfo import *
from .clusters import *
from .comparison import *
from .comparison import *
from .dimensionality import *

Loading

0 comments on commit 9e94f26

Please sign in to comment.