MDSubSampler is a Python library and toolkit for a posteriori subsampling of multiple trajectory data for further analysis. This toolkit implements uniform, random, stratified sampling, bootstrapping and targeted sampling to preserve the original distribution of relevant geometrical properties.
When using MDSubSampler, please cite the following paper:
N. Oues, S. C. Dantu, R. J. Patel, A. Pandini. MDSubSampler: a posteriori sampling of important protein conformations from biomolecular simulations. Bioinformatics (2023), btad427, https://doi.org/10.1093/bioinformatics/btad427
This project requires Python (version 3.9.1 or later). To make sure you have the right version available on your machine, try running the following command.
$ python --version
Python 3.9.1
These instructions will get you a copy of the project up and running on your local machine for analysis and development purposes.
BEFORE YOU INSTALL: please read the prerequisites
To install and set up the library, run:
pip install MDSubSampler
Input:
- Molecular Dynamics trajectory
- Geometric property
- Atom selection [optional - default is "name CA"]
- Reference structure [optional]
- Sample size or range of sizes
- Dissimilarity measure [optional - default is "Bhattacharyya"]
Output:
- .dat file with calculated property for full trajectory (user input)
- .dat file(s) with calculated property for one or all sample sizes input
- .xtc file(s) with sample trajectory for one or all sample sizes
- .npy file(s) with sample trajectory for one or all sample sizes
- .npy training set for ML purposes for sample trajectory (optional)
- .npy testing set for ML purposes for sample trajectory (optional)
- .npy file(s) with sample trajectory for one or for all sample sizes
- .png file with overlapped property distribution of reference and sample
- .json file report with important statistics from the analysis
- .txt log file with essential analysis steps and information
To run scenarios 1,2 or 3 you can download your protein trajectory and topology file (.xtc and .gro files) to the data folder and then run the following:
python mdss/scenarios/scenario_1.py data/<YourTrajectoryFile>.xtc data/<YourTopologyfile>.gro <YourPrefix>
Scenarios 1,2 and 3 are also available in Jupyter Notebooks format, can be used as templates and can be modified interactively according to the user's needs. You can also find more advanced scenarios in the cookbook directory. If you clone the library locally to your machine, then run the following command before you run the cells.
%cd <pathToMDSubSamplerDirectory>
If you are a terminal lover you can use the terminal to run the code and make a choice for the parser arguments. To see all options and choices run:
python mdss/run.py --help
Once you have made a selection of arguments, your command can look like the following example:
python mdss/run.py \
--traj "data/<YourTrajectoryFile>.xtc" \
--top "data/<YourTopologyFile>.gro" \
--prefix "<YourPrefix>" \
--output-folder "data/<YourResultsFolder>" \
--property='DistanceBetweenAtoms' \
--atom-selection='G55,P127' \
--sampler='BootstrappingSampler' \
--n-iterations=50 \
--size=<SampleSize> \
--dissimilarity='Bhattacharyya'
Start by either downloading the tarball file from https://github.com/alepandini/MDSubSampler to your local machine or cloning this repo on your local machine:
git clone [email protected]:alepandini/MDSubSampler.git
cd MDSubSampler
Following that, download and install poetry from https://python-poetry.org/docs/#installation
Finally, run the following:
poetry install
poetry build
poetry shell
You can now start developing the library.
Start by installing Docker using this link https://docs.docker.com/get-docker/.
Initially a Docker image will need to be built. To do this run the following command:
docker build -t <image name> .
Then run the following command to get access to a shell with all dependencies installed:
docker run -it -v $(pwd):/app -e PYTHONPATH=/app <image name> /bin/bash
This will also mirror the local filesystem in the Docker image, so that any local change will be reflected in the running container, and vice-versa, using a Docker volume.
The repo also includes two handy scripts to run all of the above faster (an image called
subsampler
will be created):
./build-docker
./run-docker
After dropping in the Docker shell, all dependencies will be installed, and the package scripts
will also be in scope (the mdss
command and all scenarios declared in pyproject.toml
).
- Namir Oues - namiroues
- Alessandro Pandini alepandini
The library is licensed by GPL-3.0