Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR for version 1.0 #22

Merged
merged 82 commits into from
Apr 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
25e93f1
Implementations on Biogen dataset
Dec 12, 2023
0af0d12
Statistical analysis
Dec 12, 2023
3a83853
Proof of concept for stratified splits
Dec 12, 2023
16ea7fb
Proof of concept for stratified splits, pt. 2
Dec 12, 2023
f9b1ea4
Solver analysis based on number of clusters and refactoring of biogen…
Dec 20, 2023
832062c
Integration and testing of more solvers (GLPK and CBC)
Dec 20, 2023
97c77a4
More experiments with newly added solvers
Dec 20, 2023
128c7e1
Minor changes
Dec 21, 2023
fa84669
Stratification in all techniques + test, to be debugged
Old-Shatterhand Dec 21, 2023
8feff77
Testing and debugging of stratification
Dec 21, 2023
6571002
Final preparations for christmas analysis
Dec 21, 2023
1e2d4b6
Updates for code execution and visualization
Jan 10, 2024
6688c67
Bug fixes in stratification and more visualization
Jan 10, 2024
55c2094
Merge branch 'dev_biogen' into dev_stratified
Jan 10, 2024
1808eae
Minor updates for analysis
Jan 18, 2024
509cf48
Finalizing experiments and visualization of work on ablations and str…
Jan 18, 2024
3847065
Unification branch for v0.3
Old-Shatterhand Jan 18, 2024
5dfd053
Merge branch 'dev_biogen' into dev_stratified
Old-Shatterhand Jan 19, 2024
9c8f154
Merge branch 'dev_stratified' into dev_0.3
Old-Shatterhand Jan 19, 2024
31f682b
Result visualization update
Jan 23, 2024
b559e85
More visualization and training code
Feb 1, 2024
a30f3a6
Code cleaning for experiments and documentation of experiments
Old-Shatterhand Feb 7, 2024
db22e98
Minor bug fixes and first version of a second meta.yaml for Windows a…
Feb 8, 2024
3fac66d
Adding DIAMOND to DataSAIL and writing a docu page about it. Open for…
Feb 9, 2024
c269538
More improvements on the experiments and documentation
Feb 15, 2024
5fe4768
Setup build for two versions of DataSAIL
Feb 15, 2024
935fdf6
Minor update
Mar 4, 2024
658ef2e
First part of fixing DIAMOND adapter
Old-Shatterhand Mar 4, 2024
6d25674
Bug fixes
Old-Shatterhand Mar 11, 2024
00c0a44
Embedding clustring implemented, not tested yet
Old-Shatterhand Mar 11, 2024
2d361db
Finished debugging diamond related issues
Old-Shatterhand Mar 11, 2024
160548a
Final diamond debugging
Old-Shatterhand Mar 12, 2024
037ae38
Merge branch 'dev_0.3' into dev_1.0
Old-Shatterhand Mar 12, 2024
e379ddc
Merge branch 'dev_diamond' into dev_1.0
Old-Shatterhand Mar 12, 2024
741b71b
Individual number of clusters per dataset
Mar 14, 2024
5cf8c4b
Merge remote-tracking branch 'origin/dev_0.3' into dev_0.3
Mar 14, 2024
59cff75
Merge branch 'dev_0.3' into dev_1.0
Old-Shatterhand Mar 14, 2024
1061491
Merging of dev_0.3 and dev_diamond with additional tests and bug fixes
Old-Shatterhand Mar 14, 2024
571786e
Update test.yaml, missing installment added
Old-Shatterhand Mar 14, 2024
0478d6c
Two more minor fixes
Old-Shatterhand Mar 14, 2024
4696362
Extending handling of embedding at input plus more tests
Mar 15, 2024
7e17032
Minor bug fixes
Mar 15, 2024
23577b2
Minor fixes
Old-Shatterhand Mar 15, 2024
62a3bd3
Bug fix in agglomerative clustering due to new version restrictions
Old-Shatterhand Mar 15, 2024
ebc24b0
Final fix for yesterdays changes
Old-Shatterhand Mar 15, 2024
17b7b19
Updated notebooks and some more tests
Mar 18, 2024
d48490b
Merge branch 'main' into dev_1.0
Old-Shatterhand Mar 19, 2024
cbb5041
Pull request templates
Old-Shatterhand Mar 19, 2024
71813d5
Relaxation of scikit-learn version restrictions
Old-Shatterhand Mar 19, 2024
349007b
Minor update
Old-Shatterhand Mar 19, 2024
d8b5cee
Minor update and extension of embedding clustering with scipy
Old-Shatterhand Mar 20, 2024
7edf382
Minor bug fix with original ECFP clustering
Old-Shatterhand Mar 20, 2024
2e8b7e4
More tests on different molecule formats
Old-Shatterhand Mar 21, 2024
7638294
Bug fix
Old-Shatterhand Mar 21, 2024
1a397bc
Improved documentation
Old-Shatterhand Mar 26, 2024
3c57294
Bug fix in experiments and routine for quantitative IL evaluation
Mar 26, 2024
206e367
Merge remote-tracking branch 'origin/dev_1.0' into dev_1.0
Mar 26, 2024
7a76846
Additional test for different rdkit versions
Old-Shatterhand Mar 26, 2024
2473d7f
Renaming
Old-Shatterhand Mar 26, 2024
e38a589
New triggers
Old-Shatterhand Mar 26, 2024
8b1c737
typo fixed
Old-Shatterhand Mar 26, 2024
7e70beb
Another try to fix
Old-Shatterhand Mar 26, 2024
9c3a0a0
Update of version naming for rdkit
Old-Shatterhand Mar 26, 2024
6ab96be
Conda cannot deal with ~
Old-Shatterhand Mar 26, 2024
9602a33
Bug fix with list in github actions
Old-Shatterhand Mar 26, 2024
6390166
Updated notebooks and example for Stratification
Mar 26, 2024
acb93d5
Weighting of clusters in loss function
Mar 27, 2024
ad728e5
Update test.yaml
Old-Shatterhand Mar 27, 2024
1a86bd7
Update test.yaml
Old-Shatterhand Mar 27, 2024
7f2a35f
Update test_rdkit.yaml
Old-Shatterhand Mar 27, 2024
4634a95
Fix in RDKit version testing and new files for .mol, .mrv, .pdb, and …
Old-Shatterhand Mar 28, 2024
7387964
Missed, updated file for documentation
Old-Shatterhand Mar 28, 2024
563e5d0
Bugs fixed
Old-Shatterhand Mar 28, 2024
35954ce
Bugs fixed with vector similarities
Old-Shatterhand Mar 29, 2024
00d893f
Minor bugs fixed
Old-Shatterhand Apr 3, 2024
ece9d4a
new notebook on non-biomolecular data
Old-Shatterhand Apr 3, 2024
7a19c45
Integration of notebook in documentation
Old-Shatterhand Apr 3, 2024
da7dad3
Merge branch 'dev_1.0_rdkit' into dev_1.0
Old-Shatterhand Apr 3, 2024
15e8010
Merge branch 'dev_1.0_weighting' into dev_1.0
Old-Shatterhand Apr 3, 2024
b3ac91e
Minor bug fix
Old-Shatterhand Apr 3, 2024
bbee3ae
Minor adaptation to new package setup
Old-Shatterhand Apr 3, 2024
9ebff0f
Documentation update
Old-Shatterhand Apr 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE/detailed_pr_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.
List any dependencies that are required for this change.

## Fixes

Which issues/bugs/feature requests does this PR fix/tackle?

## Type of change

Please check all applicable

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected).
- [ ] This change requires a documentation update

If you tick the third one, please justify breaking DataSAILs current functionality.

# How Has This Been Tested?

Please list the added tests and what new/changed behavior they test. Please also state if you changed/added things that
are covered by already existing tests.

# Checklist:

- [ ] I have read the contribution guidelines for DataSAIL
- [ ] My code follows the style guidelines of this project
- [ ] I have performed a self-review of my code
- [ ] I have commented my code including:
- [ ] Doc-Strings for all new methods following the
[Google Guidelines](https://github.com/google/styleguide/blob/gh-pages/pyguide.md#38-comments-and-docstrings)
- [ ] Hard-to-understand areas, e.g., algorithmic steps
- [ ] I have made corresponding changes to the documentation
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing pytests pass locally with my changes
- [ ] New dependencies have been added to the recipes for conda-build (only if applicable) (check eventually for
minimal and maximal version requirements)
16 changes: 16 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE/fast_pr_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
## Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.
List any dependencies that are required for this change.

## Issue ticket number and link

Which issues/bugs/feature requests does this PR fix/tackle?

# Checklist

- [ ] I have read the contribution guidelines for DataSAIL
- [ ] My code follows the style guidelines of this project
- [ ] I have performed a self-review of my code
- [ ] I have commented my code including Doc-Strings for all new methods following the
[Google Guidelines](https://github.com/google/styleguide/blob/gh-pages/pyguide.md#38-comments-and-docstrings)
14 changes: 6 additions & 8 deletions .github/workflows/publish_conda.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,19 +10,17 @@ jobs:
runs-on: 'ubuntu-latest'
steps:
- uses: actions/checkout@v1
- name: publish-to-conda
uses: Old-Shatterhand/[email protected].11
- name: publish-full-to-conda
uses: Old-Shatterhand/[email protected].10
with:
AnacondaToken: ${{ secrets.DATASAIL_ANACONDA_TOKEN }}
Versions: "3.8,3.9,3.10,3.11,3.12"
Folder: "recipe"
Platforms: "osx-64"
UploadOriginal: 1
- name: publish-to-conda-pt2
uses: Old-Shatterhand/[email protected]
- name: publish-lite-to-conda
uses: Old-Shatterhand/[email protected]
with:
AnacondaToken: ${{ secrets.DATASAIL_ANACONDA_TOKEN }}
Versions: "3.8,3.9,3.10,3.11,3.12"
Folder: "recipe"
Platforms: "osx-arm64,win-64"
UploadOriginal: 0
Folder: "recipe_lite"
Platforms: "osx-64,osx-arm64,win-64"
10 changes: 7 additions & 3 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,14 @@ on:
branches:
- main
- dev
- dev_1.0
- dev_1.0_weighting
pull_request:
branches:
- main
- dev
- dev_1.0
- dev_1.0_weighting
workflow_dispatch: # make is manually start-able

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
Expand All @@ -21,10 +25,10 @@ jobs:
test:
runs-on: 'ubuntu-latest'
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4

- name: Setup Mamba
uses: conda-incubator/setup-miniconda@v2
uses: conda-incubator/setup-miniconda@v3
with:
python-version: '3.10'
miniforge-variant: Mambaforge
Expand All @@ -35,7 +39,7 @@ jobs:
- name: Install environment
shell: bash -l {0}
run: |
mamba install -c conda-forge -c bioconda -y numpy pandas networkx matplotlib pytest setuptools pyscipopt"<4.0.0" foldseek mmseqs2 cd-hit mash tmalign cvxpy pytest-cov rdkit">=2022.09.1" pytest-cases scikit-learn pyyaml
mamba install -c conda-forge -c bioconda -y numpy pandas networkx matplotlib pytest setuptools pyscipopt"<4.0.0" foldseek mmseqs2 cd-hit mash tmalign diamond cvxpy pytest-cov rdkit">=2023.09.1" pytest-cases scikit-learn pyyaml h5py
pip install grakel

- name: Run tests
Expand Down
59 changes: 59 additions & 0 deletions .github/workflows/test_rdkit.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# This is a basic workflow to help you get started with Actions

name: Test RDKit

# Controls when the workflow will run
on:
# Triggers the workflow on push or pull request events but only for the main branch
push:
branches:
- main
- dev
- dev_1.0_rdkit
pull_request:
branches:
- main
- dev
- dev_1.0_rdkit
workflow_dispatch: # make is manually start-able

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
# This workflow contains a single job called "test"
test:
strategy:
matrix:
rdkit-version:
- rdkit">=2022.03.1,<2022.09.1"
- rdkit">=2023.03.1,<2023.09.1"
runs-on: 'ubuntu-latest'
steps:
- uses: actions/checkout@v4

- name: Setup Mamba
uses: conda-incubator/setup-miniconda@v3
with:
python-version: '3.10'
miniforge-variant: Mambaforge
miniforge-version: latest
activate-environment: MPP
use-mamba: true

- name: Install environment
shell: bash -l {0}
run: |
mamba install -c conda-forge -c bioconda -y numpy pandas networkx matplotlib pytest setuptools pyscipopt"<4.0.0" foldseek mmseqs2 cd-hit mash tmalign diamond cvxpy pytest-cov ${{ matrix.rdkit-version }} pytest-cases scikit-learn pyyaml h5py
pip install grakel

- name: Run tests
shell: bash -l {0}
run: |
cd tests
pytest test_pipeline.py::test_molecule_formats

- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
token: ${{ secrets.CODECOV_TOKEN }}
fail_ci_if_error: false
files: coverage.xml
64 changes: 56 additions & 8 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,27 +20,75 @@ tmp/
log*.txt
*.png
*.pdf
*.log
*.pkl
*.ipynb
*.sh

MoleculeNet.py
run_moleculenet.sh
viz_molnet.py
/experiments/MPP/datasail/
/experiments/MPP/deepchem/
/experiments/PDBBind/lppdbbind/
/experiments/PDBBind/runs/
/experiments/DTI/runs/
/experiments/segfault/
/experiments/tsne/
/experiments/umap/
/experiments/david.py
/experiments/ablation/david.py
/experiments/MPP_timing.pkl
/experiments/stiming.png
/experiments/timing.png
/runs/
/experiments/PDBBind/datasail/
/experiments/DTI/datasail/
/experiments/MPP/lohi/
/experiments/PDBBind/deepchem/
/experiments/PDBBind/graphpart/
/experiments/PDBBind/lohi/
/experiments/DTI/deepchem/
/experiments/DTI/graphpart/
/experiments/DTI/lohi/
/experiments_local/
/mave_test/
/tests/data/rw_data/mave_splits/
/tests/data/rw_data/mave_splits/
/experiments/Biogen/data/
/experiments/Biogen/datasail/HLM/
/experiments/Biogen/deepchem/HLM/
/experiments/Biogen/lohi/HLM/
/experiments/MPP/ML4Mol/
/experiments/MPP/ML4Mol_bace/
/experiments/MPP/ML4Mol_qm8/
/experiments/MPP/ML4Mol_qm9/
/experiments/MPP/tsne/
/experiments/MPP/timing.pkl
/experiments/MPP/train_ML4Mol.py
/experiments/DTI/ML4Mol/
/experiments/DTI/drug_embeds.pkl
/experiments/DTI/ML4Mol.ipynb
/experiments/DTI/prot_embeds.pkl
/experiments/DTI/tsne_embeds.pkl
/experiments/analysis/cdhit_leak/
/experiments/analysis/backup.pkl
/experiments/analysis/cluster_silhouette_scores.pkl
/experiments/analysis/clusterings/
/experiments/analysis/backup_scip2.pkl
/experiments/analysis/cluster_tables.pkl
/experiments/analysis/density.pkl
/experiments/analysis/figures.ipynb
/experiments/analysis/jaccard.pkl
/experiments/analysis/performances.pkl
/experiments/analysis/qm9.tar.gz
/experiments/analysis/rna.fasta
/experiments/analysis/save.pkl
/experiments/analysis/save_clustering.py
/experiments/analysis/silhouette_scores.pkl
/experiments/analysis/Untitled.ipynb
/ML4Mol/ML4Mol_prepare.ipynb
/ML4Mol/ML4Mol_QM8.ipynb
/ML4Mol/ML4Mols_split.ipynb
/Untitled.ipynb
/ML4Mol/
/experiments/Biogen/datasail/
/experiments/Biogen/deepchem/
/experiments/Biogen/lohi/
/experiments/MPP/datasail_old/
/experiments/MPP/deepchem_old/
/experiments/MPP/lohi_old/
/experiments/time2/
/experiments/time/
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "experiments/DTI/lppdbbind"]
path = experiments/DTI/lppdbbind
url = [email protected]:kalininalab/LP-PDBBind.git
15 changes: 13 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Change Log

## [Planned]
## [Planned - Long-term project ideas]

- [ ] Normalization of objectives for better comparability and better splits in solvers
- [ ] Time-limit and solution limit for all solvers
Expand All @@ -9,7 +9,18 @@
- [ ] Reports of results with plots and tables a PDF and or HTML
- [ ] Generalization to R-dimensional datasets (see [paper](https://doi.org/10.1101/2023.11.15.566305))
- [ ] Input from config files
- [ ] Replace GraKel with something "modern" and fully "conda-installable" to make DataSAIL fully conda-installable
- [ ] Stratified splits
- [ ] Include [MashMap3](https://github.com/marbl/MashMap)
- [ ] Include MASH for amino acid sequences

## v0.3.0 (2024-01-??)

- Stratified splits
- Extensive checks of available solvers
- Time and Space limits for all solvers
- Runtime experiments and experiments on a [stratified dataset](LINK)
- Bugs and Docu fixed

## v0.2.2 (2023-12-11)

Expand All @@ -28,7 +39,7 @@

## v0.2.1 (2023-10-26)

- Renaming of splitting techniques to align with paper to be I1/C1/I2/C2
- Renaming of splitting techniques to align with preprint to be I1/C1/I2/C2
- More tests to better cover the supposed functionality
- Now supports for Python 3.8 to Python 3.11
- Experiments on [MoleculeNet](https://doi.org/10.1039/C7SC02664A) and [LP-PDBBind](https://doi.org/10.48550/arXiv.2308.09639)
Expand Down
5 changes: 5 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Contributing to DataSAIL
========================

As with every other open-source project, you can contribute to DataSAIL. For more information, we refer you to the
[contribution guidelines in our documentation]().
Loading
Loading