Benchmark for MDA 2.4.3 (#5)
* switch to chembl33 and update env
* formatting with black
* standardization code modifies mol inplace
* benchmark results
* simplify conda env
* add scaffold analysis
* use rdkit.js to generate images on hover
* add link to scaffold network viz in readme
* cleanup
cbouy authored Aug 7, 2024
1 parent a773c61 commit 8b6e4f5
# MDAnalysis fork to use
GITHUB_USER ?= cbouy
# Branch of MDAnalysis to install
BRANCH ?= fix-converter
MDA_VERSION ?= 2.4.3
# Use conda or mamba
CONDA ?= conda
# Number of threads to use in parallel
Expand All @@ -15,9 +13,9 @@ MAX_ATOMS ?= 50

SHELL := /bin/bash
SET_CONDA_ENV := source $$(conda info --base)/etc/profile.d/ && conda activate && conda activate rdkitconverter

fetch := data/chembl_30.sdf.gz
fetch := data/chembl_33.sdf.gz
process := data/chembl_processed_unique.smi.gz
benchmark := data/chembl_failed.smi
report := results/failed_molecules.html
Expand All @@ -30,7 +28,7 @@ help:
@echo 'targets:'
@echo ' help Show this help'
@echo ' install Install dependencies'
@echo ' fetch Fetch ChEMBL 30'
@echo ' fetch Fetch ChEMBL 33'
@echo ' process Filter, standardize and remove duplicate molecules'
@echo ' benchmark Run the benchmark'
@echo ' report Generate the report'
Expand All @@ -42,7 +40,7 @@ help:
$(CONDA) env create -f environment.yaml
@pip install git+$(GITHUB_USER)/mdanalysis.git@$(BRANCH)#subdirectory=package
@$(CONDA) install 'mdanalysis==$(MDA_VERSION)'

15 changes: 9 additions & 6 deletions
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,19 @@ To cite this repository, please use the following DOI:

| Description | Value |
| --- | --- |
| **MDAnalysis version** | 2.2.0-dev0 |
| **Accuracy** | 99.14% |
| **Number of molecules fetched** | 2,136,187 |
| **Number of molecules processed** | 1,942,004 |
| **Number of molecules failed** | 16,615 |
| **MDAnalysis version** | 2.4.3 |
| **Accuracy** | 99.19% |
| **Number of molecules fetched** | 2,372,174 |
| **Number of molecules processed** | 2,166,327 |
| **Number of molecules failed** | 17,577 |

Details on the benchmark can also be found [here](results/results.json).

The **interactive list of molecules** currently failing can be accessed [here]( (click on a molecule's image to zoom in).

Failing **scaffolds** can be accessed [here]( The scaffold network used to
create this file can be viewed [here](

## Instructions

Running the benchmark requires conda (or mamba) on a Linux machine.
Expand Down Expand Up @@ -53,7 +56,7 @@ The results are available in the `results/` directory:

## Methods

The benchmark will fetch ChEMBL 30 as an SDF file and process the molecules the following way:
The benchmark will fetch ChEMBL 33 as an SDF file and process the molecules the following way:
- Discard molecules that could not be read or sanitized by RDKit
- Keep only the largest fragment
- Keep only molecules with 2 to 50 heavy atoms
Expand Down
- bokeh=3.2.0
- mols2grid=1.1.1
- networkx=3.1
- pandas=2.0.2
- python=3.10.11
- rdkit=2022.09.1
- scaffoldgraph=1.1.2
- tqdm=4.65.0
- numpy=1.25.0
- pydot=1.4.2
- pygraphviz=1.11
- scipy=1.9.3
2 changes: 1 addition & 1 deletion results/badge.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"schemaVersion": 1, "label": "accuracy", "message": "99.14%", "color": "success"}
{"schemaVersion": 1, "label": "accuracy", "message": "99.19%", "color": "success"}
