This repository contains the implementation of the Multi-view Molecular Embedding with Late Fusion (MMELON) architecture presented in our preprint Multi-view biomedical foundation models for molecule-target and property prediction. MMELON is a flexible approach to aggregate multiple views (sequence, image, graph) of molecules in a foundation model setting. While models based on single view representation typically performs well on some downstream tasks and not others, the multi-view model performs robustly across a wide range of property prediction tasks encompassing ligand-protein binding, molecular solubility, metabolism and toxicity. It has been applied to screen compounds against a large (> 100 targets) set of G Protein-Coupled receptors (GPCRs) to identify strong binders for 33 targets related to Alzheimer’s disease, which are validated through structure-based modeling and identification of key binding motifs.
Our model integrates:
- Image Representation: Captures the 2D visual depiction of molecular structures, highlighting features like symmetry, bond angles, and functional groups. Molecular images are generated using RDKit and undergo data augmentation during training to enhance robustness.
- Graph Representation: Encodes molecules as undirected graphs where nodes represent atoms and edges represent bonds. Atom-specific properties (e.g., atomic number, chirality) and bond-specific properties (e.g., bond type, stereochemistry) are embedded using categorical embedding techniques.
- Text Representation: Utilizes SMILES strings to represent chemical structures, tokenized with a custom tokenizer. The sequences are embedded using a transformer-based architecture to capture the sequential nature of the chemical information.
The embeddings from these single-view pre-trained encoders are combined using an attention-based aggregator module. This module learns to weight each view appropriately, producing a unified multi-view embedding. This approach leverages the strengths of each representation to improve performance on downstream predictive tasks.
Figure: Schematic of multi-view architecture. Embeddings from three single view pre-trained encoders (Image, Graph and Text) are combined by the aggregator module into a combined embedding. The network is finetunable for downstream predictive tasks.
Our pre-training dataset comprises 200 million molecules sourced from the PubChem and ZINC22 chemical databases, ensuring diversity and relevance to downstream tasks. Each view encoder is pre-trained with self-supervised tasks tailored to capture the unique features of its representation.
MMELON’s extensible architecture allows for seamless integration of additional views, making it a versatile tool for molecular representation learning. For further details, see here.
Follow these steps to set up the biomed-multi-view
codebase on your system.
- Operating System: Linux or macOS
- Python Version: Python 3.11
- Conda: Anaconda or Miniconda installed
- Git: Version control to clone the repository
Choose a root directory where you want to install biomed.multi-view. For example:
export ROOT_DIR=~/biomed-multiview
mkdir -p $ROOT_DIR
conda create -y python=3.11 --prefix $ROOT_DIR/envs/biomed-multiview
Activate the environment:
conda activate $ROOT_DIR/envs/biomed-multiview
Navigate to the project directory and clone the repository:
mkdir -p $ROOT_DIR/code
cd $ROOT_DIR/code
# Clone the repository using HTTPS
git clone https://github.com/BiomedSciAI/biomed-multi-view.git
# Navigate into the cloned repository
cd biomed-multi-view
Note: If you prefer using SSH, ensure that your SSH keys are set up with GitHub and use the following command:
git clone [email protected]:BiomedSciAI/biomed-multi-view.git
If you are installing in a Mac, skip this step and proceed to next step. Else, install the package in editable mode along with development dependencies:
pip install -e .['dev']
Install additional requirements:
pip install -r requirements.txt
If you are using a Mac with Apple Silicon (M1/M2/M3) and the zsh shell, you may need to disable globbing for the installation command:
noglob pip install -e .[dev]
Install macOS-specific requirements optimized for Apple’s Metal Performance Shaders (MPS):
pip install -r requirements-mps.txt
Verify that the installation was successful by running unit tests
python -m unittest bmfm_sm.tests.all_tests
Here’s a consolidated list of commands for quick reference:
# Set up the root directory
export ROOT_DIR=~/biomed-multiview
mkdir -p $ROOT_DIR
# Create and activate the Conda environment
conda create -y python=3.11 --prefix $ROOT_DIR/envs/biomed-multiview
conda activate $ROOT_DIR/envs/biomed-multiview
# Clone the repository
mkdir -p $ROOT_DIR/code
cd $ROOT_DIR/code
git clone https://github.com/BiomedSciAI/biomed-multi-view.git
cd biomed-multi-view
# Install dependencies (non-macOS)
pip install -e .[dev]
pip install -r requirements.txt
# For macOS with Apple Silicon
noglob pip install -e .[dev]
pip install -r requirements-mps.txt
# Verify installation
python -m unittest bmfm_sm.tests.all_tests
To explore the pretrained and finetuned models, please refer to the demo notebook. This notebook provides detailed examples and explanations on how to use the models effectively. You can launch the notebook from running the following command from a new terminal window.
cd $ROOT_DIR/code/biomed-multi-view
jupyter lab
Copy the URL displayed on the console and paste it on a browser tab. The URL should look something similar to the below
http://localhost:8888/lab?token=67f8a92c257a82010b4f82b219f7c1ad675ee329e730321e
Navigate to the notebooks directory in the sidepanel in the Jupyter Lab GUI to locate the notebook named smmv_api_demo.ipynb
.
You can generate embeddings for a given molecule using the pretrained model with the following code. You can excute from cell 1 to 5 in the notebook run the same.
from bmfm_sm.api.smmv_api import SmallMoleculeMultiViewModel
# Example SMILES string for a molecule
example_smiles = "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"
# Obtain embeddings from the pretrained model
example_emb = SmallMoleculeMultiViewModel.get_embeddings(
smiles=example_smiles,
model_path="ibm/biomed.sm.mv-te-84m",
huggingface=True,
)
print(example_emb)
This will output the embedding vector for the given molecule.
You can use the finetuned models to make predictions on new data. You can execute through cell 6 to 8 in the notebook to run the same.
from bmfm_sm.api.smmv_api import SmallMoleculeMultiViewModel
from bmfm_sm.api.dataset_registry import DatasetRegistry
# Initialize the dataset registry
dataset_registry = DatasetRegistry()
# Example SMILES string
example_smiles = "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"
# Get dataset information for BACE (classification task)
bace_ds = dataset_registry.get_dataset_info('BACE')
# Load the finetuned model for BACE
finetuned_model_ds = SmallMoleculeMultiViewModel.from_finetuned(
bace_ds,
model_path="ibm/biomed.sm.mv-te-84m-MoleculeNet-ligand_scaffold-BACE-101",
inference_mode=True,
huggingface=True
)
# Get predictions
bace_prediction = SmallMoleculeMultiViewModel.get_predictions(
example_smiles, bace_ds, finetuned_model=finetuned_model_bace
)
print("BACE Prediction:", bace_prediction)
BACE Prediction: {'prediction': tensor(0, dtype=torch.int32)}
Note: The outputs are illustrative. Actual predictions may vary depending on the model and data. For more detailed examples and explanations, please refer to the demo notebook cell 8.
To evaluate and train the models, you need to set up the necessary data, splits, configuration files, and model checkpoints. This section will guide you through the process.
First, create a directory to serve as your root directory for all the data. This directory will house all datasets, splits, configuration files, and model checkpoints.
mkdir -p $ROOT_DIR/data_root
Set the $BMFM_HOME
environment variable to point to your data root directory. This helps the scripts locate your data.
export BMFM_HOME=$ROOT_DIR/data_root
Optionally, add the export command to your shell configuration file (e.g., $HOME/.bashrc for bash):
echo 'export BMFM_HOME=$ROOT_DIR/data_root' >> $HOME/.bashrc
We provide all the necessary data splits, configuration files, and model checkpoints in a single archive to simplify the setup process.
- Download
data_root_os_v1.tar.gz
: from this location. - Extract the Archive: Uncompress the tar file into your data root directory
tar -xzvf data_root_os_v1.tar.gz -C $BMFM_HOME
This will populate $BMFM_HOME
with the required files and directories.
We provide a script to download the MoleculeNet datasets automatically that you can run:
run-download-moleculenet
This script will download the MoleculeNet datasets into $BMFM_HOME/datasets/raw_data/MoleculeNet/
. Note: The run-download-moleculenet
command launches a Python script that can be executed using bmfm_sm.python -m bmfm_sm.launch.download_molecule_net_data
from $ROOT_DIR/code/biomed-multi-view
directory as well.
After completing the above steps, your $BMFM_HOME
directory should have the following structure:
$BMFM_HOME/
├── bmfm_model_dir
│ ├── finetuned
│ │ └── MULTIVIEW_MODEL
│ │ └── MoleculeNet
│ │ └── ligand_scaffold
│ └── pretrained
│ └── MULTIVIEW_MODEL
│ ├── biomed-smmv-base.pth
│ └── biomed-smmv-with-coeff-agg.pth
├── configs_finetune
│ └── MULTIVIEW_MODEL
│ └── MoleculeNet
│ └── ligand_scaffold
│ ├── BACE
│ ├── BBBP
│ ├── CLINTOX
│ ├── ESOL
│ ├── FREESOLV
│ ├── HIV
│ ├── LIPOPHILICITY
│ ├── MUV
│ ├── QM7
│ ├── SIDER
│ ├── TOX21
│ └── TOXCAST
└── datasets
├── raw_data
│ └── MoleculeNet
│ ├── bace.csv
│ ├── bbbp.csv
│ ├── clintox.csv
│ ├── esol.csv
│ ├── freesolv.csv
│ ├── hiv.csv
│ ├── lipophilicity.csv
│ ├── muv.csv
│ ├── qm7.csv
│ ├── qm9.csv
│ ├── sider.csv
│ ├── tox21.csv
│ └── toxcast.csv
└── splits
└── MoleculeNet
└── ligand_scaffold
├── bace_split.json
├── bbbp_split.json
├── clintox_split.json
├── esol_split.json
├── freesolv_split.json
├── hiv_split.json
├── lipophilicity_split.json
├── muv_split.json
├── qm7_split.json
├── sider_split.json
├── tox21_split.json
└── toxcast_split.json
After successfully completing the installation and data preparation steps, you are now ready to:
- Usage: Learn how to obtain embeddings from the pretrained model.
- Evaluation: Assess the model’s performance on benchmark datasets using our pretrained checkpoints.
- Training: Understand how to finetune the pretrained model using the provided configuration files.
- Inference: Run the model on sample data in inference mode to make predictions.
We provide the run-finetune
command for running finetuning and evaluation processes. You can use this command to evaluate the performance of the finetuned models on various datasets. Note: The run-finetune
command launches a Python script that can be executed using bmfm_sm.python -m launch.download_molecule_net_data
from $ROOT_DIR/code/biomed-multi-view
directory as well.
To see the usage options for the script, run:
run-finetune --help
This will display:
Usage: run-finetune [OPTIONS]
Options:
--model TEXT Model name
--dataset-group TEXT Dataset group name
--split-strategy TEXT Split strategy name
--dataset TEXT Dataset name
--fit Run training (fit)
--test Run testing
-o, --override TEXT Override parameters in key=value format (e.g.,
trainer.max_epochs=10)
--help Show this message and exit.
To evaluate a finetuned checkpoint on a specific dataset, use the --test option along with the --dataset parameter. For example, to evaluate on the BBBP dataset:
python run-finetune --test --dataset BBBP
If you omit the --dataset
option, the script will prompt you to choose a dataset:
python run-finetune --test
Please choose a dataset (FREESOLV, BBBP, CLINTOX, MUV, TOXCAST, QM9, BACE, LIPOPHILICITY, ESOL, HIV, TOX21, SIDER, QM7): BBBP
This command will evaluate the finetuned checkpoint corresponding to the BBBP dataset using the test set of the ligand_scaffold
split.
To finetune the pretrained model on a specific dataset, use the --fit option:
python run-finetune --fit --dataset BBBP
Again, if you omit the --dataset option, the script will prompt you to select one:
python run-finetune --fit
Please choose a dataset (FREESOLV, BBBP, CLINTOX, MUV, TOXCAST, QM9, BACE, LIPOPHILICITY, ESOL, HIV, TOX21, SIDER, QM7): BBBP
This command will start the finetuning process for the BBBP dataset using the configuration files provided in the configs_finetune
directory.
Note: You can override default parameters using the -o
or --override
option. For example:
python run-finetune --fit --dataset BBBP -o trainer.max_epochs=10
Note: If you run into out of memory errors, you can reduce the batch size using the following syntax
python run-finetune --fit --dataset BBBP -o data.init_args.batch_size=4
@misc{suryanarayanan2024multiviewbiomedicalfoundationmodels,
title={Multi-view biomedical foundation models for molecule-target and property prediction},
author={Parthasarathy Suryanarayanan and Yunguang Qiu and Shreyans Sethi and Diwakar Mahajan and Hongyang Li and Yuxin Yang and Elif Eyigoz and Aldo Guzman Saenz and Daniel E. Platt and Timothy H. Rumbell and Kenney Ng and Sanjoy Dey and Myson Burch and Bum Chul Kwon and Pablo Meyer and Feixiong Cheng and Jianying Hu and Joseph A. Morrone},
year={2024},
eprint={2410.19704},
archivePrefix={arXiv},
primaryClass={q-bio.BM},
url={https://arxiv.org/abs/2410.19704},
}