EVOLVEpro interprets PLMs embeddings through a top-layer regression model, learning the relationship between sequence and experimentally determined activities through an iterative active learning process. The lightweight random forest regression model can optimize multiple protein properties simultaneously during iterative rounds of testing with as few as 10 experimental data points per round, enabling complex multi-objective evolution campaigns and minimal experimental setup.
We employed an optimized version of EVOLVEpro to evolve a number of proteins:
EVOLVEpro optimized C143 Antibody
The EVOLVEpro workflow consists of four main steps:
- Process: Generate and clean FASTA and CSV files
- PLM: Extract protein language model (PLM) embeddings for all variants
- Run EVOLVEpro: Apply the model to either DMS or experimental data
- Visualize: Prepare outputs and visualizations
Generate and clean FASTA and CSV files containing protein variant sequences and their corresponding activity data.
For detailed instructions, see the Process README.
Extract protein language model embeddings for all variants using various PLM models.
For detailed instructions, see the PLM README.
Apply the EVOLVEpro model to optimize protein activity. There are two main workflows:
Use this workflow to optimize a few-shot model on a deep mutational scanning (DMS) dataset, where activity values are known for a large number of variants.
For detailed instructions, see the DMS README.
Use this workflow for iterative experimental optimization of protein activity.
For detailed instructions, see the Experimental README.
Prepare outputs and create visualizations to interpret the results of the EVOLVEpro process.
For detailed instructions, see the Plot README.
git clone https://github.com/mat10d/EvolvePro.git
cd EvolvePro
First, create and activate a conda environment with all necessary dependencies for EVOLVEpro:
conda env create -f environment.yml
conda activate evolvepro
For installing all underlying protein language models, we use a different environment:
sh setup_plm.sh
conda activate plm
This environment includes:
- Deep learning frameworks (PyTorch)
- Protein language models that are installable via pip (ESM, ProtT5, UniRep, ankh, unirep)
- Protein language models that are only installable from github environments (proteinbert, efficient-evolution)
These environments are kept separate to maintain clean dependencies and avoid conflicts between the core EVOLVEpro functionality and the various protein language models.
For a step-by-step guide on using EVOLVEpro to improve a protein's activity, simulated on a small dataset that we used as part of the DMS work, see our Google Colab tutorial here.
If you encounter any bugs, have feature requests, or need assistance, please open an issue on our GitHub Issues page. When opening an issue, please:
- Check if a similar issue already exists
- Include a clear description of the problem
- Add steps to reproduce the issue if applicable
- Specify your environment details (OS, Python version, etc.)
- Include any relevant error messages or screenshots
We welcome contributions and feedback from the community.
If you use this code in your research, please cite our paper:
@ARTICLE
author={Jiang, Kaiyi and Yan, Zhaoqing and Di Bernardo, Matteo and Sgrizzi, Samantha R. and Villiger, Lukas and Kayabölen, Alişan and Kim, Byungji and Carscadden, Josephine K. and Hiraizumi, Masahiro and Nishimasu, Hiroshi and Gootenberg, Jonathan S. and Abudayyeh, Omar O.}
title={Rapid in silico directed evolution by a protein language model with EVOLVEpro},
year={2024},
DOI={10.1126/science.adr6006}