💡 Notes
- This is a list accompanying our manuscript 'Title' (preprint, code). We focus on the last four years' deep learning methods for protein design. This table complements Table 1 presented in our manuscript.
- We curated this list manually, and as such, it might be incomplete. Please drop us an email or open an issue if you find your method missing.
- We order the methods by release date (preprint when available) and categorize them into four classes (for more details on these categories, see our preprint):
- 1: 'fixed-backbone' protein design; p(sequence|structure)
- 2: structure generation; p(structure)
- 3: sequence generation; p(sequence) or p(sequence|sequence seed)
- 4: concomitant protein and sequence design. p(sequence and structure) (which can be constrained).
- Others before us have done an excellent work assembling other methods, sometimes overlapping with this list. We link these lists here:
- Kevin Yang's list on ML methods for protein research
- Christian Dallago & Sergey Ovchinnikov's lists on structure prediction methods and protein language models.
- Simon Dürr and Gina El Nesr's list on sequence protein design
- We sort our lists chronologically
Methods in this class attempt to solve the classical protein design problem: Find an optimal sequence that adopts a pre-determined 3D structure.
Name | Architecture | Number of Parameters | User Input | Output | Training Dataset | Paper | Code | Release Month/Year |
---|---|---|---|---|---|---|---|---|
SPIN2 | FNN | ~105k | 3D structure | sequence | 1,532 X-ray structures | Paper | Code used to be here - no longer available | 2018/02 |
SPROF | CNN-LSTM | - | 3D structure | sequence | 1,532 X-ray structures | Paper | Code Web Server | 2018/02 |
ProteinSolver | GNN | - | 3D structure | sequence | 72,464,122 sequences/adjacency matrices pairs | Paper | Code | 2019/12 |
ProDCoNN | CNN | >28k | 3D structure | sequence | 26,179 sequences/PDB pairs? | Paper | - | 2019/12 |
Ingraham et al. | modified Transformer | >3k | sequence | CATH 4.2 40% sequences/structures | Paper | Code | 2019/12] | |
Anand et al. | CNN | - | 3D structure | sequence | 53,414 CATH domain structures | Paper | Code | 2020/01] |
DenseCPD | CNN | 3M | 3D structure | sequence | 11,227 3D structures | Paper | Web server | 2020/01] |
GVP | GVP | - | 3D structure | sequence | CATH 4.2 40% sequences/structures | Paper | Code | 2020/07 |
Norn et al. | CNN | N/A | Distance map? | sequence | N/A | Paper | Code | 2020/07 |
Fold2Seq | modified Transformer | - | 3D structure | sequence | 45,995 3D structures from CATH 4.2 | Paper | Code | 2021/06 |
CNN_protein_landscape | CNN | >10M | 3D structure | sequence | 16,569 PDB chains | Paper | Code | 2021/08 |
Orellana et al. | GCN | - | 3D structures | sequence | CATH 4.2 40% sequences/structures | Paper | - | 2021/11 |
McPartlon et al. | modified Transformer | >10k | 3D structures | sequences | 37k 3D structures from BC40 | Paper | - | 2022/04 |
ESM-IF1 | GVP-Transformer | 142M | 3D structure | sequence | 16k 3D structures + 1.2 M AF2 predictions | Paper | Code | 2022/04 |
ABACUS-R | Transformer | 152M | 3D structures | sequence | CATH 4.2 | Paper | Code | 2022/02 |
ProteinMPNN | MPNN | >28k | 3D structure | sequence | CATH 4.2 40% sequences/structures | Paper | Code Web Interface | 2022/07 |
ProDESIGN-LE | ? | ? | sequence | ? | Paper | - | 2022/07 | |
MIF | SGNN | ? | sequence | ? | Paper | Code | 2022/05 |
Methods in this class generate structures unconditionally or from a set of secondary structural conditions.
Name | Architecture | Number of Parameters | User Input | Output | Training Dataset | Paper | Code | Release Month/Year |
---|---|---|---|---|---|---|---|---|
64GAN | GAN | - | - | contact map (3D structure via ADMM) | 427,659 contact maps | Paper | - | 2018/12 |
Anand et al. | GAN | - | - | distance map (3D structure via CNN) | 800,000 distance maps | Paper | 2019/03 | |
RamaNet | LSTM | - | - | A sequence of φ and ψ angles | 607 helical structures | Paper | Code | 19/06 |
Ig-VAE | VAE | - | - | protein backbone coordinates | 10,768 individual immunoglobulin domains | Paper | Code | 2022/02 |
SCUBA | NC-NN | ~20k | secondary structure motif | backbone | 12,465 structures | Paper | Code | 2022/02 |
GENESIS | VAE | - | secondary structure motif | contact map | 40,726 backbones with remodeled loops | Paper | - | 2022/02 |
DECO-VAE | VAE | - | ? | contact graph (translatable to contact map) | >650,000 contact graphs | Paper | - | 2020/04 |
Methods in this class generate sequences usually from autoregressive language models, and can sometimes be conditioned.
Name | Architecture | Number of Parameters | User Input | Output | Training Dataset | Paper | Code | Release Month/Year |
---|---|---|---|---|---|---|---|---|
ProteinGAN | GAN | 60M | sequence | 16,706 MDH sequences | Paper | Code | 2019/03 | |
ProGen | Transformer | 1.2B | Optional: sequence or function | sequence | 280M sequences | Paper | Code | 2020/03 |
ProtXLnet | Transformer | 409M | Optional: sequence | sequence | UniRef100 | Paper | Code | 2020/07 |
ProtXL | Transformer | 562M | Optional: sequence | sequence | BFD100 | Paper | Code | 2020/07 |
ProtElectra-Generator | Transformer | 420M | Optional: sequence | sequence | UniRef100 | Paper | Code | 2020/07 |
ProtT5 | Transformer | 3B | Optional: sequence | sequence | UniRef100 | Paper | Code | 2020/07 |
EVE | VAE | MSA | Sequence | 3,219 MSAs | Paper | Code | 2020/12 | |
arDCA | one layer autoregressive model + logistic regression | - | Optional:sequence | sequence | 1,019,208 sequences | Paper | Code | 2021/03 |
DARK3 | Transformer | 110M | Optional: sequence | sequence | 615,000 sequences | Paper | 2022/01 | |
ProtGPT2 | Transformer | 739M | Optional: sequence | sequence | UniRef50 | Paper | Code | 2022/03 |
RITA | Transformer | 1.2B | Optional: sequence | sequence | UniRef50 | Paper | Code | 2022/05 |
ProGEN2 | Transformer | 6.4B | Optional: sequence | sequence | Paper | Code | 2022/06 |
Methods in this class generate sequences and structures concomitantly, and include hallucination methods and constrained generation (inpainting)
Name | Architecture | Number of Parameters | User Input | Output | Training Dataset | Paper | Code | Release Month/Year |
---|---|---|---|---|---|---|---|---|
Hallucination | CNN (trRosetta) | N/A | random sequence | sequence/structure | N/A | Paper | Code | 2020/07 |
Constrained hallucination | CNN (trRosetta) | N/A | sequence/structure | sequence/structure | N/A | Paper | Code | 2020/11 |
Constrained hallucination2 | CNN (RoseTTAFold) | N/A | sequence/structure | sequence/structure | N/A | Paper | Code | 2021/11 |
RFjoint | CNN (RoseTTAFold, finetuned) | N/A | sequence/structure | sequence/structure | 25% PDB version 02/2020 + 75 % AF2 structures | Paper | Code | 2021/11 |
Protein Diffusion | Diffussion model | - | Secondary structure | sequence/structure | 53,414 3D structures (95% CATH 4.2 S95) | Paper | Code | 2022/05 |