-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
👩🔬 Add LoRA layers to fine-tune protein language models for embeddings calculation #85
Comments
does it mean biotrainer can not be used to fine-tune PLM at this moment? Wishes, |
Hello and thanks for your question. :) No, biotrainer is currently not designed for fine-tuning. We plan to implement fine tuning via LoRA layers and hope to have the feature complete by mid 2025. |
In the meantime, I would recommend using the notebooks provided here: https://github.com/RSchmirler/data-repo_plm-finetune-eval/tree/main/notebooks/finetune |
Cool. Thank you. Congratulation for the work! JQ |
After migrating from bio_embeddings to calculate embeddings directly in biotrainer for the provided sequences, it is now theoretically possible to allow for fine-tuning existing protein language models (pLMs) such as ProtTrans for specific tasks. Such tasks might include prediction of subcellular location, secondary structure or protein-protein interaction.
While fine-tuning a full pLM on a specific task is very costly, LoRA: Low-rank adaption of large language models are one possibility to enable fine-tuning a transformer model with only a fraction of the original model's parameters.
Adding LoRA layers to biotrainer would, therefore, be a meaningful enhancement and would be in line with the overall premise of biotrainer, making protein prediction tasks easily accessible in a standardized and reproducable way. On the other side, it also requires a significant change in the sequence of operations that biotrainer performs. Currently, all embeddings are loaded or calculated once at the beginning of training. Adding fine-tuning, embeddings would have to be calculated on the fly for every epoch. A possible implementation could replace the current embeddings object with a function that is called every epoch and might be constant if no fine-tuning is applied. Still, major adaptions must be made to the dataloader module.
List of required steps (non-exhaustive):
Additional material:
The text was updated successfully, but these errors were encountered: