You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Input vectors in biotrainer are usually embeddings from protein language models. They usually have large dimensions from 512 to 1024 dimensions or even bigger. Therefore, it is interesting to see, if model performance increases when trained on reduced embeddings, using PCA for example. This could be done after loading the embeddings and be switched on and off in the configuration file.
The text was updated successfully, but these errors were encountered:
hi I would like to fix this, but after looking at the code I have some questions that I would like to clarify:
When fitting and applying the dimension reduction, what is the individual sample unit? Sequence, residue or should it be able to be applied for both depending on the protocol?
If 1 sample = 1 sequence, that means the dimension reduction can only be applied when Protocol.using_per_sequence_embeddings() is True
If 1 sample = 1 residue, that means the dimension reduction can only be applied when Protocol.using_per_sequence_embeddings() is False
However, if it should be able to work for both, I'm thinking that if Protocol.using_per_sequence_embeddings() is True, I can just stack the sequence embeddings, and when Protocol.using_per_sequence_embeddings() is False, I can cat the residue embeddings along dim=0 before applying the dimension reduction
Hello and thanks for your interest in contributing to biotrainer!! :)
I discussed that with my colleagues at our chair and we think that applying dimensionality reduction only makes sense for per_sequence embeddings at the moment. We also suggest to focus on non-linear dimensionality reduction methods (so no PCA). So UMAP, t-SNE, or PaCMAP would be probably the most interesting. A completely different approach would be to use a mask (e.g. a list of numbers) to indicate which dimensions to use from the embedding, or even some method that uses information from Explainable AI (e.g. https://interprot.com/), but it is totally fine to leave that for future coding :)
Of course, if you have a (proposed) use case for PCA or another method that I did not mention, I would be excited to hear about it and discuss. When working on the code, please make sure that your implementation is modular, such that it can be activated and de-activated easily, and methods could be changed in the future. I am happy to review the PR even in an early stage and give feedback. Please let me know if you have any additional questions!
Input vectors in biotrainer are usually embeddings from protein language models. They usually have large dimensions from 512 to 1024 dimensions or even bigger. Therefore, it is interesting to see, if model performance increases when trained on reduced embeddings, using PCA for example. This could be done after loading the embeddings and be switched on and off in the configuration file.
The text was updated successfully, but these errors were encountered: