Add automatic dimensionality reduction feature for input vectors #113

SebieF · 2024-10-14T09:35:29Z

Input vectors in biotrainer are usually embeddings from protein language models. They usually have large dimensions from 512 to 1024 dimensions or even bigger. Therefore, it is interesting to see, if model performance increases when trained on reduced embeddings, using PCA for example. This could be done after loading the embeddings and be switched on and off in the configuration file.

nadyadevani3112 · 2024-11-10T07:23:29Z

hi I would like to fix this, but after looking at the code I have some questions that I would like to clarify:

When fitting and applying the dimension reduction, what is the individual sample unit? Sequence, residue or should it be able to be applied for both depending on the protocol?

If 1 sample = 1 sequence, that means the dimension reduction can only be applied when Protocol.using_per_sequence_embeddings() is True

If 1 sample = 1 residue, that means the dimension reduction can only be applied when Protocol.using_per_sequence_embeddings() is False

However, if it should be able to work for both, I'm thinking that if Protocol.using_per_sequence_embeddings() is True, I can just stack the sequence embeddings, and when Protocol.using_per_sequence_embeddings() is False, I can cat the residue embeddings along dim=0 before applying the dimension reduction

SebieF · 2024-11-11T15:25:06Z

Hello and thanks for your interest in contributing to biotrainer!! :)

I discussed that with my colleagues at our chair and we think that applying dimensionality reduction only makes sense for per_sequence embeddings at the moment. We also suggest to focus on non-linear dimensionality reduction methods (so no PCA). So UMAP, t-SNE, or PaCMAP would be probably the most interesting. A completely different approach would be to use a mask (e.g. a list of numbers) to indicate which dimensions to use from the embedding, or even some method that uses information from Explainable AI (e.g. https://interprot.com/), but it is totally fine to leave that for future coding :)

Of course, if you have a (proposed) use case for PCA or another method that I did not mention, I would be excited to hear about it and discuss. When working on the code, please make sure that your implementation is modular, such that it can be activated and de-activated easily, and methods could be changed in the future. I am happy to review the PR even in an early stage and give feedback. Please let me know if you have any additional questions!

SebieF added enhancement New feature or request good first issue Good for newcomers labels Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automatic dimensionality reduction feature for input vectors #113

Add automatic dimensionality reduction feature for input vectors #113

SebieF commented Oct 14, 2024

nadyadevani3112 commented Nov 10, 2024

SebieF commented Nov 11, 2024 •

edited

Loading

Add automatic dimensionality reduction feature for input vectors #113

Add automatic dimensionality reduction feature for input vectors #113

Comments

SebieF commented Oct 14, 2024

nadyadevani3112 commented Nov 10, 2024

SebieF commented Nov 11, 2024 • edited Loading

SebieF commented Nov 11, 2024 •

edited

Loading