Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatic dimensionality reduction feature for input vectors #113

Open
SebieF opened this issue Oct 14, 2024 · 2 comments
Open

Add automatic dimensionality reduction feature for input vectors #113

SebieF opened this issue Oct 14, 2024 · 2 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@SebieF
Copy link
Collaborator

SebieF commented Oct 14, 2024

Input vectors in biotrainer are usually embeddings from protein language models. They usually have large dimensions from 512 to 1024 dimensions or even bigger. Therefore, it is interesting to see, if model performance increases when trained on reduced embeddings, using PCA for example. This could be done after loading the embeddings and be switched on and off in the configuration file.

@SebieF SebieF added enhancement New feature or request good first issue Good for newcomers labels Oct 14, 2024
@nadyadevani3112
Copy link

hi I would like to fix this, but after looking at the code I have some questions that I would like to clarify:

  1. When fitting and applying the dimension reduction, what is the individual sample unit? Sequence, residue or should it be able to be applied for both depending on the protocol?

If 1 sample = 1 sequence, that means the dimension reduction can only be applied when Protocol.using_per_sequence_embeddings() is True

If 1 sample = 1 residue, that means the dimension reduction can only be applied when Protocol.using_per_sequence_embeddings() is False

However, if it should be able to work for both, I'm thinking that if Protocol.using_per_sequence_embeddings() is True, I can just stack the sequence embeddings, and when Protocol.using_per_sequence_embeddings() is False, I can cat the residue embeddings along dim=0 before applying the dimension reduction

@SebieF
Copy link
Collaborator Author

SebieF commented Nov 11, 2024

Hello and thanks for your interest in contributing to biotrainer!! :)

I discussed that with my colleagues at our chair and we think that applying dimensionality reduction only makes sense for per_sequence embeddings at the moment. We also suggest to focus on non-linear dimensionality reduction methods (so no PCA). So UMAP, t-SNE, or PaCMAP would be probably the most interesting. A completely different approach would be to use a mask (e.g. a list of numbers) to indicate which dimensions to use from the embedding, or even some method that uses information from Explainable AI (e.g. https://interprot.com/), but it is totally fine to leave that for future coding :)

Of course, if you have a (proposed) use case for PCA or another method that I did not mention, I would be excited to hear about it and discuss. When working on the code, please make sure that your implementation is modular, such that it can be activated and de-activated easily, and methods could be changed in the future. I am happy to review the PR even in an early stage and give feedback. Please let me know if you have any additional questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants