Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding BERT model and protocol = transformer encoder model + masked language modeling (MLM) #70

Open
prihoda opened this issue Mar 2, 2023 · 0 comments

Comments

@prihoda
Copy link

prihoda commented Mar 2, 2023

This is a very worthwhile effort. Are you considering adding the BERT transformer encoder model and the associated masked language modeling task for pre-training?

The task is actually the same as ResidueClassificationSolver, but it would only accept one sequence file (the output) and generate the randomly masked input on the fly. This could be done by a special type of Dataset, that's how fairseq implements this: https://github.com/facebookresearch/fairseq/blob/main/fairseq/data/mask_tokens_dataset.py

One issue I realized though is that the data might not fit into memory, so you would need to rewrite some of the logic. But at least for finetuning existing language models (which might be the main usecase) it would work even in memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant