You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a very worthwhile effort. Are you considering adding the BERT transformer encoder model and the associated masked language modeling task for pre-training?
One issue I realized though is that the data might not fit into memory, so you would need to rewrite some of the logic. But at least for finetuning existing language models (which might be the main usecase) it would work even in memory.
The text was updated successfully, but these errors were encountered:
This is a very worthwhile effort. Are you considering adding the BERT transformer encoder model and the associated masked language modeling task for pre-training?
The task is actually the same as
ResidueClassificationSolver
, but it would only accept one sequence file (the output) and generate the randomly masked input on the fly. This could be done by a special type of Dataset, that's how fairseq implements this: https://github.com/facebookresearch/fairseq/blob/main/fairseq/data/mask_tokens_dataset.pyOne issue I realized though is that the data might not fit into memory, so you would need to rewrite some of the logic. But at least for finetuning existing language models (which might be the main usecase) it would work even in memory.
The text was updated successfully, but these errors were encountered: