[QUESTION]Splitting large document and bucketing #1090
Unanswered
shafiqabedin
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I asked this question in the discussion section but did not receive any response so asking here with a little bit more details. I am trying to figure out if bucketization is done (or can be done) for model training. By "bucketization" I am referring to batching based on similar sequence length (https://torchtext.readthedocs.io/en/latest/data.html#bucketiterator).. The motivation is that I have documents with very large text and I want pick a splitting schema (which may create a lot of samples with small number of tokens and bucketize them). That brings me to the second question which is - is splitting supported in Megatron?
Any answer would be much appreciated - Thanks.
Beta Was this translation helpful? Give feedback.
All reactions