[QUESTION]Splitting large document and bucketing #1090

shafiqabedin · 2024-08-07T18:07:29Z

shafiqabedin
Aug 7, 2024

I asked this question in the discussion section but did not receive any response so asking here with a little bit more details. I am trying to figure out if bucketization is done (or can be done) for model training. By "bucketization" I am referring to batching based on similar sequence length (https://torchtext.readthedocs.io/en/latest/data.html#bucketiterator).. The motivation is that I have documents with very large text and I want pick a splitting schema (which may create a lot of samples with small number of tokens and bucketize them). That brings me to the second question which is - is splitting supported in Megatron?

Any answer would be much appreciated - Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION]Splitting large document and bucketing #1090

{{title}}

Replies: 0 comments

Select a reply

[QUESTION]Splitting large document and bucketing #1090

shafiqabedin Aug 7, 2024

Replies: 0 comments

shafiqabedin
Aug 7, 2024