-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: collator with padding for tpu #29
base: master
Are you sure you want to change the base?
perf: collator with padding for tpu #29
Conversation
Two side notes: 1. Python 3.7.11 is for Colab; 2. Poetry is optional for managing venv and dependencies, but syncing with requirements(-dev).txt must be done manually for the time being.
62c1536
to
0e66d08
Compare
@@ -124,15 +125,18 @@ def create_labels_column(examples): | |||
val_dataset = lm_datasets["validation"] | |||
|
|||
# DataLoaders creation: | |||
data_collator = default_data_collator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to pad to longest for GPU?
@@ -104,12 +104,14 @@ def loss_fn(batch, outputs, metadata_mask=None): | |||
return loss | |||
|
|||
|
|||
@hydra.main(config_name="config") | |||
@hydra.main(config_path=None, config_name="config") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really related to this PR but I need it for testing.
@@ -157,15 +158,18 @@ def group_texts(examples): | |||
val_dataset = lm_datasets["validation"] | |||
|
|||
# DataLoaders creation: | |||
data_collator = default_data_collator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as #29 (comment)
Indeed, on TPU the length of the samples is important, you are right. As a preliminary question, I was under the impression that the method |
0e66d08
to
d9b548b
Compare
I was under the same impression but then I realized that perhaps we are not truncating them at all, therefore my Colab notebook always warns me about the length. Admittedly it is probably not really about padding but truncation. If that's the case, then we may also have to modify the tokenizer. For the padding itself, I guess the only useful situation is for the control group without metadata. |
Please correct me if you disagree, but I think the warning comes from the line 69 of |
Hi, |
877d94c
to
029ed13
Compare
@@ -94,7 +95,7 @@ def get_dataloaders(tokenizer, args): | |||
text_column_name = "text" if "text" in column_names else column_names[0] | |||
|
|||
def tokenize_function(examples): | |||
return tokenizer(examples[text_column_name]) | |||
return tokenizer(examples[text_column_name], truncation=True, max_length=args.max_seq_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truncate to max_seq_len
(512 by default) to preserve space for metadata, cf. #29 (comment).
@@ -66,7 +66,7 @@ def add_metadata_and_chunk_examples( | |||
if global_metadata_prefix_encoded: | |||
text_with_local_metadata = " " + text_with_local_metadata | |||
char_level_metadata_mask = [False] + char_level_metadata_mask | |||
text_with_local_metadata_encoded = tokenizer.encode_plus(text_with_local_metadata) | |||
text_with_local_metadata_encoded = tokenizer.encode_plus(text_with_local_metadata, truncation=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truncate to model's max seq len (1024) for the whole seq with metadata, cf. #29 (comment).
After truncation, the warnings are gone, and the training and the evaluation are two times faster. |
bsmetadata/metadata_utils.py
Outdated
@@ -51,7 +51,7 @@ def add_metadata_and_chunk_examples( | |||
if add_metadata: | |||
# Get the global metadata prefix that is prepended to each training example. | |||
global_metadata_prefix = create_global_metadata_prefix(example, cfg) | |||
global_metadata_prefix_encoded = tokenizer.encode_plus(global_metadata_prefix).input_ids | |||
global_metadata_prefix_encoded = tokenizer.encode_plus(global_metadata_prefix, truncation=True).input_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as #29 (comment).
I think I am missing something in these changes. I'll try to explain how I saw things and if I'm wrong somewhere I'd be very happy to hear from you! In any case, we're doing a research code, so I think we should not hesitate to experiment in small iterations that will allow us to identify problems, new topics: and you're raising some interesting issues here! From my point of view, this is what has been happening so far, for an example that was handled by
Finally with this operation, an example is transformed into several examples but each of size I would therefore have the impression that the change you propose will reduce the number of examples in our training dataset. If that's the case, is that really what we want? Then, regarding the I really agree that it would be much better to have a comparable metric! I totally share your point of view on the fact that the perplexities are not comparable between a dataset with metadata and without metadata (if the first one does not take into account in the calculation of the loss the tokens corresponding to the metadata). Unfortunately with a I look forward to reading your thoughts on this! 🤗 |
I was also debating with myself whether to do it this time or make another PR. So for now I changed this one to a draft. Hopefully we can have it done once and for all (not really all but you know what I mean...). (Pardon me that I'm gonna skip quoting the whole thing this time.) If I understand the code and your point correctly, I believe that I share a pretty similar concern with you. The truncation this PR does so far is merely a WIP-style checkpoint (that I should have totally clarified it beforehand, my apologies). Admittedly, when there is already a series of procedures that cuts pieces and groups blocks, truncating before it feels totally redundant and even disruptive. The main reason why I want to try truncation is that, when splitting concatenated examples into blocks/chunks, I believe those were not the heads of the original examples may have different behaviors to autoregressive LM, and then the perplexity may be less intuitive (to me). For example, assuming a super long original example says "wubba lubba dub dub!", the concat-and-split would produce 2 blocks being "wubba lubba dub" and "dub!" while truncating in advance would produce just one block as "wubba lubba dub", and my guess is that an autoregressive LM may prefer the second case. In a way it is exactly like what you also mentioned "remove that last example which has a different size", but I may go even further to drop all (regardless the sizes) but the first, because I don't know whether the second to the last blocks make sense to what kind of LM. This whole speculation is probably what bothers/interests me the most. On the bright side, perhaps we can have both configurations being different control/experiment groups? (But I can't shake the feeling that somebody must have done that already...) So, a truncation before concat-and-split may virtually render the latter almost no-op, which is actually the rial-and-error I want to have. In this sense, you are absolutely right about the decreased number of examples, and yet I am not confident to recommend this trade-off. As for the comparability in general, I am thinking that maybe we can have another kind of control group that prepends noises in the same length as the corresponding metadata. 🍀 |
Yes, that's exactly what happens with the current code.
I don't have a strong opinion here. Does any of you know how this is typically done? What would be the downside of having an example with different size (except being less memory-efficient because we throw away all computations done on the padding tokens)?
Whether this is a reasonable thing to do strongly depends on how our examples look like (i.e., how many tokens a single example contains). In your toy example (the "wubba lubba dub dub!" one), I agree that it's probably best not to train on the second (very short) block at all. However, we will probably have much longer examples in practice - let's say an entire paragraph from Wikipedia. Throwing away everything but the first 512 (or 1,024) tokens from this paragraph would drastically reduce the number of examples in our training dataset; we'd be throwing away lots of training examples that the model could learn from. Or did I misunderstand something about your proposal? |
I think I understand the concern here. My toy example may be too simple to address my feeling about the second to the last sub-examples that may or may not be useful to an autoregressive LM. When we have long paragraphs from mC4, for instance, I imagine some of those chunked sub-examples may be sequences that are in the middle of something. Honestly I don't really know whether they are just harmless (if not useful) abstract n-grams to the LM or certain broken (left side) context that could have some negative impact. (I don't have much experience in this situation because my tasks usually presume that we do sentence segmentation first.) |
029ed13
to
63c926e
Compare
* Bump nltk from 3.6.5 to 3.6.6 Bumps [nltk](https://github.com/nltk/nltk) from 3.6.5 to 3.6.6. - [Release notes](https://github.com/nltk/nltk/releases) - [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog) - [Commits](nltk/nltk@3.6.5...3.6.6) --- updated-dependencies: - dependency-name: nltk dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * build: bump nltk to 3.6.7 for security and speed Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Mike Tian-Jian Jiang <[email protected]>
63c926e
to
c3c45f4
Compare
* master: (141 commits) build: bump nltk to 3.6.7 for security and performance (bigscience-workshop#130) build: bump nltk to 3.6.7 for security and performance (#5) Add fp16, multi-GPU training script (toy dataset) (bigscience-workshop#123) create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML (bigscience-workshop#119) remove `#SBATCH --gres=gpu:0 ` from `03_create_dataset.slurm` (bigscience-workshop#121) Add joint training slurm script (bigscience-workshop#111) Add features types for the metadata to extract and test multiprocessing (bigscience-workshop#118) feat: add a feature to choose where to extract metadata (bigscience-workshop#116) Use dateutil to parse date (bigscience-workshop#117) feat: change how the entity extraction process use ids (bigscience-workshop#115) add `path_or_url_flair_ner_model` in order to execute the entity extraction on a partition without internet (bigscience-workshop#106) delete old submodule delete ds_store style check style & quality imports handle IndexError for `wikipedia_desc_utils` (bigscience-workshop#102) handle the comment specific type not recognized by pyarrow (bigscience-workshop#83) quality check Change torch version + make it optional (bigscience-workshop#82) ... # Conflicts: # bsmetadata/metadata_utils.py
Imitating https://huggingface.co/docs/accelerate/quicktour.html#training-on-tpu
I also see that there's also a padding procedure in https://github.com/bigscience-workshop/metadata/blob/master/bsmetadata/experiments/sample.py#L16
Not really sure: