perf: collator with padding for tpu #29

tianjianjiang · 2021-09-06T14:13:20Z

Imitating https://huggingface.co/docs/accelerate/quicktour.html#training-on-tpu

I also see that there's also a padding procedure in https://github.com/bigscience-workshop/metadata/blob/master/bsmetadata/experiments/sample.py#L16
Not really sure:

what we want here for both TPU and GPU (which can be longest padding per batch);
whether truncation is also what we want.

Two side notes: 1. Python 3.7.11 is for Colab; 2. Poetry is optional for managing venv and dependencies, but syncing with requirements(-dev).txt must be done manually for the time being.

tianjianjiang · 2021-09-06T14:25:18Z

bsmetadata/experiments/with_metadata.py

@@ -124,15 +125,18 @@ def create_labels_column(examples):
    val_dataset = lm_datasets["validation"]

    # DataLoaders creation:
+    data_collator = default_data_collator


Do we want to pad to longest for GPU?

tianjianjiang · 2021-09-06T14:25:59Z

bsmetadata/train.py

@@ -104,12 +104,14 @@ def loss_fn(batch, outputs, metadata_mask=None):
    return loss


-@hydra.main(config_name="config")
+@hydra.main(config_path=None, config_name="config")


Not really related to this PR but I need it for testing.

tianjianjiang · 2021-09-06T14:26:39Z

bsmetadata/experiments/without_metadata.py

@@ -157,15 +158,18 @@ def group_texts(examples):
    val_dataset = lm_datasets["validation"]

    # DataLoaders creation:
+    data_collator = default_data_collator


Same question as #29 (comment)

SaulLu · 2021-09-06T14:41:45Z

Indeed, on TPU the length of the samples is important, you are right.

As a preliminary question, I was under the impression that the method add_metadata_and_chunk_examples in metadata_utils.py already returns samples of the same size. If this is the case, it is as if all examples had been padded/truncated to the same size. Am I missing something here? 🙂

tianjianjiang · 2021-09-06T14:57:06Z

As a preliminary question, I was under the impression that the method add_metadata_and_chunk_examples in metadata_utils.py already returns samples of the same size. If this is the case, it is as if all examples had been padded/truncated to the same size. Am I missing something here? 🙂

I was under the same impression but then I realized that perhaps we are not truncating them at all, therefore my Colab notebook always warns me about the length. Admittedly it is probably not really about padding but truncation. If that's the case, then we may also have to modify the tokenizer.

For the padding itself, I guess the only useful situation is for the control group without metadata.

SaulLu · 2021-09-06T15:03:20Z

therefore my Colab notebook always warns me about the length.

Please correct me if you disagree, but I think the warning comes from the line 69 of metadata_utils. In this line it is perfectly intentional that the sequence is longer than the maximum length, since it is then divided into sequences of the right length in the loop line 84.

tianjianjiang · 2021-09-07T05:38:48Z

therefore my Colab notebook always warns me about the length.

Please correct me if you disagree, but I think the warning comes from the line 69 of metadata_utils. In this line it is perfectly intentional that the sequence is longer than the maximum length, since it is then divided into sequences of the right length in the loop line 84.

Hi,
I'm gonna be busy for at least a couple of days so forgive me if I am not responsive or thorough.
As far as I can tell, the warnings happened earlier than those lines even without metadata. Admittedly they are probably not that critical.
I will check carefully later and get back to you.

tianjianjiang · 2021-09-08T22:54:55Z

bsmetadata/experiments/without_metadata.py

@@ -94,7 +95,7 @@ def get_dataloaders(tokenizer, args):
    text_column_name = "text" if "text" in column_names else column_names[0]

    def tokenize_function(examples):
-        return tokenizer(examples[text_column_name])
+        return tokenizer(examples[text_column_name], truncation=True, max_length=args.max_seq_len)


Truncate to max_seq_len (512 by default) to preserve space for metadata, cf. #29 (comment).

tianjianjiang · 2021-09-08T22:56:00Z

bsmetadata/metadata_utils.py

@@ -66,7 +66,7 @@ def add_metadata_and_chunk_examples(
        if global_metadata_prefix_encoded:
            text_with_local_metadata = " " + text_with_local_metadata
        char_level_metadata_mask = [False] + char_level_metadata_mask
-        text_with_local_metadata_encoded = tokenizer.encode_plus(text_with_local_metadata)
+        text_with_local_metadata_encoded = tokenizer.encode_plus(text_with_local_metadata, truncation=True)


Truncate to model's max seq len (1024) for the whole seq with metadata, cf. #29 (comment).

tianjianjiang · 2021-09-08T23:01:02Z

After truncation, the warnings are gone, and the training and the evaluation are two times faster.
More importantly, it seems making the control group fairer. Previously the control group without metadata can have longer texts than the experiment group with metadata, and the former's perplexities was much lower than the latter. Now the trend is reversed as expected.

tianjianjiang · 2021-09-08T23:03:12Z

bsmetadata/metadata_utils.py

@@ -51,7 +51,7 @@ def add_metadata_and_chunk_examples(
        if add_metadata:
            # Get the global metadata prefix that is prepended to each training example.
            global_metadata_prefix = create_global_metadata_prefix(example, cfg)
-            global_metadata_prefix_encoded = tokenizer.encode_plus(global_metadata_prefix).input_ids
+            global_metadata_prefix_encoded = tokenizer.encode_plus(global_metadata_prefix, truncation=True).input_ids


Same as #29 (comment).

SaulLu · 2021-09-09T08:19:16Z

I think I am missing something in these changes. I'll try to explain how I saw things and if I'm wrong somewhere I'd be very happy to hear from you! In any case, we're doing a research code, so I think we should not hesitate to experiment in small iterations that will allow us to identify problems, new topics: and you're raising some interesting issues here!

From my point of view, this is what has been happening so far, for an example that was handled by add_metadata_and_chunk_examples:

We (possibly) add local metadata to the plain text (this increases the size of the example)
We then tokeniz this example with metadata. Since the examples in our dataset can be very very long, this could result in number of tokens for our sequences >> model.max_seq_len [the warning is raised here]
We cut this example in pieces in order to have only examples of size < model.max_seq_len

Finally with this operation, an example is transformed into several examples but each of size < model.max_seq_len (and even for all except the last one are equal to model.max_seq_len ). Which brings me to a side suggestion: shouldn't we remove that last example which has a different size?

I would therefore have the impression that the change you propose will reduce the number of examples in our training dataset. If that's the case, is that really what we want?

Then, regarding the max_seq_len for the "without_metadata" code. Similarly to the script with metadata, what was done here was to tokenize the examples and then from these examples create several examples of the same size (of equal length to block_size this time).

I really agree that it would be much better to have a comparable metric! I totally share your point of view on the fact that the perplexities are not comparable between a dataset with metadata and without metadata (if the first one does not take into account in the calculation of the loss the tokens corresponding to the metadata). Unfortunately with a block_size chosen like this I'm afraid it's still not comparable as the number of tokens associated with metadata changes for each example.
Also, as I mentioned to @timoschick yesterday, I think that it would be great to also try to take into account local metadata in the loss with a special token at the beginning of the sequence saying whether or not you want that metadata to be include. This new framework would allow to compare perplexities calculated on sequences of the same length for local metadata. However, this would only be for this particular case and not for the case where loss is not taken into account.
Moreover, for global metadata, I don't see a framework that could have a fixed length. So, for now the only solution I can think of is to take examples for the baseline that correspond exactly to each example with metadata added but without its metadata (so that we can compare perplexities calculated on the same length of sequences).

I look forward to reading your thoughts on this! 🤗

tianjianjiang · 2021-09-09T15:38:16Z

I think I am missing something in these changes. I'll try to explain how I saw things and if I'm wrong somewhere I'd be very happy to hear from you! In any case, we're doing a research code, so I think we should not hesitate to experiment in small iterations that will allow us to identify problems, new topics: and you're raising some interesting issues here!

I was also debating with myself whether to do it this time or make another PR. So for now I changed this one to a draft. Hopefully we can have it done once and for all (not really all but you know what I mean...).

(Pardon me that I'm gonna skip quoting the whole thing this time.)

If I understand the code and your point correctly, I believe that I share a pretty similar concern with you. The truncation this PR does so far is merely a WIP-style checkpoint (that I should have totally clarified it beforehand, my apologies). Admittedly, when there is already a series of procedures that cuts pieces and groups blocks, truncating before it feels totally redundant and even disruptive.

The main reason why I want to try truncation is that, when splitting concatenated examples into blocks/chunks, I believe those were not the heads of the original examples may have different behaviors to autoregressive LM, and then the perplexity may be less intuitive (to me).

For example, assuming a super long original example says "wubba lubba dub dub!", the concat-and-split would produce 2 blocks being "wubba lubba dub" and "dub!" while truncating in advance would produce just one block as "wubba lubba dub", and my guess is that an autoregressive LM may prefer the second case.

In a way it is exactly like what you also mentioned "remove that last example which has a different size", but I may go even further to drop all (regardless the sizes) but the first, because I don't know whether the second to the last blocks make sense to what kind of LM. This whole speculation is probably what bothers/interests me the most.

On the bright side, perhaps we can have both configurations being different control/experiment groups? (But I can't shake the feeling that somebody must have done that already...)

So, a truncation before concat-and-split may virtually render the latter almost no-op, which is actually the rial-and-error I want to have. In this sense, you are absolutely right about the decreased number of examples, and yet I am not confident to recommend this trade-off.

As for the comparability in general, I am thinking that maybe we can have another kind of control group that prepends noises in the same length as the corresponding metadata.

🍀

timoschick · 2021-09-09T16:03:56Z

From my point of view, this is what has been happening so far, for an example that was handled by add_metadata_and_chunk_examples: [...]

Yes, that's exactly what happens with the current code.

Shouldn't we remove that last example which has a different size?

I don't have a strong opinion here. Does any of you know how this is typically done? What would be the downside of having an example with different size (except being less memory-efficient because we throw away all computations done on the padding tokens)?

In a way it is exactly like what you also mentioned "remove that last example which has a different size", but I may go even further to drop all (regardless the sizes) but the first.

Whether this is a reasonable thing to do strongly depends on how our examples look like (i.e., how many tokens a single example contains). In your toy example (the "wubba lubba dub dub!" one), I agree that it's probably best not to train on the second (very short) block at all. However, we will probably have much longer examples in practice - let's say an entire paragraph from Wikipedia. Throwing away everything but the first 512 (or 1,024) tokens from this paragraph would drastically reduce the number of examples in our training dataset; we'd be throwing away lots of training examples that the model could learn from. Or did I misunderstand something about your proposal?

tianjianjiang · 2021-09-09T16:12:15Z

However, we will probably have much longer examples in practice - let's say an entire paragraph from Wikipedia. Throwing away everything but the first 512 (or 1,024) tokens from this paragraph would drastically reduce the number of examples in our training dataset; we'd be throwing away lots of training examples that the model could learn from. Or did I misunderstand something about your proposal?

I think I understand the concern here. My toy example may be too simple to address my feeling about the second to the last sub-examples that may or may not be useful to an autoregressive LM. When we have long paragraphs from mC4, for instance, I imagine some of those chunked sub-examples may be sequences that are in the middle of something. Honestly I don't really know whether they are just harmless (if not useful) abstract n-grams to the LM or certain broken (left side) context that could have some negative impact.

(I don't have much experience in this situation because my tasks usually presume that we do sentence segmentation first.)

* Bump nltk from 3.6.5 to 3.6.6 Bumps [nltk](https://github.com/nltk/nltk) from 3.6.5 to 3.6.6. - [Release notes](https://github.com/nltk/nltk/releases) - [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog) - [Commits](nltk/nltk@3.6.5...3.6.6) --- updated-dependencies: - dependency-name: nltk dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * build: bump nltk to 3.6.7 for security and speed Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Mike Tian-Jian Jiang <[email protected]>

* master: (141 commits) build: bump nltk to 3.6.7 for security and performance (bigscience-workshop#130) build: bump nltk to 3.6.7 for security and performance (#5) Add fp16, multi-GPU training script (toy dataset) (bigscience-workshop#123) create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML (bigscience-workshop#119) remove `#SBATCH --gres=gpu:0 ` from `03_create_dataset.slurm` (bigscience-workshop#121) Add joint training slurm script (bigscience-workshop#111) Add features types for the metadata to extract and test multiprocessing (bigscience-workshop#118) feat: add a feature to choose where to extract metadata (bigscience-workshop#116) Use dateutil to parse date (bigscience-workshop#117) feat: change how the entity extraction process use ids (bigscience-workshop#115) add `path_or_url_flair_ner_model` in order to execute the entity extraction on a partition without internet (bigscience-workshop#106) delete old submodule delete ds_store style check style & quality imports handle IndexError for `wikipedia_desc_utils` (bigscience-workshop#102) handle the comment specific type not recognized by pyarrow (bigscience-workshop#83) quality check Change torch version + make it optional (bigscience-workshop#82) ... # Conflicts: # bsmetadata/metadata_utils.py

tianjianjiang added 2 commits September 6, 2021 22:18

build: set to py37 with an optional Poetry env

7853657

Two side notes: 1. Python 3.7.11 is for Colab; 2. Poetry is optional for managing venv and dependencies, but syncing with requirements(-dev).txt must be done manually for the time being.

Merge branch 'bigscience-workshop:master' into master

aa57ad6

tianjianjiang requested a review from SaulLu September 6, 2021 14:13

tianjianjiang self-assigned this Sep 6, 2021

tianjianjiang added the enhancement New feature or request label Sep 6, 2021

tianjianjiang added this to the Break-in Period milestone Sep 6, 2021

tianjianjiang force-pushed the perf-collator_with_padding_for_tpu branch from 62c1536 to 0e66d08 Compare September 6, 2021 14:21

tianjianjiang commented Sep 6, 2021

View reviewed changes

SaulLu requested a review from timoschick September 6, 2021 14:39

tianjianjiang added 2 commits September 6, 2021 23:48

refactor: set position of tqdm for multithread eval without newline (#3)

c67eb68

Merge branch 'bigscience-workshop:master' into master

2c55192

tianjianjiang force-pushed the perf-collator_with_padding_for_tpu branch from 0e66d08 to d9b548b Compare September 6, 2021 14:50

tianjianjiang force-pushed the perf-collator_with_padding_for_tpu branch 4 times, most recently from 877d94c to 029ed13 Compare September 8, 2021 22:23

tianjianjiang commented Sep 8, 2021

View reviewed changes

tianjianjiang marked this pull request as draft September 9, 2021 14:51

Merge branch 'bigscience-workshop:master' into master

e7f3b3a

tianjianjiang force-pushed the perf-collator_with_padding_for_tpu branch from 029ed13 to 63c926e Compare September 17, 2021 10:52

tianjianjiang and others added 8 commits October 17, 2021 17:36

Merge branch 'bigscience-workshop:master' into master

256927a

Merge branch 'bigscience-workshop:master' into master

3d1ec52

Merge branch 'bigscience-workshop:master' into master

3460c8d

Merge branch 'bigscience-workshop:master' into master

3d600dd

Merge branch 'bigscience-workshop:master' into master

01ab4ce

Merge branch 'bigscience-workshop:master' into master

9c078c7

pref: truncate and pad to max length for TPU

c3c45f4

tianjianjiang force-pushed the perf-collator_with_padding_for_tpu branch from 63c926e to c3c45f4 Compare January 21, 2022 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: collator with padding for tpu #29

perf: collator with padding for tpu #29

tianjianjiang commented Sep 6, 2021 •

edited

Loading

tianjianjiang Sep 6, 2021

tianjianjiang Sep 6, 2021

tianjianjiang Sep 6, 2021

SaulLu commented Sep 6, 2021 •

edited

Loading

tianjianjiang commented Sep 6, 2021 •

edited

Loading

SaulLu commented Sep 6, 2021

tianjianjiang commented Sep 7, 2021

tianjianjiang Sep 8, 2021 •

edited

Loading

tianjianjiang Sep 8, 2021 •

edited

Loading

tianjianjiang commented Sep 8, 2021

tianjianjiang Sep 8, 2021

SaulLu commented Sep 9, 2021

tianjianjiang commented Sep 9, 2021 •

edited

Loading

timoschick commented Sep 9, 2021

tianjianjiang commented Sep 9, 2021 •

edited

Loading

perf: collator with padding for tpu #29

Are you sure you want to change the base?

perf: collator with padding for tpu #29

Conversation

tianjianjiang commented Sep 6, 2021 • edited Loading

tianjianjiang Sep 6, 2021

Choose a reason for hiding this comment

tianjianjiang Sep 6, 2021

Choose a reason for hiding this comment

tianjianjiang Sep 6, 2021

Choose a reason for hiding this comment

SaulLu commented Sep 6, 2021 • edited Loading

tianjianjiang commented Sep 6, 2021 • edited Loading

SaulLu commented Sep 6, 2021

tianjianjiang commented Sep 7, 2021

tianjianjiang Sep 8, 2021 • edited Loading

Choose a reason for hiding this comment

tianjianjiang Sep 8, 2021 • edited Loading

Choose a reason for hiding this comment

tianjianjiang commented Sep 8, 2021

tianjianjiang Sep 8, 2021

Choose a reason for hiding this comment

SaulLu commented Sep 9, 2021

tianjianjiang commented Sep 9, 2021 • edited Loading

timoschick commented Sep 9, 2021

tianjianjiang commented Sep 9, 2021 • edited Loading

tianjianjiang commented Sep 6, 2021 •

edited

Loading

SaulLu commented Sep 6, 2021 •

edited

Loading

tianjianjiang commented Sep 6, 2021 •

edited

Loading

tianjianjiang Sep 8, 2021 •

edited

Loading

tianjianjiang Sep 8, 2021 •

edited

Loading

tianjianjiang commented Sep 9, 2021 •

edited

Loading

tianjianjiang commented Sep 9, 2021 •

edited

Loading