Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: collator with padding for tpu #29

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

tianjianjiang
Copy link
Collaborator

@tianjianjiang tianjianjiang commented Sep 6, 2021

Imitating https://huggingface.co/docs/accelerate/quicktour.html#training-on-tpu

I also see that there's also a padding procedure in https://github.com/bigscience-workshop/metadata/blob/master/bsmetadata/experiments/sample.py#L16
Not really sure:

  • what we want here for both TPU and GPU (which can be longest padding per batch);
  • whether truncation is also what we want.

Two side notes:
1. Python 3.7.11 is for Colab;
2. Poetry is optional for managing venv and dependencies, but syncing
   with requirements(-dev).txt must be done manually for the time being.
@tianjianjiang tianjianjiang self-assigned this Sep 6, 2021
@tianjianjiang tianjianjiang added the enhancement New feature or request label Sep 6, 2021
@tianjianjiang tianjianjiang added this to the Break-in Period milestone Sep 6, 2021
@tianjianjiang tianjianjiang force-pushed the perf-collator_with_padding_for_tpu branch from 62c1536 to 0e66d08 Compare September 6, 2021 14:21
@@ -124,15 +125,18 @@ def create_labels_column(examples):
val_dataset = lm_datasets["validation"]

# DataLoaders creation:
data_collator = default_data_collator
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to pad to longest for GPU?

@@ -104,12 +104,14 @@ def loss_fn(batch, outputs, metadata_mask=None):
return loss


@hydra.main(config_name="config")
@hydra.main(config_path=None, config_name="config")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really related to this PR but I need it for testing.

@@ -157,15 +158,18 @@ def group_texts(examples):
val_dataset = lm_datasets["validation"]

# DataLoaders creation:
data_collator = default_data_collator
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as #29 (comment)

@SaulLu
Copy link
Collaborator

SaulLu commented Sep 6, 2021

Indeed, on TPU the length of the samples is important, you are right.

As a preliminary question, I was under the impression that the method add_metadata_and_chunk_examples in metadata_utils.py already returns samples of the same size. If this is the case, it is as if all examples had been padded/truncated to the same size. Am I missing something here? 🙂

@tianjianjiang tianjianjiang force-pushed the perf-collator_with_padding_for_tpu branch from 0e66d08 to d9b548b Compare September 6, 2021 14:50
@tianjianjiang
Copy link
Collaborator Author

tianjianjiang commented Sep 6, 2021

As a preliminary question, I was under the impression that the method add_metadata_and_chunk_examples in metadata_utils.py already returns samples of the same size. If this is the case, it is as if all examples had been padded/truncated to the same size. Am I missing something here? 🙂

I was under the same impression but then I realized that perhaps we are not truncating them at all, therefore my Colab notebook always warns me about the length. Admittedly it is probably not really about padding but truncation. If that's the case, then we may also have to modify the tokenizer.

For the padding itself, I guess the only useful situation is for the control group without metadata.

@SaulLu
Copy link
Collaborator

SaulLu commented Sep 6, 2021

therefore my Colab notebook always warns me about the length.

Please correct me if you disagree, but I think the warning comes from the line 69 of metadata_utils. In this line it is perfectly intentional that the sequence is longer than the maximum length, since it is then divided into sequences of the right length in the loop line 84.

@tianjianjiang
Copy link
Collaborator Author

therefore my Colab notebook always warns me about the length.

Please correct me if you disagree, but I think the warning comes from the line 69 of metadata_utils. In this line it is perfectly intentional that the sequence is longer than the maximum length, since it is then divided into sequences of the right length in the loop line 84.

Hi,
I'm gonna be busy for at least a couple of days so forgive me if I am not responsive or thorough.
As far as I can tell, the warnings happened earlier than those lines even without metadata. Admittedly they are probably not that critical.
I will check carefully later and get back to you.

@tianjianjiang tianjianjiang force-pushed the perf-collator_with_padding_for_tpu branch 4 times, most recently from 877d94c to 029ed13 Compare September 8, 2021 22:23
@@ -94,7 +95,7 @@ def get_dataloaders(tokenizer, args):
text_column_name = "text" if "text" in column_names else column_names[0]

def tokenize_function(examples):
return tokenizer(examples[text_column_name])
return tokenizer(examples[text_column_name], truncation=True, max_length=args.max_seq_len)
Copy link
Collaborator Author

@tianjianjiang tianjianjiang Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truncate to max_seq_len (512 by default) to preserve space for metadata, cf. #29 (comment).

@@ -66,7 +66,7 @@ def add_metadata_and_chunk_examples(
if global_metadata_prefix_encoded:
text_with_local_metadata = " " + text_with_local_metadata
char_level_metadata_mask = [False] + char_level_metadata_mask
text_with_local_metadata_encoded = tokenizer.encode_plus(text_with_local_metadata)
text_with_local_metadata_encoded = tokenizer.encode_plus(text_with_local_metadata, truncation=True)
Copy link
Collaborator Author

@tianjianjiang tianjianjiang Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truncate to model's max seq len (1024) for the whole seq with metadata, cf. #29 (comment).

@tianjianjiang
Copy link
Collaborator Author

After truncation, the warnings are gone, and the training and the evaluation are two times faster.
More importantly, it seems making the control group fairer. Previously the control group without metadata can have longer texts than the experiment group with metadata, and the former's perplexities was much lower than the latter. Now the trend is reversed as expected.

@@ -51,7 +51,7 @@ def add_metadata_and_chunk_examples(
if add_metadata:
# Get the global metadata prefix that is prepended to each training example.
global_metadata_prefix = create_global_metadata_prefix(example, cfg)
global_metadata_prefix_encoded = tokenizer.encode_plus(global_metadata_prefix).input_ids
global_metadata_prefix_encoded = tokenizer.encode_plus(global_metadata_prefix, truncation=True).input_ids
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as #29 (comment).

@SaulLu
Copy link
Collaborator

SaulLu commented Sep 9, 2021

I think I am missing something in these changes. I'll try to explain how I saw things and if I'm wrong somewhere I'd be very happy to hear from you! In any case, we're doing a research code, so I think we should not hesitate to experiment in small iterations that will allow us to identify problems, new topics: and you're raising some interesting issues here!

From my point of view, this is what has been happening so far, for an example that was handled by add_metadata_and_chunk_examples:

  1. We (possibly) add local metadata to the plain text (this increases the size of the example)
  2. We then tokeniz this example with metadata. Since the examples in our dataset can be very very long, this could result in number of tokens for our sequences >> model.max_seq_len [the warning is raised here]
  3. We cut this example in pieces in order to have only examples of size < model.max_seq_len

Finally with this operation, an example is transformed into several examples but each of size < model.max_seq_len (and even for all except the last one are equal to model.max_seq_len ). Which brings me to a side suggestion: shouldn't we remove that last example which has a different size?

I would therefore have the impression that the change you propose will reduce the number of examples in our training dataset. If that's the case, is that really what we want?

Then, regarding the max_seq_len for the "without_metadata" code. Similarly to the script with metadata, what was done here was to tokenize the examples and then from these examples create several examples of the same size (of equal length to block_size this time).

I really agree that it would be much better to have a comparable metric! I totally share your point of view on the fact that the perplexities are not comparable between a dataset with metadata and without metadata (if the first one does not take into account in the calculation of the loss the tokens corresponding to the metadata). Unfortunately with a block_size chosen like this I'm afraid it's still not comparable as the number of tokens associated with metadata changes for each example.
Also, as I mentioned to @timoschick yesterday, I think that it would be great to also try to take into account local metadata in the loss with a special token at the beginning of the sequence saying whether or not you want that metadata to be include. This new framework would allow to compare perplexities calculated on sequences of the same length for local metadata. However, this would only be for this particular case and not for the case where loss is not taken into account.
Moreover, for global metadata, I don't see a framework that could have a fixed length. So, for now the only solution I can think of is to take examples for the baseline that correspond exactly to each example with metadata added but without its metadata (so that we can compare perplexities calculated on the same length of sequences).

I look forward to reading your thoughts on this! 🤗

@tianjianjiang tianjianjiang marked this pull request as draft September 9, 2021 14:51
@tianjianjiang
Copy link
Collaborator Author

tianjianjiang commented Sep 9, 2021

I think I am missing something in these changes. I'll try to explain how I saw things and if I'm wrong somewhere I'd be very happy to hear from you! In any case, we're doing a research code, so I think we should not hesitate to experiment in small iterations that will allow us to identify problems, new topics: and you're raising some interesting issues here!

I was also debating with myself whether to do it this time or make another PR. So for now I changed this one to a draft. Hopefully we can have it done once and for all (not really all but you know what I mean...).

(Pardon me that I'm gonna skip quoting the whole thing this time.)

If I understand the code and your point correctly, I believe that I share a pretty similar concern with you. The truncation this PR does so far is merely a WIP-style checkpoint (that I should have totally clarified it beforehand, my apologies). Admittedly, when there is already a series of procedures that cuts pieces and groups blocks, truncating before it feels totally redundant and even disruptive.

The main reason why I want to try truncation is that, when splitting concatenated examples into blocks/chunks, I believe those were not the heads of the original examples may have different behaviors to autoregressive LM, and then the perplexity may be less intuitive (to me).

For example, assuming a super long original example says "wubba lubba dub dub!", the concat-and-split would produce 2 blocks being "wubba lubba dub" and "dub!" while truncating in advance would produce just one block as "wubba lubba dub", and my guess is that an autoregressive LM may prefer the second case.

In a way it is exactly like what you also mentioned "remove that last example which has a different size", but I may go even further to drop all (regardless the sizes) but the first, because I don't know whether the second to the last blocks make sense to what kind of LM. This whole speculation is probably what bothers/interests me the most.

On the bright side, perhaps we can have both configurations being different control/experiment groups? (But I can't shake the feeling that somebody must have done that already...)

So, a truncation before concat-and-split may virtually render the latter almost no-op, which is actually the rial-and-error I want to have. In this sense, you are absolutely right about the decreased number of examples, and yet I am not confident to recommend this trade-off.

As for the comparability in general, I am thinking that maybe we can have another kind of control group that prepends noises in the same length as the corresponding metadata.

🍀

@timoschick
Copy link
Contributor

From my point of view, this is what has been happening so far, for an example that was handled by add_metadata_and_chunk_examples: [...]

Yes, that's exactly what happens with the current code.

Shouldn't we remove that last example which has a different size?

I don't have a strong opinion here. Does any of you know how this is typically done? What would be the downside of having an example with different size (except being less memory-efficient because we throw away all computations done on the padding tokens)?

In a way it is exactly like what you also mentioned "remove that last example which has a different size", but I may go even further to drop all (regardless the sizes) but the first.

Whether this is a reasonable thing to do strongly depends on how our examples look like (i.e., how many tokens a single example contains). In your toy example (the "wubba lubba dub dub!" one), I agree that it's probably best not to train on the second (very short) block at all. However, we will probably have much longer examples in practice - let's say an entire paragraph from Wikipedia. Throwing away everything but the first 512 (or 1,024) tokens from this paragraph would drastically reduce the number of examples in our training dataset; we'd be throwing away lots of training examples that the model could learn from. Or did I misunderstand something about your proposal?

@tianjianjiang
Copy link
Collaborator Author

tianjianjiang commented Sep 9, 2021

However, we will probably have much longer examples in practice - let's say an entire paragraph from Wikipedia. Throwing away everything but the first 512 (or 1,024) tokens from this paragraph would drastically reduce the number of examples in our training dataset; we'd be throwing away lots of training examples that the model could learn from. Or did I misunderstand something about your proposal?

I think I understand the concern here. My toy example may be too simple to address my feeling about the second to the last sub-examples that may or may not be useful to an autoregressive LM. When we have long paragraphs from mC4, for instance, I imagine some of those chunked sub-examples may be sequences that are in the middle of something. Honestly I don't really know whether they are just harmless (if not useful) abstract n-grams to the LM or certain broken (left side) context that could have some negative impact.

(I don't have much experience in this situation because my tasks usually presume that we do sentence segmentation first.)

@tianjianjiang tianjianjiang force-pushed the perf-collator_with_padding_for_tpu branch from 029ed13 to 63c926e Compare September 17, 2021 10:52
@tianjianjiang tianjianjiang force-pushed the perf-collator_with_padding_for_tpu branch from 63c926e to c3c45f4 Compare January 21, 2022 16:01
* master: (141 commits)
  build: bump nltk to 3.6.7 for security and performance (bigscience-workshop#130)
  build: bump nltk to 3.6.7 for security and performance (#5)
  Add fp16, multi-GPU training script (toy dataset) (bigscience-workshop#123)
  create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML (bigscience-workshop#119)
  remove `#SBATCH --gres=gpu:0 ` from `03_create_dataset.slurm` (bigscience-workshop#121)
  Add joint training slurm script (bigscience-workshop#111)
  Add features types for the metadata to extract and test multiprocessing (bigscience-workshop#118)
  feat: add a feature to choose where to extract metadata (bigscience-workshop#116)
  Use dateutil to parse date (bigscience-workshop#117)
  feat: change how the entity extraction process use ids (bigscience-workshop#115)
  add `path_or_url_flair_ner_model` in order to execute the entity extraction on a partition without internet (bigscience-workshop#106)
  delete old submodule
  delete ds_store
  style check
  style & quality
  imports
  handle IndexError for `wikipedia_desc_utils` (bigscience-workshop#102)
  handle the comment specific type not recognized by pyarrow (bigscience-workshop#83)
  quality check
  Change torch version + make it optional (bigscience-workshop#82)
  ...

# Conflicts:
#	bsmetadata/metadata_utils.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants