Skip to content

Commit

Permalink
README: tokeniz*
Browse files Browse the repository at this point in the history
  • Loading branch information
jettjaniak committed May 24, 2024
1 parent 6329aa9 commit c180ca6
Show file tree
Hide file tree
Showing 3 changed files with 110 additions and 3 deletions.
106 changes: 106 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,109 @@ See `[project.optional-dependencies]` section in `pyproject.toml` for additional
export HF_TOKEN=...
export WANDB_API_KEY=...
```


# Training a tokenizer

If you want to train a small and efficient model on a narrow dataset, then we recommend using a custom tokenizer with a small vocabulary. To train a reversible, GPT2-style, BPE tokenizer you can use `scripts/train_tokenizer.py`.

```
> scripts/train_tokenizer.py --help
usage: train_tokenizer.py [-h] --in-dataset IN_DATASET --feature FEATURE --split SPLIT --vocab-size VOCAB_SIZE [--out-dir OUT_DIR] [--out-repo OUT_REPO]
Train a custom, reversible, BPE tokenizer (GPT2-like). You need to provide --out-repo or --out-dir.
options:
-h, --help show this help message and exit
--in-dataset IN_DATASET, -i IN_DATASET
Dataset you want to train the tokenizer on. Local path or HF repo id
--feature FEATURE, -f FEATURE
Name of the feature (column) containing text documents in the input dataset
--split SPLIT, -s SPLIT
Split of the dataset to be used for tokenizer training, supports slicing like 'train[:10%]'
--vocab-size VOCAB_SIZE, -v VOCAB_SIZE
Vocabulary size of the tokenizer
--out-dir OUT_DIR Local directory to save the resulting tokenizer
--out-repo OUT_REPO HF repo id to upload the resulting tokenizer
```

Here's how we trained the tokenizer for our stories-* suite of models. Please note that you can use single letter abbreviations for most arguments.

```
> scripts/train_tokenizer.py \
--in-dataset delphi-suite/stories \
--feature story \
--split train \
--vocab-size 4096 \
--out-repo delphi-suite/stories-tokenizer
```

We use the only feature named `story` in the `train` split of [delphi-suite/stories](https://huggingface.co/datasets/delphi-suite/stories). We train a tokenizer with a vocabulary of 4096 tokens, and upload it to HF model repo [delphi-suite/stories-tokenizer](https://huggingface.co/models/delphi-suite/stories-tokenizer).


# Tokenizing a dataset

To turn a collection of text documents into sequences of tokens required for model training, you can use `scripts/tokenize_dataset.py`. All documents are tokenized and concatenated, with the `<eos>` token as a separator, e.g.
```
doc1_tok1, doc1_tok2, ..., doc1_tokX, <eos>, doc2_tok1, doc2_tok2, ..., doc2_tokX, <eos>, doc3_tok1, ...
```
Then this is divided into chunks, and the `<bos>` token is inserted at the begining of each chunk, e.g.
```
<bos> doc1_tok1, doc1_tok2, ..., doc1_tokX, <eos>, doc2_tok1
<bos> doc2_tok2, ..., doc2_tok511
<bos> doc2_tok512, doc2_tok513, ..., doc2_tokX <eos>, doc3_tok1, ...
...
```
It will produce sequences of specified size, by discarding the last chunk if it's too short. We don't use padding.


```
> scripts/tokenize_dataset.py --help
usage: tokenize_dataset.py [-h] --in-dataset IN_DATASET --feature FEATURE --split SPLIT --tokenizer TOKENIZER --seq-len SEQ_LEN [--batch-size BATCH_SIZE] [--chunk-size CHUNK_SIZE]
[--out-dir OUT_DIR] [--out-repo OUT_REPO]
Tokenize a text dataset using a specific tokenizer
options:
-h, --help show this help message and exit
--in-dataset IN_DATASET, -i IN_DATASET
Dataset you want to tokenize. Local path or HF repo id
--feature FEATURE, -f FEATURE
Name of the feature (column) containing text documents in the input dataset
--split SPLIT, -s SPLIT
Split of the dataset to be tokenized, supports slicing like 'train[:10%]'
--tokenizer TOKENIZER, -t TOKENIZER
HF repo id or local directory containing the tokenizer
--seq-len SEQ_LEN, -l SEQ_LEN
Length of the tokenized sequences
--batch-size BATCH_SIZE, -b BATCH_SIZE
How many text documents to tokenize at once (default: 50)
--chunk-size CHUNK_SIZE, -c CHUNK_SIZE
Maximum number of tokenized sequences in a single parquet file (default: 200_000)
--out-dir OUT_DIR Local directory to save the resulting dataset
--out-repo OUT_REPO HF repo id to upload the resulting dataset
```

Here's how we tokenized the dataset for our stories-* suite of models. Please note that you can use single letter abbreviations for most arguments.

For `train` split:
```
> scripts/tokenize_dataset.py \
--in-dataset delphi-suite/stories \
--feature story \
--split train \
--tokenizer delphi-suite/stories-tokenizer \
--seq-len 512 \
--out-repo-id delphi-suite/stories-tokenized
```
For `validation` split, repeated arguments omitted:
```
> scripts/tokenize_dataset.py \
...
--split validation \
...
```

The input dataset is the same as in tokenizer training example above. We tokenize it with our custom [delphi-suite/stories-tokenizer](https://huggingface.co/models/delphi-suite/stories-tokenizer) into sequences of length 512. We upload it to HF dataset repo [delphi-suite/stories-tokenized](https://huggingface.co/datasets/delphi-suite/stories-tokenized).

Please note that you can use any HuggingFace tokenizer, you don't need to train a custom one.
2 changes: 1 addition & 1 deletion scripts/tokenize_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Tokenize a text dataset using a specific tokenizer",
description="Tokenize a text dataset using a specified tokenizer",
allow_abbrev=False,
)

Expand Down
5 changes: 3 additions & 2 deletions scripts/train_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@ def train_byte_level_bpe(

if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Train a BPE tokenizer on a given dataset", allow_abbrev=False
description="Train a custom, reversible, BPE tokenizer (GPT2-like). You need to provide --out-repo or --out-dir.",
allow_abbrev=False,
)

parser.add_argument(
Expand Down Expand Up @@ -74,7 +75,7 @@ def train_byte_level_bpe(
help="HF repo id to upload the resulting tokenizer",
)
args = parser.parse_args()
assert args.out_repo or args.out_dir, "You need to provide out_repo or out_dir"
assert args.out_repo or args.out_dir, "You need to provide --out-repo or --out-dir"

in_dataset_split = utils.load_dataset_split_string_feature(
args.in_dataset, args.split, args.feature
Expand Down

0 comments on commit c180ca6

Please sign in to comment.