Skip to content

Commit

Permalink
README update
Browse files Browse the repository at this point in the history
  • Loading branch information
jettjaniak committed May 25, 2024
1 parent 517e577 commit de4555e
Show file tree
Hide file tree
Showing 3 changed files with 69 additions and 27 deletions.
84 changes: 63 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,13 @@ export WANDB_API_KEY=...

If you want to train a small and efficient model on a narrow dataset, then we recommend using a custom tokenizer with a small vocabulary. To train a reversible, GPT2-style, BPE tokenizer you can use `scripts/train_tokenizer.py`.

Script usage:

```
> scripts/train_tokenizer.py --help
usage: train_tokenizer.py [-h] --in-dataset IN_DATASET --feature FEATURE --split SPLIT --vocab-size VOCAB_SIZE [--out-dir OUT_DIR] [--out-repo OUT_REPO]
usage: train_tokenizer.py [-h] --in-dataset IN_DATASET --feature FEATURE --split SPLIT
--vocab-size VOCAB_SIZE
[--out-dir OUT_DIR] [--out-repo OUT_REPO]
Train a custom, reversible, BPE tokenizer (GPT2-like). You need to provide --out-repo or --out-dir.
Expand All @@ -44,18 +48,18 @@ options:
--out-repo OUT_REPO HF repo id to upload the resulting tokenizer
```

Here's how we trained the tokenizer for our stories-* suite of models. Please note that you can use single letter abbreviations for most arguments.
Here's how we trained the tokenizer for our `stories-*` suite of models. Please note that you can use single letter abbreviations for most arguments.

```
> scripts/train_tokenizer.py \
--in-dataset delphi-suite/stories \
--feature story \
--split train \
--vocab-size 4096 \
--out-repo delphi-suite/stories-tokenizer
--in-dataset delphi-suite/stories \
--feature story \
--split train \
--vocab-size 4096 \
--out-repo delphi-suite/stories-tokenizer
```

We use the only feature named `story` in the `train` split of [delphi-suite/stories](https://huggingface.co/datasets/delphi-suite/stories). We train a tokenizer with a vocabulary of 4096 tokens, and upload it to HF model repo [delphi-suite/stories-tokenizer](https://huggingface.co/models/delphi-suite/stories-tokenizer).
We use the only feature named `story` in the `train` split of [delphi-suite/stories](https://huggingface.co/datasets/delphi-suite/stories). We train a tokenizer with a vocabulary of 4096 tokens, and upload it to HF model repo [delphi-suite/stories-tokenizer](https://huggingface.co/delphi-suite/stories-tokenizer).


# Tokenizing a dataset
Expand All @@ -73,10 +77,13 @@ Then this is divided into chunks, and the `<bos>` token is inserted at the begin
```
It will produce sequences of specified size, by discarding the last chunk if it's too short. We don't use padding.

Script usage:

```
> scripts/tokenize_dataset.py --help
usage: tokenize_dataset.py [-h] --in-dataset IN_DATASET --feature FEATURE --split SPLIT --tokenizer TOKENIZER --seq-len SEQ_LEN [--batch-size BATCH_SIZE] [--chunk-size CHUNK_SIZE]
usage: tokenize_dataset.py [-h] --in-dataset IN_DATASET --feature FEATURE --split SPLIT
--tokenizer TOKENIZER --seq-len SEQ_LEN
[--batch-size BATCH_SIZE] [--chunk-size CHUNK_SIZE]
[--out-dir OUT_DIR] [--out-repo OUT_REPO]
Tokenize a text dataset using a specific tokenizer
Expand All @@ -101,26 +108,61 @@ options:
--out-repo OUT_REPO HF repo id to upload the resulting dataset
```

Here's how we tokenized the dataset for our stories-* suite of models. Please note that you can use single letter abbreviations for most arguments.
Here's how we tokenized the dataset for our `stories-*` suite of models. Please note that you can use single letter abbreviations for most arguments.

For `train` split:
```
> scripts/tokenize_dataset.py \
--in-dataset delphi-suite/stories \
--feature story \
--split train \
--tokenizer delphi-suite/stories-tokenizer \
--seq-len 512 \
--out-repo-id delphi-suite/stories-tokenized
--in-dataset delphi-suite/stories \
--feature story \
--split train \
--tokenizer delphi-suite/stories-tokenizer \
--seq-len 512 \
--out-repo-id delphi-suite/stories-tokenized
```
For `validation` split, repeated arguments omitted:
```
> scripts/tokenize_dataset.py \
...
--split validation \
...
...
--split validation \
...
```

The input dataset is the same as in tokenizer training example above. We tokenize it with our custom [delphi-suite/stories-tokenizer](https://huggingface.co/delphi-suite/stories-tokenizer) into sequences of length 512. We upload it to HF dataset repo [delphi-suite/stories-tokenized](https://huggingface.co/datasets/delphi-suite/stories-tokenized).

Please note that you can use any HuggingFace tokenizer, you don't need to train a custom one.

# Training a model

To train a model, you'll need to create a config file. For examples see `configs/`, and for field descriptions see `delphi/train/config/training_config.py`. The training script is located in `scripts/train_model.py`.

Script usage:

```
> scripts/train_model.py --help
usage: train_model.py [-h] [--overrides [OVERRIDES ...]] [-v | -s] [config_files ...]
The input dataset is the same as in tokenizer training example above. We tokenize it with our custom [delphi-suite/stories-tokenizer](https://huggingface.co/models/delphi-suite/stories-tokenizer) into sequences of length 512. We upload it to HF dataset repo [delphi-suite/stories-tokenized](https://huggingface.co/datasets/delphi-suite/stories-tokenized).
Train a delphi model
Please note that you can use any HuggingFace tokenizer, you don't need to train a custom one.
positional arguments:
config_files Path to json file(s) containing config values, e.g. 'primary_config.json secondary_config.json'.
options:
-h, --help show this help message and exit
--overrides [OVERRIDES ...]
Override config values with space-separated declarations. e.g. `--overrides model_config.hidden_size=42 run_name=foo`
-v, --verbose Increase verbosity level, repeatable (e.g. -vvv). Mutually exclusive with --silent, --loglevel
-s, --silent Silence all logging. Mutually exclusive with --verbose, --loglevel
```

You can specify primary config and secondary config, which is useful if you're training a suite of models that only differ in a few parameters. Additionally, you can override specific fields using the `--overrides` flag. If you don't want to push the model and its checkpoints to HF, you need to explicitly set `out_repo=""`. If you don't want to log to W&B, you need to set `wandb=""`.

Here is how we trained our `stories-mamba-100k` model
```
> scripts/train_model.py \
configs/stories/mamba/base.json \
configs/stories/mamba/100k.json \
--overrides \
out_repo="delphi-suite/stories-mamba-100k" \
wandb="delphi-suite/delphi"
```
10 changes: 6 additions & 4 deletions delphi/train/config/training_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,15 @@ class TrainingConfig:
# manually list iterations to save checkpoints on
extra_checkpoint_iters: list[int] = field(default_factory=list)

# log every N iters
# log to the console every N iters; this doesn't control wandb logging which is done only on checkpoints
log_interval: int = 1

# use N iters for each eval
# FIXME: there is a bug in the current implementation, and eval loss is computed on the
# entire dataset. In this implementation, eval_iters controls the number of minibatches
# the dataset is split into for evaluation.
eval_iters: int = 100

# path to a checkpoint to resume from (if init_from=='resume')
# path to a checkpoint to resume from
resume_from_path: Optional[str] = None

# number of samples used to compute the gradient for a single optimizer step
Expand All @@ -51,7 +53,7 @@ class TrainingConfig:
# if > 1 reduces memory usage by computing gradient in microbatches
gradient_accumulation_steps: int = 1

# (adamw) optimizer
# AdamW optimizer
adam: AdamConfig = field(default_factory=AdamConfig)

# seed used for pseudorandomly sampling data during training
Expand Down
2 changes: 0 additions & 2 deletions tests/train/config/test_config_utils.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
from typing import Optional

import pytest

from delphi import TEST_CONFIGS_DIR
from delphi.train.config.utils import (
_unoptionalize,
Expand Down

0 comments on commit de4555e

Please sign in to comment.