Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 0.2 #149

Merged
merged 25 commits into from
May 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions .github/workflows/checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,11 @@ jobs:
with:
path: |
~/.cache/huggingface
key: ${{ runner.os }}-huggingface-cache-v1 # increment this key to invalidate the cache when new models/datasets are added
key: ${{ runner.os }}-hf-cache-v0.2 # increment this key to invalidate the cache when new models/datasets are added
- name: dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements-nocuda.txt
pip install -e .
pip install -e .[dev,notebooks]
- name: black
run: black --check .
- name: isort
Expand Down
237 changes: 167 additions & 70 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,99 +1,196 @@
# Delphi
# delphi

Interpreting Small Language Models Across Time and Scale
delphi is a set of tools for standardized and (mostly) reproducible training of small language models. You can use delphi to train a custom tokenizer, tokenize your dataset, and train your model. We build on top of HuggingFace, supporting every `CausalLM` architecture. Datasets, tokenizers and models (including checkpoints!) can be downloaded from and uploaded to HuggingFace automatically, with no need to manage local files.

# Training Models
See [`scripts/run_training.py`](scripts/run_training.py):
```bash
./scripts/run_training.py --config_file /path/to/my/training/config.json

# Setup

1. Clone the repo
```shell
git clone https://github.com/delphi-suite/delphi.git
cd delphi
```
2. Make & activate python >= 3.10 virtual env
```shell
python3.10 -m venv .venv
source .venv/bin/activate
```
3. Install the project in editable state
`pip install -e .`
See `[project.optional-dependencies]` section in `pyproject.toml` for additional dependencies, e.g. you may want to `pip install -e ."[dev,mamba_cuda]"`
4. get your HuggingFace and W&B tokens and put them in the environment variables
```shell
export HF_TOKEN=...
export WANDB_API_KEY=...
```

See [`scripts/sample_config.json`](scripts/sample_config.json) for an example of a training run json.

# Training a tokenizer

## Features
### Uploading to HuggingFace
With `huggingface.push_checkpoints_to_hub` set to `True`, the model and all associated
training run data will be uploaded to HuggingFace repo specified by `huggingface.repo_id`
every checkpoint. Every upload will be in a new folder named by the current iteration (e.g. `iter_1`).
### Resuming model training
With `init_from` set to `'resume'`, training will resume from `output_dir`.
### Deterministic, Reproducible* Training
Delphi aims to be deterministic and as reproducible as possible. However, there is one major caveat: hardware. CUDA algorithms are not always 100% isomorphic to CPU algorithms. We do record the hardware device type each training run uses,
to enable reproduction *given the same class of hardware*.
### Different Model Architectures
`model_config.model_type` can specify currently supported architectures. At time of writing, these are `'llama2'` and `'mamaba`'. Config for the selected model type should
be in `model_config.<model_type>` (e.g. `model_config.llama2`) and correspond to the
arguments for that model type. See [`model_types.py`](src/delphi/train/config/models/model_types.py)
### Weights and Biases Integration
If you want to train a small and efficient model on a narrow dataset, then we recommend using a custom tokenizer with a small vocabulary. To train a reversible, GPT2-style, BPE tokenizer you can use `scripts/train_tokenizer.py`.

Script usage:

# Analyzing Models
TODO
```
> scripts/train_tokenizer.py --help
usage: train_tokenizer.py [-h] --in-dataset IN_DATASET --feature FEATURE --split SPLIT
--vocab-size VOCAB_SIZE
[--out-dir OUT_DIR] [--out-repo OUT_REPO]

Train a custom, reversible, BPE tokenizer (GPT2-like). You need to provide --out-repo or --out-dir.

options:
-h, --help show this help message and exit
--in-dataset IN_DATASET, -i IN_DATASET
Dataset you want to train the tokenizer on. Local path or HF repo id
--feature FEATURE, -f FEATURE
Name of the feature (column) containing text documents in the input dataset
--split SPLIT, -s SPLIT
Split of the dataset to be used for tokenizer training, supports slicing like 'train[:10%]'
--vocab-size VOCAB_SIZE, -v VOCAB_SIZE
Vocabulary size of the tokenizer
--out-dir OUT_DIR Local directory to save the resulting tokenizer
--out-repo OUT_REPO HF repo id to upload the resulting tokenizer
```

# Development
Here's how we trained the tokenizer for our `stories-*` suite of models. Please note that you can use single letter abbreviations for most arguments.

```
> scripts/train_tokenizer.py \
--in-dataset delphi-suite/stories \
--feature story \
--split train \
--vocab-size 4096 \
--out-repo delphi-suite/stories-tokenizer
```

We use the only feature named `story` in the `train` split of [delphi-suite/stories](https://huggingface.co/datasets/delphi-suite/stories). We train a tokenizer with a vocabulary of 4096 tokens, and upload it to HF model repo [delphi-suite/stories-tokenizer](https://huggingface.co/delphi-suite/stories-tokenizer).

## Setup

1. Clone this repo and submodules: `git clone https://github.com/delphi-suite/delphi.git --recurse-submodules`
2. make python 3.10 virtual env in `.venv`
3. install dependencies `pip install -r requirements.txt`
4. install the project in editable state `pip install -e .`
5. run tests `pytest`
# Tokenizing a dataset

### Submodule Setup
If you cloned without `--recurse-submodules`, you can still install the submodules later with:
```bash
git submodule init
git submodule update
To turn a collection of text documents into sequences of tokens required for model training, you can use `scripts/tokenize_dataset.py`. All documents are tokenized and concatenated, with the `<eos>` token as a separator, e.g.
```
doc1_tok1, doc1_tok2, ..., doc1_tokX, <eos>, doc2_tok1, doc2_tok2, ..., doc2_tokX, <eos>, doc3_tok1, ...
```
Then this is divided into chunks, and the `<bos>` token is inserted at the begining of each chunk, e.g.
```
<bos> doc1_tok1, doc1_tok2, ..., doc1_tokX, <eos>, doc2_tok1
<bos> doc2_tok2, ..., doc2_tok511
<bos> doc2_tok512, doc2_tok513, ..., doc2_tokX <eos>, doc3_tok1, ...
...
```
It will produce sequences of specified size, by discarding the last chunk if it's too short. We don't use padding.

## Formatting
Script usage:

We're using black & isort to format the code. To make sure your changes adhere to the rules:
```
> scripts/tokenize_dataset.py --help
usage: tokenize_dataset.py [-h] --in-dataset IN_DATASET --feature FEATURE --split SPLIT
--tokenizer TOKENIZER --seq-len SEQ_LEN
[--batch-size BATCH_SIZE] [--chunk-size CHUNK_SIZE]
[--out-dir OUT_DIR] [--out-repo OUT_REPO]

Tokenize a text dataset using a specific tokenizer

options:
-h, --help show this help message and exit
--in-dataset IN_DATASET, -i IN_DATASET
Dataset you want to tokenize. Local path or HF repo id
--feature FEATURE, -f FEATURE
Name of the feature (column) containing text documents in the input dataset
--split SPLIT, -s SPLIT
Split of the dataset to be tokenized, supports slicing like 'train[:10%]'
--tokenizer TOKENIZER, -t TOKENIZER
HF repo id or local directory containing the tokenizer
--seq-len SEQ_LEN, -l SEQ_LEN
Length of the tokenized sequences
--batch-size BATCH_SIZE, -b BATCH_SIZE
How many text documents to tokenize at once (default: 50)
--chunk-size CHUNK_SIZE, -c CHUNK_SIZE
Maximum number of tokenized sequences in a single parquet file (default: 200_000)
--out-dir OUT_DIR Local directory to save the resulting dataset
--out-repo OUT_REPO HF repo id to upload the resulting dataset
```

1. follow setup instructions above
2. install pre-commit `pre-commit install`
3. install recommended vscode extensions
Here's how we tokenized the dataset for our `stories-*` suite of models. Please note that you can use single letter abbreviations for most arguments.

When you save a file vscode should automatically format it. Otherwise, pre-commit will do that, but you will need to add the changes and commit again.
For `train` split:
```
> scripts/tokenize_dataset.py \
--in-dataset delphi-suite/stories \
--feature story \
--split train \
--tokenizer delphi-suite/stories-tokenizer \
--seq-len 512 \
--out-repo delphi-suite/stories-tokenized
```
For `validation` split, repeated arguments omitted:
```
> scripts/tokenize_dataset.py \
...
--split validation \
...
```

## Pull Requests

1. make a branch
- if it relates to an existing issue
- go to the issue page and click _Create a branch_ under _Development_
- if the default name is not very long, keep it; otherwise, make it shorter, but keep the issue number in the front
- otherwise pick a short but descriptive name, a few hyphen-separated-words
2. make your changes
- include unit tests
- update README if needed
- if new huggingface datasets/models are added to testing, increment the cache number in `.github/workflows/checks.yml`
3. make a pull request
- if it isn't ready for review yet, mark it as draft
- check if CI is passing
- if the change is big, try to keep the commit history clean using interactive rebase
- don't push more often than it's needed, we're running github actions on a free tier
- if there were any changes to the main branch, rebase on top of it
- explain the change
- provide short description; focus on things that were not mentioned in the relevant issue
- comment important sections of the code in _Files changed_ tab
- when it's ready, add the relevant stakeholders as reviewers
4. after the comments are resolved and PR is approved, merge it using _Squash and merge_

## Incrementing Versions
When making a new release, increment the version in `delphi/__init__.py`
The input dataset is the same as in tokenizer training example above. We tokenize it with our custom [delphi-suite/stories-tokenizer](https://huggingface.co/delphi-suite/stories-tokenizer) into sequences of length 512. We upload it to HF dataset repo [delphi-suite/stories-tokenized](https://huggingface.co/datasets/delphi-suite/stories-tokenized).

Please note that you can use any HuggingFace tokenizer, you don't need to train a custom one.

# Training a model

To train a model, you'll need to create a config file. For examples see `configs/`, and for field descriptions see `delphi/train/config/training_config.py`. The training script is located in `scripts/train_model.py`.

Script usage:

```
> scripts/train_model.py --help
usage: train_model.py [-h] [--overrides [OVERRIDES ...]] [-v | -s] [config_files ...]

Train a delphi model

positional arguments:
config_files Path to json file(s) containing config values, e.g. 'primary_config.json secondary_config.json'.

options:
-h, --help show this help message and exit
--overrides [OVERRIDES ...]
Override config values with space-separated declarations. e.g. `--overrides model_config.hidden_size=42 run_name=foo`
-v, --verbose Increase verbosity level, repeatable (e.g. -vvv). Mutually exclusive with --silent, --loglevel
-s, --silent Silence all logging. Mutually exclusive with --verbose, --loglevel
```

You can specify primary config and secondary config, which is useful if you're training a suite of models that only differ in a few parameters. Additionally, you can override specific fields using the `--overrides` flag. If you don't want to push the model and its checkpoints to HF, you need to explicitly set `out_repo=""`. If you don't want to log to W&B, you need to set `wandb=""`. Please note that by default we save the optimizer state (2x model size) with every checkpoint.

Here is how we trained our `stories-mamba-100k` model
```
> scripts/train_model.py \
configs/stories/mamba/base.json \
configs/stories/mamba/100k.json \
--overrides \
out_repo="delphi-suite/stories-mamba-100k" \
wandb="delphi-suite/delphi"
```

# Development

1. Install the `dev` and `notebooks` dependencies `pip install -e ."[dev,notebooks]"`.
2. Run the tests `pytest`.
3. Install pre-commit `pre-commit install`.
4. Install the recommended vscode extensions.

When you save a file vscode should automatically format it. Otherwise, pre-commit will do that, but you will need to add the changes and commit again.

# Citation

If you use `delphi` in your research, please cite using the following
If you use delphi in your research, please cite using the following

```bibtex
@software{delphi,
title = {delphi: small language models training made easy},
author = {Jett Janiak, Jai Dhyani, Jannik Brinkmann, Gonçalo Paulo, Joshua Wendland, Víctor Abia Alonso, Siwei Li, Rai (Phan Anh Duong), Alice Rigg},
author = {Jett Janiak, Jai Dhyani, Jannik Brinkmann, Gonçalo Paulo, Joshua Wendland, Víctor Abia Alonso, Siwei Li, Phan Anh Duong, Alice Rigg},
year = 2024,
url = {https://github.com/delphi-suite/delphi},
license = {apache-2.0}
}
```
```
20 changes: 0 additions & 20 deletions configs/debug.json

This file was deleted.

52 changes: 0 additions & 52 deletions configs/sample_config.json

This file was deleted.

22 changes: 0 additions & 22 deletions configs/sample_mamba.json

This file was deleted.

Loading
Loading