Skip to content

Commit

Permalink
Version 0.2 (#149)
Browse files Browse the repository at this point in the history
* final evaluation notebook (#98)

* model specify + nextlogprobs load + tok specify

* Draft version of plotting

* get prompt examples

* add meeting todos

* grouped imports and cells

* redid typing and fixed bugs

* unique token check

* Remove token category in calculation function

* Remove token category from vis function

* removed token_group in token diff + calculate all loss

* vis_pos_map highlight + optimization

* fixes: tokenization, mask, typing

* use interact_manual for resampling

* update quantile function tests

* small update

* beartype fix

* rm comment

* var rename

* eval notebook updates

---------

Co-authored-by: Siwei Li <[email protected]>
Co-authored-by: VICTOR ABIA <[email protected]>
Co-authored-by: Jett <[email protected]>
Co-authored-by: JaiDhyani <[email protected]>
Co-authored-by: Jai <[email protected]>

* remove stale notebooks, token map & labelling

* requirements revamp

* stale eval code purge

* remove constants

* use util in tokenize_dataset

* use load_dataset_split_* utils

* simpler structure: remove eval & dataset dirs

* src/delphi -> delphi

* remove stale test configs

* update HF cache version

* add notebooks deps to gh actions

* remove stale configs

* typo fix in train_tokenizer

* platformdirs dependency

* cosmetic changes in configs & args

* create HF repo once, outside try/except

* README: # Setup

* README: tokeniz*

* config help as comments

* dataset.name -> dataset.path

* README update

* README update

* validate_configs: overrides, init model

* README: out-repo typo

---------

Co-authored-by: Rai <[email protected]>
Co-authored-by: Siwei Li <[email protected]>
Co-authored-by: VICTOR ABIA <[email protected]>
Co-authored-by: JaiDhyani <[email protected]>
Co-authored-by: Jai <[email protected]>
  • Loading branch information
6 people authored May 28, 2024
1 parent 8670bb8 commit af79e9e
Show file tree
Hide file tree
Showing 80 changed files with 1,479 additions and 3,675 deletions.
5 changes: 2 additions & 3 deletions .github/workflows/checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,11 @@ jobs:
with:
path: |
~/.cache/huggingface
key: ${{ runner.os }}-huggingface-cache-v1 # increment this key to invalidate the cache when new models/datasets are added
key: ${{ runner.os }}-hf-cache-v0.2 # increment this key to invalidate the cache when new models/datasets are added
- name: dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements-nocuda.txt
pip install -e .
pip install -e .[dev,notebooks]
- name: black
run: black --check .
- name: isort
Expand Down
237 changes: 167 additions & 70 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,99 +1,196 @@
# Delphi
# delphi

Interpreting Small Language Models Across Time and Scale
delphi is a set of tools for standardized and (mostly) reproducible training of small language models. You can use delphi to train a custom tokenizer, tokenize your dataset, and train your model. We build on top of HuggingFace, supporting every `CausalLM` architecture. Datasets, tokenizers and models (including checkpoints!) can be downloaded from and uploaded to HuggingFace automatically, with no need to manage local files.

# Training Models
See [`scripts/run_training.py`](scripts/run_training.py):
```bash
./scripts/run_training.py --config_file /path/to/my/training/config.json

# Setup

1. Clone the repo
```shell
git clone https://github.com/delphi-suite/delphi.git
cd delphi
```
2. Make & activate python >= 3.10 virtual env
```shell
python3.10 -m venv .venv
source .venv/bin/activate
```
3. Install the project in editable state
`pip install -e .`
See `[project.optional-dependencies]` section in `pyproject.toml` for additional dependencies, e.g. you may want to `pip install -e ."[dev,mamba_cuda]"`
4. get your HuggingFace and W&B tokens and put them in the environment variables
```shell
export HF_TOKEN=...
export WANDB_API_KEY=...
```

See [`scripts/sample_config.json`](scripts/sample_config.json) for an example of a training run json.

# Training a tokenizer

## Features
### Uploading to HuggingFace
With `huggingface.push_checkpoints_to_hub` set to `True`, the model and all associated
training run data will be uploaded to HuggingFace repo specified by `huggingface.repo_id`
every checkpoint. Every upload will be in a new folder named by the current iteration (e.g. `iter_1`).
### Resuming model training
With `init_from` set to `'resume'`, training will resume from `output_dir`.
### Deterministic, Reproducible* Training
Delphi aims to be deterministic and as reproducible as possible. However, there is one major caveat: hardware. CUDA algorithms are not always 100% isomorphic to CPU algorithms. We do record the hardware device type each training run uses,
to enable reproduction *given the same class of hardware*.
### Different Model Architectures
`model_config.model_type` can specify currently supported architectures. At time of writing, these are `'llama2'` and `'mamaba`'. Config for the selected model type should
be in `model_config.<model_type>` (e.g. `model_config.llama2`) and correspond to the
arguments for that model type. See [`model_types.py`](src/delphi/train/config/models/model_types.py)
### Weights and Biases Integration
If you want to train a small and efficient model on a narrow dataset, then we recommend using a custom tokenizer with a small vocabulary. To train a reversible, GPT2-style, BPE tokenizer you can use `scripts/train_tokenizer.py`.

Script usage:

# Analyzing Models
TODO
```
> scripts/train_tokenizer.py --help
usage: train_tokenizer.py [-h] --in-dataset IN_DATASET --feature FEATURE --split SPLIT
--vocab-size VOCAB_SIZE
[--out-dir OUT_DIR] [--out-repo OUT_REPO]
Train a custom, reversible, BPE tokenizer (GPT2-like). You need to provide --out-repo or --out-dir.
options:
-h, --help show this help message and exit
--in-dataset IN_DATASET, -i IN_DATASET
Dataset you want to train the tokenizer on. Local path or HF repo id
--feature FEATURE, -f FEATURE
Name of the feature (column) containing text documents in the input dataset
--split SPLIT, -s SPLIT
Split of the dataset to be used for tokenizer training, supports slicing like 'train[:10%]'
--vocab-size VOCAB_SIZE, -v VOCAB_SIZE
Vocabulary size of the tokenizer
--out-dir OUT_DIR Local directory to save the resulting tokenizer
--out-repo OUT_REPO HF repo id to upload the resulting tokenizer
```

# Development
Here's how we trained the tokenizer for our `stories-*` suite of models. Please note that you can use single letter abbreviations for most arguments.

```
> scripts/train_tokenizer.py \
--in-dataset delphi-suite/stories \
--feature story \
--split train \
--vocab-size 4096 \
--out-repo delphi-suite/stories-tokenizer
```

We use the only feature named `story` in the `train` split of [delphi-suite/stories](https://huggingface.co/datasets/delphi-suite/stories). We train a tokenizer with a vocabulary of 4096 tokens, and upload it to HF model repo [delphi-suite/stories-tokenizer](https://huggingface.co/delphi-suite/stories-tokenizer).

## Setup

1. Clone this repo and submodules: `git clone https://github.com/delphi-suite/delphi.git --recurse-submodules`
2. make python 3.10 virtual env in `.venv`
3. install dependencies `pip install -r requirements.txt`
4. install the project in editable state `pip install -e .`
5. run tests `pytest`
# Tokenizing a dataset

### Submodule Setup
If you cloned without `--recurse-submodules`, you can still install the submodules later with:
```bash
git submodule init
git submodule update
To turn a collection of text documents into sequences of tokens required for model training, you can use `scripts/tokenize_dataset.py`. All documents are tokenized and concatenated, with the `<eos>` token as a separator, e.g.
```
doc1_tok1, doc1_tok2, ..., doc1_tokX, <eos>, doc2_tok1, doc2_tok2, ..., doc2_tokX, <eos>, doc3_tok1, ...
```
Then this is divided into chunks, and the `<bos>` token is inserted at the begining of each chunk, e.g.
```
<bos> doc1_tok1, doc1_tok2, ..., doc1_tokX, <eos>, doc2_tok1
<bos> doc2_tok2, ..., doc2_tok511
<bos> doc2_tok512, doc2_tok513, ..., doc2_tokX <eos>, doc3_tok1, ...
...
```
It will produce sequences of specified size, by discarding the last chunk if it's too short. We don't use padding.

## Formatting
Script usage:

We're using black & isort to format the code. To make sure your changes adhere to the rules:
```
> scripts/tokenize_dataset.py --help
usage: tokenize_dataset.py [-h] --in-dataset IN_DATASET --feature FEATURE --split SPLIT
--tokenizer TOKENIZER --seq-len SEQ_LEN
[--batch-size BATCH_SIZE] [--chunk-size CHUNK_SIZE]
[--out-dir OUT_DIR] [--out-repo OUT_REPO]
Tokenize a text dataset using a specific tokenizer
options:
-h, --help show this help message and exit
--in-dataset IN_DATASET, -i IN_DATASET
Dataset you want to tokenize. Local path or HF repo id
--feature FEATURE, -f FEATURE
Name of the feature (column) containing text documents in the input dataset
--split SPLIT, -s SPLIT
Split of the dataset to be tokenized, supports slicing like 'train[:10%]'
--tokenizer TOKENIZER, -t TOKENIZER
HF repo id or local directory containing the tokenizer
--seq-len SEQ_LEN, -l SEQ_LEN
Length of the tokenized sequences
--batch-size BATCH_SIZE, -b BATCH_SIZE
How many text documents to tokenize at once (default: 50)
--chunk-size CHUNK_SIZE, -c CHUNK_SIZE
Maximum number of tokenized sequences in a single parquet file (default: 200_000)
--out-dir OUT_DIR Local directory to save the resulting dataset
--out-repo OUT_REPO HF repo id to upload the resulting dataset
```

1. follow setup instructions above
2. install pre-commit `pre-commit install`
3. install recommended vscode extensions
Here's how we tokenized the dataset for our `stories-*` suite of models. Please note that you can use single letter abbreviations for most arguments.

When you save a file vscode should automatically format it. Otherwise, pre-commit will do that, but you will need to add the changes and commit again.
For `train` split:
```
> scripts/tokenize_dataset.py \
--in-dataset delphi-suite/stories \
--feature story \
--split train \
--tokenizer delphi-suite/stories-tokenizer \
--seq-len 512 \
--out-repo delphi-suite/stories-tokenized
```
For `validation` split, repeated arguments omitted:
```
> scripts/tokenize_dataset.py \
...
--split validation \
...
```

## Pull Requests

1. make a branch
- if it relates to an existing issue
- go to the issue page and click _Create a branch_ under _Development_
- if the default name is not very long, keep it; otherwise, make it shorter, but keep the issue number in the front
- otherwise pick a short but descriptive name, a few hyphen-separated-words
2. make your changes
- include unit tests
- update README if needed
- if new huggingface datasets/models are added to testing, increment the cache number in `.github/workflows/checks.yml`
3. make a pull request
- if it isn't ready for review yet, mark it as draft
- check if CI is passing
- if the change is big, try to keep the commit history clean using interactive rebase
- don't push more often than it's needed, we're running github actions on a free tier
- if there were any changes to the main branch, rebase on top of it
- explain the change
- provide short description; focus on things that were not mentioned in the relevant issue
- comment important sections of the code in _Files changed_ tab
- when it's ready, add the relevant stakeholders as reviewers
4. after the comments are resolved and PR is approved, merge it using _Squash and merge_

## Incrementing Versions
When making a new release, increment the version in `delphi/__init__.py`
The input dataset is the same as in tokenizer training example above. We tokenize it with our custom [delphi-suite/stories-tokenizer](https://huggingface.co/delphi-suite/stories-tokenizer) into sequences of length 512. We upload it to HF dataset repo [delphi-suite/stories-tokenized](https://huggingface.co/datasets/delphi-suite/stories-tokenized).

Please note that you can use any HuggingFace tokenizer, you don't need to train a custom one.

# Training a model

To train a model, you'll need to create a config file. For examples see `configs/`, and for field descriptions see `delphi/train/config/training_config.py`. The training script is located in `scripts/train_model.py`.

Script usage:

```
> scripts/train_model.py --help
usage: train_model.py [-h] [--overrides [OVERRIDES ...]] [-v | -s] [config_files ...]
Train a delphi model
positional arguments:
config_files Path to json file(s) containing config values, e.g. 'primary_config.json secondary_config.json'.
options:
-h, --help show this help message and exit
--overrides [OVERRIDES ...]
Override config values with space-separated declarations. e.g. `--overrides model_config.hidden_size=42 run_name=foo`
-v, --verbose Increase verbosity level, repeatable (e.g. -vvv). Mutually exclusive with --silent, --loglevel
-s, --silent Silence all logging. Mutually exclusive with --verbose, --loglevel
```

You can specify primary config and secondary config, which is useful if you're training a suite of models that only differ in a few parameters. Additionally, you can override specific fields using the `--overrides` flag. If you don't want to push the model and its checkpoints to HF, you need to explicitly set `out_repo=""`. If you don't want to log to W&B, you need to set `wandb=""`. Please note that by default we save the optimizer state (2x model size) with every checkpoint.

Here is how we trained our `stories-mamba-100k` model
```
> scripts/train_model.py \
configs/stories/mamba/base.json \
configs/stories/mamba/100k.json \
--overrides \
out_repo="delphi-suite/stories-mamba-100k" \
wandb="delphi-suite/delphi"
```

# Development

1. Install the `dev` and `notebooks` dependencies `pip install -e ."[dev,notebooks]"`.
2. Run the tests `pytest`.
3. Install pre-commit `pre-commit install`.
4. Install the recommended vscode extensions.

When you save a file vscode should automatically format it. Otherwise, pre-commit will do that, but you will need to add the changes and commit again.

# Citation

If you use `delphi` in your research, please cite using the following
If you use delphi in your research, please cite using the following

```bibtex
@software{delphi,
title = {delphi: small language models training made easy},
author = {Jett Janiak, Jai Dhyani, Jannik Brinkmann, Gonçalo Paulo, Joshua Wendland, Víctor Abia Alonso, Siwei Li, Rai (Phan Anh Duong), Alice Rigg},
author = {Jett Janiak, Jai Dhyani, Jannik Brinkmann, Gonçalo Paulo, Joshua Wendland, Víctor Abia Alonso, Siwei Li, Phan Anh Duong, Alice Rigg},
year = 2024,
url = {https://github.com/delphi-suite/delphi},
license = {apache-2.0}
}
```
```
20 changes: 0 additions & 20 deletions configs/debug.json

This file was deleted.

52 changes: 0 additions & 52 deletions configs/sample_config.json

This file was deleted.

22 changes: 0 additions & 22 deletions configs/sample_mamba.json

This file was deleted.

Loading

0 comments on commit af79e9e

Please sign in to comment.