-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* final evaluation notebook (#98) * model specify + nextlogprobs load + tok specify * Draft version of plotting * get prompt examples * add meeting todos * grouped imports and cells * redid typing and fixed bugs * unique token check * Remove token category in calculation function * Remove token category from vis function * removed token_group in token diff + calculate all loss * vis_pos_map highlight + optimization * fixes: tokenization, mask, typing * use interact_manual for resampling * update quantile function tests * small update * beartype fix * rm comment * var rename * eval notebook updates --------- Co-authored-by: Siwei Li <[email protected]> Co-authored-by: VICTOR ABIA <[email protected]> Co-authored-by: Jett <[email protected]> Co-authored-by: JaiDhyani <[email protected]> Co-authored-by: Jai <[email protected]> * remove stale notebooks, token map & labelling * requirements revamp * stale eval code purge * remove constants * use util in tokenize_dataset * use load_dataset_split_* utils * simpler structure: remove eval & dataset dirs * src/delphi -> delphi * remove stale test configs * update HF cache version * add notebooks deps to gh actions * remove stale configs * typo fix in train_tokenizer * platformdirs dependency * cosmetic changes in configs & args * create HF repo once, outside try/except * README: # Setup * README: tokeniz* * config help as comments * dataset.name -> dataset.path * README update * README update * validate_configs: overrides, init model * README: out-repo typo --------- Co-authored-by: Rai <[email protected]> Co-authored-by: Siwei Li <[email protected]> Co-authored-by: VICTOR ABIA <[email protected]> Co-authored-by: JaiDhyani <[email protected]> Co-authored-by: Jai <[email protected]>
- Loading branch information
1 parent
8670bb8
commit af79e9e
Showing
80 changed files
with
1,479 additions
and
3,675 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,99 +1,196 @@ | ||
# Delphi | ||
# delphi | ||
|
||
Interpreting Small Language Models Across Time and Scale | ||
delphi is a set of tools for standardized and (mostly) reproducible training of small language models. You can use delphi to train a custom tokenizer, tokenize your dataset, and train your model. We build on top of HuggingFace, supporting every `CausalLM` architecture. Datasets, tokenizers and models (including checkpoints!) can be downloaded from and uploaded to HuggingFace automatically, with no need to manage local files. | ||
|
||
# Training Models | ||
See [`scripts/run_training.py`](scripts/run_training.py): | ||
```bash | ||
./scripts/run_training.py --config_file /path/to/my/training/config.json | ||
|
||
# Setup | ||
|
||
1. Clone the repo | ||
```shell | ||
git clone https://github.com/delphi-suite/delphi.git | ||
cd delphi | ||
``` | ||
2. Make & activate python >= 3.10 virtual env | ||
```shell | ||
python3.10 -m venv .venv | ||
source .venv/bin/activate | ||
``` | ||
3. Install the project in editable state | ||
`pip install -e .` | ||
See `[project.optional-dependencies]` section in `pyproject.toml` for additional dependencies, e.g. you may want to `pip install -e ."[dev,mamba_cuda]"` | ||
4. get your HuggingFace and W&B tokens and put them in the environment variables | ||
```shell | ||
export HF_TOKEN=... | ||
export WANDB_API_KEY=... | ||
``` | ||
|
||
See [`scripts/sample_config.json`](scripts/sample_config.json) for an example of a training run json. | ||
|
||
# Training a tokenizer | ||
|
||
## Features | ||
### Uploading to HuggingFace | ||
With `huggingface.push_checkpoints_to_hub` set to `True`, the model and all associated | ||
training run data will be uploaded to HuggingFace repo specified by `huggingface.repo_id` | ||
every checkpoint. Every upload will be in a new folder named by the current iteration (e.g. `iter_1`). | ||
### Resuming model training | ||
With `init_from` set to `'resume'`, training will resume from `output_dir`. | ||
### Deterministic, Reproducible* Training | ||
Delphi aims to be deterministic and as reproducible as possible. However, there is one major caveat: hardware. CUDA algorithms are not always 100% isomorphic to CPU algorithms. We do record the hardware device type each training run uses, | ||
to enable reproduction *given the same class of hardware*. | ||
### Different Model Architectures | ||
`model_config.model_type` can specify currently supported architectures. At time of writing, these are `'llama2'` and `'mamaba`'. Config for the selected model type should | ||
be in `model_config.<model_type>` (e.g. `model_config.llama2`) and correspond to the | ||
arguments for that model type. See [`model_types.py`](src/delphi/train/config/models/model_types.py) | ||
### Weights and Biases Integration | ||
If you want to train a small and efficient model on a narrow dataset, then we recommend using a custom tokenizer with a small vocabulary. To train a reversible, GPT2-style, BPE tokenizer you can use `scripts/train_tokenizer.py`. | ||
|
||
Script usage: | ||
|
||
# Analyzing Models | ||
TODO | ||
``` | ||
> scripts/train_tokenizer.py --help | ||
usage: train_tokenizer.py [-h] --in-dataset IN_DATASET --feature FEATURE --split SPLIT | ||
--vocab-size VOCAB_SIZE | ||
[--out-dir OUT_DIR] [--out-repo OUT_REPO] | ||
Train a custom, reversible, BPE tokenizer (GPT2-like). You need to provide --out-repo or --out-dir. | ||
options: | ||
-h, --help show this help message and exit | ||
--in-dataset IN_DATASET, -i IN_DATASET | ||
Dataset you want to train the tokenizer on. Local path or HF repo id | ||
--feature FEATURE, -f FEATURE | ||
Name of the feature (column) containing text documents in the input dataset | ||
--split SPLIT, -s SPLIT | ||
Split of the dataset to be used for tokenizer training, supports slicing like 'train[:10%]' | ||
--vocab-size VOCAB_SIZE, -v VOCAB_SIZE | ||
Vocabulary size of the tokenizer | ||
--out-dir OUT_DIR Local directory to save the resulting tokenizer | ||
--out-repo OUT_REPO HF repo id to upload the resulting tokenizer | ||
``` | ||
|
||
# Development | ||
Here's how we trained the tokenizer for our `stories-*` suite of models. Please note that you can use single letter abbreviations for most arguments. | ||
|
||
``` | ||
> scripts/train_tokenizer.py \ | ||
--in-dataset delphi-suite/stories \ | ||
--feature story \ | ||
--split train \ | ||
--vocab-size 4096 \ | ||
--out-repo delphi-suite/stories-tokenizer | ||
``` | ||
|
||
We use the only feature named `story` in the `train` split of [delphi-suite/stories](https://huggingface.co/datasets/delphi-suite/stories). We train a tokenizer with a vocabulary of 4096 tokens, and upload it to HF model repo [delphi-suite/stories-tokenizer](https://huggingface.co/delphi-suite/stories-tokenizer). | ||
|
||
## Setup | ||
|
||
1. Clone this repo and submodules: `git clone https://github.com/delphi-suite/delphi.git --recurse-submodules` | ||
2. make python 3.10 virtual env in `.venv` | ||
3. install dependencies `pip install -r requirements.txt` | ||
4. install the project in editable state `pip install -e .` | ||
5. run tests `pytest` | ||
# Tokenizing a dataset | ||
|
||
### Submodule Setup | ||
If you cloned without `--recurse-submodules`, you can still install the submodules later with: | ||
```bash | ||
git submodule init | ||
git submodule update | ||
To turn a collection of text documents into sequences of tokens required for model training, you can use `scripts/tokenize_dataset.py`. All documents are tokenized and concatenated, with the `<eos>` token as a separator, e.g. | ||
``` | ||
doc1_tok1, doc1_tok2, ..., doc1_tokX, <eos>, doc2_tok1, doc2_tok2, ..., doc2_tokX, <eos>, doc3_tok1, ... | ||
``` | ||
Then this is divided into chunks, and the `<bos>` token is inserted at the begining of each chunk, e.g. | ||
``` | ||
<bos> doc1_tok1, doc1_tok2, ..., doc1_tokX, <eos>, doc2_tok1 | ||
<bos> doc2_tok2, ..., doc2_tok511 | ||
<bos> doc2_tok512, doc2_tok513, ..., doc2_tokX <eos>, doc3_tok1, ... | ||
... | ||
``` | ||
It will produce sequences of specified size, by discarding the last chunk if it's too short. We don't use padding. | ||
|
||
## Formatting | ||
Script usage: | ||
|
||
We're using black & isort to format the code. To make sure your changes adhere to the rules: | ||
``` | ||
> scripts/tokenize_dataset.py --help | ||
usage: tokenize_dataset.py [-h] --in-dataset IN_DATASET --feature FEATURE --split SPLIT | ||
--tokenizer TOKENIZER --seq-len SEQ_LEN | ||
[--batch-size BATCH_SIZE] [--chunk-size CHUNK_SIZE] | ||
[--out-dir OUT_DIR] [--out-repo OUT_REPO] | ||
Tokenize a text dataset using a specific tokenizer | ||
options: | ||
-h, --help show this help message and exit | ||
--in-dataset IN_DATASET, -i IN_DATASET | ||
Dataset you want to tokenize. Local path or HF repo id | ||
--feature FEATURE, -f FEATURE | ||
Name of the feature (column) containing text documents in the input dataset | ||
--split SPLIT, -s SPLIT | ||
Split of the dataset to be tokenized, supports slicing like 'train[:10%]' | ||
--tokenizer TOKENIZER, -t TOKENIZER | ||
HF repo id or local directory containing the tokenizer | ||
--seq-len SEQ_LEN, -l SEQ_LEN | ||
Length of the tokenized sequences | ||
--batch-size BATCH_SIZE, -b BATCH_SIZE | ||
How many text documents to tokenize at once (default: 50) | ||
--chunk-size CHUNK_SIZE, -c CHUNK_SIZE | ||
Maximum number of tokenized sequences in a single parquet file (default: 200_000) | ||
--out-dir OUT_DIR Local directory to save the resulting dataset | ||
--out-repo OUT_REPO HF repo id to upload the resulting dataset | ||
``` | ||
|
||
1. follow setup instructions above | ||
2. install pre-commit `pre-commit install` | ||
3. install recommended vscode extensions | ||
Here's how we tokenized the dataset for our `stories-*` suite of models. Please note that you can use single letter abbreviations for most arguments. | ||
|
||
When you save a file vscode should automatically format it. Otherwise, pre-commit will do that, but you will need to add the changes and commit again. | ||
For `train` split: | ||
``` | ||
> scripts/tokenize_dataset.py \ | ||
--in-dataset delphi-suite/stories \ | ||
--feature story \ | ||
--split train \ | ||
--tokenizer delphi-suite/stories-tokenizer \ | ||
--seq-len 512 \ | ||
--out-repo delphi-suite/stories-tokenized | ||
``` | ||
For `validation` split, repeated arguments omitted: | ||
``` | ||
> scripts/tokenize_dataset.py \ | ||
... | ||
--split validation \ | ||
... | ||
``` | ||
|
||
## Pull Requests | ||
|
||
1. make a branch | ||
- if it relates to an existing issue | ||
- go to the issue page and click _Create a branch_ under _Development_ | ||
- if the default name is not very long, keep it; otherwise, make it shorter, but keep the issue number in the front | ||
- otherwise pick a short but descriptive name, a few hyphen-separated-words | ||
2. make your changes | ||
- include unit tests | ||
- update README if needed | ||
- if new huggingface datasets/models are added to testing, increment the cache number in `.github/workflows/checks.yml` | ||
3. make a pull request | ||
- if it isn't ready for review yet, mark it as draft | ||
- check if CI is passing | ||
- if the change is big, try to keep the commit history clean using interactive rebase | ||
- don't push more often than it's needed, we're running github actions on a free tier | ||
- if there were any changes to the main branch, rebase on top of it | ||
- explain the change | ||
- provide short description; focus on things that were not mentioned in the relevant issue | ||
- comment important sections of the code in _Files changed_ tab | ||
- when it's ready, add the relevant stakeholders as reviewers | ||
4. after the comments are resolved and PR is approved, merge it using _Squash and merge_ | ||
|
||
## Incrementing Versions | ||
When making a new release, increment the version in `delphi/__init__.py` | ||
The input dataset is the same as in tokenizer training example above. We tokenize it with our custom [delphi-suite/stories-tokenizer](https://huggingface.co/delphi-suite/stories-tokenizer) into sequences of length 512. We upload it to HF dataset repo [delphi-suite/stories-tokenized](https://huggingface.co/datasets/delphi-suite/stories-tokenized). | ||
|
||
Please note that you can use any HuggingFace tokenizer, you don't need to train a custom one. | ||
|
||
# Training a model | ||
|
||
To train a model, you'll need to create a config file. For examples see `configs/`, and for field descriptions see `delphi/train/config/training_config.py`. The training script is located in `scripts/train_model.py`. | ||
|
||
Script usage: | ||
|
||
``` | ||
> scripts/train_model.py --help | ||
usage: train_model.py [-h] [--overrides [OVERRIDES ...]] [-v | -s] [config_files ...] | ||
Train a delphi model | ||
positional arguments: | ||
config_files Path to json file(s) containing config values, e.g. 'primary_config.json secondary_config.json'. | ||
options: | ||
-h, --help show this help message and exit | ||
--overrides [OVERRIDES ...] | ||
Override config values with space-separated declarations. e.g. `--overrides model_config.hidden_size=42 run_name=foo` | ||
-v, --verbose Increase verbosity level, repeatable (e.g. -vvv). Mutually exclusive with --silent, --loglevel | ||
-s, --silent Silence all logging. Mutually exclusive with --verbose, --loglevel | ||
``` | ||
|
||
You can specify primary config and secondary config, which is useful if you're training a suite of models that only differ in a few parameters. Additionally, you can override specific fields using the `--overrides` flag. If you don't want to push the model and its checkpoints to HF, you need to explicitly set `out_repo=""`. If you don't want to log to W&B, you need to set `wandb=""`. Please note that by default we save the optimizer state (2x model size) with every checkpoint. | ||
|
||
Here is how we trained our `stories-mamba-100k` model | ||
``` | ||
> scripts/train_model.py \ | ||
configs/stories/mamba/base.json \ | ||
configs/stories/mamba/100k.json \ | ||
--overrides \ | ||
out_repo="delphi-suite/stories-mamba-100k" \ | ||
wandb="delphi-suite/delphi" | ||
``` | ||
|
||
# Development | ||
|
||
1. Install the `dev` and `notebooks` dependencies `pip install -e ."[dev,notebooks]"`. | ||
2. Run the tests `pytest`. | ||
3. Install pre-commit `pre-commit install`. | ||
4. Install the recommended vscode extensions. | ||
|
||
When you save a file vscode should automatically format it. Otherwise, pre-commit will do that, but you will need to add the changes and commit again. | ||
|
||
# Citation | ||
|
||
If you use `delphi` in your research, please cite using the following | ||
If you use delphi in your research, please cite using the following | ||
|
||
```bibtex | ||
@software{delphi, | ||
title = {delphi: small language models training made easy}, | ||
author = {Jett Janiak, Jai Dhyani, Jannik Brinkmann, Gonçalo Paulo, Joshua Wendland, Víctor Abia Alonso, Siwei Li, Rai (Phan Anh Duong), Alice Rigg}, | ||
author = {Jett Janiak, Jai Dhyani, Jannik Brinkmann, Gonçalo Paulo, Joshua Wendland, Víctor Abia Alonso, Siwei Li, Phan Anh Duong, Alice Rigg}, | ||
year = 2024, | ||
url = {https://github.com/delphi-suite/delphi}, | ||
license = {apache-2.0} | ||
} | ||
``` | ||
``` |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.