Skip to content

Commit

Permalink
README update
Browse files Browse the repository at this point in the history
  • Loading branch information
jettjaniak committed May 25, 2024
1 parent cf70f56 commit 50d4023
Show file tree
Hide file tree
Showing 3 changed files with 43 additions and 18 deletions.
30 changes: 29 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# delphi

delphi is a set of tools for standardized and (mostly) reproducible training of small language models. You can use delphi to train a custom tokenizer, tokenize your dataset, and train your model. We build on top of HuggingFace, supporting every `CausalLM` architecture. Datasets, tokenizers and models (including checkpoints!) can be downloaded from and uploaded to HuggingFace automatically, with no need to manage local files.


# Setup

1. Clone the repo
Expand Down Expand Up @@ -155,7 +160,7 @@ options:
-s, --silent Silence all logging. Mutually exclusive with --verbose, --loglevel
```

You can specify primary config and secondary config, which is useful if you're training a suite of models that only differ in a few parameters. Additionally, you can override specific fields using the `--overrides` flag. If you don't want to push the model and its checkpoints to HF, you need to explicitly set `out_repo=""`. If you don't want to log to W&B, you need to set `wandb=""`.
You can specify primary config and secondary config, which is useful if you're training a suite of models that only differ in a few parameters. Additionally, you can override specific fields using the `--overrides` flag. If you don't want to push the model and its checkpoints to HF, you need to explicitly set `out_repo=""`. If you don't want to log to W&B, you need to set `wandb=""`. Please note that by default we save the optimizer state (2x model size) with every checkpoint.

Here is how we trained our `stories-mamba-100k` model
```
Expand All @@ -166,3 +171,26 @@ Here is how we trained our `stories-mamba-100k` model
out_repo="delphi-suite/stories-mamba-100k" \
wandb="delphi-suite/delphi"
```

# Development

1. Install the `dev` and `notebooks` dependencies `pip install -e ."[dev,notebooks]"`.
2. Run the tests `pytest`.
3. Install pre-commit `pre-commit install`.
4. Install the recommended vscode extensions.

When you save a file vscode should automatically format it. Otherwise, pre-commit will do that, but you will need to add the changes and commit again.

# Citation

If you use delphi in your research, please cite using the following

```bibtex
@software{delphi,
title = {delphi: small language models training made easy},
author = {Jett Janiak, Jai Dhyani, Jannik Brinkmann, Gonçalo Paulo, Joshua Wendland, Víctor Abia Alonso, Siwei Li, Phan Anh Duong, Alice Rigg},
year = 2024,
url = {https://github.com/delphi-suite/delphi},
license = {apache-2.0}
}
```
13 changes: 6 additions & 7 deletions configs/stories/llama2/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
not using padding, so pad_token_id not set
use_cache - using default
pretraining_tp - experimental parallelization we're not using, which is the default
tie_word_embeddings - llama2 used False and this is better for interpretability, note that llama2.c is using True by default, which is probably more efficient use of parameters for very small models
rope settings are widely used defaults
attention_bias - no biases on QKV and output projection is the default and that's what we're using
attention_dropout - this is the only dropout llama2 can use, it's set to prob=0 by default and that's what we're using
- use_cache - using default
- pretraining_tp - experimental parallelization we're not using, which is the default
- tie_word_embeddings - llama2 used False and this is better for interpretability, note that llama2.c is using True by default, which is probably more efficient use of parameters for very small models
- rope settings are widely used defaults
- attention_bias - no biases on QKV and output projection is the default and that's what we're using
- attention_dropout - this is the only dropout llama2 can use, it's set to prob=0 by default and that's what we're using
18 changes: 8 additions & 10 deletions configs/stories/mamba/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
pad_token_id - we're not using pad tokens, do we don't set it
layer_norm_eps - different than rms norm eps in mamba
initializer_range - different in mamba & llama
residual_in_fp32 - mamba specific parameter
time_step_* - mamba specific, sane defaults
there is no way to untie embeddings and unembeddings in mamba, they're tied by default
https://github.com/huggingface/transformers/blob/v4.40.0/src/transformers/models/mamba/modeling_mamba.py#L602-L610
rescale_prenorm_residual was True in original paper, so we set it to True, despite HF default being false
using default for use_cache
state_size is default
- layer_norm_eps - different than rms norm eps in llama
- initializer_range - different in mamba & llama
- residual_in_fp32 - mamba specific parameter
- time_step_* - mamba specific, sane defaults
- there is no way to untie embeddings and unembeddings in mamba, they're tied by default https://github.com/huggingface/transformers/blob/v4.40.0/src/transformers/models/mamba/modeling_mamba.py#L602-L610
- rescale_prenorm_residual was True in original paper, so we set it to True, despite HF default being false
- using default for use_cache
- state_size is default

0 comments on commit 50d4023

Please sign in to comment.