README update

delphi-suite · May 25, 2024 · 4c72efc · 4c72efc
1 parent de4555e
commit 4c72efc
Show file tree

Hide file tree

Showing 3 changed files with 43 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -1,3 +1,8 @@
+# delphi
+
+delphi is a set of tools for standardized and (mostly) reproducible training of small language models. You can use delphi to train a custom tokenizer, tokenize your dataset, and train your model. We build on top of HuggingFace, supporting every `CausalLM` architecture. Datasets, tokenizers and models (including checkpoints!) can be downloaded from and uploaded to HuggingFace automatically, with no need to manage local files.
+
+
 # Setup
 
 1. Clone the repo
@@ -155,7 +160,7 @@ options:
   -s, --silent          Silence all logging. Mutually exclusive with --verbose, --loglevel
 ```
 
-You can specify primary config and secondary config, which is useful if you're training a suite of models that only differ in a few parameters. Additionally, you can override specific fields using the `--overrides` flag. If you don't want to push the model and its checkpoints to HF, you need to explicitly set `out_repo=""`. If you don't want to log to W&B, you need to set `wandb=""`. 
+You can specify primary config and secondary config, which is useful if you're training a suite of models that only differ in a few parameters. Additionally, you can override specific fields using the `--overrides` flag. If you don't want to push the model and its checkpoints to HF, you need to explicitly set `out_repo=""`. If you don't want to log to W&B, you need to set `wandb=""`. Please note that by default we save the optimizer state (2x model size) with every checkpoint.
 
 Here is how we trained our `stories-mamba-100k` model
 ```
@@ -166,3 +171,26 @@ Here is how we trained our `stories-mamba-100k` model
       out_repo="delphi-suite/stories-mamba-100k" \
       wandb="delphi-suite/delphi"
 ```
+
+# Development
+
+1. Install the `dev` and `notebooks` dependencies `pip install -e ."[dev,notebooks]"`.
+2. Run the tests `pytest`.
+3. Install pre-commit `pre-commit install`.
+4. Install the recommended vscode extensions.
+
+When you save a file vscode should automatically format it. Otherwise, pre-commit will do that, but you will need to add the changes and commit again.
+
+# Citation
+
+If you use delphi in your research, please cite using the following
+
+```bibtex
+@software{delphi,
+  title = {delphi: small language models training made easy},
+  author = {Jett Janiak, Jai Dhyani, Jannik Brinkmann, Gonçalo Paulo, Joshua Wendland, Víctor Abia Alonso, Siwei Li, Phan Anh Duong, Alice Rigg},
+  year = 2024,
+  url = {https://github.com/delphi-suite/delphi},
+  license = {apache-2.0}
+}
+```
diff --git a/configs/stories/llama2/README.md b/configs/stories/llama2/README.md
@@ -1,7 +1,6 @@
-not using padding, so pad_token_id not set
-use_cache - using default
-pretraining_tp - experimental parallelization we're not using, which is the default
-tie_word_embeddings - llama2 used False and this is better for interpretability, note that llama2.c is using True by default, which is probably more efficient use of parameters for very small models
-rope settings are widely used defaults
-attention_bias - no biases on QKV and output projection is the default and that's what we're using
-attention_dropout - this is the only dropout llama2 can use, it's set to prob=0 by default and that's what we're using
+- use_cache - using default
+- pretraining_tp - experimental parallelization we're not using, which is the default
+- tie_word_embeddings - llama2 used False and this is better for interpretability, note that llama2.c is using True by default, which is probably more efficient use of parameters for very small models
+- rope settings are widely used defaults
+- attention_bias - no biases on QKV and output projection is the default and that's what we're using
+- attention_dropout - this is the only dropout llama2 can use, it's set to prob=0 by default and that's what we're using
diff --git a/configs/stories/mamba/README.md b/configs/stories/mamba/README.md
@@ -1,10 +1,8 @@
-pad_token_id - we're not using pad tokens, do we don't set it
-layer_norm_eps - different than rms norm eps in mamba
-initializer_range - different in mamba & llama
-residual_in_fp32 - mamba specific parameter
-time_step_* - mamba specific, sane defaults
-there is no way to untie embeddings and unembeddings in mamba, they're tied by default
-https://github.com/huggingface/transformers/blob/v4.40.0/src/transformers/models/mamba/modeling_mamba.py#L602-L610
-rescale_prenorm_residual was True in original paper, so we set it to True, despite HF default being false
-using default for use_cache
-state_size is default
+- layer_norm_eps - different than rms norm eps in llama
+- initializer_range - different in mamba & llama
+- residual_in_fp32 - mamba specific parameter
+- time_step_* - mamba specific, sane defaults
+- there is no way to untie embeddings and unembeddings in mamba, they're tied by default https://github.com/huggingface/transformers/blob/v4.40.0/src/transformers/models/mamba/modeling_mamba.py#L602-L610
+- rescale_prenorm_residual was True in original paper, so we set it to True, despite HF default being false
+- using default for use_cache
+- state_size is default