Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tiktoken as tokenizer #30

Merged
merged 4 commits into from
Mar 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 8 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,82 +1,13 @@
# ⚡️ Nanotron
# ⚡️ FMEngine

The objective of this library is to provide easy distributed primitives in order to train a variety of models efficiently using 3D parallelism. For more information about the internal design of the library or 3D parallelism in general, please check out [[docs.md]](./docs/docs.md) and [[3d_parallelism.md]](./docs/3d_parallelism.md).
FMEngine is our opinionated take on foundation model training framework. The first version of FMEngine is built on top of `PyTorch` and `DeepSpeed` and is designed to be a drop-in replacement for `DeepSpeed` with a few additional features. In the `v2` version we forked from HuggingFace's `nanotron` and added some features to make it easier to use.


# Philosophy

- Make it fast. At least as fast as other open source versions.
- Make it minimal. We don't actually need to support all techniques and all versions of 3D parallelism. What matters is that we can efficiently use the "best" ones.
- Make everything explicit instead of transparent. As we move forward, making things transparent works well when it works well but is a horrible debugging experience if one doesn't understand the implications of techniques used. In order to mitigate this, we choose to be explicit in the way it does things

# Core Features

We support the following:
- 3D parallelism, including one-forward-one-backward pipeline engine
- ZeRO-1 optimizer
- FP32 gradient accumulation
- Parameter tying/sharding

# Installation

Requirements:
- Python >= 3.10
- PyTorch >= 2.0.0
- Flash-Attention >= 2.5.0

To install (in a new env):
```bash
pip install torch
pip install packaging; pip install "flash-attn>=2.5.0" --no-build-isolation
git clone [email protected]:huggingface/nanotron.git
cd nanotron
pip install -e .
```

Also nice to have `transformers` `datasets` `python-etcd` `tensorboardX`: `pip install transformers datasets python-etcd tensorboardX`

We also support a set of flavors that you can install using `pip install -e [$FLAVOR]`:
- `dev`: Used is you are developping in `nanotron`. It installs in particular our linter mechanism. On top of that you have to run `pre-commit install` afterwards.
- `test`: We use `pytest` in order to run out testing suite. In order to run tests in parallel, it will install `pytest-xdist`, which you can leverage by running `pytest -n 12 tests` (12 is the number of parallel test)


# Quick examples

In the `/examples` directory, you can find a few example configuration file, and a script to run it.

You can run a sample training using:
```bash
torchrun --nproc_per_node=8 run_train.py --config-file examples/debug_run_train.yaml
```

And run a sample generation using:
```bash
torchrun --nproc_per_node=8 run_generation.py --ckpt-path checkpoints/text/4
```

# Development guidelines

If you plan on developing on `nanotron`, we suggest you install the `dev` flavor: `pip install -e ".[dev]"`

We use pre-commit to run a bunch of callbacks on each commit, mostly normalization code in order for the codebase to stay consistent. Please do run `pre-commit install`.

For the linting:
```bash
pre-commit install
pre-commit run --config .pre-commit-config.yaml --all-files
```

Features we would like to add:
- [ ] Support `torch.compile`
- [ ] Support `torch.distributed.rpc`
- [ ] More optimized kernels
- [ ] Support Zero3
- [ ] Other PP schedules (such as Interleaved 1f1b...)
- [ ] Ring attention / Sequence Parallelism
- [ ] 3D Parallel MoEs
- [ ] Supporting more architectures (Mamba..)
- [ ] ...

# Credits

We would like to thank everyone working on LLMs, especially those sharing their work openly from which we took great inspiration: Nvidia for `Megatron-LM/apex`, Microsoft for `DeepSpeed`, HazyResearch for `flash-attn`
We would like to thank everyone working on LLMs, especially those sharing their work openly from which we took great inspiration:

- HuggingFace for `nanotron`,
- Nvidia for `Megatron-LM/apex`,
- Microsoft for `DeepSpeed`,
- HazyResearch for `flash-attn`
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ data:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: default
hf_dataset_or_datasets: cerebras/SlimPajama-627B
hf_dataset_or_datasets: DKYoon/SlimPajama-6B
hf_dataset_splits: train
text_column_name: text
num_loading_workers: 1
Expand Down Expand Up @@ -51,7 +51,7 @@ model:
sliding_window: 4096
tie_word_embeddings: true
use_cache: true
vocab_size: 32000
vocab_size: 102000
optimizer:
accumulate_grad_in_fp32: true
adam_beta1: 0.9
Expand All @@ -70,16 +70,17 @@ optimizer:
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 2
dp: 1
pp: 1
pp_engine: 1f1b
tp: 1
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
profiler: null
tokenizer:
tokenizer_type: openai
tokenizer_max_length: null
tokenizer_name_or_path: mistralai/Mistral-7B-v0.1
tokenizer_name_or_path: cl100k_base
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 1
Expand Down
Loading
Loading