Training on Your Own Data

This guide is meant to be a detailed walkthrough of training a model on your own data using Levanter.

The basic steps are:

Configure your environment/cloud
Prepare your data and upload to cloud
Configure your training run
Upload the training configuration file
Launch training
Evaluate
Export your model to Huggingface

If you're training on data that isn't text (or audio-to-text), you'll need to write a custom cache. See the section on Direct Cache Construction.

Environment Setup

TPU Setup

See the TPU guide for instructions on setting up a TPU VM instance. You should go through the installation steps in that guide before continuing. Don't spin up a TPU VM instance yet, though.

CUDA Setup

See the GPU guide for instructions on setting up a CUDA machine.

WandB Setup

Levanter mainly uses WandB for logging. You should create a WandB account and get an API key.

Data Preparation

The key ingredient for training an LM is a lot of plain-text data. We have two top-level ways of consuming training data: a single source and mixture of sources. Single source is simpler and probably closer to what you're used to, while multiple source allows you to have multiple evaluation sets or use techniques like DoReMi.

Data Sources

In Levanter, a data source can either be a list of training and validation URLs pointing to (possibly compressed) JSONL files, or a Huggingface Dataset. In either case, we assume there is a single field, by default called "text", that contains the text of the example. If you have a sequence-to-sequence task, as of September 2023, you should turn each example into a single text by e.g. using a templating mechanism with a prompt (a la Alpaca).

Data Format: JSONL

The canonical format for training data in Levanter is (compressed) JSONL, or JSON Lines. Each line of the file is a JSON object, which is a dictionary of key-value pairs. The only required key is "text", which should map to a string of plain text. Other keys are ignored, but you can use them to store metadata about your data.

Once you have done so, you can create the data section of your training configuration:

    train_urls:
      - "gs://path/to/train_web_{1..32}.jsonl.gz"
      - "gs://path/to/train_web_crawl2.jsonl.gz"
    validation_urls:
      - "gs://path/to/valid_web.jsonl.gz"

Levanter uses fsspec to read data from files, so it can transparently handle compressed files and files in cloud storage (like Google Cloud Storage or AWS S3). Levanter uses braceexpand to expand the {1..32} syntax. You can also use more than one entry if you have urls that don't follow a naming scheme:

!!! tip

Levanter's preprocessing pipeline works best if you split your data into at least 1 shard for every machine
(i.e. every 8 TPUs or GPUs). This isn't a big deal, but it helps.

Data Format: Huggingface Datasets

If you have a Huggingface Dataset, such as The Pile, you can use it directly in Levanter. It must have a "text" column, and it must be split into train and validation sets. To use it, you can specify the dataset name in the data section of your training configuration:

data:
    id: "EleutherAI/pile"
    # if needed:
    # name: "subset"

This will be passed to datasets.load_dataset. If the dataset supports streaming, you can use stream: true to stream the data instead of loading it all into memory. If a streaming dataset is sharded, we will attempt to exploit the sharded structure to preprocess more efficiently.

Single Data Source

If you have a single source of data, you can use the data section of your training configuration to specify it:

data:
    train_urls:
      - "gs://path/to/train.{1..32}.jsonl.gz"
    validation_urls:
      - "gs://path/to/valid.{1..4}.jsonl.gz"
    cache_dir: "gs://path/to/cache"
    tokenizer: "gpt2"  # any HF tokenizer path, or GCS path to an HF tokenizer

Mixture of Sources

If you have multiple sources of data (e.g., multiple domains, or distinct subsets of data), you can use the data section of your training configuration to specify them:

data:
  configs:
    wikitext:
      id: dlwh/wikitext_103_detokenized
    web:
      train_urls:
        - "gs://path/to/train_web_{1..32}.jsonl.gz"
      validation_urls:
        - "gs://path/to/valid_web.jsonl.gz"
  train_weights:
    wikitext: 0.1
    web: 0.9
  cache_dir: "gs://path/to/cache"
  tokenizer: "gpt2"  # any HF tokenizer path, or GCS path to an HF tokenizer

train_weights is a dictionary mapping source names to weights. The weights need not sum to 1, but they should be positive. The weights are normalized to sum to 1. You can include a weight of 0.0 to exclude a source from training, in which case it will only be used for evaluation (if present).

Evaluation losses are broken down by source, so you can see how each source is performing. Not every source needs to have validation data.

!!! tip

    If you only have one training source, but you want to use multiple evaluation sources, you can use the
    the mixture of sources mechanism with a single source. Just set the weight of the training source to 1.0
    and the weights of the evaluation sources to 0.0.

Data Preprocessing

Levanter supports both online and offline preprocessing. Online preprocessing is done on-the-fly during training. With online preprocessing, you don't need to think about preprocessing your data except to make sure it's in the right format and where you'd like to store the cached preprocessing results.

Our data loading pipeline will automatically break and concatenate documents into chunks equal to the model's seq_len parameter. It will also automatically add special tokens to the end of documents.

Online Preprocessing

We have a sophisticated caching mechanism using Ray that builds a cache of preprocessed data on the fly. Online caching happens transparently in the background, using the mostly-idle CPU-cores of the machine(s) you are training on.

The cache that is built is fully reproducible, and can be used for future training runs. Training will start as soon as the system has the data it needs.

Offline Preprocessing

If you want, you can also preprocess your data offline, and then upload the preprocessed data to cloud storage.

Levanter has a script that basically runs the same online preprocessing code, but doesn't do any training. You can run it like this:

python -m levanter.main.cache_dataset --config_path my_config.yaml

You can actually connect this to a Ray cluster, and use the cluster to do the preprocessing. This lets you use any of Ray's autoscaling features to scale up the preprocessing job.

To do so:

python -m levanter.main.cache_dataset \
    --config_path my_config.yaml \
    --address <ray-cluster-address> \
    --start_workers false \
    --auto_start_cluster false

Direct Cache Construction

As a final option, you can directly construct a cache of preprocessed data without using Ray. This is useful if you have custom preprocessing logic or Ray isn't working for you for some reason. To do so, you can use [levanter.store.SerialCacheWriter][] to write batches directly. Here's an example:

import numpy as np

from levanter.store import SerialCacheWriter

exemplar = {
    "input_ids": np.zeros((0), dtype=np.int32),
    "attention_mask": np.zeros((0), dtype=np.int32),
    "labels": np.zeros((0), dtype=np.int32),
}

with SerialCacheWriter(cache_dir, exemplar) as writer:
    for batch in process_batches():
        # batch should be a list of dicts, each with keys "input_ids", "attention_mask", and "labels"
        writer.write_batch(batch)

In this case, batch should be a list of dicts, each with keys "input_ids", "attention_mask", and "labels". To work with train_lm, it should have an input_ids key that is a list of ints.

To use a cache like this, you can use the passthrough tokenizer:

data:
  cache_dir: "gs://path/to/cache"
  tokenizer: "passthrough"
  vocab_size: 5567

(Basically, you just need to tell Levanter what the vocab size is.)

Configuration

Levanter uses Draccus to configure training runs. It's a YAML-to-dataclass library that also supports argument parsing via argparse. A detailed guide to configuring Levanter is available in the Configuration Guide.

This section will cover the basics of configuring a training run.

TL;DR

Here's a configuration for a 1.4B parameter model with reasonable values for everything:

data:
    train_urls:
      - "gs://path/to/train.{1..32}.jsonl.gz" # TODO
    validation_urls:
      - "gs://path/to/valid.{1..4}.jsonl.gz" # TODO
    cache_dir: "gs://path/to/cache"  # TODO
    tokenizer: "gpt2"  # any HF tokenizer path, or GCS path to an HF tokenizer
model:
  type: gpt2
  hidden_dim: 1536
  num_heads: 24
  num_layers: 48
  seq_len: 1024
  gradient_checkpointing: true
  scale_attn_by_inverse_layer_idx: true
trainer:
  tracker:
    type: wandb
    project: "levanter" # TODO
    tags: ["gpt2"]

  mp: p=f32,c=bfloat16
  num_train_steps: 100000  # TODO
  train_batch_size: 512  # you may need to tune this or per_device_parallelism
  per_device_parallelism: -1
  per_device_eval_parallelism: 8

  max_eval_batches: null # set to a number to limit eval batches. useful if your eval set is enormous

  checkpointer:
    base_path: "gs://path/to/checkpoints"  # TODO
    save_interval: 15m
    keep:
      - every: 10000
optimizer:
  learning_rate: 1E-4
  weight_decay: 0.1
  min_lr_ratio: 0.1
# if you want:
hf_save_steps: 10000
hf_save_path: "gs://path/to/hf/checkpoints"  # TODO
hf_upload: null # set to an hf repo if you want to upload automatically. You need to have logged in to hf-cli

If you want a different model size or architecture, you can look at the config files in levanter/config.

Continued Pretraining

Levanter supports starting from an HF pretrained model. To do so, you should set your config like this:

model:
  type: mpt
initialize_from_hf: "mosaicml/mpt-7b" # can also reference a version, e.g. "mosaicml/mpt-7b@deadbeef"
use_hf_model_config: true

You should probably reduce the learning rate by a factor of 10 or so. TODO: figure out best practices here.

Llama 2

For Llama 2 specifically (or other gated models), you'll need a few extra steps:

If you haven't already, go to Llama 2's Hugging Face page and request access to the model.

Once you have access, go to Hugging Face's Tokens page to get an API token. Then, pass in the token as an environment variable:

HUGGING_FACE_HUB_TOKEN=hf...

Pass that in anywhere you're passing in a WANDB_API_KEY.

Then, you can use the model like so:

model:
  type: llama
initialize_from_hf: "meta-llama/Llama-2-7b-hf"
use_hf_model_config: true

Checkpointing

Levanter supports checkpointing to both local and Google Cloud Storage, backed by TensorStore. If you're using multiple machines, you should probably use cloud storage or NFS.

Levanter saves two kinds of checkpoints:

time-based checkpoints: temporary checkpoints that are saved every save_interval minutes. The previous time-based checkpoint is deleted when a new one is saved.
step-based checkpoints: permanent checkpoints that are saved according to a policy. These checkpoints are never deleted.

At a minimum, you should set a base_path for your checkpoints. This can be a local path or a GCS path.

Determining Batch Size

We don't have a mechanism for tuning batch size automatically, so you need to do this by hand. Memory usage is controlled by per_device_parallelism, train_batch_size, gradient checkpointing, model size, and number of accelerators, and model_axis_size. (For most models model_axis_size should be 1, so you can ignore it.)

per_device_parallelism is analogous to per_device_batch_size in other frameworks. It controls how many examples are processed at once on a single accelerator, but the name is a bit more "correct" in the presence of tensor or pipeline parallelism. A per_device_parallelism of -1 means "use as many as possible to not perform gradient accumulation."

Gradient accumulation is performed whenever num_accelerators * per_device_parallelism / model_axis_size < train_batch_size. Gradient checkpointing is enabled by default and highly recommended.

So, to find your batch size, you should modify either per_device_parallelism or train_batch_size until your job runs. Note that, due to FSDP, as you add more TPUs, you can increase the effective parallelism because you will use less memory per accelerator to store parameters and optimizer states.

Number of Training Steps

Levanter does not support epochs or number of tokens/examples, so if you want to train for a certain number of epochs or tokens, you'll need to compute the number of steps yourself. You can use the following formula:

num_train_steps = num_epochs * num_tokens_per_epoch / train_batch_size / seq_len

Note however that epoch boundaries aren't really respected: our implementation of sharded data loading restarts from the beginning as soon as any machine finishes its shards.

Launching Training

TPU

First, we assume you've gone through the setup steps in the TPU guide, at least through setting up your gcloud account. We also strongly recommend setting up ssh keys and ssh-agent.

Upload Config To GCS

Once you have your config built, you should upload it to GCS. You could also scp it to all workers, but this is easier and works with the TPU babysitting script.

gsutil cp my_config.yaml gs://path/to/config.yaml

Using the Babysitting Script with a Preemptible or TRC TPU VM

If you are using a preemptible TPU VM, or a TRC TPU VM, you should use the babysitting script to automatically restart your VM if it gets preempted. A detailed guide to babysitting is available in the babysitting section of the TPU guide. Here is the upshot:

infra/babysit-tpu-vm my-tpu -z us-east1-d -t v3-128 -- \
    WANDB_API_KEY=... levanter/infra/run.sh python levanter/src/levanter/main/train_lm.py --config_path gs://path/to/config.yaml

Spin up and manual launch

You should probably use the automated setup script, as described in the relevant section of the TPU guide. Here's what that looks like:

bash infra/spin-up-tpu-vm.sh my-tpu -z us-east1-d -t v3-128

This will spin up a TPU VM instance and install Levanter on it. You can then run a command like so:

cat > .config <<EOF
env:
    WANDB_API_KEY:
    WANDB_ENTITY:
    WANDB_PROJECT:
    HF_TOKEN:
    TPU_STDERR_LOG_LEVEL: 0
    TPU_MIN_LOG_LEVEL: 0
    LIBTPU_INIT_ARGS: <extra args to libtpu>

docker_repository: levanter
zone: us-west4-a
tpu_type: "v5litepod-16"
vm_image: "tpu-ubuntu2204-base"
preemptible: true
autodelete: false
subnetwork: "default"

EOF

python infra/launch.py --tpu_name=my_tpu -- python src/levanter/main/train_lm.py --config_path gs://path/to/config.yaml"

Monitoring

Levanter integrates with WandB for logging. You can view your run on the WandB website. Levanter will also log to the console, and write logs to logs/$RUN_ID.log on each machine. Logs can be pretty verbose.

We recommend monitoring train/loss and eval/loss in log/log scale. You should be seeing roughly a linear decrease in loss followed by a gradual flattening. You can also monitor throughput.

Evaluation

Levanter will run evaluation every trainer.steps_per_eval steps.

You can also run evaluation manually by running the levanter/main/eval_lm.py script:

python -m levanter.main.eval_lm --config_path gs://path/to/config.yaml --checkpoint_path gs://path/to/checkpoint

You can also use this script to evaluate on other datasets by modifying the config.

Huggingface Export

Exporting during Training

You can export to HF during training using the hf_save_steps and hf_save_path options in your config. You can also set hf_upload to an HF repo to automatically upload your model to HF. See the config above for an example.

Typically, you will have saved checkpoints in a directory like gs://path/to/checkpoints/hf/my_run/step_10000/. Hugging Face Transformers doesn't know how to read these. So, you'll need to copy the files to a local directory:

gsutil -m cp gs://path/to/checkpoints/hf/my_run/step_10000/* /tmp/my_exported_model

Then you can use the model as you would expect:

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("/tmp/my_exported_model")
tokenizer = AutoTokenizer.from_pretrained("/tmp/my_exported_model")

Exporting after Training

After training, you can run a separate script to export levanter checkpoints to Huggingface:

python -m levanter.main.export_lm_to_hf --config_path my_config.yaml --output_dir gs://path/to/output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training-On-Your-Data.md

Training-On-Your-Data.md

Training on Your Own Data

Environment Setup

TPU Setup

CUDA Setup

WandB Setup

Data Preparation

Data Sources

Data Format: JSONL

Data Format: Huggingface Datasets

Single Data Source

Mixture of Sources

Data Preprocessing

Online Preprocessing

Offline Preprocessing

Direct Cache Construction

Configuration

TL;DR

Continued Pretraining

Llama 2

Checkpointing

Determining Batch Size

Number of Training Steps

Launching Training

TPU

Upload Config To GCS

Using the Babysitting Script with a Preemptible or TRC TPU VM

Spin up and manual launch

Monitoring

Evaluation

Huggingface Export

Exporting during Training

Exporting after Training

Files

Training-On-Your-Data.md

Latest commit

History

Training-On-Your-Data.md

File metadata and controls

Training on Your Own Data

Environment Setup

TPU Setup

CUDA Setup

WandB Setup

Data Preparation

Data Sources

Data Format: JSONL

Data Format: Huggingface Datasets

Single Data Source

Mixture of Sources

Data Preprocessing

Online Preprocessing

Offline Preprocessing

Direct Cache Construction

Configuration

TL;DR

Continued Pretraining

Llama 2

Checkpointing

Determining Batch Size

Number of Training Steps

Launching Training

TPU

Upload Config To GCS

Using the Babysitting Script with a Preemptible or TRC TPU VM

Spin up and manual launch

Monitoring

Evaluation

Huggingface Export

Exporting during Training

Exporting after Training