Skip to content

Commit

Permalink
Reorganize documentation with existing content
Browse files Browse the repository at this point in the history
  • Loading branch information
melissawm committed Oct 29, 2024
1 parent 2df72e7 commit 4d0ac8e
Show file tree
Hide file tree
Showing 16 changed files with 153 additions and 33 deletions.
1 change: 0 additions & 1 deletion docs/advanced_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,4 @@ getting_started/Run_MaxText_via_multihost_runner.md
getting_started/Run_MaxText_via_xpk.md
getting_started/Use_Vertex_AI_Tensorboard.md
getting_started/Run_Llama2.md
data_loading.md
```
6 changes: 6 additions & 0 deletions docs/batch_size.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Per-device batch size

The value of the `per_device_batch_size` parameter dictates the amount of
training data fed into the chip. This can be of decimal value between 0 and 1.
Changing the value of per_device_batch_size can improve the MFU for your
training run.
13 changes: 13 additions & 0 deletions docs/checkpointing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Checkpointing

Maxtext provides the ability to run training with following checkpointing options:

- enabled/disabled
- asynchronous - true/false
- checkpointing frequency

They are dictated by the following parameters:

- `Enable_checkpointing` (`True`/`False`)
- `Checkpoint_period` (integer value)
- `Async_checkpointing` (`True`/`False`)
43 changes: 43 additions & 0 deletions docs/code_organization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Codebase Walkthrough

MaxText is purely written in JAX and python. Below are some folders and files
that show a high-level organization of the code and some key files.

File/Folder | Description
---------|---------------------------------
`configs` | Folder contains all the config file, including model configs (llama2, mistral etc) , and pre-optimized configs for different model size on different TPUs
`input_pipelines` | Input training data related code
`layers` | Model layer implementation
`end_to_end` | Example scripts to run Maxtext
`Maxtext/train.py` | The main training script you will run directly
`Maxtext/config/base.yaml` | The base configuration file containing all the related info: checkpointing, model arch, sharding schema, data input, learning rate, profile, compilation, decode
`Maxtext/decode.py` | This is a script to run offline inference with a sample prompt
`setup.sh`| Bash script used to install all needed library dependencies.

## Training configuration

The [MaxText/configs/base.yaml](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/configs/base.yml)
has a set of default configurations. These can be overridden directly via CLI
when invoking the MaxText train scripts. The command line parameters overwrite
the default values. A few of the key parameters are described below:

- `load_parameters_path`: maxtext checkpoint path.
- `base_output_directory`: Base path to save the outputs (logs and data).
- [`dataset_type`](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/configs/base.yml#L273):
synthetic, tfds, grain or hf (hugging face)
- `dataset_path`: for `dataset_type=tfds`, path to the dataset.
- `tokenizer_path`: Path to a tokenizer for the model. The tokenizers are
present in ...
- `quantization`: Whether to use quantized training with AQT. Valid values are ['int8']
- `per_device_batch_size`: How many batches each TPU/device receives. To improve
the MFU, you can increase this value. This can also be a fraction. For this
tutorial, we will use the default value of 1.
- `enable_checkpointing`: Boolean value. Whether we want to generate a checkpoint.
- `checkpoint_period`: After how many steps should checkpointing be performed.
- `async_checkpointing`: Accepts a boolean value to set whether to use
asynchronous checkpointing. Here, we set it to False.
- `attention`: On TPUv3 and earlier, we need to set the attention to
`dot_product`. Newer versions support the flash attention value. On GPU use
`cudnn_flash_te`.
- `steps`: Number of steps to train. For this tutorial, we will use a small
value of 10 steps.
13 changes: 7 additions & 6 deletions docs/data_loading.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# Data Loading
# How to load the data

Maxtext supports input data pipelines in the following ways:
Tf.data*
Grain
Hugging Face Datasets

*Tf.data is the most performant way of loading large scale datasets.
- Tf.data[^1]
- Grain
- Hugging Face Datasets

You can read more about the pipelines in [](getting_started/Data_Input_Pipeline.md).
[^1]: Tf.data is the most performant way of loading large scale datasets.

You can read more about the pipelines in [](getting_started/Data_Input_Pipeline.md).
5 changes: 5 additions & 0 deletions docs/full_finetuning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Full Finetuninhg LLama2/LLama3 Optimized configuration

## Parameters to achieve high MFU

This page is in progress.
3 changes: 3 additions & 0 deletions docs/gce_gke_xpk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Getting started with GCE/GKE+XPK

This page is in progress.
2 changes: 2 additions & 0 deletions docs/getting_started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,6 @@ In addition to the getting started guides, there are always other MaxText capabi
First_run.md
steps_model.md
End-to-end example <https://www.kaggle.com/code/melissawm/maxtext-examples>
Data_Input_Pipeline.md
Data_Input_Perf.md
```
13 changes: 11 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ The key value proposition of using MaxText for pre-training or full fine tuning
Maxtext today only supports Pre-training and Full Fine Tuning of the models. It does not support PEFT/LoRA, Supervised Fine Tuning or RLHF.
```

## Who are the target users of Maxtext?
## Who are the target users of MaxText?

- Any individual or a company that is interested in forking maxtext and seeing it as a reference implementation of a high performance Large Language Models and wants to build their own LLMs on TPU and GPU.
- Any individual or a company that is interested in performing a pre-training or Full Fine Tuning of the supported open source models, can use Maxtext as a blackbox to perform full fine tuning. Maxtext attains an extremely high MFU, resulting in large savings in training costs.
Expand Down Expand Up @@ -203,6 +203,15 @@ MaxText supports automatic upload of logs collected in a directory to a Tensorbo
:hidden:
getting_started/index.md
code_organization.md
data_loading.md
sharding.md
remat_policy.md
batch_size.md
checkpointing.md
profiling.md
full_finetuning.md
inference.md
gce_gke_xpk.md
advanced_usage.md
reference/index.md
```
3 changes: 3 additions & 0 deletions docs/inference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Inference (JetStream)

This page is in progress.
3 changes: 3 additions & 0 deletions docs/profiling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Profiling and Pre-training: Xplane and Tensorboard

This page is in progress.
15 changes: 0 additions & 15 deletions docs/reference/code_organization.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/reference/config_options.md

This file was deleted.

8 changes: 0 additions & 8 deletions docs/reference/index.md

This file was deleted.

26 changes: 26 additions & 0 deletions docs/remat_policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Remat Policy and Host Offloading

For large-scale model training, accelerator memory is a limited resource and we
often make trade-offs such as activation re-materialization to trade off compute
cycles for accelerator memory resources. Host offload is another technique we
recently introduced in the XLA compiler to leverage host DRAM to offload
activations computed during the forward pass and reuse them during the backward
pass for gradient computation; this saves activation recomputation cycles.

Maxtext provides a parameter called `remat_policy`. This parameter allows
offloading activation memory to host, HBM or recomputing on backward pass.

Activations in the forward pass are also needed in the backward pass. There are
three options for where in memory these activations are accessible for the
backward pass:

1. In HBM (MaxText remat policy "minimal")
2. On host (MaxText remat policy "minimal_offloaded")
3. Activations are re-computed during the backward pass (MaxText remat policy "full")

We can choose different remat policies for different activations (e.g. the FF
activations versus the QKV proj activations), which allows us to optimize memory
usage vs compute trade-offs: Generally we want to use all of our HBM. Both host
offloading (option 2) and re-computing (Aka remat, option 3), use as little HBM
as possible - which is faster depends on model sizes, device compute speed and
host to device speed.
31 changes: 31 additions & 0 deletions docs/sharding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Sharding

Maxtext supports the following sharding mechanisms:

- Distributed Data Parallelism
- Tensor Parallelism
- Fully Sharded Data Parallel
- Sequence Parallel

They are covered in the following parameters. These are the default values from base.yml. Use the following sharding parameters for setting on a single TPU Slice or a GPU Slice.

```
ici_data_parallelism: 1
ici_fsdp_parallelism: -1 # recommended ICI axis to be auto-sharded
ici_fsdp_transpose_parallelism: 1
ici_sequence_parallelism: 1
ici_tensor_parallelism: 1
```

Following sharding values dictate how training will happen across multiple TPU Pods.

```
dcn_data_parallelism: -1 # recommended DCN axis to be auto-sharded
dcn_fsdp_parallelism: 1
dcn_fsdp_transpose_parallelism: 1
dcn_sequence_parallelism: 1 # never recommended
dcn_tensor_parallelism: 1 # never recommended
dcn_pipeline_parallelism: 1
dcn_expert_parallelism: 1
dcn_autoregressive_parallelism: 1 # never recommended
```

0 comments on commit 4d0ac8e

Please sign in to comment.