Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic documentation pages #988

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Read the Docs configuration file for Sphinx projects
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"

# Build documentation in the "docs/" directory with Sphinx
sphinx:
configuration: docs/conf.py
# Fail on all warnings to avoid broken references
fail_on_warning: true

# Optional but recommended, declare the Python requirements required
# to build your documentation
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
python:
install:
- requirements: docs/requirements.txt
33 changes: 17 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,11 @@
limitations under the License.
-->

# MaxText

[![Unit Tests](https://github.com/google/maxtext/actions/workflows/UnitTests.yml/badge.svg)](https://github.com/google/maxtext/actions/workflows/UnitTests.yml)

# Overview
## Overview

MaxText is a **high performance**, **highly scalable**, **open-source** LLM written in pure Python/Jax and targeting Google Cloud TPUs and GPUs for **training** and **inference**. MaxText achieves [high MFUs](#runtime-performance-results) and scales from single host to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler.

Expand All @@ -30,15 +31,15 @@ Key supported features:
* Training and Inference (in preview)
* Models: Llama2, Mistral and Gemma

# Table of Contents
## Table of Contents

* [Getting Started](getting_started/First_run.md)
* [Runtime Performance Results](#runtime-performance-results)
* [Comparison To Alternatives](#comparison-to-alternatives)
* [Development](#development)
* [Features and Diagnostics](#features-and-diagnostics)

# Getting Started
## Getting Started

For your first time running MaxText, we provide specific [instructions](getting_started/First_run.md).

Expand All @@ -51,11 +52,11 @@ Some extra helpful guides:

In addition to the getting started guides, there are always other MaxText capabilities that are being constantly being added! The full suite of end-to-end tests is in [end_to_end](end_to_end). We run them with a nightly cadence. They can be a good source for understanding MaxText Alternatively you can see the continuous [unit tests](.github/workflows/UnitTests.yml) which are run almost continuously.

# Runtime Performance Results
## Runtime Performance Results

More details on reproducing these results can be found in [MaxText/configs/README.md](MaxText/configs/README.md).

## TPU v5p
### TPU v5p

| No. of params | Accelerator Type | TFLOP/chip/sec | Model flops utilization (MFU) |
|---|---|---|---|
Expand All @@ -70,7 +71,7 @@ More details on reproducing these results can be found in [MaxText/configs/READM
| 1160B | v5p-7680 | 2.95e+02 | 64.27% |
| 1160B | v5p-12288 | 3.04e+02 | 66.23% |

## TPU v5e
### TPU v5e

For 16B, 32B, 64B, and 128B models. See full run configs in [MaxText/configs/v5e/](MaxText/configs/v5e/) as `16b.sh`, `32b.sh`, `64b.sh`, `128b.sh`.

Expand All @@ -83,16 +84,16 @@ For 16B, 32B, 64B, and 128B models. See full run configs in [MaxText/configs/v5e
| 16x v5e-256 | 111 | 56.56% | 123 | 62.26% | 105 | 53.29% | 100 | 50.86% |
| 32x v5e-256 | 108 | 54.65% | 119 | 60.40% | 99 | 50.18% | 91 | 46.25% |

# Comparison to Alternatives
## Comparison to Alternatives

MaxText is heavily inspired by [MinGPT](https://github.com/karpathy/minGPT)/[NanoGPT](https://github.com/karpathy/nanoGPT), elegant standalone GPT implementations written in PyTorch and targeting Nvidia GPUs. MaxText is more complex, supporting more industry standard models and scaling to tens of thousands of chips. Ultimately MaxText has an MFU more than three times the [17%](https://twitter.com/karpathy/status/1613250489097027584?cxt=HHwWgIDUhbixteMsAAAA) reported most recently with that codebase, is massively scalable and implements a key-value cache for efficient auto-regressive decoding.

MaxText is more similar to [Nvidia/Megatron-LM](https://github.com/NVIDIA/Megatron-LM), a very well tuned LLM implementation targeting Nvidia GPUs. The two implementations achieve comparable MFUs. The difference in the codebases highlights the different programming strategies. MaxText is pure Python, relying heavily on the XLA compiler to achieve high performance. By contrast, Megatron-LM is a mix of Python and CUDA, relying on well-optimized CUDA kernels to achieve high performance.

MaxText is also comparable to [Pax](https://github.com/google/paxml). Like Pax, MaxText provides high-performance and scalable implementations of LLMs in Jax. Pax focuses on enabling powerful configuration parameters, enabling developers to change the model by editing config parameters. By contrast, MaxText is a simple, concrete implementation of various LLMs that encourages users to extend by forking and directly editing the source code.

# Features and Diagnostics
## Collect Stack Traces
## Features and Diagnostics
### Collect Stack Traces
When running a Single Program, Multiple Data (SPMD) job on accelerators, the overall process can hang if there is any error or any VM hangs/crashes for some reason. In this scenario, capturing stack traces will help to identify and troubleshoot the issues for the jobs running on TPU VMs.

The following configurations will help to debug a fault or when a program is stuck or hung somewhere by collecting stack traces. Change the parameter values accordingly in `MaxText/configs/base.yml`:
Expand All @@ -106,10 +107,10 @@ jsonPayload.verb="stacktraceanalyzer"

Here is the related PyPI package: https://pypi.org/project/cloud-tpu-diagnostics.

## Ahead of Time Compilation (AOT)
### Ahead of Time Compilation (AOT)
To compile your training run ahead of time, we provide a tool `train_compile.py`. This tool allows you to compile the main `train_step` in `train.py` for target hardware (e.g. a large number of v5e devices) without using the full cluster.

### TPU Support
#### TPU Support

You may use only a CPU or a single VM from a different family to pre-compile for a TPU cluster. This compilation helps with two main goals:

Expand All @@ -119,7 +120,7 @@ You may use only a CPU or a single VM from a different family to pre-compile for

The tool `train_compile.py` is tightly linked to `train.py` and uses the same configuration file `configs/base.yml`. Although you don't need to run on a TPU, you do need to install `jax[tpu]` in addition to other dependencies, so we recommend running `setup.sh` to install these if you have not already done so.

#### Example AOT 1: Compile ahead of time basics
##### Example AOT 1: Compile ahead of time basics
After installing the dependencies listed above, you are ready to compile ahead of time:
```
# Run the below on a single machine, e.g. a CPU
Expand All @@ -129,7 +130,7 @@ global_parameter_scale=16 per_device_batch_size=4

This will compile a 16B parameter MaxText model on 2 v5e pods.

#### Example AOT 2: Save compiled function, then load and run it
##### Example AOT 2: Save compiled function, then load and run it
Here is an example that saves then loads the compiled `train_step`, starting with the save:

**Step 1: Run AOT and save compiled function**
Expand All @@ -156,14 +157,14 @@ base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket

In the save step of example 2 above we included exporting the compiler flag `LIBTPU_INIT_ARGS` and `learning_rate` because those affect the compiled object `my_compiled_train.pickle.` The sizes of the model (e.g. `global_parameter_scale`, `max_sequence_length` and `per_device_batch`) are fixed when you initially compile via `compile_train.py`, you will see a size error if you try to run the saved compiled object with different sizes than you compiled with. However a subtle note is that the **learning rate schedule** is also fixed when you run `compile_train` - which is determined by both `steps` and `learning_rate`. The optimizer parameters such as `adam_b1` are passed only as shaped objects to the compiler - thus their real values are determined when you run `train.py`, not during the compilation. If you do pass in different shapes (e.g. `per_device_batch`), you will get a clear error message reporting that the compiled signature has different expected shapes than what was input. If you attempt to run on different hardware than the compilation targets requested via `compile_topology`, you will get an error saying there is a failure to map the devices from the compiled to your real devices. Using different XLA flags or a LIBTPU than what was compiled will probably run silently with the environment you compiled in without error. However there is no guaranteed behavior in this case; you should run in the same environment you compiled in.

### GPU Support
#### GPU Support
Ahead-of-time compilation is also supported for GPUs with some differences from TPUs:

1. GPU does not support compilation across hardware: A GPU host is still required to run AoT compilation, but a single GPU host can compile a program for a larger cluster of the same hardware.

1. For [A3 Cloud GPUs](https://cloud.google.com/compute/docs/gpus#h100-gpus), the maximum "slice" size is a single host, and the `compile_topology_num_slices` parameter represents the number of A3 machines to precompile for.

#### Example
##### Example
This example illustrates the flags to use for a multihost GPU compilation targeting a cluster of 4 A3 hosts:

**Step 1: Run AOT and save compiled function**
Expand Down Expand Up @@ -191,5 +192,5 @@ base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket
As in the TPU case, note that the compilation environment must match the execution environment, in this case by setting the same `XLA_FLAGS`.


## Automatically Upload Logs to Vertex Tensorboard
### Automatically Upload Logs to Vertex Tensorboard
MaxText supports automatic upload of logs collected in a directory to a Tensorboard instance in Vertex AI. Follow [user guide](getting_started/Use_Vertex_AI_Tensorboard.md) to know more.
Binary file added docs/_static/build_model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/quantization.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -1,21 +1,23 @@
## Performance of Data Input Pipeline
# Performance of Data Input Pipeline
* Overview of supported data input pipelines: https://github.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md
* Perf data intepretation: for all three data pipelines, there are data prefetch running in parallel with computation. The goal is to hide data loading behind computation. As long as data loading step time < training computation step time, the data pipeline perf is considered sufficient.

### Methods
## Methods
* The following results are measured by [standalone_dataloader.py](https://github.com/google/maxtext/blob/main/MaxText/standalone_dataloader.py), which performs data loading without computation.
* c4 data of different formats in GCS bucket are used. For Grain pipeline only, the GCS bucket is mounted to a local path via GCSFUSE ([script](https://github.com/google/maxtext/blob/main/setup_gcsfuse.sh))
* The GCS bucket is multi-region (US) and the VMs that read data can be in different regions in the US.

### HuggingFace pipeline
## HuggingFace pipeline
The following data are collected using c4 data in Parquet format.

| Pipeline | seq_len | VM type | per_host_batch | # of host | # of batch | first step (s) | total time (s) |
| ----------- | ------- | ---------- | ----------------- | --------- | ---------- | ------------- | -------------- |
| HuggingFace | 2048 | TPU v4-8 | 32 (per_device=8) | 1 | 1000 | 6 | 72 |
| HuggingFace | 2048 | TPU v4-128 | 32 (per_device=8) | 16 | 1000 | 6 | 72 |

### Grain pipeline
## Grain pipeline
The following data are collected using c4 data in ArrayRecord format.

| Pipeline | seq_len | VM type | per_host_batch | # of host | # of batch | worker | first step (s) | total time (s) |
| ----------- | ------- | ---------- | ----------------- | --------- | ---------- | ----- | -------------- | --------------- |
| Grain | 2048 | TPU v4-8 | 32 (per_device=8) | 1 | 1000 | 1 | 7 | 1200 |
Expand All @@ -27,8 +29,9 @@ The following data are collected using c4 data in ArrayRecord format.
| Grain | 2048 | TPU v4-128 | 32 (per_device=8) | 16 | 1000 | 4 | 8 | 154 |
| Grain | 2048 | TPU v4-128 | 32 (per_device=8) | 16 | 1000 | 8 | 11 | 120 |

### TFDS pipeline
## TFDS pipeline
The following data are collected using c4 data in TFRecord format.

| Pipeline | seq_len | VM type | per_host_batch | # of host | # of batch | first step (s) | total time (s) |
| ----------- | ------- | ---------- | ----------------- | --------- | ---------- | ------------- | -------------- |
| TFDS | 2048 | TPU v4-8 | 32 (per_device=8) | 1 | 1000 | 2 | 17 |
Expand Down
Loading