From b345a969cd97b190e79bfcc0544d7699033f2d2c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Tue, 22 Oct 2024 11:28:24 -0300 Subject: [PATCH 01/22] Add basic documentation pages --- README.md | 33 +-- docs/advanced_usage.md | 10 + docs/conf.py | 42 ++++ docs/data_loading.md | 10 + .../getting_started}/Data_Input_Perf.md | 13 +- .../getting_started}/Data_Input_Pipeline.md | 34 +-- .../getting_started}/First_run.md | 15 +- .../getting_started}/Run_Llama2.md | 4 +- .../Run_MaxText_via_multihost_job.md | 2 +- .../Run_MaxText_via_multihost_runner.md | 2 +- .../getting_started}/Run_MaxText_via_xpk.md | 0 .../Use_Vertex_AI_Tensorboard.md | 8 +- docs/getting_started/build_model.png | Bin 0 -> 24524 bytes docs/getting_started/index.md | 20 ++ docs/getting_started/steps_model.md | 16 ++ docs/index.md | 208 ++++++++++++++++++ docs/reference/code_organization.md | 15 ++ docs/reference/config_options.md | 1 + docs/reference/index.md | 8 + docs/requirements.txt | 5 + requirements_docs.txt | 2 + 21 files changed, 395 insertions(+), 53 deletions(-) create mode 100644 docs/advanced_usage.md create mode 100644 docs/conf.py create mode 100644 docs/data_loading.md rename {getting_started => docs/getting_started}/Data_Input_Perf.md (96%) rename {getting_started => docs/getting_started}/Data_Input_Pipeline.md (94%) rename {getting_started => docs/getting_started}/First_run.md (93%) rename {getting_started => docs/getting_started}/Run_Llama2.md (94%) rename {getting_started => docs/getting_started}/Run_MaxText_via_multihost_job.md (97%) rename {getting_started => docs/getting_started}/Run_MaxText_via_multihost_runner.md (98%) rename {getting_started => docs/getting_started}/Run_MaxText_via_xpk.md (100%) rename {getting_started => docs/getting_started}/Use_Vertex_AI_Tensorboard.md (96%) create mode 100644 docs/getting_started/build_model.png create mode 100644 docs/getting_started/index.md create mode 100644 docs/getting_started/steps_model.md create mode 100644 docs/index.md create mode 100644 docs/reference/code_organization.md create mode 100644 docs/reference/config_options.md create mode 100644 docs/reference/index.md create mode 100644 docs/requirements.txt create mode 100644 requirements_docs.txt diff --git a/README.md b/README.md index e6edd0ffa..0f24bdcef 100644 --- a/README.md +++ b/README.md @@ -14,10 +14,11 @@ limitations under the License. --> +# MaxText [![Unit Tests](https://github.com/google/maxtext/actions/workflows/UnitTests.yml/badge.svg)](https://github.com/google/maxtext/actions/workflows/UnitTests.yml) -# Overview +## Overview MaxText is a **high performance**, **highly scalable**, **open-source** LLM written in pure Python/Jax and targeting Google Cloud TPUs and GPUs for **training** and **inference**. MaxText achieves [high MFUs](#runtime-performance-results) and scales from single host to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler. @@ -30,7 +31,7 @@ Key supported features: * Training and Inference (in preview) * Models: Llama2, Mistral and Gemma -# Table of Contents +## Table of Contents * [Getting Started](getting_started/First_run.md) * [Runtime Performance Results](#runtime-performance-results) @@ -38,7 +39,7 @@ Key supported features: * [Development](#development) * [Features and Diagnostics](#features-and-diagnostics) -# Getting Started +## Getting Started For your first time running MaxText, we provide specific [instructions](getting_started/First_run.md). @@ -51,11 +52,11 @@ Some extra helpful guides: In addition to the getting started guides, there are always other MaxText capabilities that are being constantly being added! The full suite of end-to-end tests is in [end_to_end](end_to_end). We run them with a nightly cadence. They can be a good source for understanding MaxText Alternatively you can see the continuous [unit tests](.github/workflows/UnitTests.yml) which are run almost continuously. -# Runtime Performance Results +## Runtime Performance Results More details on reproducing these results can be found in [MaxText/configs/README.md](MaxText/configs/README.md). -## TPU v5p +### TPU v5p | No. of params | Accelerator Type | TFLOP/chip/sec | Model flops utilization (MFU) | |---|---|---|---| @@ -70,7 +71,7 @@ More details on reproducing these results can be found in [MaxText/configs/READM | 1160B | v5p-7680 | 2.95e+02 | 64.27% | | 1160B | v5p-12288 | 3.04e+02 | 66.23% | -## TPU v5e +### TPU v5e For 16B, 32B, 64B, and 128B models. See full run configs in [MaxText/configs/v5e/](MaxText/configs/v5e/) as `16b.sh`, `32b.sh`, `64b.sh`, `128b.sh`. @@ -83,7 +84,7 @@ For 16B, 32B, 64B, and 128B models. See full run configs in [MaxText/configs/v5e | 16x v5e-256 | 111 | 56.56% | 123 | 62.26% | 105 | 53.29% | 100 | 50.86% | | 32x v5e-256 | 108 | 54.65% | 119 | 60.40% | 99 | 50.18% | 91 | 46.25% | -# Comparison to Alternatives +## Comparison to Alternatives MaxText is heavily inspired by [MinGPT](https://github.com/karpathy/minGPT)/[NanoGPT](https://github.com/karpathy/nanoGPT), elegant standalone GPT implementations written in PyTorch and targeting Nvidia GPUs. MaxText is more complex, supporting more industry standard models and scaling to tens of thousands of chips. Ultimately MaxText has an MFU more than three times the [17%](https://twitter.com/karpathy/status/1613250489097027584?cxt=HHwWgIDUhbixteMsAAAA) reported most recently with that codebase, is massively scalable and implements a key-value cache for efficient auto-regressive decoding. @@ -91,8 +92,8 @@ MaxText is more similar to [Nvidia/Megatron-LM](https://github.com/NVIDIA/Megatr MaxText is also comparable to [Pax](https://github.com/google/paxml). Like Pax, MaxText provides high-performance and scalable implementations of LLMs in Jax. Pax focuses on enabling powerful configuration parameters, enabling developers to change the model by editing config parameters. By contrast, MaxText is a simple, concrete implementation of various LLMs that encourages users to extend by forking and directly editing the source code. -# Features and Diagnostics -## Collect Stack Traces +## Features and Diagnostics +### Collect Stack Traces When running a Single Program, Multiple Data (SPMD) job on accelerators, the overall process can hang if there is any error or any VM hangs/crashes for some reason. In this scenario, capturing stack traces will help to identify and troubleshoot the issues for the jobs running on TPU VMs. The following configurations will help to debug a fault or when a program is stuck or hung somewhere by collecting stack traces. Change the parameter values accordingly in `MaxText/configs/base.yml`: @@ -106,10 +107,10 @@ jsonPayload.verb="stacktraceanalyzer" Here is the related PyPI package: https://pypi.org/project/cloud-tpu-diagnostics. -## Ahead of Time Compilation (AOT) +### Ahead of Time Compilation (AOT) To compile your training run ahead of time, we provide a tool `train_compile.py`. This tool allows you to compile the main `train_step` in `train.py` for target hardware (e.g. a large number of v5e devices) without using the full cluster. -### TPU Support +#### TPU Support You may use only a CPU or a single VM from a different family to pre-compile for a TPU cluster. This compilation helps with two main goals: @@ -119,7 +120,7 @@ You may use only a CPU or a single VM from a different family to pre-compile for The tool `train_compile.py` is tightly linked to `train.py` and uses the same configuration file `configs/base.yml`. Although you don't need to run on a TPU, you do need to install `jax[tpu]` in addition to other dependencies, so we recommend running `setup.sh` to install these if you have not already done so. -#### Example AOT 1: Compile ahead of time basics +##### Example AOT 1: Compile ahead of time basics After installing the dependencies listed above, you are ready to compile ahead of time: ``` # Run the below on a single machine, e.g. a CPU @@ -129,7 +130,7 @@ global_parameter_scale=16 per_device_batch_size=4 This will compile a 16B parameter MaxText model on 2 v5e pods. -#### Example AOT 2: Save compiled function, then load and run it +##### Example AOT 2: Save compiled function, then load and run it Here is an example that saves then loads the compiled `train_step`, starting with the save: **Step 1: Run AOT and save compiled function** @@ -156,14 +157,14 @@ base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket In the save step of example 2 above we included exporting the compiler flag `LIBTPU_INIT_ARGS` and `learning_rate` because those affect the compiled object `my_compiled_train.pickle.` The sizes of the model (e.g. `global_parameter_scale`, `max_sequence_length` and `per_device_batch`) are fixed when you initially compile via `compile_train.py`, you will see a size error if you try to run the saved compiled object with different sizes than you compiled with. However a subtle note is that the **learning rate schedule** is also fixed when you run `compile_train` - which is determined by both `steps` and `learning_rate`. The optimizer parameters such as `adam_b1` are passed only as shaped objects to the compiler - thus their real values are determined when you run `train.py`, not during the compilation. If you do pass in different shapes (e.g. `per_device_batch`), you will get a clear error message reporting that the compiled signature has different expected shapes than what was input. If you attempt to run on different hardware than the compilation targets requested via `compile_topology`, you will get an error saying there is a failure to map the devices from the compiled to your real devices. Using different XLA flags or a LIBTPU than what was compiled will probably run silently with the environment you compiled in without error. However there is no guaranteed behavior in this case; you should run in the same environment you compiled in. -### GPU Support +#### GPU Support Ahead-of-time compilation is also supported for GPUs with some differences from TPUs: 1. GPU does not support compilation across hardware: A GPU host is still required to run AoT compilation, but a single GPU host can compile a program for a larger cluster of the same hardware. 1. For [A3 Cloud GPUs](https://cloud.google.com/compute/docs/gpus#h100-gpus), the maximum "slice" size is a single host, and the `compile_topology_num_slices` parameter represents the number of A3 machines to precompile for. -#### Example +##### Example This example illustrates the flags to use for a multihost GPU compilation targeting a cluster of 4 A3 hosts: **Step 1: Run AOT and save compiled function** @@ -191,5 +192,5 @@ base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket As in the TPU case, note that the compilation environment must match the execution environment, in this case by setting the same `XLA_FLAGS`. -## Automatically Upload Logs to Vertex Tensorboard +### Automatically Upload Logs to Vertex Tensorboard MaxText supports automatic upload of logs collected in a directory to a Tensorboard instance in Vertex AI. Follow [user guide](getting_started/Use_Vertex_AI_Tensorboard.md) to know more. diff --git a/docs/advanced_usage.md b/docs/advanced_usage.md new file mode 100644 index 000000000..a02fa35b8 --- /dev/null +++ b/docs/advanced_usage.md @@ -0,0 +1,10 @@ +# Advanced usage + +```{toctree} +getting_started/Run_MaxText_via_multihost_job.md +getting_started/Run_MaxText_via_multihost_runner.md +getting_started/Run_MaxText_via_xpk.md +getting_started/Use_Vertex_AI_Tensorboard.md +getting_started/Run_Llama2.md +data_loading.md +``` diff --git a/docs/conf.py b/docs/conf.py new file mode 100644 index 000000000..e47349f7c --- /dev/null +++ b/docs/conf.py @@ -0,0 +1,42 @@ +# Configuration file for the Sphinx documentation builder. +# +# For the full list of built-in configuration values, see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Project information ----------------------------------------------------- +# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information + +project = "MaxText" +copyright = "2024, MaxText developers" +author = "MaxText developers" + +# -- General configuration --------------------------------------------------- +# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration + +extensions = [ + "myst_nb", +] + +templates_path = ["_templates"] +exclude_patterns = [] +source_suffix = [".rst", ".ipynb", ".md"] + + +# -- Options for HTML output ------------------------------------------------- +# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output + +html_theme = "sphinx_book_theme" +html_static_path = [] + +# exclude_patterns = [ +# # Sometimes sphinx reads its own outputs as inputs! +# "build/html", +# "build/jupyter_execute", +# ] + +# -- Options for myst ---------------------------------------------- +myst_heading_anchors = 3 # auto-generate 3 levels of heading anchors +myst_enable_extensions = [ + "dollarmath", + "linkify", +] diff --git a/docs/data_loading.md b/docs/data_loading.md new file mode 100644 index 000000000..6771a8b6d --- /dev/null +++ b/docs/data_loading.md @@ -0,0 +1,10 @@ +# Data Loading + +Maxtext supports input data pipelines in the following ways: +Tf.data* +Grain +Hugging Face Datasets + +*Tf.data is the most performant way of loading large scale datasets. + +You can read more about the pipelines in [](getting_started/Data_Input_Pipeline.md). \ No newline at end of file diff --git a/getting_started/Data_Input_Perf.md b/docs/getting_started/Data_Input_Perf.md similarity index 96% rename from getting_started/Data_Input_Perf.md rename to docs/getting_started/Data_Input_Perf.md index 378ac7860..a81156597 100644 --- a/getting_started/Data_Input_Perf.md +++ b/docs/getting_started/Data_Input_Perf.md @@ -1,21 +1,23 @@ -## Performance of Data Input Pipeline +# Performance of Data Input Pipeline * Overview of supported data input pipelines: https://github.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md * Perf data intepretation: for all three data pipelines, there are data prefetch running in parallel with computation. The goal is to hide data loading behind computation. As long as data loading step time < training computation step time, the data pipeline perf is considered sufficient. -### Methods +## Methods * The following results are measured by [standalone_dataloader.py](https://github.com/google/maxtext/blob/main/MaxText/standalone_dataloader.py), which performs data loading without computation. * c4 data of different formats in GCS bucket are used. For Grain pipeline only, the GCS bucket is mounted to a local path via GCSFUSE ([script](https://github.com/google/maxtext/blob/main/setup_gcsfuse.sh)) * The GCS bucket is multi-region (US) and the VMs that read data can be in different regions in the US. -### HuggingFace pipeline +## HuggingFace pipeline The following data are collected using c4 data in Parquet format. + | Pipeline | seq_len | VM type | per_host_batch | # of host | # of batch | first step (s) | total time (s) | | ----------- | ------- | ---------- | ----------------- | --------- | ---------- | ------------- | -------------- | | HuggingFace | 2048 | TPU v4-8 | 32 (per_device=8) | 1 | 1000 | 6 | 72 | | HuggingFace | 2048 | TPU v4-128 | 32 (per_device=8) | 16 | 1000 | 6 | 72 | -### Grain pipeline +## Grain pipeline The following data are collected using c4 data in ArrayRecord format. + | Pipeline | seq_len | VM type | per_host_batch | # of host | # of batch | worker | first step (s) | total time (s) | | ----------- | ------- | ---------- | ----------------- | --------- | ---------- | ----- | -------------- | --------------- | | Grain | 2048 | TPU v4-8 | 32 (per_device=8) | 1 | 1000 | 1 | 7 | 1200 | @@ -27,8 +29,9 @@ The following data are collected using c4 data in ArrayRecord format. | Grain | 2048 | TPU v4-128 | 32 (per_device=8) | 16 | 1000 | 4 | 8 | 154 | | Grain | 2048 | TPU v4-128 | 32 (per_device=8) | 16 | 1000 | 8 | 11 | 120 | -### TFDS pipeline +## TFDS pipeline The following data are collected using c4 data in TFRecord format. + | Pipeline | seq_len | VM type | per_host_batch | # of host | # of batch | first step (s) | total time (s) | | ----------- | ------- | ---------- | ----------------- | --------- | ---------- | ------------- | -------------- | | TFDS | 2048 | TPU v4-8 | 32 (per_device=8) | 1 | 1000 | 2 | 17 | diff --git a/getting_started/Data_Input_Pipeline.md b/docs/getting_started/Data_Input_Pipeline.md similarity index 94% rename from getting_started/Data_Input_Pipeline.md rename to docs/getting_started/Data_Input_Pipeline.md index fb53e63ee..87695460c 100644 --- a/getting_started/Data_Input_Pipeline.md +++ b/docs/getting_started/Data_Input_Pipeline.md @@ -13,7 +13,7 @@ See the License for the specific language governing permissions and limitations under the License. --> -## Data Input Pipeline +# Data Input Pipeline Currently MaxText has three data input pipelines: @@ -23,24 +23,24 @@ Currently MaxText has three data input pipelines: | Grain | ArrayRecord, available through Tensorflow Datasets | fully deterministic, regardless of preemption | only supports random access datasets | | TFDS | TFRecord, available through Tensorflow Datasets | | only supports TFRecords
non-deterministic with preemption
(deterministic without preemption) | -### Performance +## Performance * Perf data for all 3 input pipeline: https://github.com/google/maxtext/blob/main/getting_started/Data_Input_Perf.md -### Multihost dataloading best practice +## Multihost dataloading best practice In multihost environment, if use an input pipeline that reads data sequentially (HuggingFace or TFDS), the most performant way is to have each data file only accessed by one host, and each host access a subset of data files (shuffle is within the subset of files). This requires (# of data files) to be multiples of (# of hosts loading data). We recommand users to reshard the dataset or use a subset of hosts to load data by setting expansion_factor_real_data (only available for some topologies, will error out otherwise). In MaxText, since the goal is to demonstrate the most performant experience, the behaviors for different data pipelines are: -#### HuggingFace pipeline in multihost +### HuggingFace pipeline in multihost * When (# of data files) >= (# of hosts loading data), assign files to each host as evenly as possible, some host may ended up with 1 file more than the others. When some hosts run out of data, they will produce empty padding batches, so that you are able to utilize the data from the hosts that still have data. But in this stage, training/eval will be less effective, and you will see a decrease in total_weights and slower change in loss. If all hosts run out of data before the step number you set, you will see 0 total_weights and 0 loss. The training/eval will run until the steps/eval_steps set in the config. Note that even each host are assigned the same number of data files, due to the different example count in each data file, and example packing, you will still have different number of batches on each host near the end of the epoch. * When (# of data files) < (# of hosts loading data), files are read sequentially with multiple hosts accessing each file, perf can degrade quickly as # of host increases. -#### TFDS pipeline in multihost +### TFDS pipeline in multihost * When (# of data files) >= (# of hosts loading data), assign equal number of files to each host. The remainning files are skipped. Train/eval will hang if steps/eval_steps are not met but some hosts run out of data. Please set steps/eval_steps accordingly. * When (# of data files) < (# of hosts loading data), files are read sequentially with multiple hosts accessing each file, perf can degrade quickly as # of host increases. -#### Grain pipeline in multihost +### Grain pipeline in multihost * Perf not affected by (# of data files) vs (# of hosts loading data). [Data are shuffled globally](https://github.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#global-shuffle-in-grain). Because grain uses a data format (ArrayRecord) that supports random access by index. Even with multiple hosts accessing the same file, they are accessing different indices and and won't have the issue seen with sequential reading. * At the end of the dataset, you may still have some hosts runing out of indices and hang, Please set steps/eval_steps accordingly. -### HuggingFace pipeline +## HuggingFace pipeline The HuggingFace pipeline supports streaming directly from HuggingFace Hub, or from GCS bucket in HuggingFace supported formats (parquet, json, etc.). This is through the HuggingFace [`datasets.load_dataset` API](https://huggingface.co/docs/datasets/en/loading) with `streaming=True`, which take in `hf_*` parameters. -#### Example config for streaming from HuggingFace Hub (no download needed): +### Example config for streaming from HuggingFace Hub (no download needed): ``` dataset_type: hf hf_path: 'allenai/c4' # for using https://huggingface.co/datasets/allenai/c4 @@ -56,7 +56,7 @@ tokenizer_path: 'google-t5/t5-large' # for using https://huggingface.co/google- hf_access_token: '' # provide token if using gated dataset or tokenizer ``` -#### Example config for streaming from downloaded data in a GCS bucket: +### Example config for streaming from downloaded data in a GCS bucket: ``` dataset_type: hf hf_path: 'parquet' # or json, arrow, etc. @@ -70,29 +70,29 @@ hf_eval_files: 'gs:////*-validation-*.parquet' # match the val # or a local path containing tokenizer in a format supported by transformers.AutoTokenizer tokenizer_path: 'google-t5/t5-large' # for using https://huggingface.co/google-t5/t5-large ``` -#### Limitations & Recommendations +### Limitations & Recommendations 1. Streaming data directly from HuggingFace Hub may be impacted by the traffic of the server. During peak hours you may encounter "504 Server Error: Gateway Time-out". It's recommended to download the HuggingFace dataset to a GCS bucket or disk for the most stable experience. 2. Streaming data directly from HuggingFace Hub works in multihost settings with a small number of hosts. We have encountered "read time out" error with host number > 16. 3. Only supports epoch=1 at this moment. -### Grain pipeline - for determinism +## Grain pipeline - for determinism -#### Why do we need determinism for data input pipeline? +### Why do we need determinism for data input pipeline? Determinism in a data input pipeline means that the same input data always results in the same sequence of batches at each step. This is typically achieved by setting a fixed shuffle seed during pipeline initialization. In an ideal scenario, where training runs uninterrupted, this determinism is straightforward (deterministic without preemption). However, real-world distributed training environments often face preemptions due to maintenance, hardware failures, or resource constraints. When a preempted training run resumes, the data input pipeline is re-initialized. If the same shuffle seed is used, the pipeline restarts from the beginning, potentially re-training the model on initial data. Conversely, a new seed produces a different batch sequence, making it difficult to track which data has been seen and how often each example is used for training. This lack of control can impact model performance and reproducibility. -#### How does Grain achieve determinism +### How does Grain achieve determinism Grain ensures determinism in data input pipelines by saving the pipeline's state, including dataset metadata and processed data indices, within a small JSON file in checkpoints. When a training run is resumed with the same dataset and shuffle seed, Grain restores the pipeline's exact state from the checkpoint. This enables fully deterministic, reproducible training that is resilient to disruptions. -#### Cases where determinism is crucial +### Cases where determinism is crucial * **Model sensitive to repetition.** When models are sensitive to the frequency with which they encounter specific examples, precise control over the order and repetition of data during training is essential. * **Convergence comparison.** In sensitive convergence experiments like testing quantization techniques, maintaining identical data batches between runs (e.g., quantized vs. unquantized) is essential for comparison. Determinism ensures consistency even when the runs are long and undergo saving/resuming at different steps. * **Debug training anomalies.** When troubleshooting training spikes or anomalies, the ability to replay the exact data sequence helps distinguish between bad data batches and underlying hardware or software issues. -#### Global shuffle in Grain +### Global shuffle in Grain In HF or TFDS data pipeline, global shuffle is performed by a shuffle buffer with limited size. Grain performs global shuffle of the indices in the beginning of each epoch and then reads the elements according to the random order. We have found this to be generally fast enough, even when using hard drives and distributed file systems. -#### Using Grain +### Using Grain 1. Dataset needs to be in a format that supports random access. The default format is [ArrayRecord](https://github.com/google/array_record). For converting a dataset into ArrayRecord, see [instructions](https://github.com/google/array_record/tree/main/beam). Additionally, other random accessible data sources can be supported via a custom data source class ([docs](https://github.com/google/grain/blob/main/docs/data_sources.md)). 2. ArrayRecord dataset, when hosted on GCS bucket, can only be read through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/setup.sh). User then needs to mount the GCS bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/setup_gcsfuse.sh). The script configs some parameters for the mount. ``` @@ -118,7 +118,7 @@ eval_interval: 10000 grain_eval_files: '/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-validation.array_record*' ``` -### TFDS pipeline +## TFDS pipeline 1. Download the Allenai c4 dataset in TFRecord format to a GCS bucket (will cost about $100, [details](https://github.com/allenai/allennlp/discussions/5056)) ``` diff --git a/getting_started/First_run.md b/docs/getting_started/First_run.md similarity index 93% rename from getting_started/First_run.md rename to docs/getting_started/First_run.md index ad34ce64e..595593375 100644 --- a/getting_started/First_run.md +++ b/docs/getting_started/First_run.md @@ -1,15 +1,15 @@ -# Getting Started +# First run We recommend starting with a single host first and then moving to multihost. -## Getting Started: Cloud Storage and Configure +## Cloud Storage and Configure 1. [Create a gcs buckets](https://cloud.google.com/storage/docs/creating-buckets) in your project for storing logs and checkpoints. To run maxtext the TPU/GPU VMs must have permission to read/write the gcs bucket. These permissions are granted by service account roles, such as the `STORAGE ADMIN` role. 2. MaxText reads a yaml file for configuration. We also recommend reviewing the configurable options in `configs/base.yml`, this config includes a decoder-only model of ~1B parameters. The configurable options can be overwritten from command lines. For instance you may change the `steps` or `log_period` by either modifying `configs/base.yml` or by passing in `steps` and `log_period` as additional args to the `train.py` call. `base_output_directory` should be set to a folder in the bucket you just created. -## Getting Started: Local Development for single host +## Local Development for single host -#### Running on Cloud TPUs +### Running on Cloud TPUs Local development is a convenient way to run MaxText on a single host. It doesn't scale to multiple hosts. @@ -40,7 +40,7 @@ python3 MaxText/decode.py MaxText/configs/base.yml \ Be aware, these decodings will be random. To get high quality decodings you need pass in a checkpoint, typically via the `load_parameters_path` argument. -#### Running on NVIDIA GPUs +### Running on NVIDIA GPUs 1. Use `bash docker_build_dependency_image.sh DEVICE=gpu` can be used to build a container with the required dependencies. 2. After installation is completed, run training with the command on synthetic data: ``` @@ -64,13 +64,14 @@ python3 MaxText/decode.py MaxText/configs/base.yml \ Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:297: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details); current tracing scope: all-reduce-start.2; current profiling annotation: XlaModule:#hlo_module=jit__unnamed_wrapped_function_,program_id=7#. ``` -## Getting Starting: Multihost development +## Multihost development There are three patterns for running MaxText with more than one host. 1. [GKE, recommended] [Running Maxtext with xpk](Run_MaxText_via_xpk.md) - Quick Experimentation and Production support 2. [GCE] [Running Maxtext with Multihost Jobs](Run_MaxText_via_multihost_job.md) - Long Running Production Jobs with Queued Resources 3. [GCE] [Running Maxtext with Multihost Runner](Run_MaxText_via_multihost_runner.md) - Fast experiments via multiple ssh connections. -## Getting Starting: Preflight Optimizations +## Preflight Optimizations Once you've gotten workloads running, there are important optimizations you might want to put on your cluster. Please check the doc [PREFLIGHT.md](https://github.com/google/maxtext/blob/main/PREFLIGHT.md) + diff --git a/getting_started/Run_Llama2.md b/docs/getting_started/Run_Llama2.md similarity index 94% rename from getting_started/Run_Llama2.md rename to docs/getting_started/Run_Llama2.md index b120ee773..febeb73d8 100644 --- a/getting_started/Run_Llama2.md +++ b/docs/getting_started/Run_Llama2.md @@ -14,13 +14,13 @@ limitations under the License. --> -## About Llama2 +# About Llama2 MaxText supports [Llama2](https://llama.meta.com/llama2) pretraining, finetuning and decoding for its 7B and 70B flavors. To get started on decoding and finetuning of Llama2, you will first need to download weights along with its tokenizer from [Meta](https://llama.meta.com/llama-downloads). The file [test_llama2_7b.sh](https://github.com/google/maxtext/blob/main/end_to_end/tpu/llama2/7b/test_llama2_7b.sh) provides details on how to convert the PyTorch weights in orbax checkpoint format, and thereafter use it for running decoding and finetuning. [test_llama2_7b.sh](https://github.com/google/maxtext/blob/main/end_to_end/tpu/llama2/7b/test_llama2_7b.sh) also shows how to run pretraining and also how to run decoding on the finetuned model checkpoint. -### MaxText supports pretraining and finetuning with high performance. +## MaxText supports pretraining and finetuning with high performance. Model Flop utilization for training on v5e and v5p and v4 TPUs with MaxText. diff --git a/getting_started/Run_MaxText_via_multihost_job.md b/docs/getting_started/Run_MaxText_via_multihost_job.md similarity index 97% rename from getting_started/Run_MaxText_via_multihost_job.md rename to docs/getting_started/Run_MaxText_via_multihost_job.md index 309f321d6..cb7a53598 100644 --- a/getting_started/Run_MaxText_via_multihost_job.md +++ b/docs/getting_started/Run_MaxText_via_multihost_job.md @@ -14,7 +14,7 @@ limitations under the License. --> -## Getting Started: `multihost_job.py` - Production Jobs On Multiple Slices +# Production Jobs On Multiple Slices (`multihost_job.py`) The workflow using `multihost_job.py` is optimized for long running experiments, providing resiliency against hardware failure and avoiding long running ssh connections. Its latency is much higher than `multihost_runner.py` because it needs to provision new capacity each time. The `multihost_job.py` script ends once the request to create the TPUs is issued. Logs are written both to gcloud in real time and also sent to GCS at the end of the job. diff --git a/getting_started/Run_MaxText_via_multihost_runner.md b/docs/getting_started/Run_MaxText_via_multihost_runner.md similarity index 98% rename from getting_started/Run_MaxText_via_multihost_runner.md rename to docs/getting_started/Run_MaxText_via_multihost_runner.md index 31a04e8e1..60f29cd4b 100644 --- a/getting_started/Run_MaxText_via_multihost_runner.md +++ b/docs/getting_started/Run_MaxText_via_multihost_runner.md @@ -14,7 +14,7 @@ limitations under the License. --> -## Getting Started: `multihost_runner.py` - Quick Experiments on Multiple Hosts (or Multiple Slices) +# Quick Experiments on Multiple Hosts or Multiple Slices (`multihost_runner.py`) This workflow using `multihost_runner.py` is optimized for quick experiments, repeatedly re-using the same TPUs. Because the `multihost_runner.py` script depends on long-lived `ssh` connections, we do not recommend it for any long-running jobs. diff --git a/getting_started/Run_MaxText_via_xpk.md b/docs/getting_started/Run_MaxText_via_xpk.md similarity index 100% rename from getting_started/Run_MaxText_via_xpk.md rename to docs/getting_started/Run_MaxText_via_xpk.md diff --git a/getting_started/Use_Vertex_AI_Tensorboard.md b/docs/getting_started/Use_Vertex_AI_Tensorboard.md similarity index 96% rename from getting_started/Use_Vertex_AI_Tensorboard.md rename to docs/getting_started/Use_Vertex_AI_Tensorboard.md index 856806cf8..113ee0b93 100644 --- a/getting_started/Use_Vertex_AI_Tensorboard.md +++ b/docs/getting_started/Use_Vertex_AI_Tensorboard.md @@ -13,19 +13,19 @@ See the License for the specific language governing permissions and limitations under the License. --> -## Use Vertex AI Tensorboard +# Use Vertex AI Tensorboard MaxText supports automatic upload of logs collected in a directory to a Tensorboard instance in Vertex AI. For more information on how MaxText supports this feature, visit [cloud-accelerator-diagnostics](https://pypi.org/project/cloud-accelerator-diagnostics) PyPI package documentation. -### What is Vertex AI Tensorboard and Vertex AI Experiment +## What is Vertex AI Tensorboard and Vertex AI Experiment Vertex AI Tensorboard is a fully managed and enterprise-ready version of open-source Tensorboard. To learn more about Vertex AI Tensorboard, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-introduction). Vertex AI Experiment is a tool that helps to track and analyze an experiment run on Vertex AI Tensorboard. To learn more about Vertex AI Experiments, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/intro-vertex-ai-experiments). You can use a single Vertex AI Tensorboard instance to track and compare metrics from multiple Vertex AI Experiments. While you can view metrics from multiple Vertex AI Experiments within a single Tensorboard instance, the underlying log data for each experiment remains separate. -### Prerequisites +## Prerequisites * Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console. * Assign [Vertex AI User IAM role](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.user) to the service account used by the TPU VMs. This is required to create and access the Vertex AI Tensorboard in Google Cloud console. If you are using XPK for MaxText, the necessary Vertex AI User IAM role will be automatically assigned to your node pools by XPK – no need to assign it manually. -### Upload Logs to Vertex AI Tensorboard +## Upload Logs to Vertex AI Tensorboard **Scenario 1: Using XPK to run MaxText on GKE** XPK simplifies MaxText's Vertex AI Tensorboard integration. A Vertex Tensorboard instance and Experiment are automatically created by XPK during workload scheduling. Also, XPK automatically sets the necessary environment variables, eliminating the need to manually configure this in MaxText. Set `use_vertex_tensorboard=False` to avoid setting up Vertex Tensorboard again in MaxText. This is how the configuration will look like for running MaxText via XPK: diff --git a/docs/getting_started/build_model.png b/docs/getting_started/build_model.png new file mode 100644 index 0000000000000000000000000000000000000000..db49cdc17b82608ecb56b0b3377a341ca7caa64a GIT binary patch literal 24524 zcmeFZRa6|&)&|&k2++8@6WrZBAwYmY2=49>+}&LgBxrDVhem@#kl^m_E>rwB_n%qw zGB5KqYie}^eY)z@w)1T{y9oQFAcOLj;4KIQLU}JMsSEraYtsitCgwSA!oRjlkFOVJ)y{cXKr34{g9R6IOX?E_ceFa-WqY|PpZ)fq0bmg}*? zOP;(zlOjXR4sKTSdKi7^$2t1>1MYn(toqo^2I|c7=mt4aYPyQW1u55d#MCA$_OyJl^>2cP6N z9eeD5XI*5%?mD^eUh(hN@$WkEn|;&Rt>LPLK&CU{~QVRTZEV}_6#hLGk^T!q@bYxT8BH*i;}c@z(*rm2VC|Ej*j6E8nwkjVa)aa3))F5} z(VOu)v2p{Jf9u2klT_Q5{g9oL+v*`S|AH0k?iC&fJ#P18NLYmQv@aZjLIKjRe#NZv zY()56oUk7>cps6G{>gPTae+EF7ZOzNjDWGe!K%M#)}ubWZlJ_Q1`$)0XNt}&^v{cn zi!}-4drtoz5ci*}b_sjUFS5#(5kp3z(o2ye2Z#LB2$z~bIRr$FOE!lh>XFgq3O&?> zXlXeiRPWaf{t<~KRqxC*Pe3#-tDzMuf}czmEgY(0H@;*ovF6u(+7&Q`Zen(YjQo!> z_=rQe)RH4R=ax8;Zxr#Exg2f=pZx=pXSj)YU18l5aEkb7WC?=d|4nWhP&XRQT!LgM zvqerTAYpY}!vO{tTUaP5J40`~5-V5d@L>7>oBF<>WE#%0jCjwy=)0hP>Fgq{DHah{ z3Q_;SygOAsA$pn>T3XmYRvGP`nkGJ5OL$rl<;(v7Wt98iGVi+BebUhJ8!S$c9jWTX zB3!~J{O{?*O&RQ|^wyzlnyilZSXE75I1RFS_6buzef@^t5?= z&!J@P+l`Gi!T4WliHxGB$;n4T4S#J_*4KxP6igkgb{<^iRPM4Gl|@cj=?MA#-K4(Y+FSP&v8|uGS@u zqh_(T6CseQcn9_Mt8skACcOW0W=e$^s>qR3-3$*dP_Y7se}BWNskvsykDk_ri2hHb z?O_H9m$AmHM%DqdSXgYlX6EKW(j|V7o+gxkoVOPZ+w*g3mKpHT8Q6_a{ee~?l7RUK zJ+|zQrQoBZ47-+T+V`JPz(inLD{cfvd;Y3_(}%W;eY9_X_g15{OgtjS0=V8FC@)Sg z-1xt6cPXDg0S#XRKjQTkoLa`J;hGBlAMz(Y=|xZG0@7xL2}L7va&|uFg!^|_@-D1` zfT6A6p)H)WsQ{ReNw~Hjl;D5l21B<&a8Y^F6|$wMdEewA5tLEY{u9~+Pm)4`UkwjL zj7)ocACKtYFbcLt20U1X#UG|%5-}<`8|ohcu{lRuMS#Iz5=3+VvsD7+T8+--G;{=T z0j5Sq&c}b6U(qATm})_P1)}(z5C0uB0;W|6kziy9fT>`VSSLdG_k~1wUfk-Ldv7)S ze}SIjt=@m-A^CrW|NkXXNfP6z{39D(s^|a7tNj1dQsjjG zzbS;h9_TAw#}IURVt1$6OS6NSoH9*>GT5FU+Om;XWTS`T?*5o(CMk{d|166y3=!O`rs*WWuM|!#0m3tWETUrIF=8tzS0V<~5zo)4nQ1PNA6%VMQFo?& z2PryG2>$3^BT7K1m}3M^H~Vy1ULf~2q_Ql<+>^t>mWQ=fR8q(phUsdfCsA zwFe+S2sCiy*JpyDFcUJpWL{(!O^mqU&v)9V-ZdXODiuv2q%qT*eNl;CPcK9?4bNMA z6>E%`Z$%q9$Nb$fJv8G`blqR!`$hUX%`Or7Ks93sPPklj_x!YfBBP4gkH>w@sM~!} zdV=|!hucl?K_B*=ylm|$Cp-bKyCSA?>axVg8-RjJsg-!_$OatfSKZIa z?4?3Y2pP;*qmsJF4OAQR=@D_(@3GTDcNxbuG|_9;E%9QK!{oiTx$yKfd2yGJm@(2q zN1Kl^D2sL8`_+-*To1N;$#9x9#kLm0@F{Y3v(J!oPm=A{Hm_PCY!9cRtC`x3BLrMXLBkmUYp?xo4$NH%)H8z{>R+v3>%Lz8Ph8?ei*h@KW&m_k5YcH8KQ zxo^c}Fo~a*%K7&1wiaw|y{+zYD4n~(jFqqPhI5r?pWxZqZtIl>ugq4sykx-WQnAz> zhk2nJrIgiv$}mD2+kOhlIJJ8Z4S^^!cyxTQ&74=+ao@2eieK;k4dRdY1U=tthg(#=D{jVN+1!QUFR@6W z$-d;lKXsD4$?=-!W$AmY2jkyV*>bAm5vBXNa2scrbDeLnk;8l!QHXdXdT*s{*Xa8+ z%|1hVoz+KQe~l;5io`QojyWXhW#FJj*7gjyyI9~|leDoTp`d~MU0QU8T70Ge>L;`8 zPc>N^hP+Xj*fL!)Ljjs{g8U&t6cqEA?vk}ldC9i3ZzDEV*1S~7-Bg5_c1rh7n+1^) zt-@e?cq9WSoZ2S-iM2z?h+Gs80 zQa+ho;XI($KB{nfv^TtmE^N^d{S8Xaah;kJNs zN^og_IpNEH{$w@f!&`Zw^J`@}Jj{o178LD?M|3%f>+#3#GVfkCAPm3rYO?P4dp<9{ ztnJhHdcEGL#i`RWI+?ank}A)y((~_g<$ox=3b*y(X&h zxfsgb9^;%o6x?ugk-7Hl7a~3E=!j~cZMuJM|JGlOoO<7A(c>8zHHMi+F+XPq(*0a2 zBmd<#&?4K0(Cr-!^>&ysML31FHVlT28zF&9V4Ngk7%O8tN%Nc|l*&oFRo_9D z-_#Pt+C6F`ej`*6(jezv2`;i#?fO2zw+4?95a2zl6dyR!J79qC%fhNI_{+!7Z#Sci zCY+b_P~YAuSanHXu@`&KzYy7hA5wQuNr~uvWT?wC}q%eGrzEUYxYhssHF!76-`nwDxozkR3aB)E6>SVOc zu!6*~%}L#*QRNoDH8>9Y*Z|Xu{W9@2#&`?P03XE4<;M8T^x{oLFlxo@k37UWtdJJo z78Qb3z~V~2p>BzNt|Hx?g+HHpB?&{mcJ*tVt2mv)jJ4}>b1W-ESk7vXFc+~IzlOY_ zpkYK$XSy%8!upfPG4o{O3ObQYRj!47yR&$f!fI1DqpT%l~p+TNdV^AktH_e939cour*y!CrQ9jk9#Wtoa$2%Dur)KNhYKstc z0GY9B^cgO>2KkDq67aW4p9DB_`@`g)Od2nAc5!rR0nJqDmipD#$unX~^{$nUYVZRdF8~JH_=>I2Mv2Dk z%g1C^YspKA0N;`e>dlfs45_3uv*XXC@RC>`4vxeP4t4-NtrA}`wkE$fxQiUbyQ^p3zUyZ$j%mw$1g`2HwmR z9%9DPRc&FKwGJwT@f^sU_624ZV976|cliE|XGun%Y{NB?2?JCTe@H^hJwuE()IR!D zi_qgRh0H;{5ps7E4(Mpm;>rytwD5${lBYx=*sO6w$E+_BcSy;w^#)zGiL#eH5+O#^ zhy7f#Nb*VPoZ(7Eul#m6Dz9j8vV83GA0zxPj9=QzL0L{=_vTxitawCkqW|`1b(ah7 zTwEh1)%oeo!2DHEfyd$+F^T5SQu*VN8asPdryl&59e-KhC3LGWK}Hxwc5uhcVX~tK zV2GdrmRDo}kBekz@N8f(bvnq1GTpFuEHQbnDpVGPpds{om06%qN zW6X-8M~7YLHc%ZJ)ZBb}(t>YP)S8QW=p$N&uTTlEn7%ZEL5XY&dkj zA+ca%qEpX~jIOLElsZx&ft5|A6r3*CFDI*Z2X@O5YqFJuvbR3yjaM8U>lyToz3N`Y zcP>y59{7rwP$SNWiNe>C79CC?=D=3D5`!wju>j=Gl$ilIiAF0H?n`B2txIm}O; zYvcs*Zt0MsOPoe=<%u3c%%6F;3Io`aJ_XQqTB+zPb*8yuw{Gvs_c-wbt|+!{@M2`R zKENQ%OqOriUYQXJbWB_!Eb?2F_26UUyAy}1$n$88DoIJNJQhCYOB|ecjWgM1DdCK8YE3eh!ey!3PJfz422ifAxGQ)?1-AcadYC6x4agX$lujRcr8>QD2Pm^3LX`V)hG`DAeGSx#1norwy>TZQv0&Ym@5V zvBdCm3#0x>=ByIT!Z9k;g)gtw7Mdl?nVfX`Ch>QcFT?uRMu=J(~ScUZy z)uq3B)TvYC=&<9Ygj37XaO9!88Mt+t)6$FFL>bfeU?^qQm8$CK+CNGJ;1a+_dVF%r zw0Mx7=wBt1Y*2A?7tKGnpbVhV-1M3t828fKoq*0v!>nq#fKpBzEl^0cxPUS-H+u9#<2Ih8nQL)NN1)}fi zA7rOAmt>M0O?#r^vSYAIU=%H(F=*&yh@FZLDWy`_`3brX=^X+2~EFYq>op zY%0b4k-zRE;)qki^Vtd9#t=PDykJs&g?~J_VI+Ng*C_Ng9>iCP<9`5q!f&UUOdhg) z&BI{Z*$_$O3Xyw$>9W0vh5Gt-In(wXVyu5hScOpBtzJkaK^($VxGv6$8 z2-AtKNabCu*Fh8N%e@ej&<)wuqfWL93qtCn%ckaXM=4fwt!;-9pV8AQyU;`4nR8G* zG{`8!A-)SjVTsvwgX_C_ghekY5zu{rAqN7rZ6A8V_U|WQ9?iD`MT2`eLzhtMP6wQ^ zrurUevo{Ryxs$x8_Nl+5op=5+KHYx$%xe1Ei~;i#e&}h*C$sl0J41vt zZ3s#9&Hx`-p+OzWNu#2p2GPfK$%S%!FVR@HKqc9eXHySBXnX?s(idjv#;H0$GfslP zjQj-#?T_Xa*YDUan}=IVt-mPmcyOb1YgN1{rJ8xRgp3t-g8&Xf4)g47oZJ>ujRytT*kbKJFoT;ri;yWNO zi-&jLU^Fj;GJ{ObKiS0uyhy%*^6Sh9DxJJ|fUkF(Cu6OhiDCUDzIv1L_!1;2t$*;F z_Z!)62g>TOxR`F4E=2G5k#|VIDbv)}XK#H)RytkE0BY$1WN^AI`#<{c z(aDe_6C_PyTq4?sa=`vB>t!}dYs?+VdaNGO?i?auVi4Sj%@>wOHh9qGVYE+L z_6Ww5xbRBb!->GH;%x4A3(LTb#|E3IupG#UttUYo;2CCDbF||p zmS?5s4U3@C(-6G(^~$+zzs&Jm{vb7)am#JCpsP_5I$E`uF&={wOA~)0#Lvxj*V4_d zCWB=v7u^U{Kd-H|M~^qG_8ajrY$Nb%X7z6lfUmJW)T5y&7Od=mNx6 zPH@y`d~_kA;~gfu8~BZwK75R@SR^-{(0)=>FDqH5*!bBoK8kjfcPI2l0yXcm*F2SjGLRJsobC@)rcADOqVN|#LSN8xgw_bqvMTd0@eBQ=DGy7oZw z&J4=(^eJQbdKM6|bO?0Yk8Z)eN3tPuv7`#$)q_vT0c{zgZz$nXE|{P}Ld0}0&i-pr z9Y{MX3U)(&y-zRjGLoXfs}n%;SO!+^#MXra&gP7%6R!oV0lmo82) z8Ap0)M7RAN>6-_&A&h>NO+SqI@D{INZJs;Ak#*6~?{6+KWj5}fsoFc`udkgjsL5_` z4(7yHm!BT6-?VyCXa1BD4}`@LZwtYyjv8MfB=`F;_VkX2@>tZ@;ZD?UrzFIB;6(t+ zw{#H;6e#GO-322|=Cma$jcUxt{mB4P)=VnO|D66i9?_v~umgH`U#F*1wR0I?T-%OD$-w;6ZYt%$`2-0uqoVdI$^|^1kJ%|3B4HCT*NqX0Xd(x zyg8VP48lU)2QrFUdx{!bv38Y%cVR{x?K0K<1)?f!`8n?O0oc3pt-cUepE^#1KspWW z$nY|}Q3d=Z$FDChw%e}(&_)cA^5Okie6m7F&b z22DMtRl)_~5uDy?C;u=&RPIJFFc*D)?I4@c(#w9x#_^iBu+uPWGD5cN4Lt(RcUMI< zRNAB%$%o?K)~K=75u#uDIj-|o!S3910ap%8W1l7yV-6vMcO%V*b*OZ+iNsW{g4FR{ zljv`L)K+3@B$aF^0pT7Ra)b*R<@+D%W%TD+&S=tJDN`^lgGuDQiRsRG-#o=W87;9* z_F_;y*qdnx;|{?{7vI;Y`+qT%ZZK~JvOeXH8bS_+ArZ!&>i5)I%Mg5>uXAn~ng>ZT zlzj8PyTg36(b#gaB@C-`b*hZ0cn=S@5!9|DCXEi&J?zkAN6r831yIWTrSl6J{9$FL ztEe@0#@D|7;aDZF-WJOX8T?s`P1RW174>u;{>Jxb^|*e@={@`vODq!2*Io>z?=}~o zd9Ma97`eDuL>MjQFjn{c^ABl#T~-yoO*oMT=kF^T;FCnb0cGq^;~a8x@vb*U{m7U@ zv&tO^2Z}a8ihOmR{j0EsE6jM9P#Bwyf2(Oua>_;fuNFOrfk&|)bOlPG^vT6OB`y&233z(Pn9DB z{2^knDLXureR-Z3HY^Ee6Xm7Jdj@@`EZ^${R$Du8WC7a zHwy!=doZequ;WWel;;$C*g>7Co&v4$VnE9J^~+A7PuR=fsWQ)VC$&B zlmq#GYd9EJ<+587`%)KJ*HEnciSv)J(j^;jXlLHdYBJJvMNrFHVI*#cb2qy8`(tnO z8YrKyTheiZ)TB}Ts5U>kabd3H4d*FOEfQtWeqE1@Cwh{e($RvBSfY;bID1jW&*?%! z=RR(X<1~B%$T$D7{drhEjS(oK_{`rBIP*xm;9UQr&fJ*9)Qte@7))*{qAc%XmeG`6 zzAZWvx568nF16UA9x3k?g8@mkY>jb#Zg(GGb<+t3+eY#k`ZL z;M4Tw;xSwM43_0JELYGwYM9zl{_Wm~za2;4U1u)J%F7#63oB4QY>PDn@-_U{Nh<7O zSTSgf=~0rU-bX6pB9rcPrinX4sbD+>Gf~u7`>s&##ZMg`WY+vPTj(A{Wm+~nWvaYD zV-_t0KalGQ-BLSH7*mNVDY1n8wT6{~Tb>aH^z6N>?Wyd&%~7==1b%c|IZ)TEk+*^d zF%G8tDcR#2{0JS5rR75;c3GqXq_Au#qxUd*Lf%k(>3jcMr4gyVLlDJ@fu&+UEJo%* znM%8159DZ)Jm&jn25070P{a3GB5Ri1sEZkp*tE1OmQY zA6TB71_Fd=jT$M_uk7X$X(Goyd*h*91#9vk$5R1YbU0~bZv0v1&qVQuepWv~vV)YA z1r&Vr*+uuZW(PV5SN65r>@?ZDg1; z>#0F#>9Y1K;3TVA&Z}pB_SQ-(#jy$~Akv-X(MX)TW>pm8r1_RBG&WcVZeEEe2w;p~ zTTjJkeL>p!!?Sw-VJp?i+d5_$reaVsN^*j1&p1zK((isN?9@4kG{Bitj_xT1{s)q= zI(v8OeW_bRU{ZwBZt>?}_lv7F-g9*bIDcU$J;&zQb5I`vH6P z7LC~WUx#$4#vVhMRnL0i>S}mDL3usURnmZk3mx}wb7~ys9*y7oTW#w(dt8$4QpfshjB3` z`DqDfowFr+Qx3=IIes`d$Nj9t!ZgVqJ$+(4UN}x0v|uKD<)o=GmPJyr_JEB%AI|s)A?zH${3HS2>h45pm$}qxi!S{qH8C$y!WVA zN6?+a6@zFzOEk&vKJb?@^Dg-D)tCfM;}czqNhQE#9)AoO!n_Ofq%rttv*@WW^anG? z<(W`PdB}mP!HYRrb>N<}GZCS}oQg^jSFggi&%X7`1%)d$=6>)C8h65ZyrX1Z`Se_v z%iv7bocBGZ1h7%W&S?HS6h9a!#O^nzeOg%Zbx%9T=(6bR-C3Q9OA)Q$8yV-N2v7Z9 z@m6oHR)Vo4K0ZDh|0klkE6wJy&F`ykr!%TB*ONXhiSp95OOAwEZu?M_6QY$ip;8l1 zl&eQ-CxqtmqNDOUM%v7jmQ}CY454}@DW7USbPZf#6bnyx60eDt^lkPg@|47SHl6Rg zUF9H(JFuBYS^E-GDPxM%t|`-8MHW)x2;iE5pgkV6IgI~Bd0KH zTT$dp+PS8?-rVwkb_Optr8-U)tcHKo8OzwUVkPE{Y((|IUA+kb*1a|AGNPh^|EwhE zYI96#<;$oRufosgsV~~ddLQHTiQvs@ealGfIN#f4KHp*?@7Zj2xIkAWD3az){Lr^U zGaku+x8hBVJP*dJDpI5|`KI^WgVU3iw9|wkm9V_zySa^uw$LVHwa-&wB&oU1RY{|T z)D+`Lji%J-HS&wDkV&$Ip({Jt`*&sDYHB*)?|QpY^qiXae~p5SuiWt>l2W9PSSlt! z5B_|+KY0kNqbb975ADC!@!FY^@L&Jv%`Xg1v+17th0U5*-NA{6M}!0vd%qA(!Z#ylwX;BW~J#9+M4v=>Si6u2LqhU+DcP zlSgIl8Y}l-T!?*m6RHC30A2_KyQzVJK{?QJWE$7rjKcUK;LE)0(lkX5G{lMIF-vOytYMoDKz zA%l``Ly&I6mClINKLZEDIVrI6GwAOv2F)HXu{KVb>f+9Q#GJquAk1K^0lE#sMrxwGfX25G>l`~RLUDOGu(HPaDnPB3`x%2p2*P3$|wTIsLuKpha zy{Ub4+e)&@;Ysw+Q5m4o!EAw_Y|bk7P{E7{{F?y9s0L9d_kZ@6BAr}E9&HqMQlh;~ zI6ZM<`o=KzDtND^JwH;W<2g$Z1~1J<-{Q9#bj|&5TMc~1P;pT1AV@QbB-%nZRBzb1 zEB(mdJ`nRhZ!)Gd7lQtdcl9Ivf0}(b^)BC&>?T-+j_jEEUnJd2@%*i_KhRPg)% zZ10Yf!M}dI;QH$s=fefn!1F(tl`acx4pt=jqv?^6rEBXsro9Ultr3m4>YfMXgg|!z z=)b*AbcL;(u+R6G5kUXz@8JQ_k6OsWZa0pe=f^vZ4{-vlFXv6`^Z~`dN%h!=Adr^x z$gfbvNZjNsGsFM3Q;D6l-V19n^BlV!E_`n=1B0^iRt4${=`=MnHY^f`k~e<4|- z{5))B{FugtEu9dT4I$$z@DDg5bkSdc*LA?k`g;(|!;hPgFW*^cPuV@+ZqDu126xi_ zOzwSKLd&YZ?^<^wa_SCTZr|_9gc&v36fmq@_Z89?d1~7s&h-#X)#>mhM(qH z%+5+@mBg>xBVF&X#z0>CB->KdXIs9vYSumRJKj;Xk@^N48PqNHhmh@W3w9?PF63NH zQdVll);3a*{is653TbV5;zMo(_UaHy4i8?Fk+d*GmACEC}#N?Ry#wx@g&T{&h-Vo^x4W_gS;@|4(XG)p9@n_^=>pMKH+{e!SufC&kT=1F7#I= zRBAl%Raa|#YEO&H@VM{L_aR!=jReT#J46zY`r9uj(j8VkSeZI!)wO$;8!O`Y_7BHo z*ZeDopw0lrWkNMv>st$s>*Lvwhs;KVfui>Y9H3RbHT}{%1LE?IBZ9~bUINgkMnjRe z?ekQObKPc~s~#dgjIJ%&<>;>_gtKDDN28b-+MuHozTKCs*CHJs8=L40NWanUxis`7 z%JK`&YUj@YeXiVz(20$6R`dHjm3Bqzdv~O0VrD7pZ^Q%#1aXO}fonhZ@PxdvaCF*0 z?rl0+L8F+ev=(;N#W*Qid@P4Tbe57dB#+N+v+ee@M!(4|k%ilIV)}i~D}n7PCj~f! zk*EWMT~af+oek;ByM_oJgxkI96_O;9cM88irhx>f70Y&26m2 z2G*2$i505ePaj)*oXyC|tpTmg8EHWSlQa-3wrpVC)#Ds?&iE65Z?OZnU)(nIsw(L? zK+B_y7~p#}oOJIDwf7`+04=YxE@ms}fPRrdv(+{X?N8%JT^Ju;<@=}i0t$3gMqGS? zFM~?b{Ux)kR|?5w?fJxuR=uLAR8?D;Ot_u)(z4J}aI5;-h7h!3*)ds_jwr@$O&=TW&;(|{H&;qmSF{W?F%4$Kc zSp1xUddL?#CZZmO;A3YyyQAqgfGA7w&glI&l;@q8p*@T=|M`V^C5J}N!@4O8Yn7!; zP{JIRbBo!SSRxT~uY}?<0;rI#MAG`3_AEP;uIDZFxv2s~D8`G>zU!c-JYVO^8x@D7 ztKg*K6LI{qiFLo+Si65QCKs#bM?4;L##f_DmS*@6le`9)*KDH+NBc`e|JXjd`MyZV z5)^3MtWF#$R=^$BK_&)mNG#-<7&1_-K}ADL23QiiBYW5Ug64vf?t<^(BBNQI#^GWs zp>^32o`@R}M*RB3{GWXRE(&fMD_@FqUSbf)KyXo+aFAbZPSS)5qXP0*Bx+!ZLS-f<4Di@1})~h zW?I?pL#9r53aL-f>q0Su9LD}a7~8IB$f;^K@AMByJ86e{aM{v7&WY&B*J--&)3ndq zrDJaTb3>glii$g$8HCWja$)?URlRpmCi_vLljY^XR9}~y#A{!SbBgJvRP zKYt+v;-$D408stiaUjS5GlTZ}gt?25r8Ouok;?O?AcihF|C@{R#*>eCU>V%P+F4^a z|2l|li2VOE!2e<_Hcs9o8`N+EUc6iA3K4_&2!w_tVfJUU0!b5}GQh(fAcH@1J zD6<)UO#%%9Tm@I@^C{{{|8RsUtgTC}%lZiJN?){mOI&hXFt8?xsyR>pwD_R3c+nRS z%3mBZykI*&G>DupJum6LP?mc0^DMg&mp{)W88pf8+x0&+bn3AqTbO}BF2b&a$DXJ4 zt4E3}x_;A{rI1&i!*_;6o9?$6;=ZT*B5i-fO{4o@^a4>nyHe8dB&vAa;%dErOuHS) zeDNBm_47Di=_tVy$}@R6rZCzzA|oH*k^J}$gJ{3CFXt}o=E7k|0}Rw zCy1=4ygaE;70@80APS=SxY{GeD`Ny~&8PA|KFxipx~xNuP5ntg6lB~mca}o^5|JNl zvgo`r(s-CE7jOW3IA4um+wsW!T)pXYMmqKkjre{)gNxtAIO_D8>sqc9@wpGck2?bIJ!8D|e?gLgr zc!^CjG>_NA{W#F^`3iU_B|K|8FIQOc%JHXwYZU!x1?*_d>4HGcHuF)hGJEK7K}~)+ zP;m+jbaZb6qk6vR3^x|;xD=HYb+7z%gJY6!_j>c2Pzc(Qx$VtK+ZlQL&4mn`ao=X; zb@qO1Baxx<2aKP$bGab2bl8nFI$9ROzboGIw}kGbkpw&#f0x|rDCBAOQ8gQgZ5jx3 z`(u;^3rw_lQNAnzPc2!~eP=Z$DsOUQ8%{o@6mjx4tEt~8OG%+1e@F82+@D$K49!%9 zBY3;n26zp|!`WKGmWCs2TqUlJ9P`Gr(F47M`i;(GiYojiZSKRRc2pIX?~xHKZD{q( zgy2IkDG4KgWy>4)fahAu@%44Xv)dR_@sC!+Jo_9DYtt!u$6eWOwL)A>Z~(zk>$1cs zt!$XmYs)9gD4l;1`Eci&t?8JAql<9Un>jcH8HguM#4R{Tky?>%<+5BmsT##{z_ z_l%5C`g0Y9-jF@E+<4eBN~lP196uZ{6=y?%hJW&4irj9wc-`!X^ahNwm_*>OxB<66 zp59sMd3n6&v(l=;FJoFO*oXU^Q3zSdd_n-=6M5WJ}^`u@P&+)U0X)NfsmLdu4M zH>apx(bGlHPHuKQWpG24j$SbEhocE>h;x}%JHb~CZFAtlGZ#CM&PFr@AjYf$3TvLn89ws`UVyv`l>dRR6n> zZT{?(ZHe$#m^2?V2f&4$^071@&y+yj28Y|OG-6Jg?sQ`q_L-8F_5cBfKI<-<^E%$* z({JK%{bW$3;UsgZ=2zeOefK=meEsvwu*Sz`sqid{sR;>cIwocbl{Q;EN4=H4k9bjM z3H|U3xLU$r*uIVe{tN4PF=nskB94m!6L_Y=Qdl=)Vndq3#fkY90+uvyI*-oDOpUmY zy&lUeDOp2qNX6XD;c#^7K!E-LTNa1x%fGrfAPB59m`kBsTocG32d+MTI@S(f>7B9e zzE>rza(NKu$u+ZhD1xMC<|r7d?X* zs^9M;eCIc0kEF?M4DAY4;sb#!U9IJnLkth)`ZRmU5WNUH2GOwbG3~c z@b~#+qX-}?od%TZzY%a>hP#um7L+P2C(FT;@Ew_dL;vs z6`)zCdYAR{1DvoDG~Y#~cKtEyE>D`B@PMEYDA1kPZRslzZZo9@FdZ~(%&uVo0fU;B zBsqYlz9*b-K|l-}o{{EaxDS`Ww)00Nrj<~@9GSVpnS7?4vGY2rg>@&shC}1~z3p)h zpX`VCJwRT>+uDJL-Oq5gOgLnB#PnqixObP;9{$&xv5q#B_qh;M?<>jK&=yPErng*4 zv`RoI4=my?>C@dyjmPyp(rye(#2L_PJj0HDZ?!7jyayd4(L$3OkJXa%qUTK%xm?K* zF&DIphywyhx6ylKBx1tEy1S1{^OLGhR8>Ns-dMf()by0$#B>VmLp10$BNPFg=IxT= zGoGMvfahr1YF6DZ6d+vcQy_u?f~kYfaN`nB;$#$opkIFFqs)RY54iR~m~Rq+NGWZS zdu#-g+j@zj6Qt4cvX@9`$BeHo**&EQ8xt^p^(BSh?-8lz?qx21==IJcLx zFu_6E>E8m~z`7=EAl8bD<04X}Ev~X0BEcdR4AvF7sL|WWuWo(IkbK7v_7h!=K<((M z{i;&+n4!2@_ISI61+IrsoZSRMbKSsUo&NS&7lo84C8>P^N?FW$8#H7%lk;& z3cv%*3tCoQJR`rId+fg~%BbMvvMvB)fafMlAI-QHZFoSsrVmD{WtD+r*ngS)fkokH z$nnZ!`KXPRb_emhq5-_*pWy%q&bEDH(MT=5-8*?^vRZP{@gminv4&YC#MrJBIGZ-L ztg`M_cr+S7jpNqxYIJ&CnNI_$t4Ey&XT`SKVE~%sH15)=`P_2>!L}+2`1KxBwOB?` zzc1nZ0ES1YYtCkdyoNoxlw;4^eE%boV3OAe8bAlOgLNc813a%+&TOk4E_?$4 z?yhLBh>nHN53usnf)&A-m6sD+|C<#PYgF*r#}Dr%k9W}lkH6Nm(&kHe(h6n(*as36 zHeNh3si2V8uzFyUHeCKKd@dBOt~>cFdEuxe2h%Hg)vW}+CuUV<3&F&!cqrewE$J`a zh4M#I!=G9r#$TuB0;}4$VrXY9{m$eAo%}|?t7gSv9TrV*;P6lLt!LfR(^Zpp87mZ!9WWG$T6L+ zqzACSirZ-p5u~)zv!ssJXn@D8$vsvFEfS*&Tsd7YQjdDm=> z2wz6w9$awAY3g^HCNv#cTJV5$xBs{-sO>GdSYZkT!gwkmI9yhH18O{+5Au^z*!xq0 zqjxk2eYm9adv#Ti3-F;d+lrzx9<@N&to)j5GXjUML+R=cE!X0+D-nT7}Ml*1}z1|=t#_&5wEM};QmM-0@<$9%;wp( z2e6r3E(B><^^{`Kdu4?6hdgAUO6P9qpt8P}epJrwA<3nxOb>y( zd~Wfe33Tj!>%4e%dUTtqD-zOoPxE+JClXwT5WHxm(}H~?WUqKSGO=gZWm7}RC4|kh z;zemP(ni*(ur~VSdUeNhwx&tCu|~Q~T3RyK(efs@FRF*-r#I&$&1-eE~YUksN9p|yM-xIn&R%esE4g9(j#qfUz# zyA1)|RW~*(AwZdA{vfP>_wlA{!ABaXN_gMzR6nK&Fpw8nmkF&lP z52T^xd)SqSrjtvm33(TtkhJ4gFw5iF2^}+kF|SCSNWJYMP&TTFSSQx!<3 znpt?#ok@ZKw`2(59AvH@SP{ZH_xP565>g-#DfH|A1>p9q!zcV}1)=^j3bQJ(zBDx+ z`RYxrZ?4i;+TE788LvV-{QBc_O6r@z3LzJXy1bGb1(DtRiXGD5q(;mG0*cQq0KmgG zFklKF^?n+F|D5#2@VJW)(s3u1GTisPBsE*FAF|){>^F%=qbTwe8qi6E@LFDbYz{JT zoHZ)Sh8BCgqd5JeesRYyc=~h#`;yEguhx2bDDtV&9TTYVRIjGLd3jvu0MhZW5N|u( z*`1Tu8d&PsU;U9C%YfW`_#q&QnzeR*JdY7U8Kt(bJVn!SzgG3UU*5SLh(Gn^N~C&E zC!i4xl%sA(YF?GrS9EE4#(~xPU$k9ea`rxiq=sc50D~v9X5H1sH43h0O~4%IKO~Y zvh}|8q~)d;$;;7J^~)`P&<1U$uoK9W+`_>^nKd-~;)|9iqlvaB)a(1zGjdlfg)6Q+ zK(DwxGILIGL7zs+x;@{pytTZ3Gfatg`!gYNkt2uWv!{Vz{_yv}SD2ie2m-2BKwTiG z_Lg^BSD5EkNM3&Rb5NkEJz# zUL`!_;$czq+>?_H?{AA)!EUu?C8)Ih3-#(Uqg-?fz7MaDBG^jUPq3A0e?>zEDc>^nj8XD^Lqf#7PO0FDAK%c7kDs5*o( z_8SJ7?;%+lqe1q%T3x6^5(8cB?>hgod5Z}RuVU*7i!$rFd4_UTSaUJ7xvpqm(N%HL zH}SE7R8-^xMkeeWKbm~f)5#sT(^*-9TFY8{MYo=^|^+8j)A0vWmpzN3deaw1!N%EI+cJn!f&ePGa0j1m)bU4~E z(UR%)H4`X}xbIanf^{yxe;?=}bLsWxYoGfCz{u$DrTA5zc77PO6w!^bmao`zXg+-^ zssZ^{C6uRlG5rcl*BU=+qGtJk3I+P9eO=;v?R0c@;#1b!@lvPej@K~S%9qucE8EH` z9YwmvgBuzh)ph&M@O00E!T#6^F~=as?p_iKd^FVEV~+Xg4-Y!ceXx;uj>DX7_SQ^6 z0E6^q7F(XSabQBgqCkqEkjXoF$Y^}?vZ>QCxZ2&1+&hnb;I*>ji>AefhoXKtrTF+K zc#>V^Y&8erbkJ7khPXv7%&&0!Z}r~|$W6gppyt{ zZ;;|}e`hw|ZxMG4<^rU@QE!+D-_uhJB3q99^0%k2`*eIrTJgkEOWtKD=B3@nGI`n_ zVsc+id~|iKwj=>X#?Fr+H6onmbcIGb|LfSpj3$~`IJz|-x_v+2h7z+tzO&LK2;o;x z$erFcA7*}Z_e4N{C9Mlf{xUn<|53Y3bOdq$DInaw#d1 zk_MMrdI3>FkX}>-qy?l)N>Ub({NKC}KJWYOzaRF){hOV8=FFLMu5-=oB_&EgGn>ly zZa~{I0MC(3wc0Kc-q$|hs?$hFpX;iba4QloxIJv@q;uQQVe~yB3}eB7ZS(7a=2YvH zWH9bm&9A{!s6da9n7WjX^(`W?1cC~ybf*|a{7~Oj6*n_@d{;R0{A2z$6COn$rn2&C zw3LaCxnrCg-@^K2S^m+y14-?d1~P%DsM8Q@LU~R^OCqgUZdim)N^S~kOLkUnHml4S z(>JONE?ISH^wM%$@Q)_~{`L#IOgf%?qT6S(UWcpzawWu$_=Pr`KTPsFMP>Ot{9#@+ zpWhY|hz2b#&N~TW`zCzdR-&qziXZRV4c?3-Wi|RiCFydGTV+wN5RfSuQ_i(#Zeqwm zqwk2Ml#L&WP3_yjcd`+MZi;KST&PZs{q*qrrKbFPD&qw&t$?36kh@|>Iq=U7W?d^> z^({>cn-YHqdC>r){BtSu&O?&@VT;0B>#ik4Sf?u+V0~CGMW&=nN|tnk+B}9C#>NFq zuX5#Swh8L4ity}X)UL1>Nz;H!YU^Oxn9G>O6O2H2DGT)&W4}4(>W@>K+jU?G<;9=; zQ|8$EZ}r9D96{`xNHjneB|;}il7k;;wG%3%`J-!6|Ga z*ZV0=&&96&sikp8C7X>hwpD5;1PmvQKWE|4vo>dH(VLN`)(Us05l?EcbBmqHC^X&o zUQ3rrAq(}b_r)a;gMRCpilf0nG96oM#t$whYF*uK83^ld+faX3tbjCL9BPxcNp#8R z8@JWsMNmT0G%1b7bWY`24P1ry{3GF1fq0)$SULe1(3-@Nv-~3s9gTx0R2%qsy{#0C ztSj!hK*wf=Vo1jrxRICb)$G1u9M_oq7Q@~Tt#Yk!Rpg7$fq#1I+oD{%pCyPO_msJd zQNAlKyjXWQ=z5~qQpg-^9d$HM5s>Wn;G!l3+YCKBo$wyQJiGhcSQI0Fr*MyEzSXy8 zUg>bLq?%BZaul&m%~vWy{-Jl&-RU^^9(6E|W1pkiZKa(S^-9VTA%;ZH*0`{*!BQkA z-;x2&cLEWHiJ-8kFWHct>bm%}h3?`M8U*no-(Ui8Op^o&1MW~8-!G=J*+TW_$V?wW z|JcYyY9dmIXv>T$@{I*loU_W~j?r-?OKne8b*ZOpr9OSUP~-PphrtF@_U6n(LYE?} z72geh6iYNGmOn~N>5iGOH~9%Auiyx>wRN2sZpq8CFU zjS_p;??tAZueVL_{oI{DK*m`*+qxiUeKZpiyoYEb zP?1NjD<g=P$W^V3UGu`WMr+W)C>eb`96BlEKH>QGlZTNDGP$EVIiEe_p8+={%ahujT0 z`<+qy>%#>9a`gtaup5ZEkhvC^WR+yZa zUDE5Y*j^bz@Kmw#5C;DK`i?V_=sYmCa-h?GGbS6(ed+BQV*NUstIB*0Olq8pctuaq zU|PRg&m|6IZyK1M^aNL`!}pm@-S7ELwYSCQq`C4P!^S?s2QWY%&(>B$J|lh6*DJhX z=4^VtrBZj56VPWbPM3S)H^>B1q-zJp-KKdyCTc6$ZeW}`bMd|&wN3Z5=8+}md~6A~ zvT_#mmGH9Z9Nq>gs36#VLjt;*bv|$bKO=L&?>T4>9z%HdK)}w~@%0W^R~8yaCc(Agp3S0^rIC3M1Gk)ILRqHzOSqt_q_#;|sdW(jGP_M;L?FpB zK*M(cJw1Q4z@~CO~{znoV5mAKRQ;9I&oy_x9LSqz5ki+6$-eJ#P2u)q%HnZ zhYX*}4h1zD87S)^wv6l*q`ro#wG1%c;DZnh#x`{9{5~YJO9@)p*!f+N!sdRzo1(9S zNCoeST2(_xJ$QAN^WhM^^QQ%}h`rG0SHMnIR5mv<(~qL;Gx-B9swO9MP>tce{W&og zciru6o0&ydrONJ@eeq)ibjxHq82LuoF9$O;a1|`wx-N#%UirG|u~pRw_Z>qeL}zvO zChadLmn-Tj2;-84WXOeA!4A6v07g;)EFydQxisYSwkXcr*;aJO=k+^=nRwUOZ*pt)$!Tr*zci><*kJm|jkh?2pu>Lw|t9I_b$?U2lQ)PP;E(bksU~g+>@Z*aDP%R9qdw z$;vF{p{wHs7_78;m_F8N%G3?kARDIkiq!So^{bF4; zt>&l0e(g4&Xi^0{*FlGxacVm+z_sJkWN>PnjL|3{{xPw&CNAL@(<7>|%hSKP${r2b zT_u>nuY~5FU&vHl1j(Ede^K<(=h6YoCFFKqc>eMX(UN+g9(E!Ya^=CO^ceSi=4F=n zkBlY%vnmTck9sIj#+=OZUB%3m_o{2`Vk(tkutl-8J$hd(8|%N)`)9zWL}k<+9;YCW0#2%7=vMH{amW2x(ufZ*}tRuijKIb7Z-p}z==}Vst;F#AaS}*O?UBz7-Z}8 ziT|Gs6Mj!q;*Sh`m%Bq=>khKfR?eTSM)ghgBEdO+i+gR-gtVfhqB%sLiPu=h?3E4R zC>RVM4?B@Ma?$R^5)cC?-#vlv&J`Gxd%mqu-4AFE?QF#eY>`T7sypeb3iHgC@&7%T)dy%6%dfXh<{@aYyWtD_di>Ce?K%SFE#1}C2sV{Xn8Dl zv<>;}H$&cN!LeYopf;495G+EiL>bqig$5^$+xF>r7bj6tzFePlnmScrc78iqmdV+3 z6bD@_fgziv8h?f9P=zNizreAiv$a@(G>BSrH$PgqCn}4bewoF8d;ebX=oVxH&SgP! zmvp3@;rUYxVZ6$Cg#rFXe12o=JPnyZX2Z*FEK_ z!^*kN@0(jylY8)ilk*Ec5H5N4bGe)d0tc%6VV9sNGL|Biq_pLEjDlSl4{NXT0ElU| z{@0v2=u5xOG50$4UH2}b_Ru>yZ8LG#&Uak7jf*$DRANyI z0hVLGFG?#I%~EzDOe+GsU18g7p)VN;I9u)a@v|S|G)hXzX?6-mZVT47PhyNuWKwxA z-Xsr`o%H-pg!UZp1y$C;lB@*0zvX56Wcf+KJfA$WD(iYP<8h7t4&56XgZ68vt|_Q9 zO4SyxPxddVg($e?PR05kUi6K-EIu=8Ru!xpy*Ty!%<-vvL~it0FMky8m>cS6iph!`oMWp#umrg93f0Md9ONoV`v9IOOb89&RG8 znsyK&umwVRn7@CXt?RB9~%#XVDL*aY9y>iP%iy zZlv(I8E$$Ln$z@?9wuP#xG4V|t;S*nx~Cf+U`pL(MZ{*Cs3KR!5Hq1ql5)h}%AdLQ{E(7~jQ+~HYQULkf6A|j!MUT-E z-AM^#l3>FFxDCo36~SFDrVmtrHQG*c)RG**h`u$Y|C4jtJ?ibYn5F&9f5MOLoQsu1*wQoY9&|1DhpGj~V ztpr^zr-TWbPU&6w`%HxnQtl#K3RJtn{}nAnQI$F_0SMpasYRmO3V*%o9u829RUTcg8IXRan)Gi}Rga zCwohec$n`OTFiV|B42JU9(ZK24V>u$o?;8+;t(?uP|(@r%Bj2I#OA2cm%7rS4d1-) zQ_7_iOx!UtK2{E+85VCn)-ZUU9}UBBjP_}_ra#TijQP#}Y10O?2=lyunDqk5)WE() z4CKm=xa0PY+^Bb$lEEvCGQfo|mzVTn{ky;cR92U^`rrXv?2Z+Y;F;q>FI76^qJl^~ z_<-;}cSTZmR`@Wvs;%X_r9{>y_FFgIzJu01_x$q4yP;meVD20KYW&&l5g+KQl zJ$=p2A)=F0hH(he`&-yYePbj?fH8Ds4}RAm0pbcd@R;b!5figZyj}F~{Ui8gEVf$2 zyce*w0pk3g`2=`j6AxtmN?OX80fcHmt4@F^yJ!Ar!SKM#ze1p{ z8b#9tR_ZZ_wb~2!SZ}+V6dm&4OacvIXv7J~v;NPI$O{RqN#6aPR5c~TF(OmmJ1z{< z-Q-4Mv)Qj{)_UDXVN9HV9r!nzF3Z33p1`m9OlSEC1LP!3D-LGZfc_%~%Dfq7=~wW* zKsIwJ1_U7%>V*1g{C6%ZWgd~{MXF9^`jV`S;xP#cFS@^e!@p7DoVXR(O6N5^4DePI zw{W}!x_XJwgbX*bplEvrl7;so%)AYhVVDsLB!*>VN0R>zbF-J;SPXQJUD801If;lK zByVX+NV_>@c5~mojKkY>Hs)0vOtmPP7{iInla!n&V>CgPD&p@-%RExbn>MMFe`ULH zPr5&HkDq+Af({4_|9r+UC%0x`nhDmeOT|jEfPFw;zG_F*kMvaiwKzjrLx1f1*Y-}7 z4P2lMJbGo-OfpKS>cc1{$(ufuP2Qfy^LA7!!dH|xlHhkSaX?8~$@mrtbk*32J~wiy zG1ZFJe0qj+u*3JVdX*N$MDL~XJLnl+OlES9y%yTK(WfrsD68~QtQ;i&USnqc;KMkYcvpW~YebWBGK5lVI8`1Fx4Xp@Jyi661~`BO{%U0#<*lv0qP zt)wAEj;NnNBapnGBkg;a2lwOMMwo6^v>D&B%1J(AAVLB2R3n-K3oJh5QcoQ)3u>ab zw&Zb5BL3^DVvdc?(AEW1N3L6BmLj4%E}WPi1RHkhNK&|fBA<3yZ>9gY?TY(ee#K;1 z{~-lRRZSz>VyvPA0|c0DSWb#Ot;es4p2z=2gL^FRxs0_XXB`W#C~bLg3Zgq>d=qHP z%8n}1-BmVvJ#Kbz}vONT1A`)oIQg9gZQ0 z@06AKL*&VPsXMFqgcHn=ry$m!Tl}SA*6HWxb{98w++W4^B!Sv5=f`0pAe^ zfpP|3dHjB*ScGGnk^s{Tz`og(z9rcQ*rt_MO}wRywuLN+?k6Kqz{xy6%!>Q3P}36; zXD%y}x)3M7QM_>`aKE#M^203hy)prNNNA>Z!HD6HDkV&u!le;GiXa=@v`_Id{xXN` z35OIni|E49I2w^%hb``wpst~#Xs#(Joc$^od-pXk2jKu|PM-UnQ-$6`H1#pR5D~`n zKiP@@`aWuOUBxPjz;~QTT{}k2F#z9(DK8a0?;M7i+_(BE>z#_Y$x{YwJKV!XTIvB> zCnaj-8|T7LFD0v;_%H-PcNtH>Cf7Ml1xHC;e$B9B(og|mtboKAGZz`7r4**61im|(ZZwS1%GMprM_tSNUCT#>Z-)tV zi*WYJ7Q8TtTL_Yq7h@7*RV*!q4<+*sDJ5qq@#xhhXlaKQJiF$GK$S>_SJ*i8#2mD= zxt@|%djZua7Oejmc9Q}SuMWr~1}AP%>M@dS#mLqn?Uxz g;r{>4BB!Ck^S6J1l+Gvt34M@;ijH!%A|(9(03ZiaQvd(} literal 0 HcmV?d00001 diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md new file mode 100644 index 000000000..b808cc4cb --- /dev/null +++ b/docs/getting_started/index.md @@ -0,0 +1,20 @@ +# Getting Started + +For your first time running MaxText, we provide specific [instructions](First_run.md). + +MaxText supports training and inference of various open models. + +Some extra helpful guides: +* [Gemma](https://ai.google.dev/gemma): a family of open-weights Large Language Model (LLM) by [Google DeepMind](https://deepmind.google/), based on Gemini research and technology. You can run decode and finetuning using [these instructions](https://github.com/AI-Hypercomputer/maxtext/blob/main/end_to_end/tpu/gemma/Run_Gemma.md). +* [Llama2](https://llama.meta.com/llama2/): a family of open-weights Large Language Model (LLM) by Meta. You can run decode and finetuning using [these instructions](Run_Llama2.md). +* [Mixtral](https://mistral.ai/news/mixtral-of-experts/): a family of open-weights sparse mixture-of-experts (MoE) model by Mistral AI. You can run decode and finetuning using [these instructions](https://github.com/AI-Hypercomputer/maxtext/blob/main/end_to_end/tpu/mixtral/Run_Mixtral.md) + +In addition to the getting started guides, there are always other MaxText capabilities that are being constantly being added! The full suite of end-to-end tests is in [end_to_end](https://github.com/AI-Hypercomputer/maxtext/blob/main/end_to_end). We run them with a nightly cadence. They can be a good source for understanding MaxText Alternatively you can see the continuous [unit tests](https://github.com/AI-Hypercomputer/maxtext/blob/main/.github/workflows/UnitTests.yml) which are run almost continuously. + +```{toctree} +:hidden: + +First_run.md +steps_model.md +End-to-end example +``` diff --git a/docs/getting_started/steps_model.md b/docs/getting_started/steps_model.md new file mode 100644 index 000000000..45c997249 --- /dev/null +++ b/docs/getting_started/steps_model.md @@ -0,0 +1,16 @@ +# Steps to build a Model + +![](build_model.png) +_Fig1: Stages of LLM Model Development from pre-training to fine tuning and finally serving a model._ + +Model building starts with Pre-training a base model architecture. Pre-training is the process where you take a model architecture, which starts with random weights and train with a very large corpus in the scale of trillions of tokens. E.g. Google’s Gemma models were pre-trained on 6 Trillion tokens; LLama 3 was trained with 15 Trillion tokens + +Post the pre-training most model producers will publish a checkpoint of the weights of the model. The corpus used for pre-training these models are usually a large public corpus like Common Crawl, public code bases, books etc. + +Though these may be a great way to answer very general questions or prompts, they usually fail on very domain specific questions and answers like Medical and Life Sciences, Engineering, etc. + +Customers and enterprises usually like to continue training a pre-trained model or performing a full fine tuning of the models using their own datasets. These datasets are usually in billions of tokens. This allows better prompt understanding when questions are asked on keywords and terms specific to their model or domain specific question. + +Post a Full Fine Tuning, most models go through a process of Instruction Fine Tuning(PEFT/LoRA), Supervised Fine Tuning and RLHF to improve the model quality and follow prompt answers better. + +PEFT/Lora, Supervised Finetuning are less expensive operations compared to full fine tuning. diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 000000000..5487e1a62 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,208 @@ + + +# MaxText + +## Overview + +MaxText is a a Google initiated open source project for **high performance**, **highly scalable**, **open-source** LLM written in pure Python/[JAX](https://jax.readthedocs.io/en/latest/index.html) and targeting Google Cloud TPUs and GPUs for **training** and **inference**. MaxText achieves [high MFUs](#runtime-performance-results) and scales from single host to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler. + +MaxText achieves very high MFUs (Model Flop Utilization) and scales from single host to very large clusters while staying simple and "optimization-free". + +MaxText aims to be a launching off point for ambitious LLM projects both in research and production. We encourage users to start by experimenting with MaxText out of the box and then fork and modify MaxText to meet their needs. + +We have used MaxText to [demonstrate high-performance, well-converging training in int8](https://cloud.google.com/blog/products/compute/accurate-quantized-training-aqt-for-tpu-v5e) and [scale training to ~51K chips](https://cloud.google.com/blog/products/compute/the-worlds-largest-distributed-llm-training-job-on-tpu-v5e). + +Key supported features: +- TPUs and GPUs (in preview) +- Training and Inference (in preview) + +MaxText additionally provides an highly optimized reference implementations for popular Open Source models like: + +- Llama 2, 3 and 3.1 +- Mistral and Mixtral +- Gemma and Gemma2 +- GPT + +These reference implementations support pre-training and full fine tuning. Maxtext also allows you to create various sized models for benchmarking purposes. + +The key value proposition of using MaxText for pre-training or full fine tuning is: + +- Very high performance of average of 50% MFU +- Open code base +- Easy to understand: MaxText is purely written in JAX and Python, which makes it accessible to ML developers interested in inspecting the implementation or stepping through it. It is written at the block-by-block level, with code for Embeddings, Attention, Normalization etc. Different Attention mechanisms like MQA and GQA are all present. For quantization, it uses the JAX AQT library. The implementation is suitable for both GPUs and TPUs. + +```{note} +Maxtext today only supports Pre-training and Full Fine Tuning of the models. It does not support PEFT/LoRA, Supervised Fine Tuning or RLHF. +``` + +## Who are the target users of Maxtext? + +- Any individual or a company that is interested in forking maxtext and seeing it as a reference implementation of a high performance Large Language Models and wants to build their own LLMs on TPU and GPU. +- Any individual or a company that is interested in performing a pre-training or Full Fine Tuning of the supported open source models, can use Maxtext as a blackbox to perform full fine tuning. Maxtext attains an extremely high MFU, resulting in large savings in training costs. + +## Runtime Performance Results + +More details on reproducing these results can be found in [MaxText/configs/README.md](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/configs/README.md). + +### TPU v5p + +| No. of params | Accelerator Type | TFLOP/chip/sec | Model flops utilization (MFU) | +|---|---|---|---| +| 32B | v5p-128 | 3.28e+02 | 71.47% | +| 64B | v5p-128 | 3.23e+02 | 70.31% | +| 128B | v5p-256 | 3.15e+02 | 68.68% | +| 128B | v5p-512 | 3.15e+02 | 68.53% | +| 256B | v5p-1024 | 3.16e+02 | 68.82% | +| 512B | v5p-1024 | 2.94e+02 | 63.99% | +| 1024B | v5p-2048 | 2.49e+02 | 64.05% | +| 1024B | v5p-4096 | 2.97e+02 | 64.80% | +| 1160B | v5p-7680 | 2.95e+02 | 64.27% | +| 1160B | v5p-12288 | 3.04e+02 | 66.23% | + +### TPU v5e + +For 16B, 32B, 64B, and 128B models. See full run configs in [MaxText/configs/v5e/](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/configs/v5e/) as `16b.sh`, `32b.sh`, `64b.sh`, `128b.sh`. + +| Hardware | 16B TFLOP/sec/chip | 16B MFU | 32B TFLOP/sec/chip | 32B MFU | 64B TFLOP/sec/chip | 64B MFU | 128B TFLOP/sec/chip | 128B MFU | +| ----------- | -----------------: | ------- | -----------------: | ------- | -----------------: | ------- | ------------------: | -------- | +| 1x v5e-256 | 120 | 61.10% | 132 | 66.86% | 118 | 59.90% | 110 | 56.06% | +| 2x v5e-256 | 117 | 59.37% | 128 | 64.81% | 112 | 56.66% | 110 | 55.82% | +| 4x v5e-256 | 117 | 59.14% | 126 | 64.10% | 110 | 55.85% | 108 | 54.93% | +| 8x v5e-256 | 115 | 58.27% | 125 | 63.67% | 108 | 54.96% | 104 | 52.93% | +| 16x v5e-256 | 111 | 56.56% | 123 | 62.26% | 105 | 53.29% | 100 | 50.86% | +| 32x v5e-256 | 108 | 54.65% | 119 | 60.40% | 99 | 50.18% | 91 | 46.25% | + +## Comparison to Alternatives + +MaxText is heavily inspired by [MinGPT](https://github.com/karpathy/minGPT)/[NanoGPT](https://github.com/karpathy/nanoGPT), elegant standalone GPT implementations written in PyTorch and targeting Nvidia GPUs. MaxText is more complex, supporting more industry standard models and scaling to tens of thousands of chips. Ultimately MaxText has an MFU more than three times the [17%](https://twitter.com/karpathy/status/1613250489097027584?cxt=HHwWgIDUhbixteMsAAAA) reported most recently with that codebase, is massively scalable and implements a key-value cache for efficient auto-regressive decoding. + +MaxText is more similar to [Nvidia/Megatron-LM](https://github.com/NVIDIA/Megatron-LM), a very well tuned LLM implementation targeting Nvidia GPUs. The two implementations achieve comparable MFUs. The difference in the codebases highlights the different programming strategies. MaxText is pure Python, relying heavily on the XLA compiler to achieve high performance. By contrast, Megatron-LM is a mix of Python and CUDA, relying on well-optimized CUDA kernels to achieve high performance. + +MaxText is also comparable to [Pax](https://github.com/google/paxml). Like Pax, MaxText provides high-performance and scalable implementations of LLMs in Jax. Pax focuses on enabling powerful configuration parameters, enabling developers to change the model by editing config parameters. By contrast, MaxText is a simple, concrete implementation of various LLMs that encourages users to extend by forking and directly editing the source code. + +## Features and Diagnostics +### Collect Stack Traces +When running a Single Program, Multiple Data (SPMD) job on accelerators, the overall process can hang if there is any error or any VM hangs/crashes for some reason. In this scenario, capturing stack traces will help to identify and troubleshoot the issues for the jobs running on TPU VMs. + +The following configurations will help to debug a fault or when a program is stuck or hung somewhere by collecting stack traces. Change the parameter values accordingly in `MaxText/configs/base.yml`: +1. Set `collect_stack_trace: True` to enable collection of stack traces on faults or when the program is hung. This setting will periodically dump the traces for the program to help in debugging. To disable this, set `collect_stack_trace: False`. +2. Set `stack_trace_to_cloud: False` to display stack traces on console. `stack_trace_to_cloud: True` will create a temporary file in `/tmp/debugging` in the TPUs to store the stack traces. There is an agent running on TPU VMs that will periodically upload the traces from the temporary directory to cloud logging in the gcp project. You can view the traces in Logs Explorer on Cloud Logging using the following query: +``` +logName="projects//logs/tpu.googleapis.com%2Fruntime_monitor" +jsonPayload.verb="stacktraceanalyzer" +``` +3. `stack_trace_interval_seconds` signifies the duration in seconds between each stack trace collection event. Setting `stack_trace_interval_seconds: 600` will collect the stack traces every 600 seconds (10 minutes). + +Here is the related PyPI package: https://pypi.org/project/cloud-tpu-diagnostics. + +### Ahead of Time Compilation (AOT) +To compile your training run ahead of time, we provide a tool `train_compile.py`. This tool allows you to compile the main `train_step` in `train.py` for target hardware (e.g. a large number of v5e devices) without using the full cluster. + +#### TPU Support + +You may use only a CPU or a single VM from a different family to pre-compile for a TPU cluster. This compilation helps with two main goals: + +* It will flag any out of memory (OOM) information, such as when the `per_device_batch_size` is set too high, with an identical OOM stack trace as if it was compiled on the target hardware. + +* The ahead of time compilation can be saved and then loaded for fast startup and restart times on the target hardware. + +The tool `train_compile.py` is tightly linked to `train.py` and uses the same configuration file `configs/base.yml`. Although you don't need to run on a TPU, you do need to install `jax[tpu]` in addition to other dependencies, so we recommend running `setup.sh` to install these if you have not already done so. + +##### Example AOT 1: Compile ahead of time basics +After installing the dependencies listed above, you are ready to compile ahead of time: +``` +# Run the below on a single machine, e.g. a CPU +python3 MaxText/train_compile.py MaxText/configs/base.yml compile_topology=v5e-256 compile_topology_num_slices=2 \ +global_parameter_scale=16 per_device_batch_size=4 +``` + +This will compile a 16B parameter MaxText model on 2 v5e pods. + +##### Example AOT 2: Save compiled function, then load and run it +Here is an example that saves then loads the compiled `train_step`, starting with the save: + +**Step 1: Run AOT and save compiled function** +``` +# Run the below on a single machine, e.g. a CPU +export LIBTPU_INIT_ARGS="--xla_enable_async_all_gather=true" +python3 MaxText/train_compile.py MaxText/configs/base.yml compile_topology=v5e-256 \ +compile_topology_num_slices=2 \ +compiled_trainstep_file=my_compiled_train.pickle global_parameter_scale=16 \ +per_device_batch_size=4 steps=10000 learning_rate=1e-3 +``` + +**Step 2: Run train.py and load the compiled function** + +To load the compiled train_step, you just need to pass `compiled_trainstep_file=my_compiled_train.pickle` into `train.py`: +``` +# Run the below on each host of the target hardware, e.g. each host on 2 slices of v5e-256 +export LIBTPU_INIT_ARGS="--xla_enable_async_all_gather=true" +python3 MaxText/train.py MaxText/configs/base.yml run_name=example_load_compile \ +compiled_trainstep_file=my_compiled_train.pickle \ +global_parameter_scale=16 per_device_batch_size=4 steps=10000 learning_rate=1e-3 \ +base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket +``` + +In the save step of example 2 above we included exporting the compiler flag `LIBTPU_INIT_ARGS` and `learning_rate` because those affect the compiled object `my_compiled_train.pickle.` The sizes of the model (e.g. `global_parameter_scale`, `max_sequence_length` and `per_device_batch`) are fixed when you initially compile via `compile_train.py`, you will see a size error if you try to run the saved compiled object with different sizes than you compiled with. However a subtle note is that the **learning rate schedule** is also fixed when you run `compile_train` - which is determined by both `steps` and `learning_rate`. The optimizer parameters such as `adam_b1` are passed only as shaped objects to the compiler - thus their real values are determined when you run `train.py`, not during the compilation. If you do pass in different shapes (e.g. `per_device_batch`), you will get a clear error message reporting that the compiled signature has different expected shapes than what was input. If you attempt to run on different hardware than the compilation targets requested via `compile_topology`, you will get an error saying there is a failure to map the devices from the compiled to your real devices. Using different XLA flags or a LIBTPU than what was compiled will probably run silently with the environment you compiled in without error. However there is no guaranteed behavior in this case; you should run in the same environment you compiled in. + +#### GPU Support +Ahead-of-time compilation is also supported for GPUs with some differences from TPUs: + +1. GPU does not support compilation across hardware: A GPU host is still required to run AoT compilation, but a single GPU host can compile a program for a larger cluster of the same hardware. + +1. For [A3 Cloud GPUs](https://cloud.google.com/compute/docs/gpus#h100-gpus), the maximum "slice" size is a single host, and the `compile_topology_num_slices` parameter represents the number of A3 machines to precompile for. + +##### Example +This example illustrates the flags to use for a multihost GPU compilation targeting a cluster of 4 A3 hosts: + +**Step 1: Run AOT and save compiled function** +``` +# Run the below on a single A3 machine +export XLA_FLAGS="--xla_gpu_enable_async_collectives=true" +python3 MaxText/train_compile.py MaxText/configs/base.yml compile_topology=a3 \ +compile_topology_num_slices=4 \ +compiled_trainstep_file=my_compiled_train.pickle global_parameter_scale=16 \ +attention=dot_product per_device_batch_size=4 steps=10000 learning_rate=1e-3 +``` + +**Step 2: Run train.py and load the compiled function** + +To load the compiled train_step, you just need to pass `compiled_trainstep_file=my_compiled_train.pickle` into `train.py`: +``` +# Run the below on each of the 4 target A3 hosts. +export XLA_FLAGS="--xla_gpu_enable_async_collectives=true" +python3 MaxText/train.py MaxText/configs/base.yml run_name=example_load_compile \ +compiled_trainstep_file=my_compiled_train.pickle \ +attention=dot_product global_parameter_scale=16 per_device_batch_size=4 steps=10000 learning_rate=1e-3 \ +base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket +``` + +As in the TPU case, note that the compilation environment must match the execution environment, in this case by setting the same `XLA_FLAGS`. + + +### Automatically Upload Logs to Vertex Tensorboard +MaxText supports automatic upload of logs collected in a directory to a Tensorboard instance in Vertex AI. Follow [user guide](getting_started/Use_Vertex_AI_Tensorboard.md) to know more. + + +```{toctree} +:maxdepth: 1 +:hidden: + +getting_started/index.md +advanced_usage.md +reference/index.md +``` diff --git a/docs/reference/code_organization.md b/docs/reference/code_organization.md new file mode 100644 index 000000000..6f78f11ad --- /dev/null +++ b/docs/reference/code_organization.md @@ -0,0 +1,15 @@ +# MaxText Code Organization + +Maxtext is purely written in JAX and python. Below are some folders and files +that show a high-level organization of the code and some key files. + +File/Folder | Description +---------|--------------------------------- + `configs` | Folder contains all the config file, including model configs (llama2, mistral etc) , and pre-optimized configs for different model size on different TPUs + `input_pipelines` | Input training data related code + `layers` | Model layer implementation + `end_to_end` | Example scripts to run Maxtext + `Maxtext/train.py` | The main training script you will run directly + `Maxtext/config/base.yaml` | The base configuration file containing all the related info: checkpointing, model arch, sharding schema, data input, learning rate, profile, compilation, decode + `Maxtext/decode.py` | This is a script to run offline inference with a sample prompt + `setup.sh`| Bash script used to install all needed library dependencies. diff --git a/docs/reference/config_options.md b/docs/reference/config_options.md new file mode 100644 index 000000000..3cb2b938b --- /dev/null +++ b/docs/reference/config_options.md @@ -0,0 +1 @@ +# Configuration options diff --git a/docs/reference/index.md b/docs/reference/index.md new file mode 100644 index 000000000..c5961f468 --- /dev/null +++ b/docs/reference/index.md @@ -0,0 +1,8 @@ +# Reference + +```{toctree} +code_organization.md +config_options.md +../getting_started/Data_Input_Pipeline.md +../getting_started/Data_Input_Perf.md +``` diff --git a/docs/requirements.txt b/docs/requirements.txt new file mode 100644 index 000000000..b14ee0fa4 --- /dev/null +++ b/docs/requirements.txt @@ -0,0 +1,5 @@ +# Sphinx-related requirements. +sphinx +myst-nb +myst-parser[linkify] +sphinx-book-theme \ No newline at end of file diff --git a/requirements_docs.txt b/requirements_docs.txt new file mode 100644 index 000000000..f819b01f5 --- /dev/null +++ b/requirements_docs.txt @@ -0,0 +1,2 @@ +mkdocs-material +mkdocs-include-markdown-plugin From 2df72e78194ca139cd6daf7ceaeb9d37efb27a59 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Fri, 25 Oct 2024 15:46:27 -0300 Subject: [PATCH 02/22] Add ReadTheDocs setup --- .readthedocs.yaml | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) create mode 100644 .readthedocs.yaml diff --git a/.readthedocs.yaml b/.readthedocs.yaml new file mode 100644 index 000000000..8fa1276c7 --- /dev/null +++ b/.readthedocs.yaml @@ -0,0 +1,24 @@ +# Read the Docs configuration file for Sphinx projects +# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details + +# Required +version: 2 + +# Set the OS, Python version and other tools you might need +build: + os: ubuntu-22.04 + tools: + python: "3.12" + +# Build documentation in the "docs/" directory with Sphinx +sphinx: + configuration: docs/conf.py + # Fail on all warnings to avoid broken references + fail_on_warning: true + +# Optional but recommended, declare the Python requirements required +# to build your documentation +# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html +python: + install: + - requirements: docs/requirements.txt From 4d0ac8e725d7581dabb2fb83cd3307fadec3d06a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Tue, 29 Oct 2024 15:12:21 -0300 Subject: [PATCH 03/22] Reorganize documentation with existing content --- docs/advanced_usage.md | 1 - docs/batch_size.md | 6 ++++ docs/checkpointing.md | 13 +++++++++ docs/code_organization.md | 43 +++++++++++++++++++++++++++++ docs/data_loading.md | 13 +++++---- docs/full_finetuning.md | 5 ++++ docs/gce_gke_xpk.md | 3 ++ docs/getting_started/index.md | 2 ++ docs/index.md | 13 +++++++-- docs/inference.md | 3 ++ docs/profiling.md | 3 ++ docs/reference/code_organization.md | 15 ---------- docs/reference/config_options.md | 1 - docs/reference/index.md | 8 ------ docs/remat_policy.md | 26 +++++++++++++++++ docs/sharding.md | 31 +++++++++++++++++++++ 16 files changed, 153 insertions(+), 33 deletions(-) create mode 100644 docs/batch_size.md create mode 100644 docs/checkpointing.md create mode 100644 docs/code_organization.md create mode 100644 docs/full_finetuning.md create mode 100644 docs/gce_gke_xpk.md create mode 100644 docs/inference.md create mode 100644 docs/profiling.md delete mode 100644 docs/reference/code_organization.md delete mode 100644 docs/reference/config_options.md delete mode 100644 docs/reference/index.md create mode 100644 docs/remat_policy.md create mode 100644 docs/sharding.md diff --git a/docs/advanced_usage.md b/docs/advanced_usage.md index a02fa35b8..55abfe796 100644 --- a/docs/advanced_usage.md +++ b/docs/advanced_usage.md @@ -6,5 +6,4 @@ getting_started/Run_MaxText_via_multihost_runner.md getting_started/Run_MaxText_via_xpk.md getting_started/Use_Vertex_AI_Tensorboard.md getting_started/Run_Llama2.md -data_loading.md ``` diff --git a/docs/batch_size.md b/docs/batch_size.md new file mode 100644 index 000000000..d7e955dd7 --- /dev/null +++ b/docs/batch_size.md @@ -0,0 +1,6 @@ +# Per-device batch size + +The value of the `per_device_batch_size` parameter dictates the amount of +training data fed into the chip. This can be of decimal value between 0 and 1. +Changing the value of per_device_batch_size can improve the MFU for your +training run. diff --git a/docs/checkpointing.md b/docs/checkpointing.md new file mode 100644 index 000000000..5e9a43f0f --- /dev/null +++ b/docs/checkpointing.md @@ -0,0 +1,13 @@ +# Checkpointing + +Maxtext provides the ability to run training with following checkpointing options: + +- enabled/disabled +- asynchronous - true/false +- checkpointing frequency + +They are dictated by the following parameters: + +- `Enable_checkpointing` (`True`/`False`) +- `Checkpoint_period` (integer value) +- `Async_checkpointing` (`True`/`False`) diff --git a/docs/code_organization.md b/docs/code_organization.md new file mode 100644 index 000000000..95deed42e --- /dev/null +++ b/docs/code_organization.md @@ -0,0 +1,43 @@ +# Codebase Walkthrough + +MaxText is purely written in JAX and python. Below are some folders and files +that show a high-level organization of the code and some key files. + +File/Folder | Description +---------|--------------------------------- + `configs` | Folder contains all the config file, including model configs (llama2, mistral etc) , and pre-optimized configs for different model size on different TPUs + `input_pipelines` | Input training data related code + `layers` | Model layer implementation + `end_to_end` | Example scripts to run Maxtext + `Maxtext/train.py` | The main training script you will run directly + `Maxtext/config/base.yaml` | The base configuration file containing all the related info: checkpointing, model arch, sharding schema, data input, learning rate, profile, compilation, decode + `Maxtext/decode.py` | This is a script to run offline inference with a sample prompt + `setup.sh`| Bash script used to install all needed library dependencies. + +## Training configuration + +The [MaxText/configs/base.yaml](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/configs/base.yml) +has a set of default configurations. These can be overridden directly via CLI +when invoking the MaxText train scripts. The command line parameters overwrite +the default values. A few of the key parameters are described below: + +- `load_parameters_path`: maxtext checkpoint path. +- `base_output_directory`: Base path to save the outputs (logs and data). +- [`dataset_type`](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/configs/base.yml#L273): + synthetic, tfds, grain or hf (hugging face) +- `dataset_path`: for `dataset_type=tfds`, path to the dataset. +- `tokenizer_path`: Path to a tokenizer for the model. The tokenizers are + present in ... +- `quantization`: Whether to use quantized training with AQT. Valid values are ['int8'] +- `per_device_batch_size`: How many batches each TPU/device receives. To improve + the MFU, you can increase this value. This can also be a fraction. For this + tutorial, we will use the default value of 1. +- `enable_checkpointing`: Boolean value. Whether we want to generate a checkpoint. +- `checkpoint_period`: After how many steps should checkpointing be performed. +- `async_checkpointing`: Accepts a boolean value to set whether to use + asynchronous checkpointing. Here, we set it to False. +- `attention`: On TPUv3 and earlier, we need to set the attention to + `dot_product`. Newer versions support the flash attention value. On GPU use + `cudnn_flash_te`. +- `steps`: Number of steps to train. For this tutorial, we will use a small + value of 10 steps. diff --git a/docs/data_loading.md b/docs/data_loading.md index 6771a8b6d..1c0f63c96 100644 --- a/docs/data_loading.md +++ b/docs/data_loading.md @@ -1,10 +1,11 @@ -# Data Loading +# How to load the data Maxtext supports input data pipelines in the following ways: -Tf.data* -Grain -Hugging Face Datasets -*Tf.data is the most performant way of loading large scale datasets. +- Tf.data[^1] +- Grain +- Hugging Face Datasets -You can read more about the pipelines in [](getting_started/Data_Input_Pipeline.md). \ No newline at end of file +[^1]: Tf.data is the most performant way of loading large scale datasets. + +You can read more about the pipelines in [](getting_started/Data_Input_Pipeline.md). diff --git a/docs/full_finetuning.md b/docs/full_finetuning.md new file mode 100644 index 000000000..df22b01a3 --- /dev/null +++ b/docs/full_finetuning.md @@ -0,0 +1,5 @@ +# Full Finetuninhg LLama2/LLama3 Optimized configuration + +## Parameters to achieve high MFU + +This page is in progress. diff --git a/docs/gce_gke_xpk.md b/docs/gce_gke_xpk.md new file mode 100644 index 000000000..89a356e32 --- /dev/null +++ b/docs/gce_gke_xpk.md @@ -0,0 +1,3 @@ +# Getting started with GCE/GKE+XPK + +This page is in progress. \ No newline at end of file diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md index b808cc4cb..a1c445ee8 100644 --- a/docs/getting_started/index.md +++ b/docs/getting_started/index.md @@ -17,4 +17,6 @@ In addition to the getting started guides, there are always other MaxText capabi First_run.md steps_model.md End-to-end example +Data_Input_Pipeline.md +Data_Input_Perf.md ``` diff --git a/docs/index.md b/docs/index.md index 5487e1a62..687182927 100644 --- a/docs/index.md +++ b/docs/index.md @@ -49,7 +49,7 @@ The key value proposition of using MaxText for pre-training or full fine tuning Maxtext today only supports Pre-training and Full Fine Tuning of the models. It does not support PEFT/LoRA, Supervised Fine Tuning or RLHF. ``` -## Who are the target users of Maxtext? +## Who are the target users of MaxText? - Any individual or a company that is interested in forking maxtext and seeing it as a reference implementation of a high performance Large Language Models and wants to build their own LLMs on TPU and GPU. - Any individual or a company that is interested in performing a pre-training or Full Fine Tuning of the supported open source models, can use Maxtext as a blackbox to perform full fine tuning. Maxtext attains an extremely high MFU, resulting in large savings in training costs. @@ -203,6 +203,15 @@ MaxText supports automatic upload of logs collected in a directory to a Tensorbo :hidden: getting_started/index.md +code_organization.md +data_loading.md +sharding.md +remat_policy.md +batch_size.md +checkpointing.md +profiling.md +full_finetuning.md +inference.md +gce_gke_xpk.md advanced_usage.md -reference/index.md ``` diff --git a/docs/inference.md b/docs/inference.md new file mode 100644 index 000000000..deb2dce66 --- /dev/null +++ b/docs/inference.md @@ -0,0 +1,3 @@ +# Inference (JetStream) + +This page is in progress. diff --git a/docs/profiling.md b/docs/profiling.md new file mode 100644 index 000000000..d1ed7902c --- /dev/null +++ b/docs/profiling.md @@ -0,0 +1,3 @@ +# Profiling and Pre-training: Xplane and Tensorboard + +This page is in progress. diff --git a/docs/reference/code_organization.md b/docs/reference/code_organization.md deleted file mode 100644 index 6f78f11ad..000000000 --- a/docs/reference/code_organization.md +++ /dev/null @@ -1,15 +0,0 @@ -# MaxText Code Organization - -Maxtext is purely written in JAX and python. Below are some folders and files -that show a high-level organization of the code and some key files. - -File/Folder | Description ----------|--------------------------------- - `configs` | Folder contains all the config file, including model configs (llama2, mistral etc) , and pre-optimized configs for different model size on different TPUs - `input_pipelines` | Input training data related code - `layers` | Model layer implementation - `end_to_end` | Example scripts to run Maxtext - `Maxtext/train.py` | The main training script you will run directly - `Maxtext/config/base.yaml` | The base configuration file containing all the related info: checkpointing, model arch, sharding schema, data input, learning rate, profile, compilation, decode - `Maxtext/decode.py` | This is a script to run offline inference with a sample prompt - `setup.sh`| Bash script used to install all needed library dependencies. diff --git a/docs/reference/config_options.md b/docs/reference/config_options.md deleted file mode 100644 index 3cb2b938b..000000000 --- a/docs/reference/config_options.md +++ /dev/null @@ -1 +0,0 @@ -# Configuration options diff --git a/docs/reference/index.md b/docs/reference/index.md deleted file mode 100644 index c5961f468..000000000 --- a/docs/reference/index.md +++ /dev/null @@ -1,8 +0,0 @@ -# Reference - -```{toctree} -code_organization.md -config_options.md -../getting_started/Data_Input_Pipeline.md -../getting_started/Data_Input_Perf.md -``` diff --git a/docs/remat_policy.md b/docs/remat_policy.md new file mode 100644 index 000000000..ad5dea67b --- /dev/null +++ b/docs/remat_policy.md @@ -0,0 +1,26 @@ +# Remat Policy and Host Offloading + +For large-scale model training, accelerator memory is a limited resource and we +often make trade-offs such as activation re-materialization to trade off compute +cycles for accelerator memory resources. Host offload is another technique we +recently introduced in the XLA compiler to leverage host DRAM to offload +activations computed during the forward pass and reuse them during the backward +pass for gradient computation; this saves activation recomputation cycles. + +Maxtext provides a parameter called `remat_policy`. This parameter allows +offloading activation memory to host, HBM or recomputing on backward pass. + +Activations in the forward pass are also needed in the backward pass. There are +three options for where in memory these activations are accessible for the +backward pass: + +1. In HBM (MaxText remat policy "minimal") +2. On host (MaxText remat policy "minimal_offloaded") +3. Activations are re-computed during the backward pass (MaxText remat policy "full") + +We can choose different remat policies for different activations (e.g. the FF +activations versus the QKV proj activations), which allows us to optimize memory +usage vs compute trade-offs: Generally we want to use all of our HBM. Both host +offloading (option 2) and re-computing (Aka remat, option 3), use as little HBM +as possible - which is faster depends on model sizes, device compute speed and +host to device speed. diff --git a/docs/sharding.md b/docs/sharding.md new file mode 100644 index 000000000..03d34c198 --- /dev/null +++ b/docs/sharding.md @@ -0,0 +1,31 @@ +# Sharding + +Maxtext supports the following sharding mechanisms: + +- Distributed Data Parallelism +- Tensor Parallelism +- Fully Sharded Data Parallel +- Sequence Parallel + +They are covered in the following parameters. These are the default values from base.yml. Use the following sharding parameters for setting on a single TPU Slice or a GPU Slice. + +``` +ici_data_parallelism: 1 +ici_fsdp_parallelism: -1 # recommended ICI axis to be auto-sharded +ici_fsdp_transpose_parallelism: 1 +ici_sequence_parallelism: 1 +ici_tensor_parallelism: 1 +``` + +Following sharding values dictate how training will happen across multiple TPU Pods. + +``` +dcn_data_parallelism: -1 # recommended DCN axis to be auto-sharded +dcn_fsdp_parallelism: 1 +dcn_fsdp_transpose_parallelism: 1 +dcn_sequence_parallelism: 1 # never recommended +dcn_tensor_parallelism: 1 # never recommended +dcn_pipeline_parallelism: 1 +dcn_expert_parallelism: 1 +dcn_autoregressive_parallelism: 1 # never recommended +``` From 3c1ea57c2bd7346b778193e3701c83ed5e0e9925 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Tue, 29 Oct 2024 15:21:10 -0300 Subject: [PATCH 04/22] Improve content --- .../Run_MaxText_via_xpk.md | 2 +- docs/advanced_usage.md | 1 - docs/full_finetuning.md | 20 ++++++++++++++++++- docs/gce_gke_xpk.md | 3 --- docs/index.md | 3 ++- docs/terminologies.md | 13 ++++++++++++ 6 files changed, 35 insertions(+), 7 deletions(-) rename docs/{getting_started => }/Run_MaxText_via_xpk.md (99%) delete mode 100644 docs/gce_gke_xpk.md create mode 100644 docs/terminologies.md diff --git a/docs/getting_started/Run_MaxText_via_xpk.md b/docs/Run_MaxText_via_xpk.md similarity index 99% rename from docs/getting_started/Run_MaxText_via_xpk.md rename to docs/Run_MaxText_via_xpk.md index 20516fdcb..511f98431 100644 --- a/docs/getting_started/Run_MaxText_via_xpk.md +++ b/docs/Run_MaxText_via_xpk.md @@ -15,7 +15,7 @@ --> -# How to run MaxText with XPK? +# How to run MaxText with XPK This document focuses on steps required to setup XPK on TPU VM and assumes you have gone through the [README](https://github.com/google/xpk/blob/main/README.md) to understand XPK basics. diff --git a/docs/advanced_usage.md b/docs/advanced_usage.md index 55abfe796..d0362646d 100644 --- a/docs/advanced_usage.md +++ b/docs/advanced_usage.md @@ -3,7 +3,6 @@ ```{toctree} getting_started/Run_MaxText_via_multihost_job.md getting_started/Run_MaxText_via_multihost_runner.md -getting_started/Run_MaxText_via_xpk.md getting_started/Use_Vertex_AI_Tensorboard.md getting_started/Run_Llama2.md ``` diff --git a/docs/full_finetuning.md b/docs/full_finetuning.md index df22b01a3..a73f7e4c5 100644 --- a/docs/full_finetuning.md +++ b/docs/full_finetuning.md @@ -1,5 +1,23 @@ # Full Finetuninhg LLama2/LLama3 Optimized configuration +In the pre-training section you saw the steps on how to do pre-training with +MaxText. To perform full fine tuning, you need to pass the checkpoint to the +training script. + +Following is the parameter to assign a checkpoint to the training script. + +- `load_parameters_path`: Path to the checkpoint directory + +The high level steps involve: +- Converting the model checkpoints to MaxText formatted checkpoints +- Preparing the dataset so that data can be fed into the training script. + MaxText provides sample pipelines to load the data via tf.data or Pygrain from + a disk or gcs bucket. Or it can also input data directly from the hugging face + dataset. +- Running the training script with the checkpoint +- Note: You may need to change the training parameters to fit the model to the + TPU or GPU shape and to obtain an optimized performance. + ## Parameters to achieve high MFU -This page is in progress. +This content is in progress. diff --git a/docs/gce_gke_xpk.md b/docs/gce_gke_xpk.md deleted file mode 100644 index 89a356e32..000000000 --- a/docs/gce_gke_xpk.md +++ /dev/null @@ -1,3 +0,0 @@ -# Getting started with GCE/GKE+XPK - -This page is in progress. \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 687182927..8bcc84947 100644 --- a/docs/index.md +++ b/docs/index.md @@ -212,6 +212,7 @@ checkpointing.md profiling.md full_finetuning.md inference.md -gce_gke_xpk.md +Run_MaxText_via_xpk.md advanced_usage.md +terminologies.md ``` diff --git a/docs/terminologies.md b/docs/terminologies.md new file mode 100644 index 000000000..00509684e --- /dev/null +++ b/docs/terminologies.md @@ -0,0 +1,13 @@ +# Terminologies + +- **FLOP**: Floating Point Operation +- **FLOPS**: Plural form of FLOP +- **FLOP/s** or **FLOPs**: stands for Floating Point Operations Per Second. +- **MFU**: Model FLOP/s Utilization +- **ICI**: Interchip-interconnect. +- **HBM**: High Bandwidth Memory. Built with DRAM technology. Each chip usually has XX GiBs of HBM. +- **VMEM**: Vector Memory. Built with SRAM technology. Each chip usually has XX MiBs of VMEM. +- **DCN**: Data Center Network +- **PCIe**: Peripheral Component Interconnect Express. How the TPUs communicate with the CPU. +- **AI**: Arithmetic Intensity +- **Rank**: Position or ID of a worker within a group of workers. This is not the same as Rank in linear algebra. From bdbda94f68423cdf21fbb42164b70ccaf8be21c5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Tue, 29 Oct 2024 15:29:38 -0300 Subject: [PATCH 05/22] Fix broken link --- docs/getting_started/First_run.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/getting_started/First_run.md b/docs/getting_started/First_run.md index 595593375..959a0c63e 100644 --- a/docs/getting_started/First_run.md +++ b/docs/getting_started/First_run.md @@ -67,7 +67,7 @@ Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.a ## Multihost development There are three patterns for running MaxText with more than one host. -1. [GKE, recommended] [Running Maxtext with xpk](Run_MaxText_via_xpk.md) - Quick Experimentation and Production support +1. [GKE, recommended] [Running Maxtext with xpk](../Run_MaxText_via_xpk.md) - Quick Experimentation and Production support 2. [GCE] [Running Maxtext with Multihost Jobs](Run_MaxText_via_multihost_job.md) - Long Running Production Jobs with Queued Resources 3. [GCE] [Running Maxtext with Multihost Runner](Run_MaxText_via_multihost_runner.md) - Fast experiments via multiple ssh connections. From d4d041b981ad2965bf47be38b74a43ff26d9dd9b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Tue, 29 Oct 2024 15:43:11 -0300 Subject: [PATCH 06/22] Add full finetuning content --- docs/full_finetuning.md | 61 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) diff --git a/docs/full_finetuning.md b/docs/full_finetuning.md index a73f7e4c5..6faada38a 100644 --- a/docs/full_finetuning.md +++ b/docs/full_finetuning.md @@ -18,6 +18,67 @@ The high level steps involve: - Note: You may need to change the training parameters to fit the model to the TPU or GPU shape and to obtain an optimized performance. +## MaxText Checkpoints + +MaxText checkpoints are in their own format. You can see the format in the script for llama conversion script. + +The conversion scripts for LLama work with Meta’s original checkpoints and not with HuggingFace Checkpoint. + +E.g. + +```bash +python3 MaxText/llama_or_mistral_ckpt.py --base-model-path \ + --maxtext-model-path --model-size llama2-7b +``` + +The conversion scripts do not use accelerators but need large host memory to perform the conversion. + +- The base model checkpoints should be in the format `{name}.{chkpt_idx}.pth` + - For example: `mistral-7b.00.pth` +- For large size model (e.g. 70B model), this script requires large memory VM. +- The script load and save weights in a single pass. + +### Sample Full Fine tuning Script + +Below is a sample training script for LLama2-7b. + +```bash +python3 MaxText/train.py \ +MaxText/configs/base.yml \ +run_name="llama2-finetune-maxtext" \ +base_output_directory=${output_directory} \load_parameters_path=${path_to_checkpoint} \ model_name='llama2-7b' \ +dataset_path=${dataset_path} \ +async_checkpointing=False \ +model_name='llama2-7b' \ +steps=10 per_device_batch_size=.25 +``` + +You can find some [end to end scripts here](https://github.com/AI-Hypercomputer/maxtext/tree/main/end_to_end/tpu). +These scripts can provide a reference point for various scripts. + +### Maxtext Checkpoint to Hugging Face + +Post finetuning or pre-training, maxtext also provides scripts to convert maxtext format weights back to [hugging face](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/llama_mistral_mixtral_orbax_to_hf.py). + +#### Dataset + +Maxtext provides examples to work with [Common Crawl](https://commoncrawl.org/). The dataset is available in TFRecords format in a cloud bucket. Maxtext provides scripts to copy the dataset to a Google Cloud Storage Bucket. + +##### Common Crawl (c4) Dataset Setup + +You need to run these steps once per project prior to any local development or cluster experiments. + +1. Create two gcs buckets in your project, one for downloading and retrieving the dataset and the other for storing the logs. +2. Download the dataset in your gcs bucket + +MaxText assumes these GCS buckets are created in the same project and that it has permissions to read and write from them: + +```bash +bash download_dataset.sh {GCS_PROJECT} {GCS_BUCKET_NAME} +``` + +The above will download the c4 dataset to your GCS BUCKET. + ## Parameters to achieve high MFU This content is in progress. From 79a28a264bc7ebc035e649d89d5d473650683bae Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Tue, 29 Oct 2024 16:46:38 -0300 Subject: [PATCH 07/22] Add MaxText on GPU doc --- docs/index.md | 1 + docs/single_host_gpu.md | 182 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 183 insertions(+) create mode 100644 docs/single_host_gpu.md diff --git a/docs/index.md b/docs/index.md index 8bcc84947..02c807211 100644 --- a/docs/index.md +++ b/docs/index.md @@ -213,6 +213,7 @@ profiling.md full_finetuning.md inference.md Run_MaxText_via_xpk.md +single_host_gpu.md advanced_usage.md terminologies.md ``` diff --git a/docs/single_host_gpu.md b/docs/single_host_gpu.md new file mode 100644 index 000000000..53a94c91a --- /dev/null +++ b/docs/single_host_gpu.md @@ -0,0 +1,182 @@ +# Maxtext on Single host GPU + +This is a short guide to run Maxtext on GPU. For this current set of instructions the GPUs used are A3-high. This is a single node 8 H100 instruction. + +## Create a GPU VM + +Follow the instructions to create a3 high or an a3 Mega VM +- https://cloud.google.com/compute/docs/gpus/create-gpu-vm-accelerator-optimized#console +- Add enough disk space to work through the examples (at least 500GB) + +Ssh into your host: + +```bash +gcloud compute ssh --zone "xxx" "hostname" --project "project name" +``` + +## Install the CUDA libraries + +Install CUDA prior to starting: + +- Follow the [instructions](https://cloud.google.com/compute/docs/gpus/install-drivers-gpu) to install CUDA +- Check nvida-smi is working +- Check nvcc + +Related NVIDIA Content: + +- NVIDIA JAX Session: +- Learn more about Jax on GPUs: + - https://www.nvidia.com/en-us/on-demand/session/gtc24-s62246/ +- NVIDIA JAX Toolbox: + - https://github.com/NVIDIA/JAX-Toolbox + +## Install Docker + +Follow the following steps to install docker +https://docs.docker.com/engine/install/debian/ + +## Install NVIDIA Container Toolkit + +https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html + +If you get the NVML Error: Please follow these instructions. + +https://stackoverflow.com/questions/72932940/failed-to-initialize-nvml-unknown-error-in-docker-after-few-hours + +## Install MaxText + +Clone Maxtext: + +```bash +git clone https://github.com/AI-Hypercomputer/maxtext.git +``` + +## Build Maxtext Docker Image + +This builds a docker image called `maxtext_base_image`. You can retag to a different name. + +1. Check out the code changes: + +```bash +cd maxtext +``` + +2. Run the following commands to build and push the docker image: + +```bash +export LOCAL_IMAGE_NAME= +sudo bash docker_build_dependency_image.sh DEVICE=gpu +docker tag maxtext_base_image $LOCAL_IMAGE_NAME +docker push $LOCAL_IMAGE_NAME +``` + +Note that when running `bash docker_build_dependency_image.sh DEVICE=gpu`, it +uses `MODE=stable` by default. If you want to use other modes, you need to +specify it explicitly: + +- using nightly mode: `bash docker_build_dependency_image.sh DEVICE=gpu MODE=nightly` +- using pinned mode: `bash docker_build_dependency_image.sh DEVICE=gpu MODE=pinned` + +## Test + +Test the docker, to see if jax can see all the 8 GPUs + +```bash +sudo docker run maxtext_base_image:latest python3 -c "import jax; print(jax.devices())" +``` + +You should see the following: + +``` +[CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3), CudaDevice(id=4), CudaDevice(id=5), CudaDevice(id=6), CudaDevice(id=7)] +``` + +Note: If you only see CPUDevice, that means there is a issue with NVIDIA Container and you need to stop and fix the issue. + +We will Run the next commands from inside the docker for convenience. + +## SSH into the docker + +```bash +sudo docker run --runtime=nvidia --gpus all -it maxtext_base_image:latest bash +``` + +If you do not wish to ssh execute the next set of commands by pre-pending the following: + +```bash +sudo docker run --runtime=nvidia --gpus all -it maxtext_base_image:latest .... +``` + +### Test a 1B model training + +```bash +export JAX_COORDINATOR_ADDRESS=localhost +export JAX_COORDINATOR_PORT=2222 +export GPUS_PER_NODE=8 +export NODE_RANK=0 +export NNODES=1 +``` + +Update script and run the command with synthetic data: + +``` +base_output_directory: A GCS Bucket +dataset_type: Synthetic or pass a real bucket +attention:cudnn_flash_te (The default in maxtext is flash. Flash does not work on GPUs) +scan_layers=False +use_iota_embed=True +hardware=gpu +per_device_batch_size=12 [Update this to get a better MFU] + + +Hardware: GPU +``` + +```bash +python3 MaxText/train.py MaxText/configs/base.yml run_name=gpu01 base_output_directory=/deps/output dataset_type=synthetic enable_checkpointing=True steps=10 attention=cudnn_flash_te scan_layers=False use_iota_embed=True hardware=gpu per_device_batch_size=12 +``` + +### Test a LLama2-7B model training + +You can find the optimized running of LLama Models for various host configurations here: + +https://github.com/AI-Hypercomputer/maxtext/tree/main/MaxText/configs/a3/llama_2_7b + +`1vm.sh` modified script below: + +```bash +echo "Running 1vm.sh" + +# Example command to invoke this script via XPK +# python3 xpk/xpk.py workload create --cluster ${CLUSTER_NAME} \ +# --workload ${WORKLOAD_NAME} --docker-image=gcr.io/supercomputer-testing/${LOCAL_IMAGE_NAME} \ +# --device-type ${DEVICE_TYPE} --num-slices 1 \ +# --command "bash MaxText/configs/a3/llama_2_7b/1vm.sh" + +# Stop execution if any command exits with error +set -e + +export OUTPUT_PATH="provide an output path" +export RUN_NAME="llama-2-1vm-$(date +%Y-%m-%d-%H-%M)" + +# Set environment variables +for ARGUMENT in "$@"; do + IFS='=' read -r KEY VALUE <<< "$ARGUMENT" + export "$KEY"="$VALUE" +done + +export XLA_FLAGS="--xla_dump_to=$OUTPUT_PATH/$RUN_NAME/HLO_dumps/ +--xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_triton_gemm=false + --xla_gpu_graph_level=0 --xla_gpu_enable_highest_priority_async_stream=true + --xla_gpu_all_reduce_combine_threshold_bytes=134217728 --xla_gpu_all_gather_combine_threshold_bytes=134217728 + --xla_gpu_reduce_scatter_combine_threshold_bytes=67108864 --xla_gpu_enable_pipelined_all_gather=true + --xla_gpu_enable_pipelined_reduce_scatter=true --xla_gpu_enable_pipelined_all_reduce=true + --xla_gpu_enable_while_loop_double_buffering=true --xla_gpu_enable_triton_softmax_fusion=false + --xla_gpu_enable_all_gather_combine_by_dim=false --xla_gpu_enable_reduce_scatter_combine_by_dim=false + --xla_disable_hlo_passes=rematerialization" + + +# 1 node, DATA_DP=1, ICI_FSDP=8 +python MaxText/train.py MaxText/configs/models/gpu/llama2_7b.yml run_name=$RUN_NAME \ + dcn_data_parallelism=1 ici_fsdp_parallelism=8 base_output_directory=$OUTPUT_PATH attention=cudnn_flash_te scan_layers=False use_iota_embed=True hardware=gpu +``` From 72bc791583bd6bd704e5b4612f05b9af6a12fe4f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Wed, 30 Oct 2024 11:13:51 -0300 Subject: [PATCH 08/22] Improve sharding content Also fixes capitalization in the checkpointing document. --- docs/checkpointing.md | 6 ++-- docs/sharding.md | 67 ++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 66 insertions(+), 7 deletions(-) diff --git a/docs/checkpointing.md b/docs/checkpointing.md index 5e9a43f0f..09ad04ae4 100644 --- a/docs/checkpointing.md +++ b/docs/checkpointing.md @@ -8,6 +8,6 @@ Maxtext provides the ability to run training with following checkpointing option They are dictated by the following parameters: -- `Enable_checkpointing` (`True`/`False`) -- `Checkpoint_period` (integer value) -- `Async_checkpointing` (`True`/`False`) +- `enable_checkpointing` (`True`/`False`) +- `checkpoint_period` (integer value) +- `async_checkpointing` (`True`/`False`) diff --git a/docs/sharding.md b/docs/sharding.md index 03d34c198..d146e4085 100644 --- a/docs/sharding.md +++ b/docs/sharding.md @@ -3,11 +3,70 @@ Maxtext supports the following sharding mechanisms: - Distributed Data Parallelism + - This is arguably the simplest parallelization strategy, where each device + can run the forward pass independently on a different set of data. The + devices must communicate the gradients during the backward pass. This + strategy works best with large per device batch sizes, and is suitable for + slower networks since it doesn't require much communication. +- Fully Sharded Data Parallelism + - Similar to data parallelism each device computes on a different set of + data. However additionally the optimizer state is sharded across devices, + which allows larger models to fit in this distributed memory. However now + the weights need to be all-gathered during the forward pass. This strategy + works best with large per device batch sizes. - Tensor Parallelism -- Fully Sharded Data Parallel -- Sequence Parallel + - Each device has the same data, but is responsible for computing a + different set of features. For the feed forward layer, this requires all + gathering the activations, performing the computations, and then + reduce-scattering the output, similar to the + [megatron strategy](https://parsa.epfl.ch/course-info/cs723/papers/Megatron.pdf). + Ideally these communications can be overlapped with the compute in a + pattern called a "collective matmul". This strategy works best for large + models (large intermediate or "mlp" dim), and is often used when the per + device batch size is small (which is where pure FSDP would not work well). + In MaxText we shard the heads by the tensor parallel axis for the + attention ops, since the heads act like a batch dimension it is easy to + use with efficient attention kernels such as flash attention. +- Sequence Parallelism + - Sequence parallelism as implemented in MaxText is similar to fully sharded + data parallelism. The optimizer state is sharded just like as in FSDP, and + we still shard the tokens, but on the "sequence" dimension instead of the + "batch" dimension. However for the attention component we shard the heads + by the sequence axis for the same reason as TP above - heads act like + batch dimension in the attention ops. Transition from sharding on sequence + to heads requires an all-to-all which should be cheap. Sequence + parallelism has strictly more communications than FSDP because of this + all-to-all, however it allows for a fractional per device batch size since + we shard the sequence dimension instead of the batch dimension. A + fractional per device batch size is needed to remain within memory limits + for longer sequence lengths. +- Pipeline parallelism + - Pipeline parallelism shards the optimizer state and computation by layers. + In MaxText we have implemented a "circular" pipeline which is able to + achieve smaller pipeline "bubbles" (idle time). Users can tradeoff bubble + versus communications by setting the layers per stage, more layers per + stage -> less communications required between layers, but also a larger + bubble due to fewer repeats. Pipeline parallelism is useful when the + gradient comms of data parallelism across the slower network cannot be + hidden, which generally occurs with "strong scaling" (fixed global batch + size of say 8M or 16M tokens) and a large number of "pods" or slower + network data parallel replicas. Pipeline parallelism is most useful for + large models when run on a huge cluster which drives the per device batch + size (per pod batch size) small. +- Expert parallelism + - Expert parallelism is specific to MoE models. It shards the optimizer + state and computation by experts for the MoE feedforward component. The + attention component is shared across experts, and thus in MaxText the + expert parallelism axis acts like FSDP in the attention layer. Moving + between this expert sharding and FSDP sharding requires an all-to-all, + which is generally cheap, and thus expert parallelism is often used in any + MoE configuration. However currently in MaxText we only support expert + parallelism with a dropping strategy (dropping tokens that exceed an + "expert capacity"), we are still improving the EP integrations. -They are covered in the following parameters. These are the default values from base.yml. Use the following sharding parameters for setting on a single TPU Slice or a GPU Slice. +These mechanisms are covered in the following parameters. These are the default +values from `base.yml`. Use the following sharding parameters for setting on a +single TPU Slice or a GPU Slice. ``` ici_data_parallelism: 1 @@ -17,7 +76,7 @@ ici_sequence_parallelism: 1 ici_tensor_parallelism: 1 ``` -Following sharding values dictate how training will happen across multiple TPU Pods. +The following sharding values dictate how training will happen across multiple TPU Pods. ``` dcn_data_parallelism: -1 # recommended DCN axis to be auto-sharded From dc425f3f302ef1f20482c1fa5b2804e8e73de136 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Wed, 30 Oct 2024 11:31:58 -0300 Subject: [PATCH 09/22] Reorganization of ToC --- docs/about.md | 6 ++ docs/comparison_to_alternatives.md | 7 ++ docs/concepts.md | 8 ++ docs/features_and_diagnostics.md | 103 ++++++++++++++++++++++++++ docs/index.md | 115 +---------------------------- 5 files changed, 126 insertions(+), 113 deletions(-) create mode 100644 docs/about.md create mode 100644 docs/comparison_to_alternatives.md create mode 100644 docs/concepts.md create mode 100644 docs/features_and_diagnostics.md diff --git a/docs/about.md b/docs/about.md new file mode 100644 index 000000000..717c1e5fe --- /dev/null +++ b/docs/about.md @@ -0,0 +1,6 @@ +# About MaxText + +```{toctree} +features_and_diagnostics.md +comparison_to_alternatives.md +``` \ No newline at end of file diff --git a/docs/comparison_to_alternatives.md b/docs/comparison_to_alternatives.md new file mode 100644 index 000000000..c9100c62b --- /dev/null +++ b/docs/comparison_to_alternatives.md @@ -0,0 +1,7 @@ +# Comparison to Alternatives + +MaxText is heavily inspired by [MinGPT](https://github.com/karpathy/minGPT)/[NanoGPT](https://github.com/karpathy/nanoGPT), elegant standalone GPT implementations written in PyTorch and targeting Nvidia GPUs. MaxText is more complex, supporting more industry standard models and scaling to tens of thousands of chips. Ultimately MaxText has an MFU more than three times the [17%](https://twitter.com/karpathy/status/1613250489097027584?cxt=HHwWgIDUhbixteMsAAAA) reported most recently with that codebase, is massively scalable and implements a key-value cache for efficient auto-regressive decoding. + +MaxText is more similar to [Nvidia/Megatron-LM](https://github.com/NVIDIA/Megatron-LM), a very well tuned LLM implementation targeting Nvidia GPUs. The two implementations achieve comparable MFUs. The difference in the codebases highlights the different programming strategies. MaxText is pure Python, relying heavily on the XLA compiler to achieve high performance. By contrast, Megatron-LM is a mix of Python and CUDA, relying on well-optimized CUDA kernels to achieve high performance. + +MaxText is also comparable to [Pax](https://github.com/google/paxml). Like Pax, MaxText provides high-performance and scalable implementations of LLMs in Jax. Pax focuses on enabling powerful configuration parameters, enabling developers to change the model by editing config parameters. By contrast, MaxText is a simple, concrete implementation of various LLMs that encourages users to extend by forking and directly editing the source code. diff --git a/docs/concepts.md b/docs/concepts.md new file mode 100644 index 000000000..3475e378b --- /dev/null +++ b/docs/concepts.md @@ -0,0 +1,8 @@ +# Concepts + +```{toctree} +sharding.md +remat_policy.md +batch_size.md +checkpointing.md +``` \ No newline at end of file diff --git a/docs/features_and_diagnostics.md b/docs/features_and_diagnostics.md new file mode 100644 index 000000000..0cd452835 --- /dev/null +++ b/docs/features_and_diagnostics.md @@ -0,0 +1,103 @@ +# Features and Diagnostics + +## Collect Stack Traces +When running a Single Program, Multiple Data (SPMD) job on accelerators, the overall process can hang if there is any error or any VM hangs/crashes for some reason. In this scenario, capturing stack traces will help to identify and troubleshoot the issues for the jobs running on TPU VMs. + +The following configurations will help to debug a fault or when a program is stuck or hung somewhere by collecting stack traces. Change the parameter values accordingly in `MaxText/configs/base.yml`: +1. Set `collect_stack_trace: True` to enable collection of stack traces on faults or when the program is hung. This setting will periodically dump the traces for the program to help in debugging. To disable this, set `collect_stack_trace: False`. +2. Set `stack_trace_to_cloud: False` to display stack traces on console. `stack_trace_to_cloud: True` will create a temporary file in `/tmp/debugging` in the TPUs to store the stack traces. There is an agent running on TPU VMs that will periodically upload the traces from the temporary directory to cloud logging in the gcp project. You can view the traces in Logs Explorer on Cloud Logging using the following query: +``` +logName="projects//logs/tpu.googleapis.com%2Fruntime_monitor" +jsonPayload.verb="stacktraceanalyzer" +``` +3. `stack_trace_interval_seconds` signifies the duration in seconds between each stack trace collection event. Setting `stack_trace_interval_seconds: 600` will collect the stack traces every 600 seconds (10 minutes). + +Here is the related PyPI package: https://pypi.org/project/cloud-tpu-diagnostics. + +## Ahead of Time Compilation (AOT) +To compile your training run ahead of time, we provide a tool `train_compile.py`. This tool allows you to compile the main `train_step` in `train.py` for target hardware (e.g. a large number of v5e devices) without using the full cluster. + +### TPU Support + +You may use only a CPU or a single VM from a different family to pre-compile for a TPU cluster. This compilation helps with two main goals: + +* It will flag any out of memory (OOM) information, such as when the `per_device_batch_size` is set too high, with an identical OOM stack trace as if it was compiled on the target hardware. + +* The ahead of time compilation can be saved and then loaded for fast startup and restart times on the target hardware. + +The tool `train_compile.py` is tightly linked to `train.py` and uses the same configuration file `configs/base.yml`. Although you don't need to run on a TPU, you do need to install `jax[tpu]` in addition to other dependencies, so we recommend running `setup.sh` to install these if you have not already done so. + +#### Example AOT 1: Compile ahead of time basics +After installing the dependencies listed above, you are ready to compile ahead of time: +``` +# Run the below on a single machine, e.g. a CPU +python3 MaxText/train_compile.py MaxText/configs/base.yml compile_topology=v5e-256 compile_topology_num_slices=2 \ +global_parameter_scale=16 per_device_batch_size=4 +``` + +This will compile a 16B parameter MaxText model on 2 v5e pods. + +#### Example AOT 2: Save compiled function, then load and run it +Here is an example that saves then loads the compiled `train_step`, starting with the save: + +**Step 1: Run AOT and save compiled function** +``` +# Run the below on a single machine, e.g. a CPU +export LIBTPU_INIT_ARGS="--xla_enable_async_all_gather=true" +python3 MaxText/train_compile.py MaxText/configs/base.yml compile_topology=v5e-256 \ +compile_topology_num_slices=2 \ +compiled_trainstep_file=my_compiled_train.pickle global_parameter_scale=16 \ +per_device_batch_size=4 steps=10000 learning_rate=1e-3 +``` + +**Step 2: Run train.py and load the compiled function** + +To load the compiled train_step, you just need to pass `compiled_trainstep_file=my_compiled_train.pickle` into `train.py`: +``` +# Run the below on each host of the target hardware, e.g. each host on 2 slices of v5e-256 +export LIBTPU_INIT_ARGS="--xla_enable_async_all_gather=true" +python3 MaxText/train.py MaxText/configs/base.yml run_name=example_load_compile \ +compiled_trainstep_file=my_compiled_train.pickle \ +global_parameter_scale=16 per_device_batch_size=4 steps=10000 learning_rate=1e-3 \ +base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket +``` + +In the save step of example 2 above we included exporting the compiler flag `LIBTPU_INIT_ARGS` and `learning_rate` because those affect the compiled object `my_compiled_train.pickle.` The sizes of the model (e.g. `global_parameter_scale`, `max_sequence_length` and `per_device_batch`) are fixed when you initially compile via `compile_train.py`, you will see a size error if you try to run the saved compiled object with different sizes than you compiled with. However a subtle note is that the **learning rate schedule** is also fixed when you run `compile_train` - which is determined by both `steps` and `learning_rate`. The optimizer parameters such as `adam_b1` are passed only as shaped objects to the compiler - thus their real values are determined when you run `train.py`, not during the compilation. If you do pass in different shapes (e.g. `per_device_batch`), you will get a clear error message reporting that the compiled signature has different expected shapes than what was input. If you attempt to run on different hardware than the compilation targets requested via `compile_topology`, you will get an error saying there is a failure to map the devices from the compiled to your real devices. Using different XLA flags or a LIBTPU than what was compiled will probably run silently with the environment you compiled in without error. However there is no guaranteed behavior in this case; you should run in the same environment you compiled in. + +### GPU Support +Ahead-of-time compilation is also supported for GPUs with some differences from TPUs: + +1. GPU does not support compilation across hardware: A GPU host is still required to run AoT compilation, but a single GPU host can compile a program for a larger cluster of the same hardware. + +1. For [A3 Cloud GPUs](https://cloud.google.com/compute/docs/gpus#h100-gpus), the maximum "slice" size is a single host, and the `compile_topology_num_slices` parameter represents the number of A3 machines to precompile for. + +#### Example +This example illustrates the flags to use for a multihost GPU compilation targeting a cluster of 4 A3 hosts: + +**Step 1: Run AOT and save compiled function** +``` +# Run the below on a single A3 machine +export XLA_FLAGS="--xla_gpu_enable_async_collectives=true" +python3 MaxText/train_compile.py MaxText/configs/base.yml compile_topology=a3 \ +compile_topology_num_slices=4 \ +compiled_trainstep_file=my_compiled_train.pickle global_parameter_scale=16 \ +attention=dot_product per_device_batch_size=4 steps=10000 learning_rate=1e-3 +``` + +**Step 2: Run train.py and load the compiled function** + +To load the compiled train_step, you just need to pass `compiled_trainstep_file=my_compiled_train.pickle` into `train.py`: +``` +# Run the below on each of the 4 target A3 hosts. +export XLA_FLAGS="--xla_gpu_enable_async_collectives=true" +python3 MaxText/train.py MaxText/configs/base.yml run_name=example_load_compile \ +compiled_trainstep_file=my_compiled_train.pickle \ +attention=dot_product global_parameter_scale=16 per_device_batch_size=4 steps=10000 learning_rate=1e-3 \ +base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket +``` + +As in the TPU case, note that the compilation environment must match the execution environment, in this case by setting the same `XLA_FLAGS`. + + +## Automatically Upload Logs to Vertex Tensorboard +MaxText supports automatic upload of logs collected in a directory to a Tensorboard instance in Vertex AI. Follow [user guide](getting_started/Use_Vertex_AI_Tensorboard.md) to know more. diff --git a/docs/index.md b/docs/index.md index 02c807211..b2d7ad25d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -86,116 +86,7 @@ For 16B, 32B, 64B, and 128B models. See full run configs in [MaxText/configs/v5e | 16x v5e-256 | 111 | 56.56% | 123 | 62.26% | 105 | 53.29% | 100 | 50.86% | | 32x v5e-256 | 108 | 54.65% | 119 | 60.40% | 99 | 50.18% | 91 | 46.25% | -## Comparison to Alternatives -MaxText is heavily inspired by [MinGPT](https://github.com/karpathy/minGPT)/[NanoGPT](https://github.com/karpathy/nanoGPT), elegant standalone GPT implementations written in PyTorch and targeting Nvidia GPUs. MaxText is more complex, supporting more industry standard models and scaling to tens of thousands of chips. Ultimately MaxText has an MFU more than three times the [17%](https://twitter.com/karpathy/status/1613250489097027584?cxt=HHwWgIDUhbixteMsAAAA) reported most recently with that codebase, is massively scalable and implements a key-value cache for efficient auto-regressive decoding. - -MaxText is more similar to [Nvidia/Megatron-LM](https://github.com/NVIDIA/Megatron-LM), a very well tuned LLM implementation targeting Nvidia GPUs. The two implementations achieve comparable MFUs. The difference in the codebases highlights the different programming strategies. MaxText is pure Python, relying heavily on the XLA compiler to achieve high performance. By contrast, Megatron-LM is a mix of Python and CUDA, relying on well-optimized CUDA kernels to achieve high performance. - -MaxText is also comparable to [Pax](https://github.com/google/paxml). Like Pax, MaxText provides high-performance and scalable implementations of LLMs in Jax. Pax focuses on enabling powerful configuration parameters, enabling developers to change the model by editing config parameters. By contrast, MaxText is a simple, concrete implementation of various LLMs that encourages users to extend by forking and directly editing the source code. - -## Features and Diagnostics -### Collect Stack Traces -When running a Single Program, Multiple Data (SPMD) job on accelerators, the overall process can hang if there is any error or any VM hangs/crashes for some reason. In this scenario, capturing stack traces will help to identify and troubleshoot the issues for the jobs running on TPU VMs. - -The following configurations will help to debug a fault or when a program is stuck or hung somewhere by collecting stack traces. Change the parameter values accordingly in `MaxText/configs/base.yml`: -1. Set `collect_stack_trace: True` to enable collection of stack traces on faults or when the program is hung. This setting will periodically dump the traces for the program to help in debugging. To disable this, set `collect_stack_trace: False`. -2. Set `stack_trace_to_cloud: False` to display stack traces on console. `stack_trace_to_cloud: True` will create a temporary file in `/tmp/debugging` in the TPUs to store the stack traces. There is an agent running on TPU VMs that will periodically upload the traces from the temporary directory to cloud logging in the gcp project. You can view the traces in Logs Explorer on Cloud Logging using the following query: -``` -logName="projects//logs/tpu.googleapis.com%2Fruntime_monitor" -jsonPayload.verb="stacktraceanalyzer" -``` -3. `stack_trace_interval_seconds` signifies the duration in seconds between each stack trace collection event. Setting `stack_trace_interval_seconds: 600` will collect the stack traces every 600 seconds (10 minutes). - -Here is the related PyPI package: https://pypi.org/project/cloud-tpu-diagnostics. - -### Ahead of Time Compilation (AOT) -To compile your training run ahead of time, we provide a tool `train_compile.py`. This tool allows you to compile the main `train_step` in `train.py` for target hardware (e.g. a large number of v5e devices) without using the full cluster. - -#### TPU Support - -You may use only a CPU or a single VM from a different family to pre-compile for a TPU cluster. This compilation helps with two main goals: - -* It will flag any out of memory (OOM) information, such as when the `per_device_batch_size` is set too high, with an identical OOM stack trace as if it was compiled on the target hardware. - -* The ahead of time compilation can be saved and then loaded for fast startup and restart times on the target hardware. - -The tool `train_compile.py` is tightly linked to `train.py` and uses the same configuration file `configs/base.yml`. Although you don't need to run on a TPU, you do need to install `jax[tpu]` in addition to other dependencies, so we recommend running `setup.sh` to install these if you have not already done so. - -##### Example AOT 1: Compile ahead of time basics -After installing the dependencies listed above, you are ready to compile ahead of time: -``` -# Run the below on a single machine, e.g. a CPU -python3 MaxText/train_compile.py MaxText/configs/base.yml compile_topology=v5e-256 compile_topology_num_slices=2 \ -global_parameter_scale=16 per_device_batch_size=4 -``` - -This will compile a 16B parameter MaxText model on 2 v5e pods. - -##### Example AOT 2: Save compiled function, then load and run it -Here is an example that saves then loads the compiled `train_step`, starting with the save: - -**Step 1: Run AOT and save compiled function** -``` -# Run the below on a single machine, e.g. a CPU -export LIBTPU_INIT_ARGS="--xla_enable_async_all_gather=true" -python3 MaxText/train_compile.py MaxText/configs/base.yml compile_topology=v5e-256 \ -compile_topology_num_slices=2 \ -compiled_trainstep_file=my_compiled_train.pickle global_parameter_scale=16 \ -per_device_batch_size=4 steps=10000 learning_rate=1e-3 -``` - -**Step 2: Run train.py and load the compiled function** - -To load the compiled train_step, you just need to pass `compiled_trainstep_file=my_compiled_train.pickle` into `train.py`: -``` -# Run the below on each host of the target hardware, e.g. each host on 2 slices of v5e-256 -export LIBTPU_INIT_ARGS="--xla_enable_async_all_gather=true" -python3 MaxText/train.py MaxText/configs/base.yml run_name=example_load_compile \ -compiled_trainstep_file=my_compiled_train.pickle \ -global_parameter_scale=16 per_device_batch_size=4 steps=10000 learning_rate=1e-3 \ -base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket -``` - -In the save step of example 2 above we included exporting the compiler flag `LIBTPU_INIT_ARGS` and `learning_rate` because those affect the compiled object `my_compiled_train.pickle.` The sizes of the model (e.g. `global_parameter_scale`, `max_sequence_length` and `per_device_batch`) are fixed when you initially compile via `compile_train.py`, you will see a size error if you try to run the saved compiled object with different sizes than you compiled with. However a subtle note is that the **learning rate schedule** is also fixed when you run `compile_train` - which is determined by both `steps` and `learning_rate`. The optimizer parameters such as `adam_b1` are passed only as shaped objects to the compiler - thus their real values are determined when you run `train.py`, not during the compilation. If you do pass in different shapes (e.g. `per_device_batch`), you will get a clear error message reporting that the compiled signature has different expected shapes than what was input. If you attempt to run on different hardware than the compilation targets requested via `compile_topology`, you will get an error saying there is a failure to map the devices from the compiled to your real devices. Using different XLA flags or a LIBTPU than what was compiled will probably run silently with the environment you compiled in without error. However there is no guaranteed behavior in this case; you should run in the same environment you compiled in. - -#### GPU Support -Ahead-of-time compilation is also supported for GPUs with some differences from TPUs: - -1. GPU does not support compilation across hardware: A GPU host is still required to run AoT compilation, but a single GPU host can compile a program for a larger cluster of the same hardware. - -1. For [A3 Cloud GPUs](https://cloud.google.com/compute/docs/gpus#h100-gpus), the maximum "slice" size is a single host, and the `compile_topology_num_slices` parameter represents the number of A3 machines to precompile for. - -##### Example -This example illustrates the flags to use for a multihost GPU compilation targeting a cluster of 4 A3 hosts: - -**Step 1: Run AOT and save compiled function** -``` -# Run the below on a single A3 machine -export XLA_FLAGS="--xla_gpu_enable_async_collectives=true" -python3 MaxText/train_compile.py MaxText/configs/base.yml compile_topology=a3 \ -compile_topology_num_slices=4 \ -compiled_trainstep_file=my_compiled_train.pickle global_parameter_scale=16 \ -attention=dot_product per_device_batch_size=4 steps=10000 learning_rate=1e-3 -``` - -**Step 2: Run train.py and load the compiled function** - -To load the compiled train_step, you just need to pass `compiled_trainstep_file=my_compiled_train.pickle` into `train.py`: -``` -# Run the below on each of the 4 target A3 hosts. -export XLA_FLAGS="--xla_gpu_enable_async_collectives=true" -python3 MaxText/train.py MaxText/configs/base.yml run_name=example_load_compile \ -compiled_trainstep_file=my_compiled_train.pickle \ -attention=dot_product global_parameter_scale=16 per_device_batch_size=4 steps=10000 learning_rate=1e-3 \ -base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket -``` - -As in the TPU case, note that the compilation environment must match the execution environment, in this case by setting the same `XLA_FLAGS`. - - -### Automatically Upload Logs to Vertex Tensorboard -MaxText supports automatic upload of logs collected in a directory to a Tensorboard instance in Vertex AI. Follow [user guide](getting_started/Use_Vertex_AI_Tensorboard.md) to know more. ```{toctree} @@ -205,15 +96,13 @@ MaxText supports automatic upload of logs collected in a directory to a Tensorbo getting_started/index.md code_organization.md data_loading.md -sharding.md -remat_policy.md -batch_size.md -checkpointing.md +concepts.md profiling.md full_finetuning.md inference.md Run_MaxText_via_xpk.md single_host_gpu.md advanced_usage.md +about.md terminologies.md ``` From 56aa722c6dd08ac063f7ddea097f8bf7b88181b2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Wed, 30 Oct 2024 11:34:48 -0300 Subject: [PATCH 10/22] Reorganization of guides --- docs/guides.md | 10 ++++++++++ docs/index.md | 7 +------ 2 files changed, 11 insertions(+), 6 deletions(-) create mode 100644 docs/guides.md diff --git a/docs/guides.md b/docs/guides.md new file mode 100644 index 000000000..01b9305b1 --- /dev/null +++ b/docs/guides.md @@ -0,0 +1,10 @@ +# How-to guides + +```{toctree} +data_loading.md +profiling.md +full_finetuning.md +inference.md +Run_MaxText_via_xpk.md +single_host_gpu.md +``` \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index b2d7ad25d..2a6c7f90f 100644 --- a/docs/index.md +++ b/docs/index.md @@ -95,13 +95,8 @@ For 16B, 32B, 64B, and 128B models. See full run configs in [MaxText/configs/v5e getting_started/index.md code_organization.md -data_loading.md concepts.md -profiling.md -full_finetuning.md -inference.md -Run_MaxText_via_xpk.md -single_host_gpu.md +guides.md advanced_usage.md about.md terminologies.md From 367da74dc76b041d4c68120a2284cf309a00a5a5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Wed, 30 Oct 2024 14:39:06 -0300 Subject: [PATCH 11/22] Remove extra documents --- docs/about.md | 6 ------ docs/{getting_started => advanced_usage}/Run_Llama2.md | 0 .../Run_MaxText_via_multihost_job.md | 0 .../Run_MaxText_via_multihost_runner.md | 0 .../Use_Vertex_AI_Tensorboard.md | 0 docs/{ => advanced_usage}/comparison_to_alternatives.md | 0 docs/{ => advanced_usage}/features_and_diagnostics.md | 0 docs/conf.py | 4 ++++ docs/getting_started/index.md | 2 +- docs/index.md | 2 -- 10 files changed, 5 insertions(+), 9 deletions(-) delete mode 100644 docs/about.md rename docs/{getting_started => advanced_usage}/Run_Llama2.md (100%) rename docs/{getting_started => advanced_usage}/Run_MaxText_via_multihost_job.md (100%) rename docs/{getting_started => advanced_usage}/Run_MaxText_via_multihost_runner.md (100%) rename docs/{getting_started => advanced_usage}/Use_Vertex_AI_Tensorboard.md (100%) rename docs/{ => advanced_usage}/comparison_to_alternatives.md (100%) rename docs/{ => advanced_usage}/features_and_diagnostics.md (100%) diff --git a/docs/about.md b/docs/about.md deleted file mode 100644 index 717c1e5fe..000000000 --- a/docs/about.md +++ /dev/null @@ -1,6 +0,0 @@ -# About MaxText - -```{toctree} -features_and_diagnostics.md -comparison_to_alternatives.md -``` \ No newline at end of file diff --git a/docs/getting_started/Run_Llama2.md b/docs/advanced_usage/Run_Llama2.md similarity index 100% rename from docs/getting_started/Run_Llama2.md rename to docs/advanced_usage/Run_Llama2.md diff --git a/docs/getting_started/Run_MaxText_via_multihost_job.md b/docs/advanced_usage/Run_MaxText_via_multihost_job.md similarity index 100% rename from docs/getting_started/Run_MaxText_via_multihost_job.md rename to docs/advanced_usage/Run_MaxText_via_multihost_job.md diff --git a/docs/getting_started/Run_MaxText_via_multihost_runner.md b/docs/advanced_usage/Run_MaxText_via_multihost_runner.md similarity index 100% rename from docs/getting_started/Run_MaxText_via_multihost_runner.md rename to docs/advanced_usage/Run_MaxText_via_multihost_runner.md diff --git a/docs/getting_started/Use_Vertex_AI_Tensorboard.md b/docs/advanced_usage/Use_Vertex_AI_Tensorboard.md similarity index 100% rename from docs/getting_started/Use_Vertex_AI_Tensorboard.md rename to docs/advanced_usage/Use_Vertex_AI_Tensorboard.md diff --git a/docs/comparison_to_alternatives.md b/docs/advanced_usage/comparison_to_alternatives.md similarity index 100% rename from docs/comparison_to_alternatives.md rename to docs/advanced_usage/comparison_to_alternatives.md diff --git a/docs/features_and_diagnostics.md b/docs/advanced_usage/features_and_diagnostics.md similarity index 100% rename from docs/features_and_diagnostics.md rename to docs/advanced_usage/features_and_diagnostics.md diff --git a/docs/conf.py b/docs/conf.py index e47349f7c..5eb3ee123 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -21,6 +21,10 @@ exclude_patterns = [] source_suffix = [".rst", ".ipynb", ".md"] +exclude_patterns = [ + "advanced_usage.md", + "advanced_usage/*", +] # -- Options for HTML output ------------------------------------------------- # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md index a1c445ee8..ea3ab8f64 100644 --- a/docs/getting_started/index.md +++ b/docs/getting_started/index.md @@ -6,7 +6,7 @@ MaxText supports training and inference of various open models. Some extra helpful guides: * [Gemma](https://ai.google.dev/gemma): a family of open-weights Large Language Model (LLM) by [Google DeepMind](https://deepmind.google/), based on Gemini research and technology. You can run decode and finetuning using [these instructions](https://github.com/AI-Hypercomputer/maxtext/blob/main/end_to_end/tpu/gemma/Run_Gemma.md). -* [Llama2](https://llama.meta.com/llama2/): a family of open-weights Large Language Model (LLM) by Meta. You can run decode and finetuning using [these instructions](Run_Llama2.md). +* [Llama2](https://llama.meta.com/llama2/): a family of open-weights Large Language Model (LLM) by Meta. You can run decode and finetuning using [these instructions](https://github.com/AI-Hypercomputer/maxtext/blob/main/getting_started/Run_Llama2.md). * [Mixtral](https://mistral.ai/news/mixtral-of-experts/): a family of open-weights sparse mixture-of-experts (MoE) model by Mistral AI. You can run decode and finetuning using [these instructions](https://github.com/AI-Hypercomputer/maxtext/blob/main/end_to_end/tpu/mixtral/Run_Mixtral.md) In addition to the getting started guides, there are always other MaxText capabilities that are being constantly being added! The full suite of end-to-end tests is in [end_to_end](https://github.com/AI-Hypercomputer/maxtext/blob/main/end_to_end). We run them with a nightly cadence. They can be a good source for understanding MaxText Alternatively you can see the continuous [unit tests](https://github.com/AI-Hypercomputer/maxtext/blob/main/.github/workflows/UnitTests.yml) which are run almost continuously. diff --git a/docs/index.md b/docs/index.md index 2a6c7f90f..000d04bda 100644 --- a/docs/index.md +++ b/docs/index.md @@ -97,7 +97,5 @@ getting_started/index.md code_organization.md concepts.md guides.md -advanced_usage.md -about.md terminologies.md ``` From 130539a051099dff42b8e1f9d3f4be2e85ca975a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Wed, 30 Oct 2024 14:43:59 -0300 Subject: [PATCH 12/22] Reorganize ToC --- docs/concepts.md | 8 ++++---- docs/{ => concepts}/batch_size.md | 0 docs/{ => concepts}/checkpointing.md | 0 docs/{ => concepts}/remat_policy.md | 0 docs/{ => concepts}/sharding.md | 0 .../{getting_started/index.md => getting_started.md} | 12 ++++++------ docs/guides.md | 12 ++++++------ docs/{ => guides}/Run_MaxText_via_xpk.md | 0 docs/{ => guides}/data_loading.md | 2 +- docs/{ => guides}/full_finetuning.md | 0 docs/{ => guides}/inference.md | 0 docs/{ => guides}/profiling.md | 0 docs/{ => guides}/single_host_gpu.md | 0 docs/index.md | 2 +- 14 files changed, 18 insertions(+), 18 deletions(-) rename docs/{ => concepts}/batch_size.md (100%) rename docs/{ => concepts}/checkpointing.md (100%) rename docs/{ => concepts}/remat_policy.md (100%) rename docs/{ => concepts}/sharding.md (100%) rename docs/{getting_started/index.md => getting_started.md} (85%) rename docs/{ => guides}/Run_MaxText_via_xpk.md (100%) rename docs/{ => guides}/data_loading.md (70%) rename docs/{ => guides}/full_finetuning.md (100%) rename docs/{ => guides}/inference.md (100%) rename docs/{ => guides}/profiling.md (100%) rename docs/{ => guides}/single_host_gpu.md (100%) diff --git a/docs/concepts.md b/docs/concepts.md index 3475e378b..4559bf7ca 100644 --- a/docs/concepts.md +++ b/docs/concepts.md @@ -1,8 +1,8 @@ # Concepts ```{toctree} -sharding.md -remat_policy.md -batch_size.md -checkpointing.md +concepts/sharding.md +concepts/remat_policy.md +concepts/batch_size.md +concepts/checkpointing.md ``` \ No newline at end of file diff --git a/docs/batch_size.md b/docs/concepts/batch_size.md similarity index 100% rename from docs/batch_size.md rename to docs/concepts/batch_size.md diff --git a/docs/checkpointing.md b/docs/concepts/checkpointing.md similarity index 100% rename from docs/checkpointing.md rename to docs/concepts/checkpointing.md diff --git a/docs/remat_policy.md b/docs/concepts/remat_policy.md similarity index 100% rename from docs/remat_policy.md rename to docs/concepts/remat_policy.md diff --git a/docs/sharding.md b/docs/concepts/sharding.md similarity index 100% rename from docs/sharding.md rename to docs/concepts/sharding.md diff --git a/docs/getting_started/index.md b/docs/getting_started.md similarity index 85% rename from docs/getting_started/index.md rename to docs/getting_started.md index ea3ab8f64..29510c3c8 100644 --- a/docs/getting_started/index.md +++ b/docs/getting_started.md @@ -1,6 +1,6 @@ # Getting Started -For your first time running MaxText, we provide specific [instructions](First_run.md). +For your first time running MaxText, we provide specific [instructions](getting_started/First_run.md). MaxText supports training and inference of various open models. @@ -14,9 +14,9 @@ In addition to the getting started guides, there are always other MaxText capabi ```{toctree} :hidden: -First_run.md -steps_model.md -End-to-end example -Data_Input_Pipeline.md -Data_Input_Perf.md +getting_started/First_run.md +getting_started/steps_model.md +getting_started/End-to-end example +getting_started/Data_Input_Pipeline.md +getting_started/Data_Input_Perf.md ``` diff --git a/docs/guides.md b/docs/guides.md index 01b9305b1..2116ba6b7 100644 --- a/docs/guides.md +++ b/docs/guides.md @@ -1,10 +1,10 @@ # How-to guides ```{toctree} -data_loading.md -profiling.md -full_finetuning.md -inference.md -Run_MaxText_via_xpk.md -single_host_gpu.md +guides/data_loading.md +guides/profiling.md +guides/full_finetuning.md +guides/inference.md +guides/Run_MaxText_via_xpk.md +guides/single_host_gpu.md ``` \ No newline at end of file diff --git a/docs/Run_MaxText_via_xpk.md b/docs/guides/Run_MaxText_via_xpk.md similarity index 100% rename from docs/Run_MaxText_via_xpk.md rename to docs/guides/Run_MaxText_via_xpk.md diff --git a/docs/data_loading.md b/docs/guides/data_loading.md similarity index 70% rename from docs/data_loading.md rename to docs/guides/data_loading.md index 1c0f63c96..6d7eed71e 100644 --- a/docs/data_loading.md +++ b/docs/guides/data_loading.md @@ -8,4 +8,4 @@ Maxtext supports input data pipelines in the following ways: [^1]: Tf.data is the most performant way of loading large scale datasets. -You can read more about the pipelines in [](getting_started/Data_Input_Pipeline.md). +You can read more about the pipelines in [](../getting_started/Data_Input_Pipeline.md). diff --git a/docs/full_finetuning.md b/docs/guides/full_finetuning.md similarity index 100% rename from docs/full_finetuning.md rename to docs/guides/full_finetuning.md diff --git a/docs/inference.md b/docs/guides/inference.md similarity index 100% rename from docs/inference.md rename to docs/guides/inference.md diff --git a/docs/profiling.md b/docs/guides/profiling.md similarity index 100% rename from docs/profiling.md rename to docs/guides/profiling.md diff --git a/docs/single_host_gpu.md b/docs/guides/single_host_gpu.md similarity index 100% rename from docs/single_host_gpu.md rename to docs/guides/single_host_gpu.md diff --git a/docs/index.md b/docs/index.md index 000d04bda..c588c6333 100644 --- a/docs/index.md +++ b/docs/index.md @@ -93,7 +93,7 @@ For 16B, 32B, 64B, and 128B models. See full run configs in [MaxText/configs/v5e :maxdepth: 1 :hidden: -getting_started/index.md +getting_started.md code_organization.md concepts.md guides.md From 44d19bba3e76fcfb8e5f297a753681155ddf1895 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Wed, 30 Oct 2024 14:50:12 -0300 Subject: [PATCH 13/22] Fix broken links --- docs/getting_started.md | 2 +- docs/getting_started/First_run.md | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/getting_started.md b/docs/getting_started.md index 29510c3c8..645cce834 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -16,7 +16,7 @@ In addition to the getting started guides, there are always other MaxText capabi getting_started/First_run.md getting_started/steps_model.md -getting_started/End-to-end example +End-to-end example getting_started/Data_Input_Pipeline.md getting_started/Data_Input_Perf.md ``` diff --git a/docs/getting_started/First_run.md b/docs/getting_started/First_run.md index 959a0c63e..ea05ee28b 100644 --- a/docs/getting_started/First_run.md +++ b/docs/getting_started/First_run.md @@ -28,7 +28,7 @@ python3 MaxText/train.py MaxText/configs/base.yml \ dataset_type=synthetic \ steps=10 ``` -Next, you can try training on a HugginFace dataset, see [Data Input Pipeline](https://github.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md) for data input options. +Next, you can try training on a HugginFace dataset, see [Data Input Pipeline](Data_Input_Pipeline.md) for data input options. 5. If you want to decode, you can decode as follows. ``` @@ -67,9 +67,9 @@ Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.a ## Multihost development There are three patterns for running MaxText with more than one host. -1. [GKE, recommended] [Running Maxtext with xpk](../Run_MaxText_via_xpk.md) - Quick Experimentation and Production support -2. [GCE] [Running Maxtext with Multihost Jobs](Run_MaxText_via_multihost_job.md) - Long Running Production Jobs with Queued Resources -3. [GCE] [Running Maxtext with Multihost Runner](Run_MaxText_via_multihost_runner.md) - Fast experiments via multiple ssh connections. +1. [GKE, recommended] [Running Maxtext with xpk](../guides/Run_MaxText_via_xpk.md) - Quick Experimentation and Production support +2. [GCE] [Running Maxtext with Multihost Jobs](https://github.com/AI-Hypercomputer/maxtext/blob/main/getting_started/Run_MaxText_via_multihost_job.md) - Long Running Production Jobs with Queued Resources +3. [GCE] [Running Maxtext with Multihost Runner](https://github.com/AI-Hypercomputer/maxtext/blob/main/getting_started/Run_MaxText_via_multihost_runner.md) - Fast experiments via multiple ssh connections. ## Preflight Optimizations From beff5723779e0bac5e09f794a6a5923273d75c3a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Wed, 30 Oct 2024 19:23:28 -0300 Subject: [PATCH 14/22] Reorganize toctree --- .../Data_Input_Perf.md | 0 .../Data_Input_Pipeline.md | 0 .../Run_Llama2.md | 0 .../Run_MaxText_via_multihost_job.md | 0 .../Run_MaxText_via_multihost_runner.md | 0 .../Run_MaxText_via_xpk.md | 0 .../Use_Vertex_AI_Tensorboard.md | 0 .../comparison_to_alternatives.md | 0 .../features_and_diagnostics.md | 0 .../runtime_performance_results.md | 35 ++++++++++++++++ .../single_host_gpu.md | 0 docs/advanced_usage.md | 8 ---- docs/concepts.md | 5 ++- docs/concepts/attention.md | 9 +++++ .../build_model.png | Bin docs/{ => concepts}/code_organization.md | 0 docs/concepts/quantization.md | 17 ++++++++ docs/concepts/quantization.png | Bin 0 -> 30167 bytes .../steps_model.md | 0 docs/conf.py | 13 ++---- docs/getting_started.md | 6 +-- docs/getting_started/First_run.md | 8 ++-- docs/getting_started/end-to-end.md | 3 ++ .../full_finetuning.md | 0 docs/guides.md | 4 -- docs/guides/data_loading.md | 2 +- docs/index.md | 38 +----------------- 27 files changed, 80 insertions(+), 68 deletions(-) rename docs/{getting_started => advanced_docs}/Data_Input_Perf.md (100%) rename docs/{getting_started => advanced_docs}/Data_Input_Pipeline.md (100%) rename docs/{advanced_usage => advanced_docs}/Run_Llama2.md (100%) rename docs/{advanced_usage => advanced_docs}/Run_MaxText_via_multihost_job.md (100%) rename docs/{advanced_usage => advanced_docs}/Run_MaxText_via_multihost_runner.md (100%) rename docs/{guides => advanced_docs}/Run_MaxText_via_xpk.md (100%) rename docs/{advanced_usage => advanced_docs}/Use_Vertex_AI_Tensorboard.md (100%) rename docs/{advanced_usage => advanced_docs}/comparison_to_alternatives.md (100%) rename docs/{advanced_usage => advanced_docs}/features_and_diagnostics.md (100%) create mode 100644 docs/advanced_docs/runtime_performance_results.md rename docs/{guides => advanced_docs}/single_host_gpu.md (100%) delete mode 100644 docs/advanced_usage.md create mode 100644 docs/concepts/attention.md rename docs/{getting_started => concepts}/build_model.png (100%) rename docs/{ => concepts}/code_organization.md (100%) create mode 100644 docs/concepts/quantization.md create mode 100644 docs/concepts/quantization.png rename docs/{getting_started => concepts}/steps_model.md (100%) create mode 100644 docs/getting_started/end-to-end.md rename docs/{guides => getting_started}/full_finetuning.md (100%) diff --git a/docs/getting_started/Data_Input_Perf.md b/docs/advanced_docs/Data_Input_Perf.md similarity index 100% rename from docs/getting_started/Data_Input_Perf.md rename to docs/advanced_docs/Data_Input_Perf.md diff --git a/docs/getting_started/Data_Input_Pipeline.md b/docs/advanced_docs/Data_Input_Pipeline.md similarity index 100% rename from docs/getting_started/Data_Input_Pipeline.md rename to docs/advanced_docs/Data_Input_Pipeline.md diff --git a/docs/advanced_usage/Run_Llama2.md b/docs/advanced_docs/Run_Llama2.md similarity index 100% rename from docs/advanced_usage/Run_Llama2.md rename to docs/advanced_docs/Run_Llama2.md diff --git a/docs/advanced_usage/Run_MaxText_via_multihost_job.md b/docs/advanced_docs/Run_MaxText_via_multihost_job.md similarity index 100% rename from docs/advanced_usage/Run_MaxText_via_multihost_job.md rename to docs/advanced_docs/Run_MaxText_via_multihost_job.md diff --git a/docs/advanced_usage/Run_MaxText_via_multihost_runner.md b/docs/advanced_docs/Run_MaxText_via_multihost_runner.md similarity index 100% rename from docs/advanced_usage/Run_MaxText_via_multihost_runner.md rename to docs/advanced_docs/Run_MaxText_via_multihost_runner.md diff --git a/docs/guides/Run_MaxText_via_xpk.md b/docs/advanced_docs/Run_MaxText_via_xpk.md similarity index 100% rename from docs/guides/Run_MaxText_via_xpk.md rename to docs/advanced_docs/Run_MaxText_via_xpk.md diff --git a/docs/advanced_usage/Use_Vertex_AI_Tensorboard.md b/docs/advanced_docs/Use_Vertex_AI_Tensorboard.md similarity index 100% rename from docs/advanced_usage/Use_Vertex_AI_Tensorboard.md rename to docs/advanced_docs/Use_Vertex_AI_Tensorboard.md diff --git a/docs/advanced_usage/comparison_to_alternatives.md b/docs/advanced_docs/comparison_to_alternatives.md similarity index 100% rename from docs/advanced_usage/comparison_to_alternatives.md rename to docs/advanced_docs/comparison_to_alternatives.md diff --git a/docs/advanced_usage/features_and_diagnostics.md b/docs/advanced_docs/features_and_diagnostics.md similarity index 100% rename from docs/advanced_usage/features_and_diagnostics.md rename to docs/advanced_docs/features_and_diagnostics.md diff --git a/docs/advanced_docs/runtime_performance_results.md b/docs/advanced_docs/runtime_performance_results.md new file mode 100644 index 000000000..a9b7e10be --- /dev/null +++ b/docs/advanced_docs/runtime_performance_results.md @@ -0,0 +1,35 @@ +--- +orphan: true +--- + +# Runtime Performance Results + +More details on reproducing these results can be found in [MaxText/configs/README.md](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/configs/README.md). + +## TPU v5p + +| No. of params | Accelerator Type | TFLOP/chip/sec | Model flops utilization (MFU) | +|---|---|---|---| +| 32B | v5p-128 | 3.28e+02 | 71.47% | +| 64B | v5p-128 | 3.23e+02 | 70.31% | +| 128B | v5p-256 | 3.15e+02 | 68.68% | +| 128B | v5p-512 | 3.15e+02 | 68.53% | +| 256B | v5p-1024 | 3.16e+02 | 68.82% | +| 512B | v5p-1024 | 2.94e+02 | 63.99% | +| 1024B | v5p-2048 | 2.49e+02 | 64.05% | +| 1024B | v5p-4096 | 2.97e+02 | 64.80% | +| 1160B | v5p-7680 | 2.95e+02 | 64.27% | +| 1160B | v5p-12288 | 3.04e+02 | 66.23% | + +## TPU v5e + +For 16B, 32B, 64B, and 128B models. See full run configs in [MaxText/configs/v5e/](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/configs/v5e/) as `16b.sh`, `32b.sh`, `64b.sh`, `128b.sh`. + +| Hardware | 16B TFLOP/sec/chip | 16B MFU | 32B TFLOP/sec/chip | 32B MFU | 64B TFLOP/sec/chip | 64B MFU | 128B TFLOP/sec/chip | 128B MFU | +| ----------- | -----------------: | ------- | -----------------: | ------- | -----------------: | ------- | ------------------: | -------- | +| 1x v5e-256 | 120 | 61.10% | 132 | 66.86% | 118 | 59.90% | 110 | 56.06% | +| 2x v5e-256 | 117 | 59.37% | 128 | 64.81% | 112 | 56.66% | 110 | 55.82% | +| 4x v5e-256 | 117 | 59.14% | 126 | 64.10% | 110 | 55.85% | 108 | 54.93% | +| 8x v5e-256 | 115 | 58.27% | 125 | 63.67% | 108 | 54.96% | 104 | 52.93% | +| 16x v5e-256 | 111 | 56.56% | 123 | 62.26% | 105 | 53.29% | 100 | 50.86% | +| 32x v5e-256 | 108 | 54.65% | 119 | 60.40% | 99 | 50.18% | 91 | 46.25% | diff --git a/docs/guides/single_host_gpu.md b/docs/advanced_docs/single_host_gpu.md similarity index 100% rename from docs/guides/single_host_gpu.md rename to docs/advanced_docs/single_host_gpu.md diff --git a/docs/advanced_usage.md b/docs/advanced_usage.md deleted file mode 100644 index d0362646d..000000000 --- a/docs/advanced_usage.md +++ /dev/null @@ -1,8 +0,0 @@ -# Advanced usage - -```{toctree} -getting_started/Run_MaxText_via_multihost_job.md -getting_started/Run_MaxText_via_multihost_runner.md -getting_started/Use_Vertex_AI_Tensorboard.md -getting_started/Run_Llama2.md -``` diff --git a/docs/concepts.md b/docs/concepts.md index 4559bf7ca..348dc100f 100644 --- a/docs/concepts.md +++ b/docs/concepts.md @@ -1,8 +1,11 @@ # Concepts ```{toctree} +concepts/steps_model.md concepts/sharding.md concepts/remat_policy.md +concepts/attention.md concepts/batch_size.md -concepts/checkpointing.md +concepts/quantization.md +concepts/code_organization.md ``` \ No newline at end of file diff --git a/docs/concepts/attention.md b/docs/concepts/attention.md new file mode 100644 index 000000000..f69e950e3 --- /dev/null +++ b/docs/concepts/attention.md @@ -0,0 +1,9 @@ +# Attention + +MaxText supports multiple optimization options for attention calculations. The default is [flash](https://github.com/Dao-AILab/flash-attention). This has been optimized for maximum performance. + +MaxText supports the following values for the `attention` parameter: + +- `flash`: Default and most performant. This is written in Pallas to achieve maximum performance. +- `dot_product`: Works with older versions of TPU, e.g v2,v3. Should be used when flash does not work. +- `cudnn_flash_te`: This is a GPU specific setting. diff --git a/docs/getting_started/build_model.png b/docs/concepts/build_model.png similarity index 100% rename from docs/getting_started/build_model.png rename to docs/concepts/build_model.png diff --git a/docs/code_organization.md b/docs/concepts/code_organization.md similarity index 100% rename from docs/code_organization.md rename to docs/concepts/code_organization.md diff --git a/docs/concepts/quantization.md b/docs/concepts/quantization.md new file mode 100644 index 000000000..ea3841076 --- /dev/null +++ b/docs/concepts/quantization.md @@ -0,0 +1,17 @@ +# Quantization + +Accurated Quantized Training is another technique that maps a subset of matrix multiplications in the training step to int8 to boost training efficiency. + +Jax supports AQT quantization. You can read more about AQT quantization on this [Google Cloud blog](https://cloud.google.com/blog/products/compute/accurate-quantized-training-aqt-for-tpu-v5e). +You can turn on the quantization by adding the following flag `--quantization` and passing one of the following values: + +- 'int8' for dynamic range quantization using 8-bits +- 'int8w' for weights only quantization using 8-bits +- 'int4w' for weights only quantization using 4-bits +- 'intmp' for mixed precision weight only quantization based on config file +- 'fp8' for 8-bit floating-point GeMMs on NVIDIA GPUs. + +```{figure} quantization.png + +EMFU measured using MaxText 128b, context length 2048, trained with synthetic data, using Cloud TPU v5e-256. Measured as of April, 2024. +``` diff --git a/docs/concepts/quantization.png b/docs/concepts/quantization.png new file mode 100644 index 0000000000000000000000000000000000000000..e180d68afed416262978a006f29623c39e54c833 GIT binary patch literal 30167 zcmeFZXH-;Mw=G(V2!aTTf@BfNQE~=Bi2{;yQgW6k8KnRf5hQ0MXF+mC1w@J@IfLXJ zBxmlZz3=zh`|(a|@6Wk^?pbZOT?kdR)*5rpG5YAe&-Lz^qBI^3DGmaGz>}4Etb#yX zDMlbJFJNDW-%LC{mVp-xCka_KY;5eYX{A5#PeTtiEoT*Dw|fqb_GXqgruUrP9Zc^% zdiLxg0)ZVjE2q>c&x(4g!N7o0YVVMpS3zLBAY>zyCzMZrjDdlHi-FC-&B4KYC0t6p zLO_k72Z0$&8mk>ol0XuV@$hXoCPJp-cAH?oBqsuK43xH!sAher$2zePliqT@|K+sZNl6^!6`40z zWF)YcZ6mxdVm%k*YHPK5bxW@8oSl{n3C(vTmU8VR2s}yeTfPE!h7k8oBu_&Bfbe@k zK!6T`pWMy=dHrvRLp8rulYQOaK!H*LA)(r;s#osr14V{n3=9my!gcOD3j+~(N@8dI zin@%9jJc}WA0>Hho$x!PL!@Tc*48#QHda^1GGrq(3U$h@hq{`Y(mle>KMJ{RBnmju z!6*3mDl;;gJ39*v>OC;7USAsinw}nUb?i5DCTYk*k?&c4Wo4y6`@-U4+^`-Ok^Dke%)cm4X?Y@)cEaBkdvbVc? z{rYvAfjspXHvRGO@eXZSXJ=w{L%2a7)GHWyT;7 z-z2{oUrLQbwWAMtIk}CATDSGla><|@0fB*viTq#TdOs@0)0XM1zaymI-<-PGsNED4 zWF2^%-Dg=odiu|yTWOahAkD$WY*xk9B#mkJytTM{df553l(mQtYBjHjp45}?CeScU+e1Xs;Ukr z>pj1I{aWj`Ny%v<)154$P%yX9nUJqp)P+K+DAu`c!W$0K($f>TEw?+ky1A|TbL50B zBmDFS8ko|DiVPv5n@*07J4kMhK8Y%u4_ z%64B5T*%4GM}&vx(aWxwWS7psX!)KWnHOqUFPzdVC95UlJr(lYx0$GM!NR93;H%Hb zFgeNH}G8NA&${;Kf^=_ds#JOwmtTIx4pfcmX@}=yW1Yap2Gh!t)Sp( zd~#%@^WMrpadGjAT>6B_l@O^98*db|^K)_-*w`TIf)GZy_IW->&x4JS44&#(IYf4%m^Cv_JbisHK79CK zCHuMhY2rh*Nz(P7A!O~*tnJ<1d`{;-en`yMKGxJ6LQlt~->ENWf5Hl*rKPpBv^4KZ ztag}wBi#MnAB%>D=Irb&YE{7Jw8}9ZX2C{Jy0rgQJR|{33XzvH%`j&(QZgcpiwx>n zCZ!Pwo18375*Jt3{r!Erk&^kX=_W((y@48+Rg@Td2EV@MAD5wM3e!$PLIN&`&wbm3 zKHE2{W@ct)cvuT7t1u@==KLvzauhZ`evD{yON)quS#^WO$^QC!EZwl8v-2*#*r|~Z zHv_WDVOrwl%dfj5LqiAiF@}@^jwZoGtimE9Nr{QV!ooj({7CjWUS?Fy`lS#IOHV0D zFtd`<^xaBNdTWkyIw_ys8k}=x<`w|~hgnDL7yVikS95c-R#gxJF|4D2j~zwIYs2q2 zCv#*iC-;7~_cE@_((gy^nkHQe)n^Q3m#b@P1UYDPcpb%K!@`Wpj&cgp(q^{j+dbq{ zf1=vjv>UvR;(4qyVrmrPxgR=xmycmXs>fI~f6W=Z4wCFBexjtx=AKPufO!pXS`IR-L~#9O$*xj7R; zCxb>GeUG5E^>vC{w^S6C{WC>R*0_#B>||tq6=;`s=xyzAdf8qK55L6~frap^A9%})Tu?Q5OvP{U2nNT-#zsy~&cMLH-MzZ9(#^=o z=+&!Ni(N^DBx-+F2J#^@%F4>NwY4#_u&}VRyScd?z=#hPXv1>*(bP20^yyZ5QveC? z_QS;O7GlF+Nj(sUT&KmZC|X8AL66ROp58?m|Hl1M>%g7iWeY9wk@=cs(K8mA(n;eE z6zUj;6via0#{v@IXr;ZF!yj=jr^m9{eJybO{5e_mQZg@u4)fa0n>Qilawy-%)AYPJ zKO>}9)NAm90y8!;GSb~$^rH2nh{vvjn_JA{Q(0L7_w6~DbdW4IS5`(pKk|27AKlxS z80+inYo5NP&Yl=W%PAsqHeT%n_lLeYR@Nvf<=_t=yo@^A-oJkjquA2ie0sE)(o0bq z8mi>rP+V4!eqJg^8j|IGvajOZczAdi&8n-xn@z5d3<^pRdn~73YSi-n&bFtcm{{YG zfoJDZ`$}wVta`piTkX~lT=wB2OqVFf?zRG1tkBHN%-$fV2_bhM{y4q7(CuV zbLuB9#J377ZSfo%$1EGAoJCmekc#^~Pg9c62L&Kf?KvUcQ|#;4 zxy{vy+4P3%+)KR{nUD-DENCFm7C~>`8L!^x<6Wpx>wV&Cy7Qn=!*d5p5}$ptpv%gL zc}k&;WCc^AV3HjK1M3F)v&l_4>GPSE;MCMq)ojH=7B&UAl`_kSNnJzB5eTv0;aLVVz_Hb)D(0aJWCI7S> zlAi1PtE|5;+k1Vt=VU!Irn5Z7h?SL1BoPs8Nv^5Ad4v-M$U78?Ng(MtJ~pX zim%VXgc}``zp1Gyt3h2=p6&7M%nYP5`lebDk+`@xvC}Pog`fQ@fv(rCU5nZ?gAq@L z6y(fkT$yf+XF5w_Z*Sj3ViSdqS9{OrW)2;n&yd+R8&0~KtBfn{rwqz0J_+~}K9-g) z(y!GZ!9&5csI(a=feC(xS>vv&26Y>ZUV~S4PrMl~OX3)Oq|)BdG-UOQVPk{M2xoec z8=FDh;cTc_d=5rpeEe|}zS!NuW&O@fZWflOqz?B1-j(=AM@Q35oYhuW#}OK7C1N%- zG&BX`MKn^va)c}MkhdKvd18Ui%ac}sS6ptyGuQ|L@dRI=nVhrZ`$d| z)`TOh>-pBmY{evWR_N{PE7!4gcXyYP`Ux3g&ac>RqUP-6fbVN6EbVxH`~t17Gp9#8 z(VJHhh>KjQM(X}$XNV=N4Y!T)cW`|Cs4?vpM#860 zFQKy0dwY8!pVV)on6sqW$E#kZ6Qo%WeSvCi2mb(I0t@Qma29`Ald<@z zf`W^q;|km@O{nn9;$lohgwx@cadqYK$-!odfRj{FBaSsrHuZ8c51?|T6wybI9tkF~ z(bAg2sT^|v8Qmf#9!izOnpc5nYB&1SpB~r@7L0s*jUg!^!A+|0{Q1mB3ZV>s4k01W zzAS}qk%L6Xkrnn#Ha0ff(Xvl@WB~cd){QiLRwuI*60klJo0}3jZ%&@Uh^eq^173th z?z5a3GZ&xv@Zm#L{LR_c&`Rl$N7jzt-|XsoX>fm^Oy{4QnSwvE4zlv|4?7%pB?-;PNIHRUQ6006+oWWk)hF zGmin>gWjWKWE_&mJT=Z}xiuf0uX-{)Awl;U`UndR4Hb?ZfC;8B?vzjjiD~-cuIu3^>GlAjoVrEmTdx8p zwjRhE2&a{YMaYI^U}A!sS^Yr5^|YX{pund2?M6(r%FZ@v5!{Rl97@g1E@wlc_=8T z!+sBHU^Y1~V8cfORWd<0C@x>V+;F-rpW7b747Cw3$C1Yrbtd=y`w};>Ap=|R`d-hehIjG#}IZ@Tt7K5R68DYNoNNyion6QaWNQiQ{#J{DChz<#Ju~fX!v_ z19zdLSQ^kY6mePU2W%9%Cr=CM;tbZ^Kz@vXWG|oXs53wdN=nM!ApkOg0Rd;HJKdT^ z22s7Gu;L(PPUh8LMu=l8*XDgd=~UW@n04H3TCT3H?&;~-7DjoJY7c$UpQo=y_lb7n z9Ol6&`fkmhpklU}nmh2#`8HPxxsO+^qDP{LDo;=mOirg0>RZr_%ejgkR-~i_S z38Xuw%^RyjMMcD#!@QkPVFS+uV>e-%Y=3oea`HT0&fc?|)zi}h*lC%&ol0<@n)qvmL8Y$)>5ao8YW_*$owE@=Q=bAa27|<+(S};mL{g zlP7J4z87Z)1qCeB%@Cx>C|10L#S!FQ^`GJ0FUyeZ6?` zBILXsI+R<4gi%TH8b4cFOt^zGts@7=)^o_r>f&uHunixq3A# zGxHiX4i1irips8zA{xML%KI9IT3R^=`L?*(27XMclqU5W zZ&408fY(qPDyAsP@C;4zaL7Wp@bN>5SaoC=;W~%L%EoQ8xCM`3)<&=_O*LM?&14Sd zL9GA~3m7-1q$fU!DHw-r{tZT-y{1l~D*!o^>nK2%XXmu0fIgtu!+-fU>X0HKTQM?P z{tBbc8Y6IB8&=jnGF$-4qT0zk*F9A+JNVr@<)70CgjSa86(r1Vm=-Xk>p)XXN~+S% zKii|TBk_Op<^$XFI3O3w%E|!T&nF1>VHjx2!Z|Yio?+jlqADULV8U|UgimeL2a)4( zS#rn2f?<-5;(qG4XQ|O;_8Jm@TS|z=U{o0t`Wx5^RO0DTiF>Hh9>vE|l|9=bwE`>* ziBg1&=cP^F_2(gxp3ur#rw7Bb7`DKDh!x!n>;>xb52gWPp3YMi2Uk~0_!GcA_Pr6O z^--SrfyvDT&{C8OK@Aif0n**At+(&rTTL~1!vbbzVuIO2 zOh)DjO=fSeN|ItFya~7(U=%bf^a;JgDy)YTH|(=UyM_^dLAoJ;C*9twNOLYPFH3(U z|CAzyg^3AJSmNlooxxN+@nF)kJ5gb()~zTnw;nPcG$Nbh)e$g7;^Hh1fBvihcvuE! zSa#CW8-II~CbALb7DpN)RdmUCj~owIrt|2-q}3R2C*4)Lhe2wK3eRWiQ?}{*c&a{R zNf&vf#>B)-3rkG}k(qi=7udYwp3kxK`1_&6yeDt4$Hj;d4f7do^;8%LqXyEO z1WcCPq!7M8fBwiu&;gdFu-eLrftAQB*+O?kAsLF#Lj);f|)g(ce@remYZ&bH#^+IU_-bItQ zqtdp|X2!HcAoATqLE*q@rK~>tGPX=gCyxXUltzH6-L2f#=3i?7t+vgbn6#J#iX+jV z%L~<4R(=V*2;eihK6b+fmM3-SHh99efHJ}4JCq!D{hACu{?@9OvmK1WM5V@T!yIS$ z1#j$3i!!p#puK{H8`WMRl};k!;S9-_IpJciEn?e&B~)ZTwuhbu9c#d>{d*SL+9QB& zpgvhS+tMlWqs*oDeaNpNp1mW5l_GMyln(urkn0-3#g`f`z}Ue4sPQ(AQ)e7?*sr<4 z6`}#J#&DknH#AMqoNSC&@9<4|sSaF;O^15z$VUk*prd$B0lO9dk7WJOK;^-ylo8%e zRE!SMvyX@7_Z3yoO^y-q7imR#Hd_3wG8v9V`uh6$Wj1>;Dw$hb4!NSPUS4(N1SntT zKFOffuj)iqwYB(Uyf#qKwD!2@HuHAdGEMDC44MyM0;h{w0Cc z2J7CLzkbQMv%CiskGd>DctJC-6osdVix7|MUxK{QldY5j*->`l77;P=1yshOy6yKp zRxT3QnwpwNbmwOWQy@3|W(N;r8{rJ_6{gN7b^9-_{rC3;lEPtAW`LKuA= zy$J*z%*qFiK8$Q^(H`+V={C|aVPW@%a{37g327P&w4j+2(A6B=Q63!r6crs!$?qU8 z(6Y55J)$qIZrIZgd9tCW?&Yi$Iwju|r5kX1ZWYCvc)L=>?91*a3L`&Ep5EyG3wd6C)78<*2?)-pm>7EG*4EZq;M*?cgz)nOe z;Fyw-kZk|H{bOz^3s5WPXs9*tlwM)L8Y%1=U?BXCtQ5qTR#uMZB9u^uFs&l(%^pDi z9o;H8wCrp>Sn$biGqR+l#C2nw(U)f@Zz@Nb=BQK{ry6nuT0`-;q(&2&;@l)PEMj^#-;@}=H?zu#cYn0N+17*T*tImVdYJVy^ILNLOx~^>K70XHLmMSVyzPuwz`n>_dJTBJ?n@u^kH<_3X6+7f_4a6 zwj97XK+FpW2tXZzRy2~C2h=ac20TuMjbsDtmw&!pD%3bxN)H{(*JNjB--KiXwUEt@x_-IeNVPG0_Y& zKn}P3yH6X_UkvKet>6tx;i@OWM4*%W%QQH8QW6u#e0|Y6%1%o+oZ+Bj*%A8?fPT4i zzQ@||SAeY$O{^20;dvjW?f)5z$e@>- z&`mo#m#KAt`~@@F(ChCL!WR-c9VoqG{xgl_JGS>38D+cffi$<%EtXLIxnztCL<^X! zMS9gL$2xj?UQ55J=TF3GL!=;jcL`s}^9u+7^SK;E<#U(Y;u#RMirMl{pN`jiR_Y0W zRG=yP3ov*tFD6m~;Gxum2M=W2D1j;ll8-{vtIF{zfF<)}cQZixy$W7-b{kM4KYU@j zch3m44l}z-=s%#^M%UovBhVjh&Zw&_cB=S;Q07bk;_mJPUxu)x{+WE6M5Q%Ur7tOH-n| z8Wj*#H%CX%H^wWce3&VQfhz$dGcsFP#Zzsha9k+NJc``q|K# zKzjhs#++mR0t9zj=y>Q$Ko9*T3747LAbwR)Q0w8}+4Quu$n}qr^%gRKDQxD8uLPD$ zf{tb5anq9=&odw50CFBk6_FwzSNgvNv)`Z)Nt52k)M1dlK zu9fdNKO&+Jh3hv(}BGMRmv{TS_xgHfADteq_;N1cxQ`GzTmm%}2 zO1nq&{F|ud%JTw?`pC%Jx`ivy#HMhXzSnc>4=XC#jAT@$^j!Oj^(EW);#|rQlh3l( zWPfc0pPcw60ZhV=uGil?Vu*Ay6(h3*tzm`8waO-)Q9qoX}MJa|D&g+>s4!Mnv3z}i59 z30nva38CP&2y?{q@$o5YI9`D!52WaD!08$q3)9oeBY0)XFo44i-nDusT`+$WYi#%V zN%u9~tHkltHcF91se?h{PvYHg)|@E`-&5+E-2bAD*j4iYOSSOBH;1``)b zitlo7CZw+U#YJGE0MD0O4demAlGQrrVDm8u2=gR9ok6a!7VXcPBPvBGf0M00-ytQb zxO@i<4X$4yNpRrJW6?0oY9rI*LNR8Y?&9wmJdkJ5*%#J={KSq7R%?4ZGY3a8yd9S4 zLuFMpOXv#WUm`xIudJ-BfX9xB(FD>^$Ytd=35j;*)h`HyaM;h+6t$Po77g;ZSRv@w z&nj1t|MU9)jsx3AMa1qzTaR|>D{=jhLY2#&%#BWC@hwRh@xx5C;yitKvUN(Ue%8mq zpJH8W?={7@B)mErW1>YE(O*IG$QXRx7>H_Vi@P&eknw~jJ3w{Zd;A)%HMhRO)7_^7f1vTfD=lLkA33b{LoVRCgT>oBpIG7TwziMMkd@8L&YS^Tx^S5tP(>5r(9 zyz@GL8z$ga&up3OioSA^-j-O}o@Is~Wj*{*Yl{+ykof%PGV-O%UH;{>jnJ2;J!!dt~-JccX`_MquNwfB=!XoSP1yS2=%KHFp8PLdIx^zkGY*!Vk_XkC1wQCu332ieN z>e>(6K?{YJ?qFjAO4I}hf9U4%{PYNFN;EAyCnx89^zj@BV308Q5~q7JY|9^5%8MekO9da=p!-T^V1ju57|wh-OLyRLFgWKkKkqiF@CYRPpKgltPP$4 zsK_d+7XuV(Iy$4kEQXQ`l-Q2(QU%ACn46faO*KMhJ_FTRKAPoW;xqt)3jirBT22aC zeaqn50G<|(0~jla6qWDk*Z3C_-w2)%9QWQDWVlugqUtBG8t&6!wuJBUjV z#zw^?L68P?>OEZc=zmTIe;%a1hlB=%4Y*>DnRoOJM+;=s%@A0Mr2E4LMn<0n)WxZu z17Q!c*WaBTG*SeJ=mxh0SH~}?oO1*J<@Cka)VV9n1|XFH&F$qEHF%u^laD=X{kw$Jb9MKraq0NWQpVBdJ=3g-?E zmUnR28nbU?MIM51SD?M_cL@UsU=? zT?^AyuMiOY5C&W!@fLcQ`DQ>^z&3)o#0$z4G>E`h#lr`HwAhUXxz}f>lg|^@GbqR~ zKW~t5&cThKx`*sde4$kHO}?7_Bu!~KS?6wIG6&)YG-*_^9QD>2r#x(Ib}$%Fhir$w zOu)p3se%_%ZWCObY%Bk94z3dfxofz%tHWPsJL0$qZ{6xp3f+}+fQSLXkS-N+>uR@X zWDCod*s?+-@xyHAI8{9bo0hGse<9T1cX9;37+66D;o>z7+-U^_M&5+{Smh-%u&l^DtzjOC4LrbO{5QO;Gudl(2a!;e+2-l9IVdp$r`j&W`k@3L&eVH|&Qd01Q zX=%ll{CyKwOF%^>2C^Rrt~`Kv03Y~XoJ@iFEs|>~3diY7sh)MeczT~U&vii?ikmt^Xq|t7KH%uK$8mcq?{_uhGdTwMpGgCccrB}Ks(N5dw^!6 zB(fJdt>6WI9vI5OVq?kWRTyS~;?XfN*@@7P3PKK07QM&)G5{wnKEF? z;i^-~!fVXQ$Fdx|z&3cD?F~UfJDob;=j*H8K`|HEfOZSB)xP1#OufC@)YEE#L>&7i zOuEHnPsPpHI-tTg>kpE1qS@GpTI~-RyRP2(WiIvf6}8u!DCVyX1v>6O++i7mVherT zaGpA|g@vPIiJsCj^rNL_o#4InT!=FVZ)HksY!w(xAel!+MRmq;(bLeR=e7Aj?7{rl z0NOebcR-24Vg;IFt)0Cw(!BVH$XJvtntTKV>g=a+j&LjRCOk6UD_B^>#KdG>X8quk z0}eq4AakxNJ@sP~6BD=qm^`VH7ktjKC71y{S>dQ6byoeXU~5y|9#AVD0u>Fo89oTr zN({tx2s0?nY%DB^Jk}cOd^lvBT-KCYe>z#Ng#%>2hR(C6-{wL{xTLVDC%iRs6Lssd z&5|zx?X%bs7m<=81~>!8G&>_>4R~>Ih(agbg5ickr{(p-{rqS#fPl(-w+H$~hy~2l zGw6NV`&bti76#$Q(Ml6D`qumV`=Pmno-bBw5IioRcn&T*19)d=VF4)ST%&2|%|lPW z2)R5u?GcyKuN#c^x)nAzC7eO-vOfkb=3spcIIt~nu*NsyLZWbo$O3@Lg1AA3g=HzL z*;H!YEvKBB%A{{jt7lEAaRvE1yMXS2`a&F(Wgx(zMPhfUrv9X8f}RVIF1&$1udj#8 zWIJn%D&Kql{5g_fvu;NbQXU8v@K*7Q=_X{g_lL3X z`t14n)-ZLWS+sk>O`DiVE@Z3L*};BlW~LB@bBZwm8Yp%T==fM&jGa!fX7cQQgvFs( zy3sr;3TOnVj*tVbMoKc_i(b7t1PQlugfZ6WukS^idGgU8fBcVBV&}hTLIFmA{<;pU z0CK`4ptr)O;cD!=pz>oRTtb7J6S9BAoB*W*3w9&uh~{+*%xV;*@{Gw;_^;CUw^)PK5+WB zOAb9i0HqllZ&RkYgG)PDD~ir6R7LQutp-CAnQ|!`HUC;)lw(9itIH{$HT_+1|$q{iCLrI z5NZL)dXgYLLo%15I?QEQwyvNOL&AL4oBAQ`)x#nH2|a=mn+L2o07EgrxWVu6Sub7R>h1&(i0R8;|MZj_ z;MCXEWq5ze9`*`eF)<~hT$TS=03Syc^?Eyi=M$q3-NA#(0ri5>z7Xbw$LYhHvu@O21-8KmIp`%j?&Und~*IR@J51_ z+X`+8^uSVn+1IaM!+i<3tf=|;G=OK-z+e)}2q-zJZqJLJf=M#tlG^$4P$LetC=5q# zCD}We^j6IFq#-}s+w-9)0M-QmXn*k81C>NNE+^vi(;X#hqyX!pDN&+p)u3S($PL=XBW)Qyts3b7pDTU!?afxZt3xg!B9#2Ta}=-e#8 zD+P3Nro4>lnUvhq3oA8%v5|&6qzX}kmfwl2iR(Q_zU3XL8`0`Uk;g5&($UGB1yjQ0 zS}NU#(hX9xM!B;{aaFve#~-U#2kiRZQS1gbRXw9Q$GDAg+NznCFT|l=^P7e~p2)zcFzF_1#0n38z%) zmn*BidM8gkb9`LwJ$1|6-X;oMMWKHaJ;zWmdqb5)PZ|VS#t_hgpxcHX^8B=wb0ZNxF8_L$w1)&DhH-mNkPFIypb`ix^&2`-QAA%cIa>=%JE$F z&UBF-fwZl1&>d!-G=BW}QBh7Vytbb!LQ)PieNE^@fm||R{XRafM~ws>uRQaLM@Kg) z8PTE7*)iCO(Yo*JOVgmo1`xp6a(jx_V?#U zU{MUSzs|yI^bI}!F!Ak&i;Wlf?nOS4mX>ddmYMj<&}X@Dt(q=|kIU`*#DoaY%EiT7 zulFuT=6U>+C)4n#J7@=}!UVT(Nua@>#M{QRb$e+40Vsb5TRj#Y>D@_1=VcVHFG0j>xpZGeuig7sh_8$1Z^I)(dCk#0(` zXcdc(oQ`!6Ix4W)n{B+=l_-*Xv4!doTX!gQ*SbmYL$5dKB_QL>A{8OW`mwjZ=Q}CA zSnhuDt(2ZD930O!BbmyXqQ|_TjW@*p3r^o=BBSwaAC*|mhal2ccv?wdgFA7o%> zp90@7$DoSgeVzb#PlRw?s@l*Kfzi~Scl!+bQhbHG;8De;Zn!x2G<^(?1nPm=yuef* zWfHE^pBPsr&DRY_xA;MCAxfiew+*$`$LpjxQa=xr+rn`(RZH#q_7T9XSp zGl+=PgKW^-()|o@HO>9|V7r|c9navs1!bz=s(d?O#34;YE~ZC;8BP6;jh#TB3=NDI z;4=U&-~t7H2Dt8PR8%P-P3smb4SN06-)Wa74H1Xue}@%?5e^ z(9dCO6#5O&SfhUZdbReo)c{C|m2p+BZbSsV(hVFOs2}|Ralk{OrlH{qEkBR}abFhj z5;$1QRQeyn{uOY;PSm-pf;wkz4qcrjY1sswl5(aN`l5)<4%nHH{E%KCHG&)dv`3oi z8ug_6V$!#7-vH{Nb*uRA|Czu)0DS-jHVB)X05hALnGxN&Lm}k)l=QLANJvFE=8!hJ zD2J4&vhxqVr!j30jnrvOw)CFJ+Anl0a{lqzNmvrpoa439?u>4LfIhP!e7k1Qy7NI} z3yz4xJ|!`czs-aaZ^0A{t~?Tr8qhVDArECfbVU+?C)am6fOyWlsRUF2If27e<9v^L z#weazzC^nPqV)`imYFwqf(s2phx&JXa|~G$ZSr41HiME zV}oWaVSyS!w0ldtq2MuE_sB5w3HHov>y0?4>-Hq91IGj|U^B~Q-$Y5+rQ>rrjWw^T z8yp&2sP>@I@a35=oJ(~{NRjpY_|K8e`fDr{x{-zDWMB3`Eep25K#D(HNU;CVpzQ4G z>H{WFL-+YR^&WmwZofO;Ev)6nnIov1!Dt7)RCrVrEn0eNIN1JIues0g3z>eDrxwzonxB*wZ4InY6Y1U!E zBqR=VGBNxfI9<@i2ABl9ItIWIUWu*lz1*vR3tQSm6IOw@z4KLJdFWZf+hUp`}f;> zdoFM9EA4R zXz@rq6C2w)Y$3#k%OJaO5aZ;fot?kvnN1bt=TVEgg7vW96|0w+2v^AT(g!gRt`bo_ zq>t&Fx0^Dh3EP9^SUPTp3dV~TCS~tA(cCXscCLX%Cm%}Lf9H-?#pRHo7~PcC`mBa~ z+;hBE#QH6J7XzktgrmZ^ntx`}ExAV-cPXeDPK#-hexKd0oc`0Pim7wfCEPJLKXd-- z*+JY*LheX)1kc?|)M!yF=e9bCt^Z!||81}9&%xylXuKm1*u~sjq(0#_bR_1nv2E=E z;zNDr^*sliGRzJ;&k=}cB$p+~P8P*31Z+pSlN8O*@2iL(!lYliKFk`2euKEUg%A|l zzHBAf!@2m6FYohh0;wlYM(Fj?e>X}D9C}tUv$~oP6Eg%=6xQnAQja9Iao1N?*bk~( z`6WAh5jIFBCBE%OkA)i5eS(A@t8z3a2m9lIuj({hfeH%c6guH^NPAG40JTlR21DQu zzyvYIo{K)wy;qVJl28`G(GeypPl1egvbp#h>RSi9-#~7ABQJ3NFOc7kg z#01YH8k0dEj9NxBwCH32tf3v`EHk@ZC0#;XpQEiy*pZIP0T1aY1%D((uTU(n> zh~7Q_C-sT=k`agzy%yX6*oP956BAjGU@3NZoCbeJJQzA682Ys|Mkm2wB3~W zK*rZ~$rAnX&o|1@H!(&4%r^L!8w?B-ZqteV#|vNpnP!$yNqBmF-5W9%1Onua1lXVg zJuWcu`%3KSQ}EosR~8I5$@D(-+tJCyUJ*=x(CflCPA0dZ_x*pix`K37MtiFNqV)#7 zsx7dWVq#*#chN|Fcm)}b48$FgTuk(P75}3)J2PRRPr&p<9DIlWAFuN>OXx4}w;&17 zHZ;_8XX84zQ?5ww+YYl>Qf{s0z3Et>`Gf)7Vhe+mnmv|L8{m*QBz@#SR;s@wslW z|8e-|`l3)QHH%9Rggf}_{F?r|9$`?;FNj_T9A9!{eAS+XotbXFE(-T1w1dPr^jweKC5{VZ|p=1B7 z)aLGSCV>RK^H=S6lV6fqz5H3U+G3+IYO$R2)`oggF(!RXDhQr$hP+@VtT1d-9K4p) za_#Cf&wt|5&k|Ew1aj6el`mzZe%O&uXbkufF@(t)#x1QCR5rV@pe9@Qp2+s(a&E;4KwD|EAD-pL%Bsu zxX*)b7S2%?j^=DxeZG|O-qX609~bKnH+z0kcu&-A5={ujx_Zc-X2D-pl(-OBGk6fn zZ%wo7(gwSZdxwG^E$eTtr*k0(@rWeUmYmzMe1yxi#gY_b3|>{j9rYCkv7Z!$t_=Mx zoZP+az%hw_)P?W^sh}H6B)_XmP-)?ypGds*aPhL6n}~N?@*ggEO#OJ(qmy{wn|V09 z?ev%Vg~xXOiu^-_^z6T<{`}RvNmEdXPgz@S=BVg3RzB;LNJ zuJ_Wb&!hj2u)2&+DSkn()wFjLf|;4~S(tj3kp%ZeSFOx?MCNz>Ewh8mXuC)k2lo=j zxZ)62M0p?J0{cC@6I|Cxersd@H5($HTswX1t4yS6ZR!_*;0NuYN{HNql`aqvxp| z*K@0}9yY};b)8~~>=Q9ZZWU|?ha>TY$ZU+bTlk>g#% zjpLiS!%f}Xc`9^bpIp#e()fT1`v8}WnD=nC^?cFwk(6ZsZ*%rLDU0W&w;DsVO;dI+ z=cHnzL622Rf1Pe6E@vW7m+XNmi`Hp`Q*gIwvyuf*nZg-S7io6>f&H4SND14rX1zuL zg3^OD=}WJ%6)bSIjZb^Cy~L~%!V^YS^`DH0NN3MZt2pY$ zG$?x77s($uH6s!I{!Hnxs5s|k?Qi0Z%8%4h}bU= z{Y;!^W0B*I&N!4-UiT|SZ{Z}>d|L^brt(|_Otn(l3R&P!Wrq>uu$A_B7?-d+* zrP)nqy-)vIvuiW0E$wWR8~3n(Iqo&7C;c77=OFZa)Ve<%i2hVGP0doipru^szS$FJ zd7r}A6ZI$V0b4mVI#`x==F1~5X}X~@Ws}QLM38=Bac=%TB0rcw!h-wxK0|i)u{Dw);fDG!Ydr!% z1KlM8!SZs&(;PQiFzFtWpviNo@#@-`a_PAr8_n3U^Cd){ANuyxQm71_nAq+lHSJ>j zNW4sq#I~LBdvM7w9)0zGD}6#gWC*|GBx(7I(<5Vlr~G6=JS*g1ha1B>@o;+5!J|Jk z7;k)4UG@^f^`GG&x14gH&<$o@sd`U~o0#|(>(XZrWpu&f?`W~ExwaW)-QLn|FDPAh zkNk??%1`K`5P$9t*)Jy_`p#De%-DAeicmb^#zx!_M^6UwKV2dZ=Q`i-+WcL4m3@)i zfp-nTe|6Ci%JhUIPBWo>6#dko2w%r?pZ0hN!mr5z2lizT76NDn%@>9-Py2=O#7W6bX zVh>_pAv%4Ajz}KKm?jPK(Al=-Pe?KR-P`{dN5E z#?yDCiSAYr=@Ez=QgwJ(736hjDk0DQH@_ZUtWQ`wN6jbkom)qU^Ggv3#J*yvF~xm-@brPOuOt!F}3p9ZLZl;jKc5EWR&xia9qf5I||A*$pcOisPmoh zN5mW3+}{XFYPwsp65$uY6P(=l{ln)PMh+Htm`>a_nZ4SA@uu_+E+KBop`nQkLjX*` zd#Se}+I~nC^xfWKat&|*uM7z5Xln@2_$8&KeNDjJ1Fn}QgV^0`GLfeELj~LV34wMx z+-ky4=%50T7AP+8Iw{6;Glm;M)Z*x+gpjb=5(MtTyu6x?n$`F-o)4+3iuxM{Y9HtU>#fj^;)=eFzx*e;0;`t3&s3K)j0cgoZ(tgLv{ zPvbaE8>Rc`gTc6%rOpLcZSv*rogYBlAqhZAubWB-k`a$uO;fYdiVp6q<4$@4Y8UJ} zb#-1~#)n?vbgl<1Lo~?Wz`tm8V}5&N`D$|_bR${w;FuXtXfjk#KG4ExPT5w8bxC;7 zaBO#p#DxgKsrtbc8_YWt60v;tll1rl0|OJl^-&{%m#uJRLEMmN^4#30dM?qKmtFSV ztiO2nEM8q{9ex47mrBC)>$sgR!4EcSU>w21s|3<9kVDJUup0qPh6eET0Z{wE0M!-y zkY+Rtu{I>_)@@yo_d)Q9nN>mF>C(p8X~k{C4Sk3g&{^>K7LWZk0pM6b8xRR1qoUM- zW(7(M>~|o?G%UnhqaSH>1N+-KkJpyL|MO9U01o7D=z~GngCoZuhR&qwZW1SSv*On8 zHK_{<3t8FND8ZE}`aIN&=YVbx9Myc!%8tUpHw@kAja<>JUC;+-}eG2cTZpuU^k-|Y#=EwF9*9cTI3DFBdY*SD;gK-GeNq9 zmVp7K#Gwh<)06F(MskValW)hvIv~YXHtZGt{Q1-0%h%L&9!O?d8tevqHMsWn!L<%s z_dU1rqhWr)o=YH$*|bYcKvahPg=8E?-z4!x)L`BK^N4;b*ZgJZV6PFUOT5}2F%Kg6 zX9y0KfDmZP-lY2Z9T@J$SASwm|g0*sp990jCQn{mnRONYFgO8naYr5QO1tAOW| z&`HA}`uq(xOmrF4{$wI_*Y|AI5Z33pwk8PDtjMsGlqr+sWE*gRX_vrk@lt^ZVo#@B z9HF1QbPj6&_HuR#`n3e&0TScx>JU1%wV;cGQ{pfOgYws+g+DbCg+dWXpj~?%eG#yM zM-m?@RKZlJLh0W;MFce#OuDZJ zro=c6y1QK1uYbYZ(J`h*kbiuxi5-D{Oc1069v&VL4X4};R8{ z`}Ag0d<8S_{%~3P3)7I$+J#RErHi=I8ivj@p3-yu>q26@dEyrgtHYuKP2KGzqW5br z-YvH!m${~$c#Nhl&Zf{-qrAL4B_ZAIKPg8(T1)6hsn|E_8yNV^eUjb~f_+K& z)Zo0KIIXmrpQ05%y(4C@^D=4Yi*9qPi#B_M+f0YNX zwRi4eHRt;tU&cACA+?Mh(aplhW!#cnO0jYu)krajCMB6r4UtQiJyzvfjl_(osSIJd zAW2lDErysBDJq!>T`0Oytvau7_SrMD&+O+p&-wE_yT>D1YyJBEzL(GC{ds@hzp`CZ z21q2KH&Y~*>A~y#XK-@)8ucBK-S;Q4wvh}= zSW)=Nna%07@eY~3;GPQ%IHfn7#}&&TZ}Ti+vPB)&ulSK+?JRN&$u~Rm)a87SAC8Mp zY3E-{7w+vLmm~+;MyNaGoH%>-l>U~nHAQ{A21s`9S4gLwO`f{S=xv7@gJt!tx#rcWJE)f8(ulY^vw+zg|We1uaw8PCoZchD=n=ny|m3(HLmItiR_H` zOW8p>d;0ZD1h`F0qk~?ev-_t<>w8=%X}G!NTKC$Qscrl3o{>omtywtMVe`$84;dL5 zZ7)M>qIpp<|=Mn?G5*Odf>5X$G{$2MhklTX@OC2fiA;;IF(~9CNAEPl)nAzS4#|eg} z>UhNZDZlI;kED#ImyQ43C#55&C}jgIhCM3&d9s1dwt}lGF8f?t{75%%-1(NfjU5@6 zT^3t9Nz0FPDH=cK=Ea)5DII0kTN*1GM`;a^%n8&gv2_7=l`G*bo^XI+Jw%EB(?8~5 z6=gP6IhMNX&U;<9Te)YHBzPfyYu{p3r@&_1!tO97>YJLX?gsG(s`MtmB+#qQo!O} z!*s7a-`E%vVxDtqVaTl>^0v%3ExVU@cm&0lt$SF1{%X+)orwFUyil$jrEgTSpshMMP^L&6Y_rZzA$bz| zw>a5~O+~BZ%F_vHRYW8`LaN0HjzQjHO=r4vD> zpctgK3Tnu@IsOkETQwxE|M|M5*9IJzZ$qrU>cwpHVv~z>bBO3!I=D-AzCn|E z7m3+9PWJ~%i`V%`<<34&Ez)w!E~vXTC4Z?W88~>jQsQa*&eQ2_Yy7IF`#v33pCOQxnBD?Qy3m3!r!)ope8<&-75{~g@J-SA}xoKYWD0~rZ**DZ=(H%J?)m2S6P4>86?xpx3NcW3+cqn+#&_zYf|@ng(Uvv%2Ht#m0_;?O(NDoxQJ*--`6IcXuOC z^@>iZi0vYoD-r|3chSZWsJ%XpP#Tdj>`F9~JK)2b6pxL;DU zshcp>-&bVE$prT;gIP6u`(#IFcs@jdAslt#@x6x+y|4afzsqb%(KnN0g~3ja(^_uv zt#3^>vQ!X^^+|8YllsN#X5lhGOXz0!?HbLQ9V%HCrENVg(nbFnJM)Ie63$c*227<& zg#^k5yi8ndFv^&_rhqhby1mD(b9Paj*L$+=)kxyfpFdf-QB*yC?=`}7|0(f=Iy8Ru zolk}hoAh+&VD2u2@A-u7-Ob-#B^HedCR7VuuySKB0>6_dB2iM>PLEFMT!L;lYT3nZ zi6vNm;Z;!oUhCOMN@Q~0med{C2+Ulb^ zbiC$slWN%g&+Y9I)9}HMnAYDO-_PD-`ulV^WJ*=sp3!NvSK2wW78ep7tG&FU{#73;y$2?zfOxgJi+Pr7FFt<;p$^>H1M( zmh7u|FB#6x4iJP=Q*0^rHfA3`3!T=xY?YC@c?UE!OTVL~ zstvE$p3|y#dIyPJ9+&;%YJ*r_lCos4>O~sgcb8y#F*#rm%Tv}ap`=?douParpA}F? zAH15J)MPSdLn7HlrudeSlNPr(ryW#PG~v=?@%i%Qx?Y3C`uB}>lZKeQN^R&Tj!~D= zXU50)RPG?cDZVEk$8;CBFw?GsO>_h3!jL(MGip8~EMI#h65Zry2r)IXkh4qcxR zV=V?BxGQf0OJfk}y%LTmw0Xy{cQ9oZ+1Q+gPy(NN=0^g-sn2!|N+K4fgv^TJ?v{tC zu23q?%*gPCLPn&qcZ6Q&ius=UOH>Plrvnd?j^+Vc+`)u_cVy z64a&TN(@v@vn!4eGE(w?qc4h77FL!8f4zAd<||sRq^oREE(lcUMfN*=U7896I&S%f zWW2tvEm!gj|AE{n=w5jt%c6d7wk0kh;nj6*cQyH5dI!%tv53xboHw>VTU4|;qp2%; zNi@*3M4|T$lEmszudroeQ%;O=wXB8E6U$`)*g5}$g zHydh}{yLEf@9mw~Js;7*%S;!9SqR^#p`50Dgv1q|Byf;V?u0_}pP9M>O52cq8iXZ{ z)*2z%ZTpu_gQ?apcA;Q>xLW4*%i_3_Qqz!7iOn&?%-x|c2<$A*LflRU+uFk7+A5-e z#`}v!lP0+V$~*RulZXLK36flQh&Nv5G=W|SqzC+H474v|H<2;BV{5R;^q-WeUTw~E z5v#MdcfTwFd2~i`cqi0rCYYQ1F)7$p4TY4R*S*HtEV#>_R2l77#>%>V?%SQp%7F(I zQiyHALopE33}@I&&G9ZW1;9mxss;$ zVTZA$lAAJ>lYyBf4QmCx`LE%U?LFmuAN2OIHrHZF8^@B--DWec!5Oy0*$9kC2|P#Rg5w*F`0_a zu|1Ns$ydby=Wi#$N&qyR6Ba|a`q&2^Fs^Cw{6y%-1c?C#uJ>;a2 zK%-^Ak}?IqaM?dTtAj(7Izk(XG6fgijM2wq1nU$PpAk1~OQb+TO2jI3E%50OMScjv zI{=4mnLhn0>ZG(~i4eurLfG=X3=|xRHrR%mq77szbhmt}K|0EoI!9{E9x}s@Bd8K$ zJcQ9--&we$5|rkDgF7;_lQ9ayXZNw16iwgQ%+4=zMz*rYxGAd|u(raIANpj;SzeAi z$1ON#FeF+QRG?Bvjn>%E9dN#IMTGAi4!rFFMawv`2W;$pK|TeyLX5YEL1;2L5wGR%}?iSW4%5%ko+^e(6r zZ?yzb^Yv{9g9o-^yRy}f)0zR_@4a>UkvSIF& zYkR%m1Q(mL8&ftSyocP#*f>yTHK|;=W66Hs@w^f=?7>}8efYj&#$}7t+D%dMaDeyBdkwS87f7qCtAP;v zDqLwFDOe)zW5$7r;af=ByQ^t^Mz&d7Ct1BOP$uHMzMawZUlzQQ(bvGaE9zc<$9BYh z!d=#?IpNZ)FDpeb@?r1SP6p!E0c0BrTg{G%wL7@U+g=p)rk&g49pvHA)Q$bPG~Y4F z;^EIfCy}ZFV}fA}COX?YMV+?=0+TpOqv^ckX&9_8z@V|3_>4Ghe*#cUJnzC;@mK4M zx1h(m2~U9oqwG@n;DMf3!2-DwP872Hs~}U#l>}%Z+7m}teJiGB(5wItn9_13G6uM4 zAvtm)a_g5;w9Q z$}oNIr1#nW-%S#He&Ek%nMe0Pt8YmNBeZ#Hk zJGuqP^tQf;Zn%X6!bD~0ydMV5!-$8jVRgq`syQpmA~9;b_vx4bSw>yJm?InS%>_V9 z#fD-@ecxWw50n9j$vo#VAM*tQgEJH^llz|^^~nH5D$FnoDKex!X!y`H(ocb8?(VU2 zW_>7CqT8L??&+1sBM3zgMcR|mBopSE}fvs(6=?+mDHQK$0&i#7AFM7dK zJ{Yo;1cR%%T7rN{rYUCHOBY5hPyVaM6XHtU@hy+=2hP5T7F za};9na*;-L|9siPh4o}z4ESYozLi^ED4(#*PLQn!orLWP?q7*XM456cN_wdP9IiWb zuaE@{P)tI%EaG;MTK|s}pX0iGAJm*nIey%Qdxni|QgL7*j|C{BAa!f>rJBY<2ltCt zxU_tIK;15bNUcd8Ur;i{IgWs6U0t1DPoh)909x-S`f9GbY<-RcAX6~&VOzGI#ngBG z(K2dGi$h1FgOIM(cE7xUag%|GiiU6j)KLmne#0t0wUyG*ieX!3XXjPoRXCwYLx_N2 zVt72rWzwOzr@GUqYD%L?GA*A~&L#cq^5A%9E~4lCrw(foD>d==kpFQB}OV zj~iV(qYeVj&Pl6@>)IAVYzFPrI@0tVDaqQeT<;y5OS@|tl_;0Hk*e^-?H}*=?k+5f zOu;lp+$LjDNHJEEuM}^Ti_UOFP4dH>f+s&abvmIK#V%Yojl#l$={WBdDaE;=LGw4; z6m&6soS|=$Gwa*I{hL&4qCRpAcCNGj((O%t;80rgm)3`XqB+ulD3$vRK4?wF)kf+_ zdJ%uP@(FB8Gv#Ls7v_VGZ+!Lmoy3i&Y-nq1H>X@p&v?DhT`7BdaWJ(%=RlGb@xU47 zSns+gDm^;8T&2-;>f4RiJwl>HK7Pf2RtjLCJcL9c6{Vuu-xY6jH+r-*z0_9FFXV1$ZB6OY zET?&X?P%jQ!Z79z!!Xh$AXl;peBOLUePrLnQZ0oRZB}kMQ>3(!lEqL=V<~6P)_L2! z0_7_bPm+i)4Ld|Bso8`8I$4Q*`$Lwot>?WzFWa_GpbJQrg}zAy<)wz32~p4J4pEo4 zfzGTqbV%E{=51@ZlA(yKdEtUnR7D@jIduZs?x zKu4?tNFjB_RO~uaOvfP)cWA%(JhHm@d^IrUR{!&5;Dr`ac(K)8HB5!`18rkc@C)rV ztIew^-cUU)TIxqRbD*aHm1J48=dtJe(mJy8Z@VeX0?Xm~8f(jhT7RMahTP|uv=TKp zRu!y4i5<4JxQVT+$BYV}SK^tY+iDgyHhf-yjEZNo)txETKT><(F)uF&b<3KChdqY_ zBp(>1ug?L9^VZCgDJU`Q!4B=-9i#rH=&{GOUX;yL6=VF~@u9_bcCfU60Z|oC4?&IW zb!NiRaZa;{Ldn(y-KlF3#rIb`DA>{=VwMLgA##xtj(;FcKI}SYo0*c zym#ps9M6p}#-#<;`PnUBuDYh*KDG*s3a8nk_}tEJ0tI7D)-&0(wq?ZCk3T(l;LxEc3Z~}muS2v43=q*NKWr1VSqyY+ zFVg+caf3!H+0i}Z=%{Y3{gsJRm(%a;0y4yC*XJM4ZO&*;XE)71GHKk|1I*2L!&=;= z7EadTEV)5*?&BV%V^Ucr!XHOBCYb>H&&}YJw>w%laQ=mXx2DVzw=!iC~4e ziPEF$NG(>4LiLSf8OCjQT4Eqa#@ju0jzrFI=k|q5WFF@LeByD)u%KP6r^VFF$ zUJv_Z)6ktZ<8Pg;ybF#>7OK`#nxduEe7c zt#25@1MULEQt0$xf{%bv$d!-}jK9mI)FW3C(f@WIV}Bo(P#O5$)Q}D&!FB?^rTK~y zvjtHSY6fU7bx?Wq;u%^bYJXpxEGL7*Io3cTv|UZ?f86FIls~~eDBApy!u8E~^BZCb zb3c`@H70g>fk(e3L`Lv-0HsirTEJvXP_y$XJ6ERg4i)Qn_|SU;p75^-iGRTE{w1jQ d|K){tzn><%+Ws2y3jqg3k(pLA5)@y2{a -getting_started/Data_Input_Pipeline.md -getting_started/Data_Input_Perf.md +getting_started/end-to-end.md ``` diff --git a/docs/getting_started/First_run.md b/docs/getting_started/First_run.md index ea05ee28b..657c99cc8 100644 --- a/docs/getting_started/First_run.md +++ b/docs/getting_started/First_run.md @@ -28,7 +28,7 @@ python3 MaxText/train.py MaxText/configs/base.yml \ dataset_type=synthetic \ steps=10 ``` -Next, you can try training on a HugginFace dataset, see [Data Input Pipeline](Data_Input_Pipeline.md) for data input options. +Next, you can try training on a HugginFace dataset, see [Data Input Pipeline](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/advanced_docs/Data_Input_Pipeline.md) for data input options. 5. If you want to decode, you can decode as follows. ``` @@ -67,9 +67,9 @@ Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.a ## Multihost development There are three patterns for running MaxText with more than one host. -1. [GKE, recommended] [Running Maxtext with xpk](../guides/Run_MaxText_via_xpk.md) - Quick Experimentation and Production support -2. [GCE] [Running Maxtext with Multihost Jobs](https://github.com/AI-Hypercomputer/maxtext/blob/main/getting_started/Run_MaxText_via_multihost_job.md) - Long Running Production Jobs with Queued Resources -3. [GCE] [Running Maxtext with Multihost Runner](https://github.com/AI-Hypercomputer/maxtext/blob/main/getting_started/Run_MaxText_via_multihost_runner.md) - Fast experiments via multiple ssh connections. +1. [GKE, recommended] [Running Maxtext with xpk](https://github.com/AI-Hypercomputer/maxtext/blob/main/advanced_docs/Run_MaxText_via_xpk.md) - Quick Experimentation and Production support +2. [GCE] [Running Maxtext with Multihost Jobs](https://github.com/AI-Hypercomputer/maxtext/blob/main/advanced_docs/Run_MaxText_via_multihost_job.md) - Long Running Production Jobs with Queued Resources +3. [GCE] [Running Maxtext with Multihost Runner](https://github.com/AI-Hypercomputer/maxtext/blob/main/advanced_docs/Run_MaxText_via_multihost_runner.md) - Fast experiments via multiple ssh connections. ## Preflight Optimizations diff --git a/docs/getting_started/end-to-end.md b/docs/getting_started/end-to-end.md new file mode 100644 index 000000000..774623ccd --- /dev/null +++ b/docs/getting_started/end-to-end.md @@ -0,0 +1,3 @@ +# End-to-end example + +See the MaxText example Kaggle notebook. diff --git a/docs/guides/full_finetuning.md b/docs/getting_started/full_finetuning.md similarity index 100% rename from docs/guides/full_finetuning.md rename to docs/getting_started/full_finetuning.md diff --git a/docs/guides.md b/docs/guides.md index 2116ba6b7..673000eec 100644 --- a/docs/guides.md +++ b/docs/guides.md @@ -3,8 +3,4 @@ ```{toctree} guides/data_loading.md guides/profiling.md -guides/full_finetuning.md -guides/inference.md -guides/Run_MaxText_via_xpk.md -guides/single_host_gpu.md ``` \ No newline at end of file diff --git a/docs/guides/data_loading.md b/docs/guides/data_loading.md index 6d7eed71e..7c59ee0e4 100644 --- a/docs/guides/data_loading.md +++ b/docs/guides/data_loading.md @@ -8,4 +8,4 @@ Maxtext supports input data pipelines in the following ways: [^1]: Tf.data is the most performant way of loading large scale datasets. -You can read more about the pipelines in [](../getting_started/Data_Input_Pipeline.md). +You can read more about the pipelines in [Data Input Pipeline](https://github.com/AI-Hypercompute/maxtext/blob/main/docs/advanced_docs/Data_Input_Pipeline.md). diff --git a/docs/index.md b/docs/index.md index c588c6333..779df0e01 100644 --- a/docs/index.md +++ b/docs/index.md @@ -18,7 +18,7 @@ ## Overview -MaxText is a a Google initiated open source project for **high performance**, **highly scalable**, **open-source** LLM written in pure Python/[JAX](https://jax.readthedocs.io/en/latest/index.html) and targeting Google Cloud TPUs and GPUs for **training** and **inference**. MaxText achieves [high MFUs](#runtime-performance-results) and scales from single host to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler. +MaxText is a a Google initiated open source project for **high performance**, **highly scalable**, **open-source** LLM written in pure Python/[JAX](https://jax.readthedocs.io/en/latest/index.html) and targeting Google Cloud TPUs and GPUs for **training** and **inference**. MaxText achieves [high MFUs](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/advanced_topics/runtime_performance_results.md) and scales from single host to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler. MaxText achieves very high MFUs (Model Flop Utilization) and scales from single host to very large clusters while staying simple and "optimization-free". @@ -54,48 +54,12 @@ Maxtext today only supports Pre-training and Full Fine Tuning of the models. It - Any individual or a company that is interested in forking maxtext and seeing it as a reference implementation of a high performance Large Language Models and wants to build their own LLMs on TPU and GPU. - Any individual or a company that is interested in performing a pre-training or Full Fine Tuning of the supported open source models, can use Maxtext as a blackbox to perform full fine tuning. Maxtext attains an extremely high MFU, resulting in large savings in training costs. -## Runtime Performance Results - -More details on reproducing these results can be found in [MaxText/configs/README.md](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/configs/README.md). - -### TPU v5p - -| No. of params | Accelerator Type | TFLOP/chip/sec | Model flops utilization (MFU) | -|---|---|---|---| -| 32B | v5p-128 | 3.28e+02 | 71.47% | -| 64B | v5p-128 | 3.23e+02 | 70.31% | -| 128B | v5p-256 | 3.15e+02 | 68.68% | -| 128B | v5p-512 | 3.15e+02 | 68.53% | -| 256B | v5p-1024 | 3.16e+02 | 68.82% | -| 512B | v5p-1024 | 2.94e+02 | 63.99% | -| 1024B | v5p-2048 | 2.49e+02 | 64.05% | -| 1024B | v5p-4096 | 2.97e+02 | 64.80% | -| 1160B | v5p-7680 | 2.95e+02 | 64.27% | -| 1160B | v5p-12288 | 3.04e+02 | 66.23% | - -### TPU v5e - -For 16B, 32B, 64B, and 128B models. See full run configs in [MaxText/configs/v5e/](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/configs/v5e/) as `16b.sh`, `32b.sh`, `64b.sh`, `128b.sh`. - -| Hardware | 16B TFLOP/sec/chip | 16B MFU | 32B TFLOP/sec/chip | 32B MFU | 64B TFLOP/sec/chip | 64B MFU | 128B TFLOP/sec/chip | 128B MFU | -| ----------- | -----------------: | ------- | -----------------: | ------- | -----------------: | ------- | ------------------: | -------- | -| 1x v5e-256 | 120 | 61.10% | 132 | 66.86% | 118 | 59.90% | 110 | 56.06% | -| 2x v5e-256 | 117 | 59.37% | 128 | 64.81% | 112 | 56.66% | 110 | 55.82% | -| 4x v5e-256 | 117 | 59.14% | 126 | 64.10% | 110 | 55.85% | 108 | 54.93% | -| 8x v5e-256 | 115 | 58.27% | 125 | 63.67% | 108 | 54.96% | 104 | 52.93% | -| 16x v5e-256 | 111 | 56.56% | 123 | 62.26% | 105 | 53.29% | 100 | 50.86% | -| 32x v5e-256 | 108 | 54.65% | 119 | 60.40% | 99 | 50.18% | 91 | 46.25% | - - - ```{toctree} :maxdepth: 1 :hidden: getting_started.md -code_organization.md concepts.md guides.md -terminologies.md ``` From 52b2c9d969f27a95f643c8d0d6060e35b296b634 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Wed, 30 Oct 2024 19:26:47 -0300 Subject: [PATCH 15/22] Add repository information to landing page --- docs/index.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/index.md b/docs/index.md index 779df0e01..5e01d6be5 100644 --- a/docs/index.md +++ b/docs/index.md @@ -54,6 +54,13 @@ Maxtext today only supports Pre-training and Full Fine Tuning of the models. It - Any individual or a company that is interested in forking maxtext and seeing it as a reference implementation of a high performance Large Language Models and wants to build their own LLMs on TPU and GPU. - Any individual or a company that is interested in performing a pre-training or Full Fine Tuning of the supported open source models, can use Maxtext as a blackbox to perform full fine tuning. Maxtext attains an extremely high MFU, resulting in large savings in training costs. +## Code repository + +You can find the latest version of MaxText at https://github.com/AI-Hypercomputer/maxtext + +## In-depth documentation + +You can find in-depth documentation at [the MaxText GitHub repository](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/advanced_docs/). ```{toctree} :maxdepth: 1 From d6a654f498e4edbf5226ccfca8afa48f358ecea5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Wed, 30 Oct 2024 19:31:24 -0300 Subject: [PATCH 16/22] Add logo and move images to _static folder --- docs/{concepts => _static}/build_model.png | Bin docs/_static/flax.png | Bin 0 -> 5802 bytes docs/{concepts => _static}/quantization.png | Bin docs/concepts/quantization.md | 2 +- docs/concepts/steps_model.md | 6 ++++-- docs/conf.py | 1 + 6 files changed, 6 insertions(+), 3 deletions(-) rename docs/{concepts => _static}/build_model.png (100%) create mode 100644 docs/_static/flax.png rename docs/{concepts => _static}/quantization.png (100%) diff --git a/docs/concepts/build_model.png b/docs/_static/build_model.png similarity index 100% rename from docs/concepts/build_model.png rename to docs/_static/build_model.png diff --git a/docs/_static/flax.png b/docs/_static/flax.png new file mode 100644 index 0000000000000000000000000000000000000000..e3b178824464e00e7fc1b1a609063c017a139837 GIT binary patch literal 5802 zcmV;b7FFp|Nk&GZ761TOMM6+kP&iDM761S*89_!6=LRd1ZP!Zgvu}FeIKT@2bztD7 z8_wAT6R;!MZl$z5-jfJP5i|i%{H|a0BtedCw{CMlkRI>Se?e*Ix66bO0I`uIMN;St zIRCw5_@9QV0@Ruy2!db{1dU)3EEa=c{tu1CK@c<+jX`j*_(SlI#X;i_jlv-KN24%k z92_+M5d5RD_+k*ev3MeQSZ)|J{>#9%5CLqBxm3U$1B9_gP8b$P7-iI87y`A%7^6^O z^wCT%%?Pzd53SEus7?!EUss=F7|YxomV5dmP-~3b`fPO#XDZ{&RZF2CxLP^KqtCIA zGG?=1`fTrB#v1X|9{sY$^Zz)y-!yTs!0U+&lI^Yg>4 z94&^eY#En1Eo)9KF1L1U^~e~Oyv;vv`>DK-*M2JdvFyik?8;aFGQAvPm|izAgptMZ)FRC}5%tSI0F*l;v~H$$ z8Fw~b%)O0aXXI=$b<1%})GM-Xl$g`10iD=bKL!@Rl zs{9&xmWY;4BR)2EO1=He8YzgVbsF)tK|32|Mmgw=Y^ETYoOL#axirE;5T}ta8%IR+ zN|nhR2Q9}_hk$J4&&Y>F?Pv;@X2tR%8O(7^MP$PBBH7DvNJ-riQBh?r#~~@3&c;Ee zavZHQ1QD~5sT_xh_~k`1l;cn!wlr*JavTc8k%rAljzfWXq+#=v<4|KG@k&OHL!%tG zsrMX8B3e(C%A_TFAnkwhGO5U7?=bxi)MBmh-)18J%A)#Onc6xVF8GiusnT*~iE^N% zRx1-qI<%aNUw<5wMo#Z3FG|u{Mv6XHsi9htR+f~MXyr;Nvs&d4tfX739H!<+@Bz4z z;&^`0{ZOYRiAs8v#&E5YI@dfQ^%M&taDEY})U1qGLM$wtOGZ6Kjb^WUhADwo((ir) zpNldUQWKFx-62qJ&?!N*Iwt|nZXq@Lm5t$xMiLcH@d43K*H7l1Qn1$7S3Y5nNiepR zjp*wsIN)-F`RR7MoqfiXg0$v0cOTw`ui2^}6zj^yAnJMw6YLAnY$I^w=ZQ zBBs)E;-_Vo3!kwvK@{#n?=r%A3OMDX{E=x9n-W=-#>MK9ckpK{5ea`beC!?H%nG%krBDNH zMvNpEoHu*iLr7cXp2JZ$2?t+V0=8-?3_;)wUwD^!2pN^>&mi%t?QjMv>PoB8R6Eb9 zqnip=ANDBns=dH$)lyYj9gd&p^zj$$^*@7j4x3gZ5O6Xc_!VJu~@l9+in zoDKu@-iduUL^~}kW)k&eyY&EnTPFh?=%vWaPCSkg1kK_$;7psA3IS>~#v^sM^@6yn zsp^P6CZQHZo)@XmOqVh6qTx*X!MIm*Ryu?ROonr1606qydT-kJoOm61dZ zLKf`ooVJTue-z&{0qU@&9mS4y`31|e^8rwb(I{pGIQ{z)4AQX6apcrT{3{FQ>Z1Tk ze<%^3OWo-zfARp}qNP0QF3xN;5i>b69*tPOaU1~DnnAOM0$gp(S{e_6z^?$@b6t_c zz)@#5;=-|t&trEii%tW8u||+cV+`ffR59zyW`Lwt0yyCg83sC{WYqF}50+(@0Kljp zQ1M5_`Lw9dHL$r70M1;8go}<{TP%Z394j>qG!p(+n7ibv(%jxHCIt4GikBj-zj&nC1l*n~2yfu+e%KvE(cO z@EU52WIID4*ttSu6EGAg;0E ztR+s&bhNQh6YauuEH@tuP^twQsaJ5o*WHU<38@gz0WQCkV0RHwL~POEfnO)h#348^ zi=T9*jF$kwnyaZ|>j7-9#KJiQP?@l5;3NMY0(`n0-DVq^9P?^c4fFtD|JI!6t%<4z1^Iodq5Id_@gLnPaCmGnDo%JVh%DrhmwV2W1TjW;$dW>f@15OBkChtrWoKy z{d8R?k?Y~q@mVk5*qI~?6*k*L-Pi)fO@F?wbgX*0M&dImt+qC~q0V~1Q>r#MD$vh4 zn8#F*YA*l2tV)Ol%3eOdLEC08|Ik_O4Xv=X7P(j{cnH64hosGu`HjYES2u-;Y^nml zN`Ve-AKwxpX`?nba~$C(k90yRn@21I6@L`>aVj0u=A1@+Q{QNXE@{0{w#uV5NFQg~ zj2x1j;{OoXkfb^lax#TYm5GNivN`12Jdc7LW11+amjtDWZfZMpi`Y~NUFz~1p{&in zMnR4#i;XV)ZvYELc2lL&DI55sYp29*-ckm5{EwCpwY5gvhZwvDcT-(Wf<>V*S5uW z_uEJtjk(7f&l|{{gzCmn-Sb9$*&wR;l;3R(jj``_A1lDUQMTHXAq*3D@=mReD$R@3 z(IBd~h)>6zKOOeCjq^qv4zt0m?BmivN-O&m4g}kexuZ}7KghsT*-kI8^cs_*pS4!6CcU56CB*w zcjy0<;pxAHp_sh$wgpUmwh##E?$H$X&O^VD3S|QV+_Yj_o#|<$P zs2hx$;A3MU0YBMTq$@Xo5~A6jn^Uqg)V;h4{}XOC?;72tvHco-RI2GH{n za$jhj9i0(8V2Ej~B#wJ3uua-KH3<=58~q^PLpmseL?Rd9ns^9enl->PJcxr)3hY&7 zr3}qxCf{vyDhPQZamDKfSFI%@g_!XYUVcb`&$i2n=ms(IV!#9@c?|QqhZEE~(Br;7 z3>A<(Ed}ih%^xDP0rlGe%#s>SKF}n{>= zCPe3_bkFExVjWZ3gwhn9Vs&#l_Niy1!R1!i$txjAfCSMBfekIWU6v?99UB0T}cxW3` z;-j<3rqtOap)A04c?fSB7MUW;))>GzNV(~(%-hlF^?twLlvrOgKvw= zsrn=;T>CvPSYU^H?z3Y{YXdR(Dew*Ch|i*m*MO@?$zZm!t!>B~#NeC3YWQpxZH(VN zfU31hu+TAOM28ss6j*@kzeStKD`WO|5wt~s0l{;Fd z=E-VqJGkOzy@Ae{xpiE=6l1*QWnKq0`1TZ$+r2$6a*3qgwgr4XO+^Rq`?SR!sttbr znGN=h34A4Qz-Ti_fdv$_Nu1v!&hQY4pkeGjTa*(-% z#;)0idPVRGpisxk=AiOh_#xh^Xv|@MUx2fP_tEJK0F>qPO0~gc{61XG)tEt_Wa@U< z0fyHVW#|hw)(4M?3aJsN#v^ki1izg1eV!vN8| z*UHA=&=@1tOjP*aaN4Y8u-{KFMXBS-T5~XHt|1YKl7P;wgka7z;0eG+4Ae;RNHyNq z8i~QazTXY%9Fwe5C-oK#@az&(8oQDU)Y&&f}{RrdaPyQ-PvLe*Dzn!1R$L##ru6 zQ-Y>kGhL0Iokz?uT0>KVuEh5j7m~^e@rxO?;ji@DX*#*^HSv4XG@*Wb*-ux~$m|p1 zjbiA+;I$l|%@WA=@H6vAMB3olwE2lVt@OywdlqUzf|b6%DoOy*&NS0=L`gi@Lw}}} zDxyA7q8G@eE+=WCN0dZ}Bb)a1r3f)ol5J2*pUk4eN=e4SDP4~ZtrGlxx&g#D^q;4L zna}Vcz6Og>KP?^1p@TH%2(?Pq!dN=&Oiqb8?5INR{}udY*z*ewPjuu=Jv|fwZ&D1} zNbbhH$3el%Lj<%mPRiqGBe{T185H;x7$Tf=thi(fMLSr z#cNQ>R2f0=%H7FEaspk;UwSq0peEhwcS3&kjSegw2?d}QF8YyL4Xl0Mmc|3vgK`Vh=AmVV*pBZ7P?mU z0RSiiBuZWjD1t>7r+l{%%jM7?HWZ3ipjWzZP+1l0z)AuK?B>Q}t`bMiVKC z*NG273dIZ1YEv_ilIshFp#>|qABPsV8$fC;UAn9={BTz6xf7sN)6?+CaW-?Fo(azL z*F>M*381C|0FV9$pDV_W3%@1@G!JM1b7r5d`(l3TEBF}zEdPYA^uDghpqj}^5NK*$ zjW#^>QaN#rlJosn~oK3)=m!mJvh;Wo1lyBXe z)Sc?WeL_?*GfR6mM74JM;88r@0dU%{NPweUrS$1@=S6%^VJXpFUw{yQTS7>UZzAw{ zVI3?>uj1CQ2v7Mj?mEuJjBatoRZDp;8J%2;(3}y;jr?Nzjl0d58;b+0MIYn5NUfx9 zVfO&QkI!b9F(akt6EmieuP@m{q5ZQ!dZwGtwi~S$F0n5#mtW+DANwmLWGda!48?g5ma_5MQt;ERKMDXc$?RN`d5tsDLNG=Y_A6}f!7jTzp_?^!LoCPMrU*0v zU49-a?Qz-Ow_kKd_D;^)dl*mcnBlSAz0f<3Q3_Gd0`n8!nG{GlcUeJAz~<8zE7h$o zrw|W*!|ud)ChjFUMs6BR&Xl#R?tLnXC{i{#e|dU3Bq2x|G%M29cX+@_I`%n)r70XR zQE|e?LMm*8#^)2ti;M)aD~E#pY&t^_3F?-R6H2HER0Q%)pnf*PI}L~6w~MDgjBHLu zd`7M^8oQKi+ERk%TyiuC(Yy#YS7}Yd;*{Ju4iVPaf%%jLDgs$tzUl)u@`dpz#DUI< zY3@H9o8PuJ9SbyaYP`%NQIQi_DXf+c5@Sp#(Kqb7j$p5tx{~((aQ<)cmC;Mwlx2=KQ zuUaNh6SzXTOvkWOeQ>*Wi>j0gbzOzZxj(DqA0O=K} zp4RxN4sZXG89DZue4xZ#p`7e#jl)|8z4{Z&=KAHko*m|FnA%n2z3eo%FL$0z{Bxv5 zm8-eRm*e2pWM`j*wF|$2%wL{rt>fU;r>{@u{G$sUCL%9N<)gOBv9JXN=V3MJRT3+W z!VHE5`si{pq40hWIrcPk<`6A;r45HiX08`*(7=@>{x1AG=w8% zzQ)}js~vVfbDo22+9OMzgOc~~AuFCk93DclSGpzw9zssv6&E!gLZ&>2w7sJdkx!0_ zOg3lwZ5q+?&S~U#=_?0|T3Y893w=XavgtVp47inZ&tW3_#!h5s-1|g&2++}%v ov3{Bb4?*wFP#r@kXRF|8CGR0*tVk~G?exgL=kV-s@86NK2d;@FiU0rr literal 0 HcmV?d00001 diff --git a/docs/concepts/quantization.png b/docs/_static/quantization.png similarity index 100% rename from docs/concepts/quantization.png rename to docs/_static/quantization.png diff --git a/docs/concepts/quantization.md b/docs/concepts/quantization.md index ea3841076..aa3319ca2 100644 --- a/docs/concepts/quantization.md +++ b/docs/concepts/quantization.md @@ -11,7 +11,7 @@ You can turn on the quantization by adding the following flag `--quantization` a - 'intmp' for mixed precision weight only quantization based on config file - 'fp8' for 8-bit floating-point GeMMs on NVIDIA GPUs. -```{figure} quantization.png +```{figure} ../_static/quantization.png EMFU measured using MaxText 128b, context length 2048, trained with synthetic data, using Cloud TPU v5e-256. Measured as of April, 2024. ``` diff --git a/docs/concepts/steps_model.md b/docs/concepts/steps_model.md index 45c997249..5b1c96fcf 100644 --- a/docs/concepts/steps_model.md +++ b/docs/concepts/steps_model.md @@ -1,7 +1,9 @@ # Steps to build a Model -![](build_model.png) -_Fig1: Stages of LLM Model Development from pre-training to fine tuning and finally serving a model._ +```{figure} ../_static/build_model.png + +Stages of LLM Model Development from pre-training to fine tuning and finally serving a model. +``` Model building starts with Pre-training a base model architecture. Pre-training is the process where you take a model architecture, which starts with random weights and train with a very large corpus in the scale of trillions of tokens. E.g. Google’s Gemma models were pre-trained on 6 Trillion tokens; LLama 3 was trained with 15 Trillion tokens diff --git a/docs/conf.py b/docs/conf.py index a0d1e1acc..d777ee7b3 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -32,6 +32,7 @@ html_theme = "sphinx_book_theme" html_static_path = [] +html_logo = "_static/flax.png" # -- Options for myst ---------------------------------------------- myst_heading_anchors = 3 # auto-generate 3 levels of heading anchors From cc8440b22999ef78873205932e6bbf838cdaa029 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Wed, 30 Oct 2024 19:43:15 -0300 Subject: [PATCH 17/22] Add sphinx design cards --- docs/conf.py | 1 + docs/index.md | 39 +++++++++++++++++++++++++++++++++++++++ docs/requirements.txt | 3 ++- 3 files changed, 42 insertions(+), 1 deletion(-) diff --git a/docs/conf.py b/docs/conf.py index d777ee7b3..0ddd5e01e 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -15,6 +15,7 @@ extensions = [ "myst_nb", + "sphinx_design", ] templates_path = ["_templates"] diff --git a/docs/index.md b/docs/index.md index 5e01d6be5..76b7aa327 100644 --- a/docs/index.md +++ b/docs/index.md @@ -54,6 +54,45 @@ Maxtext today only supports Pre-training and Full Fine Tuning of the models. It - Any individual or a company that is interested in forking maxtext and seeing it as a reference implementation of a high performance Large Language Models and wants to build their own LLMs on TPU and GPU. - Any individual or a company that is interested in performing a pre-training or Full Fine Tuning of the supported open source models, can use Maxtext as a blackbox to perform full fine tuning. Maxtext attains an extremely high MFU, resulting in large savings in training costs. +## Learn more + +::::{grid} 1 1 2 2 +:gutter: 2 +:::{grid-item-card} +Full finetuning and training with Llama3 +:link: getting_started/full_finetuning.md +:link-type: ref +::: + +:::{grid-item-card} +First run +:link: getting_started/First_run.md +:link-type: ref +::: +:::: + +::::{grid} 2 +:gutter: 1 + +:::{grid-item-card} +A +::: +:::{grid-item-card} +B +::: +:::: + +::::{grid} 2 +:gutter: 3 3 4 5 + +:::{grid-item-card} +A +::: +:::{grid-item-card} +B +::: +:::: + ## Code repository You can find the latest version of MaxText at https://github.com/AI-Hypercomputer/maxtext diff --git a/docs/requirements.txt b/docs/requirements.txt index b14ee0fa4..92c3ee019 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -2,4 +2,5 @@ sphinx myst-nb myst-parser[linkify] -sphinx-book-theme \ No newline at end of file +sphinx-book-theme +sphinx-design From a1474e2bc7d70f53d707535f7f44d7e364002bbf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Wed, 30 Oct 2024 19:53:39 -0300 Subject: [PATCH 18/22] Add card styling --- docs/conf.py | 1 + docs/index.md | 33 +++++++-------------------------- 2 files changed, 8 insertions(+), 26 deletions(-) diff --git a/docs/conf.py b/docs/conf.py index 0ddd5e01e..e1a6eb08a 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -40,4 +40,5 @@ myst_enable_extensions = [ "dollarmath", "linkify", + "colon_fence", ] diff --git a/docs/index.md b/docs/index.md index 76b7aa327..a56075fbc 100644 --- a/docs/index.md +++ b/docs/index.md @@ -58,38 +58,19 @@ Maxtext today only supports Pre-training and Full Fine Tuning of the models. It ::::{grid} 1 1 2 2 :gutter: 2 -:::{grid-item-card} -Full finetuning and training with Llama3 -:link: getting_started/full_finetuning.md -:link-type: ref -::: :::{grid-item-card} -First run -:link: getting_started/First_run.md -:link-type: ref -::: -:::: +:link: getting_started/full_finetuning.html +:class-card: sd-text-black sd-bg-light -::::{grid} 2 -:gutter: 1 - -:::{grid-item-card} -A +{material-regular}`settings;2em` Full finetuning and training with Llama3 ::: -:::{grid-item-card} -B -::: -:::: - -::::{grid} 2 -:gutter: 3 3 4 5 :::{grid-item-card} -A -::: -:::{grid-item-card} -B +:link: getting_started/First_run.html +:class-card: sd-text-black sd-bg-light + +{material-regular}`rocket_launch;2em` First run ::: :::: From 4cf9c863a7d11274190d0e0702a213577b501f50 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Wed, 30 Oct 2024 20:02:14 -0300 Subject: [PATCH 19/22] Add sharding implementation details section --- docs/concepts/sharding.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/concepts/sharding.md b/docs/concepts/sharding.md index d146e4085..2fe56d3af 100644 --- a/docs/concepts/sharding.md +++ b/docs/concepts/sharding.md @@ -88,3 +88,14 @@ dcn_pipeline_parallelism: 1 dcn_expert_parallelism: 1 dcn_autoregressive_parallelism: 1 # never recommended ``` + +## Sharding implementation details + +You may think of the sharding in the maxtext codebase as split into three levels +1. The physical mesh where e.g. `ici_fsdp_parallelism` is used - see [`create_device_mesh`](https://github.com/AI-Hypercomputer/maxtext/blob/e7c4824ee9cc13fd6db863796bbe7696b03eb448/MaxText/max_utils.py#L363) +2. The logical names, with physical <-> logical mappings [here](https://github.com/AI-Hypercomputer/maxtext/blob/e7c4824ee9cc13fd6db863796bbe7696b03eb448/MaxText/configs/base.yml#L211-L248) +3. Individual tensors which will use logical names, here is one [example](https://github.com/AI-Hypercomputer/maxtext/blob/e7c4824ee9cc13fd6db863796bbe7696b03eb448/MaxText/layers/linears.py#L243) + +Following this example we see the first axis is sharded by logical name "embed". This logical name maps the physical names "fsdp, fsdp_transpose, sequence, expert", thus this axes will get sharded by the product of these specified parallelisms. E.g. if `ici_fsdp_parallelism=4` and `ici_sequence_parallelism=2` then this array axis will get sharded 8 ways. + +This example showed a "kernel_axes" which is used to define a weight matrix. For activations we use shardings hints for the compiler such as `nn.with_logical_constraint` (example [here](https://github.com/AI-Hypercomputer/maxtext/blob/e7c4824ee9cc13fd6db863796bbe7696b03eb448/MaxText/layers/linears.py#L261)). This will generally shard the activations according to these constraints, but the compiler occasionally chooses a different sharding other that what we specified for these activations. From ebbb6317aed0e04b5dc8972d05c7b62d648f7a80 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Thu, 31 Oct 2024 14:33:08 -0300 Subject: [PATCH 20/22] Apply suggestions from code review --- docs/concepts/steps_model.md | 2 +- docs/getting_started/end-to-end.md | 2 +- docs/getting_started/full_finetuning.md | 2 +- docs/index.md | 6 +++--- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/concepts/steps_model.md b/docs/concepts/steps_model.md index 5b1c96fcf..079e587dd 100644 --- a/docs/concepts/steps_model.md +++ b/docs/concepts/steps_model.md @@ -5,7 +5,7 @@ Stages of LLM Model Development from pre-training to fine tuning and finally serving a model. ``` -Model building starts with Pre-training a base model architecture. Pre-training is the process where you take a model architecture, which starts with random weights and train with a very large corpus in the scale of trillions of tokens. E.g. Google’s Gemma models were pre-trained on 6 Trillion tokens; LLama 3 was trained with 15 Trillion tokens +Model building starts with Pre-training a base model architecture. Pre-training is the process where you take a model architecture, which starts with random weights and train with a very large corpus in the scale of trillions of tokens. E.g. Google’s Gemma models was pre-trained on 6 Trillion tokens; LLama 3 was trained with 15 Trillion tokens Post the pre-training most model producers will publish a checkpoint of the weights of the model. The corpus used for pre-training these models are usually a large public corpus like Common Crawl, public code bases, books etc. diff --git a/docs/getting_started/end-to-end.md b/docs/getting_started/end-to-end.md index 774623ccd..3f59fd65c 100644 --- a/docs/getting_started/end-to-end.md +++ b/docs/getting_started/end-to-end.md @@ -1,3 +1,3 @@ # End-to-end example -See the MaxText example Kaggle notebook. +See the MaxText example Kaggle notebook. diff --git a/docs/getting_started/full_finetuning.md b/docs/getting_started/full_finetuning.md index 6faada38a..0028633b9 100644 --- a/docs/getting_started/full_finetuning.md +++ b/docs/getting_started/full_finetuning.md @@ -1,4 +1,4 @@ -# Full Finetuninhg LLama2/LLama3 Optimized configuration +# Full Finetuning LLama3-8B Model In the pre-training section you saw the steps on how to do pre-training with MaxText. To perform full fine tuning, you need to pass the checkpoint to the diff --git a/docs/index.md b/docs/index.md index a56075fbc..49a5a6457 100644 --- a/docs/index.md +++ b/docs/index.md @@ -42,8 +42,8 @@ These reference implementations support pre-training and full fine tuning. Maxte The key value proposition of using MaxText for pre-training or full fine tuning is: - Very high performance of average of 50% MFU -- Open code base -- Easy to understand: MaxText is purely written in JAX and Python, which makes it accessible to ML developers interested in inspecting the implementation or stepping through it. It is written at the block-by-block level, with code for Embeddings, Attention, Normalization etc. Different Attention mechanisms like MQA and GQA are all present. For quantization, it uses the JAX AQT library. The implementation is suitable for both GPUs and TPUs. +- [Open code base](https://github.com/AI-Hypercomputer/maxtext) +- Easy to understand: MaxText is purely written in JAX and Python, which makes it accessible to ML developers interested in inspecting the implementation or stepping through it. It is written at the block-by-block level, with code for Embeddings, Attention, Normalization etc. Different Attention mechanisms like MQA and GQA are all present. For quantization, it uses the [JAX AQT](https://github.com/google/aqt) library. The implementation is suitable for both GPUs and TPUs. ```{note} Maxtext today only supports Pre-training and Full Fine Tuning of the models. It does not support PEFT/LoRA, Supervised Fine Tuning or RLHF. @@ -51,7 +51,7 @@ Maxtext today only supports Pre-training and Full Fine Tuning of the models. It ## Who are the target users of MaxText? -- Any individual or a company that is interested in forking maxtext and seeing it as a reference implementation of a high performance Large Language Models and wants to build their own LLMs on TPU and GPU. +- Any individual or a company that is interested in forking maxtext and seeing it as a reference implementation of a high performance Large Language Models and wants to build their own LLMs on TPUs or GPUs. - Any individual or a company that is interested in performing a pre-training or Full Fine Tuning of the supported open source models, can use Maxtext as a blackbox to perform full fine tuning. Maxtext attains an extremely high MFU, resulting in large savings in training costs. ## Learn more From 6036ca7894726c1e8f43619b2ba7540af13449e0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Thu, 31 Oct 2024 14:37:10 -0300 Subject: [PATCH 21/22] Add link --- docs/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/index.md b/docs/index.md index 49a5a6457..f6c573e15 100644 --- a/docs/index.md +++ b/docs/index.md @@ -43,7 +43,7 @@ The key value proposition of using MaxText for pre-training or full fine tuning - Very high performance of average of 50% MFU - [Open code base](https://github.com/AI-Hypercomputer/maxtext) -- Easy to understand: MaxText is purely written in JAX and Python, which makes it accessible to ML developers interested in inspecting the implementation or stepping through it. It is written at the block-by-block level, with code for Embeddings, Attention, Normalization etc. Different Attention mechanisms like MQA and GQA are all present. For quantization, it uses the [JAX AQT](https://github.com/google/aqt) library. The implementation is suitable for both GPUs and TPUs. +- Easy to understand: MaxText is purely written in JAX and Python, which makes it accessible to ML developers interested in inspecting the implementation or stepping through it. It is written at the [block-by-block](https://github.com/AI-Hypercomputer/maxtext/tree/main/MaxText/layers) level, with code for Embeddings, Attention, Normalization etc. Different Attention mechanisms like MQA and GQA are all present. For quantization, it uses the [JAX AQT](https://github.com/google/aqt) library. The implementation is suitable for both GPUs and TPUs. ```{note} Maxtext today only supports Pre-training and Full Fine Tuning of the models. It does not support PEFT/LoRA, Supervised Fine Tuning or RLHF. From 66d902a2ec87f5700305197721f17ecfcc87425e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Melissa=20Weber=20Mendon=C3=A7a?= Date: Mon, 18 Nov 2024 12:23:24 -0300 Subject: [PATCH 22/22] Remove logo --- docs/_static/flax.png | Bin 5802 -> 0 bytes docs/conf.py | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) delete mode 100644 docs/_static/flax.png diff --git a/docs/_static/flax.png b/docs/_static/flax.png deleted file mode 100644 index e3b178824464e00e7fc1b1a609063c017a139837..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 5802 zcmV;b7FFp|Nk&GZ761TOMM6+kP&iDM761S*89_!6=LRd1ZP!Zgvu}FeIKT@2bztD7 z8_wAT6R;!MZl$z5-jfJP5i|i%{H|a0BtedCw{CMlkRI>Se?e*Ix66bO0I`uIMN;St zIRCw5_@9QV0@Ruy2!db{1dU)3EEa=c{tu1CK@c<+jX`j*_(SlI#X;i_jlv-KN24%k z92_+M5d5RD_+k*ev3MeQSZ)|J{>#9%5CLqBxm3U$1B9_gP8b$P7-iI87y`A%7^6^O z^wCT%%?Pzd53SEus7?!EUss=F7|YxomV5dmP-~3b`fPO#XDZ{&RZF2CxLP^KqtCIA zGG?=1`fTrB#v1X|9{sY$^Zz)y-!yTs!0U+&lI^Yg>4 z94&^eY#En1Eo)9KF1L1U^~e~Oyv;vv`>DK-*M2JdvFyik?8;aFGQAvPm|izAgptMZ)FRC}5%tSI0F*l;v~H$$ z8Fw~b%)O0aXXI=$b<1%})GM-Xl$g`10iD=bKL!@Rl zs{9&xmWY;4BR)2EO1=He8YzgVbsF)tK|32|Mmgw=Y^ETYoOL#axirE;5T}ta8%IR+ zN|nhR2Q9}_hk$J4&&Y>F?Pv;@X2tR%8O(7^MP$PBBH7DvNJ-riQBh?r#~~@3&c;Ee zavZHQ1QD~5sT_xh_~k`1l;cn!wlr*JavTc8k%rAljzfWXq+#=v<4|KG@k&OHL!%tG zsrMX8B3e(C%A_TFAnkwhGO5U7?=bxi)MBmh-)18J%A)#Onc6xVF8GiusnT*~iE^N% zRx1-qI<%aNUw<5wMo#Z3FG|u{Mv6XHsi9htR+f~MXyr;Nvs&d4tfX739H!<+@Bz4z z;&^`0{ZOYRiAs8v#&E5YI@dfQ^%M&taDEY})U1qGLM$wtOGZ6Kjb^WUhADwo((ir) zpNldUQWKFx-62qJ&?!N*Iwt|nZXq@Lm5t$xMiLcH@d43K*H7l1Qn1$7S3Y5nNiepR zjp*wsIN)-F`RR7MoqfiXg0$v0cOTw`ui2^}6zj^yAnJMw6YLAnY$I^w=ZQ zBBs)E;-_Vo3!kwvK@{#n?=r%A3OMDX{E=x9n-W=-#>MK9ckpK{5ea`beC!?H%nG%krBDNH zMvNpEoHu*iLr7cXp2JZ$2?t+V0=8-?3_;)wUwD^!2pN^>&mi%t?QjMv>PoB8R6Eb9 zqnip=ANDBns=dH$)lyYj9gd&p^zj$$^*@7j4x3gZ5O6Xc_!VJu~@l9+in zoDKu@-iduUL^~}kW)k&eyY&EnTPFh?=%vWaPCSkg1kK_$;7psA3IS>~#v^sM^@6yn zsp^P6CZQHZo)@XmOqVh6qTx*X!MIm*Ryu?ROonr1606qydT-kJoOm61dZ zLKf`ooVJTue-z&{0qU@&9mS4y`31|e^8rwb(I{pGIQ{z)4AQX6apcrT{3{FQ>Z1Tk ze<%^3OWo-zfARp}qNP0QF3xN;5i>b69*tPOaU1~DnnAOM0$gp(S{e_6z^?$@b6t_c zz)@#5;=-|t&trEii%tW8u||+cV+`ffR59zyW`Lwt0yyCg83sC{WYqF}50+(@0Kljp zQ1M5_`Lw9dHL$r70M1;8go}<{TP%Z394j>qG!p(+n7ibv(%jxHCIt4GikBj-zj&nC1l*n~2yfu+e%KvE(cO z@EU52WIID4*ttSu6EGAg;0E ztR+s&bhNQh6YauuEH@tuP^twQsaJ5o*WHU<38@gz0WQCkV0RHwL~POEfnO)h#348^ zi=T9*jF$kwnyaZ|>j7-9#KJiQP?@l5;3NMY0(`n0-DVq^9P?^c4fFtD|JI!6t%<4z1^Iodq5Id_@gLnPaCmGnDo%JVh%DrhmwV2W1TjW;$dW>f@15OBkChtrWoKy z{d8R?k?Y~q@mVk5*qI~?6*k*L-Pi)fO@F?wbgX*0M&dImt+qC~q0V~1Q>r#MD$vh4 zn8#F*YA*l2tV)Ol%3eOdLEC08|Ik_O4Xv=X7P(j{cnH64hosGu`HjYES2u-;Y^nml zN`Ve-AKwxpX`?nba~$C(k90yRn@21I6@L`>aVj0u=A1@+Q{QNXE@{0{w#uV5NFQg~ zj2x1j;{OoXkfb^lax#TYm5GNivN`12Jdc7LW11+amjtDWZfZMpi`Y~NUFz~1p{&in zMnR4#i;XV)ZvYELc2lL&DI55sYp29*-ckm5{EwCpwY5gvhZwvDcT-(Wf<>V*S5uW z_uEJtjk(7f&l|{{gzCmn-Sb9$*&wR;l;3R(jj``_A1lDUQMTHXAq*3D@=mReD$R@3 z(IBd~h)>6zKOOeCjq^qv4zt0m?BmivN-O&m4g}kexuZ}7KghsT*-kI8^cs_*pS4!6CcU56CB*w zcjy0<;pxAHp_sh$wgpUmwh##E?$H$X&O^VD3S|QV+_Yj_o#|<$P zs2hx$;A3MU0YBMTq$@Xo5~A6jn^Uqg)V;h4{}XOC?;72tvHco-RI2GH{n za$jhj9i0(8V2Ej~B#wJ3uua-KH3<=58~q^PLpmseL?Rd9ns^9enl->PJcxr)3hY&7 zr3}qxCf{vyDhPQZamDKfSFI%@g_!XYUVcb`&$i2n=ms(IV!#9@c?|QqhZEE~(Br;7 z3>A<(Ed}ih%^xDP0rlGe%#s>SKF}n{>= zCPe3_bkFExVjWZ3gwhn9Vs&#l_Niy1!R1!i$txjAfCSMBfekIWU6v?99UB0T}cxW3` z;-j<3rqtOap)A04c?fSB7MUW;))>GzNV(~(%-hlF^?twLlvrOgKvw= zsrn=;T>CvPSYU^H?z3Y{YXdR(Dew*Ch|i*m*MO@?$zZm!t!>B~#NeC3YWQpxZH(VN zfU31hu+TAOM28ss6j*@kzeStKD`WO|5wt~s0l{;Fd z=E-VqJGkOzy@Ae{xpiE=6l1*QWnKq0`1TZ$+r2$6a*3qgwgr4XO+^Rq`?SR!sttbr znGN=h34A4Qz-Ti_fdv$_Nu1v!&hQY4pkeGjTa*(-% z#;)0idPVRGpisxk=AiOh_#xh^Xv|@MUx2fP_tEJK0F>qPO0~gc{61XG)tEt_Wa@U< z0fyHVW#|hw)(4M?3aJsN#v^ki1izg1eV!vN8| z*UHA=&=@1tOjP*aaN4Y8u-{KFMXBS-T5~XHt|1YKl7P;wgka7z;0eG+4Ae;RNHyNq z8i~QazTXY%9Fwe5C-oK#@az&(8oQDU)Y&&f}{RrdaPyQ-PvLe*Dzn!1R$L##ru6 zQ-Y>kGhL0Iokz?uT0>KVuEh5j7m~^e@rxO?;ji@DX*#*^HSv4XG@*Wb*-ux~$m|p1 zjbiA+;I$l|%@WA=@H6vAMB3olwE2lVt@OywdlqUzf|b6%DoOy*&NS0=L`gi@Lw}}} zDxyA7q8G@eE+=WCN0dZ}Bb)a1r3f)ol5J2*pUk4eN=e4SDP4~ZtrGlxx&g#D^q;4L zna}Vcz6Og>KP?^1p@TH%2(?Pq!dN=&Oiqb8?5INR{}udY*z*ewPjuu=Jv|fwZ&D1} zNbbhH$3el%Lj<%mPRiqGBe{T185H;x7$Tf=thi(fMLSr z#cNQ>R2f0=%H7FEaspk;UwSq0peEhwcS3&kjSegw2?d}QF8YyL4Xl0Mmc|3vgK`Vh=AmVV*pBZ7P?mU z0RSiiBuZWjD1t>7r+l{%%jM7?HWZ3ipjWzZP+1l0z)AuK?B>Q}t`bMiVKC z*NG273dIZ1YEv_ilIshFp#>|qABPsV8$fC;UAn9={BTz6xf7sN)6?+CaW-?Fo(azL z*F>M*381C|0FV9$pDV_W3%@1@G!JM1b7r5d`(l3TEBF}zEdPYA^uDghpqj}^5NK*$ zjW#^>QaN#rlJosn~oK3)=m!mJvh;Wo1lyBXe z)Sc?WeL_?*GfR6mM74JM;88r@0dU%{NPweUrS$1@=S6%^VJXpFUw{yQTS7>UZzAw{ zVI3?>uj1CQ2v7Mj?mEuJjBatoRZDp;8J%2;(3}y;jr?Nzjl0d58;b+0MIYn5NUfx9 zVfO&QkI!b9F(akt6EmieuP@m{q5ZQ!dZwGtwi~S$F0n5#mtW+DANwmLWGda!48?g5ma_5MQt;ERKMDXc$?RN`d5tsDLNG=Y_A6}f!7jTzp_?^!LoCPMrU*0v zU49-a?Qz-Ow_kKd_D;^)dl*mcnBlSAz0f<3Q3_Gd0`n8!nG{GlcUeJAz~<8zE7h$o zrw|W*!|ud)ChjFUMs6BR&Xl#R?tLnXC{i{#e|dU3Bq2x|G%M29cX+@_I`%n)r70XR zQE|e?LMm*8#^)2ti;M)aD~E#pY&t^_3F?-R6H2HER0Q%)pnf*PI}L~6w~MDgjBHLu zd`7M^8oQKi+ERk%TyiuC(Yy#YS7}Yd;*{Ju4iVPaf%%jLDgs$tzUl)u@`dpz#DUI< zY3@H9o8PuJ9SbyaYP`%NQIQi_DXf+c5@Sp#(Kqb7j$p5tx{~((aQ<)cmC;Mwlx2=KQ zuUaNh6SzXTOvkWOeQ>*Wi>j0gbzOzZxj(DqA0O=K} zp4RxN4sZXG89DZue4xZ#p`7e#jl)|8z4{Z&=KAHko*m|FnA%n2z3eo%FL$0z{Bxv5 zm8-eRm*e2pWM`j*wF|$2%wL{rt>fU;r>{@u{G$sUCL%9N<)gOBv9JXN=V3MJRT3+W z!VHE5`si{pq40hWIrcPk<`6A;r45HiX08`*(7=@>{x1AG=w8% zzQ)}js~vVfbDo22+9OMzgOc~~AuFCk93DclSGpzw9zssv6&E!gLZ&>2w7sJdkx!0_ zOg3lwZ5q+?&S~U#=_?0|T3Y893w=XavgtVp47inZ&tW3_#!h5s-1|g&2++}%v ov3{Bb4?*wFP#r@kXRF|8CGR0*tVk~G?exgL=kV-s@86NK2d;@FiU0rr diff --git a/docs/conf.py b/docs/conf.py index e1a6eb08a..5ca1b4878 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -33,7 +33,7 @@ html_theme = "sphinx_book_theme" html_static_path = [] -html_logo = "_static/flax.png" +# html_logo = "_static/flax.png" # -- Options for myst ---------------------------------------------- myst_heading_anchors = 3 # auto-generate 3 levels of heading anchors