diff --git a/examples/benchmarks/bert/README.md b/examples/benchmarks/bert/README.md
index 6832e5160..d1d2e1dae 100644
--- a/examples/benchmarks/bert/README.md
+++ b/examples/benchmarks/bert/README.md
@@ -1,15 +1,103 @@
-# Welcome!
-
-This benchmark covers both pre-training and fine-tuning a BERT model. With this starter code, you'll be able to do Masked Language Modeling (MLM) [pre-training](#mlm-pre-training) on the C4 dataset and classification [fine-tuning](#glue-fine-tuning) on GLUE benchmark tasks. We also provide the source code and recipe behind our [Mosaic BERT](#mosaic-bert) model, which you can train yourself using this repo.
+
+
+
🚀 MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining
+
+
+
+
+
+
MosaicML x Databricks
+
jacob.portes@databricks.com, *equal contribution
+
+
+
+
+
+
+
+
+
+
+
TLDR; How to Speed up Transformer Pretraining
+
MosaicBERT is a custom BERT architecture optimized for fast pretraining. This study motivated many of the architecture choices around MosaicML's MPT-7B and MPT-30B models. Below are the main architectural modifications used by MosaicBERT for rapid pretraining:
+
+
+
+
+
These are the main architectural modifications used by MosaicBERT for rapid pretraining 👇
+
+
+
+
+
+
And here are a few more efficiency tips used by MosaicBERT:
+
+
+
+
+
+ - Change the Masked Language Modeling ratio to 30% (instead of the default 15%)
+ - Remove dropout from the attention module (dropout often slows things down)
+ - Use bfloat16!
+ - Make your vocab size as a multiple of 64 (Andrej Karpathy says so!)
+
+
+
+
+
+
All modifications (except for MLM) can be applied to decoder architectures such as GPT and MPT.
+
+
+
+
+
+
+## Citation
+```
+@article{portes2023mosaicbert,
+ title={Mosaicbert: A bidirectional encoder optimized for fast pretraining},
+ author={Portes, Jacob and Trott, Alexander and Havens, Sam and King, Daniel and Venigalla, Abhinav and Nadeem, Moin and Sardana, Nikhil and Khudia, Daya and Frankle, Jonathan},
+ journal={Advances in Neural Information Processing Systems},
+ volume={36},
+ pages={3106--3130},
+ year={2023}
+}
+```
## Contents
+This benchmark covers both pre-training and fine-tuning a BERT model. With this starter code, you'll be able to do Masked Language Modeling (MLM) [pre-training](#mlm-pre-training) on the C4 dataset and classification [fine-tuning](#glue-fine-tuning) on GLUE benchmark tasks. We also provide the source code and recipe behind our [MosaicBERT](#mosaic-bert) model, which you can train yourself using this repo.
You'll find in this folder:
### Pre-training
- `main.py` — A straightforward script for parsing YAMLs, building a [Composer](https://github.com/mosaicml/composer) Trainer, and kicking off an MLM pre-training job, locally or on the MosaicML platform.
-- `yamls/main/` - Pre-baked configs for pre-training both our sped-up Mosaic BERT as well as the reference HuggingFace BERT. These are used when running `main.py`.
+- `yamls/main/` - Pre-baked configs for pre-training both our sped-up MosaicBERT as well as the reference HuggingFace BERT. These are used when running `main.py`.
- `yamls/test/main.yaml` - A config for quickly verifying that `main.py` runs.
### Fine-tuning
@@ -18,17 +106,17 @@ You'll find in this folder:
- `glue.py` - A more complex script for parsing YAMLs and orchestrating the numerous fine-tuning training jobs across 8 GLUE tasks (we exclude the WNLI task here), locally or on the MosaicML platform.
- `src/glue/data.py` - Datasets used by `glue.py` in GLUE fine-tuning.
- `src/glue/finetuning_jobs.py` - Custom classes, one for each GLUE task, instantiated by `glue.py`. These handle individual fine-tuning jobs and task-specific hyperparameters.
-- `yamls/finetuning/` - Pre-baked configs for fine-tuning both our sped-up Mosaic BERT as well as the reference HuggingFace BERT. These are used when running `sequence_classification.py` and `glue.py`.
+- `yamls/finetuning/` - Pre-baked configs for fine-tuning both our sped-up MosaicBERT as well as the reference HuggingFace BERT. These are used when running `sequence_classification.py` and `glue.py`.
- `yamls/test/sequence_classification.yaml` - A config for quickly verifying that `sequence_classification.py` runs.
- `yamls/test/glue.yaml` - A config for quickly verifying that `glue.py` runs.
### Shared
- `src/hf_bert.py` — HuggingFace BERT models for MLM (pre-training) or classification (GLUE fine-tuning), wrapped in [`ComposerModel`s](https://docs.mosaicml.com/en/stable/api_reference/generated/composer.models.HuggingFaceModel.html) for compatibility with the [Composer Trainer](https://docs.mosaicml.com/en/stable/api_reference/generated/composer.Trainer.html#composer.Trainer).
-- `src/mosaic_bert.py` — Mosaic BERT models for MLM (pre-training) or classification (GLUE fine-tuning). See [Mosaic BERT](#mosaic-bert) for more.
-- `src/bert_layers.py` — The Mosaic BERT layers/modules with our custom speed up methods built in, with an eye towards HuggingFace API compatibility.
-- `src/bert_padding.py` — Utilities for Mosaic BERT that help avoid padding overhead.
-- `src/flash_attn_triton.py` - Source code for the [FlashAttention](https://arxiv.org/abs/2205.14135) implementation used in Mosaic BERT.
+- `src/mosaic_bert.py` — MosaicBERT models for MLM (pre-training) or classification (GLUE fine-tuning). See [MosaicBERT](#mosaic-bert) for more.
+- `src/bert_layers.py` — The MosaicBERT layers/modules with our custom speed up methods built in, with an eye towards HuggingFace API compatibility.
+- `src/bert_padding.py` — Utilities for MosaicBERT that help avoid padding overhead.
+- `src/flash_attn_triton.py` - Source code for the [FlashAttention](https://arxiv.org/abs/2205.14135) implementation used in MosaicBERT.
- `src/text_data.py`- a [MosaicML streaming dataset](https://streaming.docs.mosaicml.com/en/stable/) that can be used with a vanilla PyTorch dataloader.
- `src/convert_dataset.py` - A script to convert a text dataset from HuggingFace to a [MosaicML streaming dataset](https://streaming.docs.mosaicml.com/en/stable/).
- `requirements.txt` — All needed Python dependencies.
@@ -123,26 +211,26 @@ With our data prepared, we can now start training.
### Test pre-training
To verify that pre-training runs correctly, first prepare a local copy of the C4 validation split (see the above section), and then run the `main.py` pre-training script twice using our testing config.
-First, with the baseline HuggingFace BERT. Second, with the Mosaic BERT.
+First, with the baseline HuggingFace BERT. Second, with the MosaicBERT.
```bash
# Run the pre-training script with the test config and HuggingFace BERT
composer main.py yamls/test/main.yaml
-# Run the pre-training script with the test config and Mosaic BERT
+# Run the pre-training script with the test config and MosaicBERT
composer main.py yamls/test/main.yaml model.name=mosaic_bert
```
### Test fine-tuning
-To verify that fine-tuning runs correctly, run the fine-tuning script using our testing configs and both the HuggingFace and Mosaic BERT models.
-First, verify `sequence_classification.py` with the baseline HuggingFace BERT and again with the Mosaic BERT.
+To verify that fine-tuning runs correctly, run the fine-tuning script using our testing configs and both the HuggingFace and MosaicBERT models.
+First, verify `sequence_classification.py` with the baseline HuggingFace BERT and again with the MosaicBERT.
```bash
# Run the fine-tuning script with the test config and HuggingFace BERT
composer sequence_classification.py yamls/test/sequence_classification.yaml
-# Run the fine-tuning script with the test config and Mosaic BERT
+# Run the fine-tuning script with the test config and MosaicBERT
composer sequence_classification.py yamls/test/sequence_classification.yaml model.name=mosaic_bert
```
@@ -152,7 +240,7 @@ Second, verify `glue.py` for both models.
# Run the GLUE script with the test config and HuggingFace BERT
python glue.py yamls/test/glue.yaml && rm -rf local-finetune-checkpoints
-# Run the GLUE script with the test config and Mosaic BERT
+# Run the GLUE script with the test config and MosaicBERT
python glue.py yamls/test/glue.yaml model.name=mosaic_bert && rm -rf local-finetune-checkpoints
```
@@ -168,7 +256,7 @@ This is already done in the testing YAML `yamls/test/main.py`, which you can als
### MLM pre-training
-To get the most out of your pre-training budget, we recommend using **Mosaic BERT**! You can read more [below](#mosaic-bert).
+To get the most out of your pre-training budget, we recommend using **MosaicBERT**! You can read more [below](#mosaic-bert).
We run the `main.py` pre-training script using our `composer` launcher, which generates N processes (1 process per GPU device).
If training on a single node, the `composer` launcher will autodetect the number of devices.
@@ -178,7 +266,7 @@ If training on a single node, the `composer` launcher will autodetect the number
# It takes about 11.5 hours on a single node with 8 A100_80g GPUs.
composer main.py yamls/main/hf-bert-base-uncased.yaml
-# This will pre-train a Mosaic BERT that reaches the same downstream accuracy in roughly 1/3 the time.
+# This will pre-train a MosaicBERT that reaches the same downstream accuracy in roughly 1/3 the time.
composer main.py yamls/main/mosaic-bert-base-uncased.yaml
```
@@ -212,7 +300,7 @@ Once you have modified the YAMLs in `yamls/glue/` to reference your pre-trained
# This will run GLUE fine-tuning evaluation on your HuggingFace BERT
python glue.py yamls/finetuning/glue/hf-bert-base-uncased.yaml
-# This will run GLUE fine-tuning evaluation on your Mosaic BERT
+# This will run GLUE fine-tuning evaluation on your MosaicBERT
python glue.py yamls/finetuning/glue/mosaic-bert-base-uncased.yaml
```
@@ -247,7 +335,7 @@ Before using the configs in `yamls/main/` when running `main.py`, you'll need to
Before using the configs in `yamls/finetuning/` when running `sequence_classification.py`, you'll need to fill in:
-- `load_path` (optional) - If you have a checkpoint that you'd like to start from, this is how you set that. If you're fine-tuning a Mosaic BERT, this should not be left empty.
+- `load_path` (optional) - If you have a checkpoint that you'd like to start from, this is how you set that. If you're fine-tuning a MosaicBERT, this should not be left empty.
- `save_folder` - This will determine where model checkpoints are saved. Note that it can depend on `run_name`. For example, if you set `save_folder` to `s3://mybucket/mydir/{run_name}/ckpt` it will replace `{run_name}` with the value of `run_name`. So you should avoid re-using the same run name across multiple training runs.
- `loggers.wandb` (optional) - If you want to log to W&B, fill in the `project` and `entity` fields, or comment out the `wandb` block if you don't want to use this logger.
- `algorithms` (optional) - Make sure to include any architecture-modifying algorithms that were applied to your starting checkpoint model before pre-training. For instance, if you turned on `gated_linear_units` in pre-training, make sure to do so during fine-tuning too!
@@ -337,10 +425,10 @@ composer main.py yamls/main/mosaic-bert-base-uncased.yaml
You should see logs being printed to your terminal.
You can also easily enable other experiment trackers like Weights and Biases or CometML by using [Composer's logging integrations](https://docs.mosaicml.com/en/stable/trainer/logging.html).
-## Mosaic BERT
+## MosaicBERT
-Our starter code supports both standard HuggingFace BERT models and our own **Mosaic BERT**. The latter incorporates numerous methods to improve throughput and training.
-Our goal in developing Mosaic BERT was to greatly reduce training time while making it easy for you to use on your own problems!
+Our starter code supports both standard HuggingFace BERT models and our own **MosaicBERT**. The latter incorporates numerous methods to improve throughput and training.
+Our goal in developing MosaicBERT was to greatly reduce training time while making it easy for you to use on your own problems!
To do this, we employ a number of techniques from the literature:
@@ -352,10 +440,8 @@ To do this, we employ a number of techniques from the literature:
... and get them to work together! To our knowledge, many of these methods have never been combined before.
-If you're reading this, we're still profiling the exact speedup and performance gains offered by Mosaic BERT compared to comparable HuggingFace BERT models. Stay tuned for incoming results!
+If you're reading this, we're still profiling the exact speedup and performance gains offered by MosaicBERT compared to comparable HuggingFace BERT models. Stay tuned for incoming results!
## Contact Us
If you run into any problems with the code, please file Github issues directly to this repo.
-
-If you want to train BERT-style models on MosaicML platform, reach out to us at [demo@mosaicml.com](mailto:demo@mosaicml.com)!