Skip to content

Commit

Permalink
Update S3 download instructions (#701)
Browse files Browse the repository at this point in the history
* Update S3 download instructions in README.md

* Add Rclone install instructions for Windows in README.md

* Minor language tweak in README.md

* Add Rclone download instructions in README.md
  • Loading branch information
nathanw-mlc authored Feb 29, 2024
1 parent ab4ae1c commit d6b1389
Show file tree
Hide file tree
Showing 2 changed files with 72 additions and 18 deletions.
58 changes: 49 additions & 9 deletions language_model/tensorflow/bert/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# V1.0 Dataset and Training

# Location of the input files
## Location of the input files

This [Google Drive location](https://drive.google.com/drive/folders/1oQF4diVHNPCclykwdvQJw8n_VIWwV0PT) contains the following.
The following files are available for download in a Cloudflare R2 bucket.
* tf1_ckpt folder: contains checkpoint files
- model.ckpt-28252.data-00000-of-00001
- model.ckpt-28252.index
Expand All @@ -18,6 +18,27 @@ This [Google Drive location](https://drive.google.com/drive/folders/1oQF4diVHNPC
* License.txt
* vocab.txt: Contains WordPiece to id mapping

### Download from bucket

You can access the bucket and download the files with Rclone

To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, run the following command to authenticate with the bucket:
```
rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
```
You can then navigate in the terminal to your desired download directory and run the following command to download the input files:

```
rclone copy mlc-training:mlcommons-training-wg-public/wikipedia_for_bert/input_files ./input_files -P
```

### Alternatively generate the checkpoints

Alternatively, TF2 checkpoint can also be generated using [tf2_encoder_checkpoint_converter.py](https://github.com/tensorflow/models/blob/master/official/nlp/bert/tf2_encoder_checkpoint_converter.py) and TF1 checkpoint

```shell
Expand All @@ -30,11 +51,30 @@ python3 tf2_encoder_checkpoint_converter.py \
Note that the checkpoint converter removes optimizer slot variables, so the resulting TF2 checkpoint is only about 1/3 size of the TF1 checkpoint.


# Download and preprocess datasets
## Download and preprocess datasets

The dataset was prepared using Python 3.7.6, nltk 3.4.5 and the [tensorflow/tensorflow:1.15.2-gpu](https://hub.docker.com/layers/tensorflow/tensorflow/1.15.2-gpu/images/sha256-da7b6c8a63bdafa77864e7e874664acfe939fdc140cb99940610c34b8c461cd0?context=explore) docker image.

Files after the download, uncompress, extract, clean up and dataset seperation steps are providedat a [Google Drive location](https://drive.google.com/corp/drive/u/0/folders/1cywmDnAsrP5-2vsr8GDc6QUc7VWe-M3v). The main reason is that, WikiExtractor.py replaces some of the tags present in XML such as {CURRENTDAY}, {CURRENTMONTHNAMEGEN} with the current values obtained from time.strftime ([code](https://github.com/attardi/wikiextractor/blob/e4abb4cbd019b0257824ee47c23dd163919b731b/WikiExtractor.py#L632)). Hence, one might see slighly different preprocessed files after the WikiExtractor.py file is invoked. This means the md5sum hashes of these files will also be different each time WikiExtractor is called.
Files after the download, uncompress, extract, clean up and dataset seperation steps are available for download in a Cloudflare R2 bucket. The main reason is that, WikiExtractor.py replaces some of the tags present in XML such as {CURRENTDAY}, {CURRENTMONTHNAMEGEN} with the current values obtained from time.strftime ([code](https://github.com/attardi/wikiextractor/blob/e4abb4cbd019b0257824ee47c23dd163919b731b/WikiExtractor.py#L632)). Hence, one might see slighly different preprocessed files after the WikiExtractor.py file is invoked. This means the md5sum hashes of these files will also be different each time WikiExtractor is called.

### Download from bucket

You can access the bucket and download the files with Rclone

To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, run the following command to authenticate with the bucket:
```
rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
```
You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:

```
rclone copy mlc-training:mlcommons-training-wg-public/wikipedia_for_bert/processed_dataset ./processed_dataset -P
```

### Files in ./results directory:

Expand Down Expand Up @@ -130,7 +170,7 @@ The examples in the TFRecords have the following key/values in its features dict
| part-00XXX-of-00500 | 391,434,110,129 |


# Stopping criteria
## Stopping criteria
A valid submission will evaluate a masked lm accuracy >= 0.720.

The evaluation will be on the 10,000 samples in the evaluation set. The evalution frequency in terms of number of samples trained is determined by the following formular based on the global batch size, starting from 0 samples. Evaluation with 0 samples trained could be skipped, but that's a good place to verify the initial checkpoint was loaded correctly for debugging purpose; the masked lm accuracy after loading the initial checkpint and before any training should be very close to 0.34085. The evaluation can be either offline or online for v1.0. More details please refer to the training policy.
Expand All @@ -155,9 +195,9 @@ The purpose of this formular is to make the eval interval 1) not too large to ma

The generation of the evaluation set shard should follow the exact command shown above, using create_pretraining_data.py. **_In particular the seed (12345) must be set to ensure everyone evaluates on the same data._**

# Running the model
## Running the model

## On GPU-V100-8
### On GPU-V100-8

To run this model with batch size 24 on GPUs, use the following command.

Expand Down Expand Up @@ -221,7 +261,7 @@ The model has been tested using the following stack:
- NVIDIA Docker 2.5.0-1 + Docker 19.03.13
- docker image tensorflow/tensorflow:2.4.0-gpu

## On TPU-v3-128
### On TPU-v3-128

To run the training workload for batch size 8k on [Cloud TPUs](https://cloud.google.com/tpu), follow these steps:

Expand Down Expand Up @@ -388,7 +428,7 @@ for step_num in 0 $(seq 600 -3 3); do
done
```

## Gradient Accumulation
### Gradient Accumulation

The GradientAggregationOptimizer can accumulate gradients across multiple steps, on each accelerators, before actually applying the gradients. To use this feature, please note the following:

Expand Down
32 changes: 23 additions & 9 deletions large_language_model/megatron-lm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,17 +164,31 @@ Evaluation on the validation subset that consists of 24567 examples.
# 6. Other

### S3 artifacts download
The dataset and the checkpoints are available to download from an S3 bucket.
To achieve the best download bandwidth (currently no more than 25MB/s is expected) it's necessary to set up a third-party client capable of downloading the artifacts.
[Here are the instructions](https://help.lyvecloud.seagate.com/en/connecting-s3-clients-to-lyve-cloud.html).
The read-only access credentials are provided below.
The dataset and the checkpoints are available to download from an S3 bucket. You can download this data from the bucket using Rclone as follows:

#### Access details
- Bucket name: `mlcommons-training-wg-s3`
- Endpoint URL: https://s3.us-east-1.lyvecloud.seagate.com
- Access Key: `3ZC41B4Z2WHM5DT2`
- Secret Key: `AK4NQQZV0NKFEJWJUZVPX5XQ0QNTXCGW`
To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, run the following command to authenticate with the bucket:
```
rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
```
You can then navigate in the terminal to your desired download directory and run the following commands to download the dataset and checkpoints:

**`dataset_c4_spm.tar`**
```
rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/dataset_c4_spm.tar ./ -P
```
**`checkpoint_megatron_fp32.tar`**
```
rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/checkpoint_megatron_fp32.tar ./ -P
```
**`checkpoint_nemo_bf16`**
```
rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/checkpoint_nemo_bf16.tar ./ -P
```

### Model conversion from Paxml checkpoints
Alternatively to downloading the checkpoint in Megatron ready format, it can be obtained by converting a Paxml checkpoint.
Expand Down

0 comments on commit d6b1389

Please sign in to comment.