Skip to content

Commit

Permalink
Setup Docker for TPU execution and update infra scripts.
Browse files Browse the repository at this point in the history
I tried to optimize the Docker image size a bit using a staged build, as Ray
currently requires a source build of Meson, which requires a Clang
installation... even with this jax & libtpu are each themselves >250MB
installs, so there's no avoiding a large image size at the moment.

Still, with this configuration, a v5-32 (the most I could get given GCPs stingy
IP address allocation) takes about 50 seconds to run setup-vm.sh and pull the
initial image.  After the initial pull, new deployments take a few seconds to
package up the current source directory.

It's still possible to use the `git clone` approach via a volume mount, but the
permissions are a bit finicky at that point, and I'm not sure how many options
we want to have.
  • Loading branch information
rjpower committed May 29, 2024
1 parent ed3c6f1 commit fd6333c
Show file tree
Hide file tree
Showing 18 changed files with 432 additions and 276 deletions.
4 changes: 3 additions & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.git

scratch
cache
wandb
Expand Down Expand Up @@ -44,6 +46,7 @@ instance/

# Sphinx documentation
docs/_build/
docs/figures/

# PyBuilder
target/
Expand Down Expand Up @@ -105,7 +108,6 @@ dmypy.json
# JetBrains
.idea/


# dataset cache files
**/*.parquet
**/ledger.json
Expand Down
6 changes: 2 additions & 4 deletions .github/workflows/tpu_unit_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,12 @@ jobs:
export TPU_NAME=ci-run-${{ github.run_id }}
eval "$(ssh-agent -s)"
TRUE_SHA=${{ github.event.pull_request.head.sha }}
bash infra/spin-up-vm.sh $TPU_NAME -z ${TPU_ZONE} -t v4-8 --preemptible -s infra/helpers/setup-tpu-vm-tests.sh -b ${TRUE_SHA} --retries 1
# infra/babysit-tpu-vm.sh $TPU_NAME -z ${{ TPU_ZONE }} -t v4-8 --preemptible -s infra/helpers/setup-tpu-vm-tests.sh -b ${{ github.sha }} --retries 1 -- \
# PYTHONPATH=$PYTHONPATH:levanter/tests bash levanter/infra/run.sh pytest levanter/tests -m "not entry"
bash infra/spin-up-vm.sh $TPU_NAME -z ${TPU_ZONE} -t v4-8 --preemptible --retries 1
- name: Run most tests
run: |
export TPU_NAME=ci-run-${{ github.run_id }}
gcloud compute tpus tpu-vm ssh $TPU_NAME --zone ${TPU_ZONE} --command "PYTHONPATH=$PYTHONPATH:levanter/tests bash levanter/infra/run.sh pytest levanter/tests -m 'not entry'"
python infra/launch.py --foreground --tpu=$TPU_NAME --zone=$TPU_ZONE -- /opt/levanter/.venv/bin/pytest tests -m "not entry"
# Something's wrong with these
#
# - name: Run forked tests
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
/scratch

# Configuration for TPU launches/secrets
.config

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down Expand Up @@ -140,6 +143,7 @@ dmypy.json
/wandb

# dataset cache files
/cache
*.parquet
ledger.json

Expand Down
18 changes: 18 additions & 0 deletions docker/tpu/Dockerfile.base
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
FROM python:3.10 AS build
RUN apt-get update && apt-get install -y clang
RUN pip install virtualenv

# venv binaries encode their directory, so we need to setup the venv in the final location
RUN virtualenv -p python3.10 /opt/levanter/.venv
RUN /opt/levanter/.venv/bin/pip install -U "jax[tpu]==0.4.26" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

WORKDIR /tmp

# Add only the requirements files to cache dependency build/installation
ADD pyproject.toml README.md /tmp/
RUN /opt/levanter/.venv/bin/pip install -e '.[test]'

FROM python:3.10

WORKDIR /opt/levanter
COPY --from=build /opt/levanter/.venv /opt/levanter/.venv
16 changes: 16 additions & 0 deletions docker/tpu/Dockerfile.incremental
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
ARG REPO_LOCATION=us-west4-docker.pkg.dev/beastmaster-408319/levanter
ARG BASE_VERSION=latest

FROM ${REPO_LOCATION}/levanter:${BASE_VERSION}

ENV TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS=60\
TENSORSTORE_CURL_LOW_SPEED_LIMIT_BYTES=1024\
RAY_USAGE_STATS_ENABLED=0\
PATH=/opt/levanter/.venv/bin:$PATH

WORKDIR /opt/levanter

ADD pyproject.toml README.md /opt/levanter/
RUN pip install -e '.[test]'

ADD . /opt/levanter
43 changes: 33 additions & 10 deletions docs/Getting-Started-TPU-VM.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,18 +85,40 @@ the VM. That's explained down below in the [Running Levanter GPT-2](#running-lev
## Running Levanter GPT-2
Now that you have a TPU VM instance, you can follow the [Getting Started](Getting-Started-Training.md) steps, but here are a few shortcuts:

### Launch a GPT-2 Small in unattended mode (using nohup)
### Launch a GPT-2 Small in unattended mode

You will need a [Docker installation](https://docs.docker.com/engine/install/)
on your development machine to build and run images on TPUs.

First create a configuration file for future launches in your Levanter directory:

```
cat > .config <<EOF
env:
WANDB_API_KEY: ...
WANDB_ENTITY: ...
WANDB_PROJECT: levanter
HF_TOKEN: ...
docker_repository: levanter
zone: us-west4-a
tpu: test-tpu
EOF
```

Everything after the `--` is run on each worker.

```bash
gcloud compute tpus tpu-vm ssh $NAME --zone $ZONE --worker=all --command 'WANDB_API_KEY=... levanter/infra/launch.sh python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
python infra/launch.py -- python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
```
`launch.sh` will run the command in the background and redirect stdout and stderr to a log file in the home directory
on each worker.
`launch.py` will package your directory and create and deploy a Docker image on each worker.
### Launch a GPT-2 Small in interactive mode
This version writes to the terminal, you should use tmux or something for long running jobs for this version. It's mostly for debugging.
To run in the foreground, use `--foreground` with the `launch.py` script. You should use tmux or something for long running jobs for this version. It's mostly for debugging.
```bash
gcloud compute tpus tpu-vm ssh $NAME --zone $ZONE --worker=all --command 'WANDB_API_KEY=... levanter/infra/run.sh python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
python infra/launch.py -- python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
```
### Babysitting Script
Expand All @@ -113,11 +135,12 @@ You can run it like this:
```bash
infra/babysit-tpu-vm <name> -z <zone> -t <type> [--preemptible] -- \
WANDB_API_KEY=... levanter/infra/run.sh python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml
python infra/launch.py -- levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml
```
That `--` is important! It separates the spin up args from the running args. Also, you should never use `launch.sh`
with `babysit`, because nohup exits immediately with exit code 0.
That `--` is important! It separates the spin up args from the running args.
Also you should always use `--foregrouund` with `babysit-tpu-vm`, as the
background mode will always return immediately.
### Running your own config
Expand All @@ -132,7 +155,7 @@ Afterward, you can use the config directly from the TPU VM instance, e.g.:

```bash
infra/babysit-tpu-vm <name> -z <zone> -t <type> [--preemptible] -- \
WANDB_API_KEY=... levanter/infra/run.sh python levanter/src/levanter/main/train_lm.py --config_path gs://my_bucket/my_config.yaml \
python infra/launch.py -- python levanter/src/levanter/main/train_lm.py --config_path gs://my_bucket/my_config.yaml \
--trainer.checkpointer.base_path gs://path/to/checkpoints/
```

Expand Down
17 changes: 16 additions & 1 deletion docs/Training-On-Your-Data.md
Original file line number Diff line number Diff line change
Expand Up @@ -395,8 +395,23 @@ bash infra/spin-up-tpu-vm.sh my-tpu -z us-east1-d -t v3-128

This will spin up a TPU VM instance and install Levanter on it. You can then run a command like so:


```
cat > .config <<EOF
env:
WANDB_API_KEY: ...
WANDB_ENTITY: ...
WANDB_PROJECT: levanter
HF_TOKEN: ...
docker_repository: levanter
zone: us-west4-a
tpu: test-tpu
EOF
```

```bash
gcloud compute tpus tpu-vm ssh my-tpu --zone us-east1-d --worker=all --command="WANDB_API_KEY=... levanter/infra/launch.sh python levanter/src/levanter/main/train_lm.py --config_path gs://path/to/config.yaml"
python infra/launch.py -- python levanter/src/levanter/main/train_lm.py --config_path gs://path/to/config.yaml"
```

## Monitoring
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/Training-On-Audio-Data.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ bash infra/spin-up-tpu-vm.sh my-tpu -z us-east1-d -t v3-128
This will spin up a TPU VM instance and install Levanter on it. You can then run a command like so:

```bash
gcloud compute tpus tpu-vm ssh my-tpu --zone us-east1-d --worker=all --command="WANDB_API_KEY=... levanter/infra/launch.sh python levanter/src/levanter/main/train_asr.py --config_path gs://path/to/config.yaml"
python infra/launch.py -- python levanter/src/levanter/main/train_asr.py --config_path gs://path/to/config.yaml"
```

### GPU
Expand Down
Empty file added infra/__init__.py
Empty file.
3 changes: 1 addition & 2 deletions infra/babysit-tpu-vm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,7 @@ while true; do
else
# run the command
echo "Running command on VM $VM_NAME"
echo "gcloud compute tpus tpu-vm ssh --zone=$ZONE $VM_NAME --command='$CMD_ARGS_STR' --worker=all"
gcloud compute tpus tpu-vm ssh --zone=$ZONE $VM_NAME --command="$CMD_ARGS_STR" --worker=all
$CMD_ARGS_STR
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
echo "Command succeeded. Exiting"
Expand Down
30 changes: 5 additions & 25 deletions infra/helpers/parse-tpu-creation-args.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ AUTODELETE=true
SETUP_SCRIPT="$SCRIPT_DIR/helpers/setup-tpu-vm.sh"
SUBNETWORK="default"
USE_ALPHA=false
DOCKER_REPOSITORY="levanter"
DOCKER_IMAGE="levanter"
PROJECT=$(gcloud info --format='value(config.project)')
RETRIES=-1 # how many times babysit-tpu-vm.sh should retry before giving up. -1 means infinite

if [ -z "$GIT_BRANCH" ]; then
Expand Down Expand Up @@ -119,28 +122,5 @@ while [[ $# -gt 0 ]]; do
esac
done

# check if the branch we chose has been pushed to the remote
# if not, warn
# if it's a commit sha/short-sha (or something that looks like one), check if it's in the remote
if [[ "$GIT_BRANCH" =~ ^[0-9a-f]{7,40}$ ]]; then
# if it's a commit, check if it's in the remote
BRANCHES=$(git branch -r --contains "$GIT_BRANCH")
if [ -z "$BRANCHES" ]; then
>&2 echo "Warning: commit $GIT_BRANCH not found on remote $GIT_REPO"
fi
else
# get the remote branch name
REMOTE_BRANCH=$(git ls-remote --heads origin "$GIT_BRANCH" | awk '{print $2}' | sed 's/refs\/heads\///g')
# if it's empty, warn
if [ -z "$REMOTE_BRANCH" ]; then
>&2 echo "Warning: branch $GIT_BRANCH not found on remote $GIT_REPO"
else
# make sure it's pushed
LOCAL_COMMIT=$(git rev-parse --short "$GIT_BRANCH")
REMOTE_COMMIT=$(git rev-parse --short "origin/$REMOTE_BRANCH")

if [ "$LOCAL_COMMIT" != "$REMOTE_COMMIT" ]; then
>&2 echo "Warning: branch $GIT_BRANCH not pushed to remote $GIT_REPO. Local commit: $LOCAL_COMMIT, remote commit: $REMOTE_COMMIT"
fi
fi
fi
# Extract region from ZONE (everything before last "-"")
REGION=$(echo "$ZONE" | sed 's/-[^-]*$//g')
126 changes: 0 additions & 126 deletions infra/helpers/setup-tpu-vm-tests.sh

This file was deleted.

Loading

0 comments on commit fd6333c

Please sign in to comment.