Skip to content

Commit

Permalink
Setup Docker for TPU execution and update infra scripts.
Browse files Browse the repository at this point in the history
I tried to optimize the Docker image size a bit using a staged build, as Ray
currently requires a source build of Meson, which requires a Clang
installation... even with this jax & libtpu are each themselves >250MB
installs, so there's no avoiding a large image size at the moment.

Still, with this configuration, a v5-32 (the most I could get given GCPs stingy
IP address allocation) takes about 50 seconds to run setup-vm.sh and pull the
initial image.  After the initial pull, new deployments take a few seconds to
package up the current source directory.

It's still possible to use the `git clone` approach via a volume mount, but the
permissions are a bit finicky at that point, and I'm not sure how many options
we want to have.
  • Loading branch information
rjpower committed Jun 5, 2024
1 parent ed3c6f1 commit 5906255
Show file tree
Hide file tree
Showing 14 changed files with 588 additions and 19 deletions.
4 changes: 3 additions & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.git

scratch
cache
wandb
Expand Down Expand Up @@ -44,6 +46,7 @@ instance/

# Sphinx documentation
docs/_build/
docs/figures/

# PyBuilder
target/
Expand Down Expand Up @@ -105,7 +108,6 @@ dmypy.json
# JetBrains
.idea/


# dataset cache files
**/*.parquet
**/ledger.json
Expand Down
6 changes: 2 additions & 4 deletions .github/workflows/tpu_unit_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,12 @@ jobs:
export TPU_NAME=ci-run-${{ github.run_id }}
eval "$(ssh-agent -s)"
TRUE_SHA=${{ github.event.pull_request.head.sha }}
bash infra/spin-up-vm.sh $TPU_NAME -z ${TPU_ZONE} -t v4-8 --preemptible -s infra/helpers/setup-tpu-vm-tests.sh -b ${TRUE_SHA} --retries 1
# infra/babysit-tpu-vm.sh $TPU_NAME -z ${{ TPU_ZONE }} -t v4-8 --preemptible -s infra/helpers/setup-tpu-vm-tests.sh -b ${{ github.sha }} --retries 1 -- \
# PYTHONPATH=$PYTHONPATH:levanter/tests bash levanter/infra/run.sh pytest levanter/tests -m "not entry"
bash infra/spin-up-vm.sh $TPU_NAME -z ${TPU_ZONE} -t v4-8 --preemptible --retries 1
- name: Run most tests
run: |
export TPU_NAME=ci-run-${{ github.run_id }}
gcloud compute tpus tpu-vm ssh $TPU_NAME --zone ${TPU_ZONE} --command "PYTHONPATH=$PYTHONPATH:levanter/tests bash levanter/infra/run.sh pytest levanter/tests -m 'not entry'"
python infra/launch.py --foreground --tpu=$TPU_NAME --zone=$TPU_ZONE -- /opt/levanter/.venv/bin/pytest tests -m "not entry"
# Something's wrong with these
#
# - name: Run forked tests
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
/scratch

# Configuration for TPU launches/secrets
.config

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down Expand Up @@ -140,6 +143,7 @@ dmypy.json
/wandb

# dataset cache files
/cache
*.parquet
ledger.json

Expand Down
17 changes: 17 additions & 0 deletions docker/tpu/Dockerfile.base
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
FROM python:3.10 AS build
RUN apt-get update && apt-get install -y clang
RUN pip install virtualenv

# venv binaries encode their directory, so we need to setup the venv in the final location
RUN virtualenv -p python3.10 /opt/levanter/.venv
RUN /opt/levanter/.venv/bin/pip install -U "jax[tpu]==0.4.26" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# Add only the requirements files to cache dependency build/installation
WORKDIR /tmp
ADD pyproject.toml README.md /tmp/
RUN /opt/levanter/.venv/bin/pip install -e '.[test]'

FROM python:3.10

WORKDIR /opt/levanter
COPY --from=build /opt/levanter/.venv /opt/levanter/.venv
17 changes: 17 additions & 0 deletions docker/tpu/Dockerfile.incremental
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
ARG IMAGE=ghcr.io/rjpower/levanter
ARG TAG=latest

FROM ${IMAGE}:${TAG}

ENV TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS=60\
TENSORSTORE_CURL_LOW_SPEED_LIMIT_BYTES=1024\
RAY_USAGE_STATS_ENABLED=0\
PATH=/opt/levanter/.venv/bin:$PATH\
PYTHONPATH=/opt/levanter:/opt/levanter/src:/opt/levanter/examples:/opt/levanter/tests\
HOME=/home/levanter

WORKDIR /opt/levanter

ADD pyproject.toml README.md /opt/levanter/
RUN pip install -e '.[test]'
ADD . /opt/levanter
43 changes: 33 additions & 10 deletions docs/Getting-Started-TPU-VM.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,18 +85,40 @@ the VM. That's explained down below in the [Running Levanter GPT-2](#running-lev
## Running Levanter GPT-2
Now that you have a TPU VM instance, you can follow the [Getting Started](Getting-Started-Training.md) steps, but here are a few shortcuts:

### Launch a GPT-2 Small in unattended mode (using nohup)
### Launch a GPT-2 Small in unattended mode

You will need a [Docker installation](https://docs.docker.com/engine/install/)
on your development machine to build and run images on TPUs.

First create a configuration file for future launches in your Levanter directory:

```
cat > .config <<EOF
env:
WANDB_API_KEY: ...
WANDB_ENTITY: ...
WANDB_PROJECT: levanter
HF_TOKEN: ...
docker_repository: levanter
zone: us-west4-a
tpu: test-tpu
EOF
```

Everything after the `--` is run on each worker.

```bash
gcloud compute tpus tpu-vm ssh $NAME --zone $ZONE --worker=all --command 'WANDB_API_KEY=... levanter/infra/launch.sh python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
python infra/launch.py -- python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
```
`launch.sh` will run the command in the background and redirect stdout and stderr to a log file in the home directory
on each worker.
`launch.py` will package your directory and create and deploy a Docker image on each worker.
### Launch a GPT-2 Small in interactive mode
This version writes to the terminal, you should use tmux or something for long running jobs for this version. It's mostly for debugging.
To run in the foreground, use `--foreground` with the `launch.py` script. You should use tmux or something for long running jobs for this version. It's mostly for debugging.
```bash
gcloud compute tpus tpu-vm ssh $NAME --zone $ZONE --worker=all --command 'WANDB_API_KEY=... levanter/infra/run.sh python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
python infra/launch.py -- python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
```
### Babysitting Script
Expand All @@ -113,11 +135,12 @@ You can run it like this:
```bash
infra/babysit-tpu-vm <name> -z <zone> -t <type> [--preemptible] -- \
WANDB_API_KEY=... levanter/infra/run.sh python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml
python infra/launch.py -- levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml
```
That `--` is important! It separates the spin up args from the running args. Also, you should never use `launch.sh`
with `babysit`, because nohup exits immediately with exit code 0.
That `--` is important! It separates the spin up args from the running args.
Also you should always use `--foregrouund` with `babysit-tpu-vm`, as the
background mode will always return immediately.
### Running your own config
Expand All @@ -132,7 +155,7 @@ Afterward, you can use the config directly from the TPU VM instance, e.g.:

```bash
infra/babysit-tpu-vm <name> -z <zone> -t <type> [--preemptible] -- \
WANDB_API_KEY=... levanter/infra/run.sh python levanter/src/levanter/main/train_lm.py --config_path gs://my_bucket/my_config.yaml \
python infra/launch.py -- python levanter/src/levanter/main/train_lm.py --config_path gs://my_bucket/my_config.yaml \
--trainer.checkpointer.base_path gs://path/to/checkpoints/
```

Expand Down
17 changes: 16 additions & 1 deletion docs/Training-On-Your-Data.md
Original file line number Diff line number Diff line change
Expand Up @@ -395,8 +395,23 @@ bash infra/spin-up-tpu-vm.sh my-tpu -z us-east1-d -t v3-128

This will spin up a TPU VM instance and install Levanter on it. You can then run a command like so:


```
cat > .config <<EOF
env:
WANDB_API_KEY: ...
WANDB_ENTITY: ...
WANDB_PROJECT: levanter
HF_TOKEN: ...
docker_repository: levanter
zone: us-west4-a
tpu: test-tpu
EOF
```

```bash
gcloud compute tpus tpu-vm ssh my-tpu --zone us-east1-d --worker=all --command="WANDB_API_KEY=... levanter/infra/launch.sh python levanter/src/levanter/main/train_lm.py --config_path gs://path/to/config.yaml"
python infra/launch.py -- python levanter/src/levanter/main/train_lm.py --config_path gs://path/to/config.yaml"
```

## Monitoring
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/Training-On-Audio-Data.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ bash infra/spin-up-tpu-vm.sh my-tpu -z us-east1-d -t v3-128
This will spin up a TPU VM instance and install Levanter on it. You can then run a command like so:

```bash
gcloud compute tpus tpu-vm ssh my-tpu --zone us-east1-d --worker=all --command="WANDB_API_KEY=... levanter/infra/launch.sh python levanter/src/levanter/main/train_asr.py --config_path gs://path/to/config.yaml"
python infra/launch.py -- python levanter/src/levanter/main/train_asr.py --config_path gs://path/to/config.yaml"
```

### GPU
Expand Down
Empty file added infra/__init__.py
Empty file.
74 changes: 74 additions & 0 deletions infra/helpers/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
import argparse
import os
import subprocess
import typing

from google.cloud import storage
import yaml


def run_command(*args, **kwargs):
print("Running:", " ".join(list(args)))
return subprocess.check_call(args, **kwargs)


def add_ssh_key(ssh_key_filename):
# format 3072 SHA256:... key-name (RSA)
key_hash = subprocess.check_output(["ssh-keygen", "-lf", ssh_key_filename]).decode("utf-8").split()[1]
existing_keys = subprocess.check_output(["ssh-add", "-l"]).decode("utf-8").split("\n")
for key in existing_keys:
if key_hash in key:
print('Found existing key in ssh-agent, skipping "ssh-add"')
return

subprocess.check_call(["ssh-add", ssh_key_filename])


def tpu_ssh(tpu_name, zone, *args):
add_ssh_key(os.path.expanduser("~/.ssh/google_compute_engine"))
return run_command(
"gcloud",
"alpha",
"compute",
"tpus",
"tpu-vm",
"ssh",
tpu_name,
"--worker=all",
f"--zone={zone}",
"--command=%s" % " ".join(args),
)


# Oddly enough, there's no API to simply fetch the current gcloud configuration...
def gcloud_config():
client = storage.Client()
return {
"project": client.project,
}


def add_arg(
parser: argparse.ArgumentParser, config: typing.Dict, flags: typing.List[str], required=False, default=None, **kw
):
"""Add an argument to the parser, using `config` or the environment to resolve default values."""
key = flags[0].lstrip("-").replace("-", "_")
if key in config:
default = config[key]

if key.upper() in os.environ:
default = os.environ[key.upper()]

if default is not None:
kw["default"] = default
elif required:
kw["required"] = True

parser.add_argument(*flags, **kw)


def load_config():
if os.path.exists(".config"):
return yaml.load(open(".config", "r"), Loader=yaml.SafeLoader)
else:
return {}
Loading

0 comments on commit 5906255

Please sign in to comment.