Setup Docker for TPU execution and update infra scripts.

I tried to optimize the Docker image size a bit using a staged build, as Ray currently requires a source build of Meson, which requires a Clang installation... even with this jax & libtpu are each themselves >250MB installs, so there's no avoiding a large image size at the moment. Still, with this configuration, a v5-32 (the most I could get given GCPs stingy IP address allocation) takes about 50 seconds to run setup-vm.sh and pull the initial image. After the initial pull, new deployments take a few seconds to package up the current source directory. It's still possible to use the `git clone` approach via a volume mount, but the permissions are a bit finicky at that point, and I'm not sure how many options we want to have.
stanford-crfm · May 29, 2024 · fd6333c · fd6333c
1 parent ed3c6f1
commit fd6333c
Show file tree

Hide file tree

Showing 18 changed files with 432 additions and 276 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -1,3 +1,5 @@
+.git
+
 scratch
 cache
 wandb
@@ -44,6 +46,7 @@ instance/
 
 # Sphinx documentation
 docs/_build/
+docs/figures/
 
 # PyBuilder
 target/
@@ -105,7 +108,6 @@ dmypy.json
 # JetBrains
 .idea/
 
-
 # dataset cache files
 **/*.parquet
 **/ledger.json

diff --git a/.github/workflows/tpu_unit_tests.yaml b/.github/workflows/tpu_unit_tests.yaml
@@ -31,14 +31,12 @@ jobs:
           export TPU_NAME=ci-run-${{ github.run_id }}
           eval "$(ssh-agent -s)"
           TRUE_SHA=${{ github.event.pull_request.head.sha }}
-          bash infra/spin-up-vm.sh $TPU_NAME -z ${TPU_ZONE} -t v4-8 --preemptible -s infra/helpers/setup-tpu-vm-tests.sh -b ${TRUE_SHA} --retries 1
-#          infra/babysit-tpu-vm.sh $TPU_NAME -z ${{ TPU_ZONE }} -t v4-8 --preemptible -s infra/helpers/setup-tpu-vm-tests.sh -b ${{ github.sha }} --retries 1 -- \
-#            PYTHONPATH=$PYTHONPATH:levanter/tests bash levanter/infra/run.sh pytest levanter/tests -m "not entry"
+          bash infra/spin-up-vm.sh $TPU_NAME -z ${TPU_ZONE} -t v4-8 --preemptible --retries 1
 
       - name: Run most tests
         run: |
           export TPU_NAME=ci-run-${{ github.run_id }}
-          gcloud compute tpus tpu-vm ssh $TPU_NAME --zone ${TPU_ZONE} --command "PYTHONPATH=$PYTHONPATH:levanter/tests bash levanter/infra/run.sh pytest levanter/tests -m 'not entry'"
+          python infra/launch.py --foreground --tpu=$TPU_NAME --zone=$TPU_ZONE -- /opt/levanter/.venv/bin/pytest tests -m "not entry"
 # Something's wrong with these
 #
 #      - name: Run forked tests

diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,8 @@
 /scratch
 
+# Configuration for TPU launches/secrets
+.config
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
@@ -140,6 +143,7 @@ dmypy.json
 /wandb
 
 # dataset cache files
+/cache
 *.parquet
 ledger.json
 

diff --git a/docker/tpu/Dockerfile.base b/docker/tpu/Dockerfile.base
@@ -0,0 +1,18 @@
+FROM python:3.10 AS build
+RUN apt-get update && apt-get install -y clang
+RUN pip install virtualenv
+
+# venv binaries encode their directory, so we need to setup the venv in the final location
+RUN virtualenv -p python3.10 /opt/levanter/.venv
+RUN /opt/levanter/.venv/bin/pip install -U "jax[tpu]==0.4.26" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
+
+WORKDIR /tmp
+
+# Add only the requirements files to cache dependency build/installation
+ADD pyproject.toml README.md /tmp/
+RUN /opt/levanter/.venv/bin/pip install -e '.[test]'
+
+FROM python:3.10
+
+WORKDIR /opt/levanter
+COPY --from=build /opt/levanter/.venv /opt/levanter/.venv
diff --git a/docker/tpu/Dockerfile.incremental b/docker/tpu/Dockerfile.incremental
@@ -0,0 +1,16 @@
+ARG REPO_LOCATION=us-west4-docker.pkg.dev/beastmaster-408319/levanter
+ARG BASE_VERSION=latest
+
+FROM ${REPO_LOCATION}/levanter:${BASE_VERSION}
+
+ENV TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS=60\
+    TENSORSTORE_CURL_LOW_SPEED_LIMIT_BYTES=1024\
+    RAY_USAGE_STATS_ENABLED=0\
+    PATH=/opt/levanter/.venv/bin:$PATH
+
+WORKDIR /opt/levanter
+
+ADD pyproject.toml README.md /opt/levanter/
+RUN pip install -e '.[test]'
+
+ADD . /opt/levanter
diff --git a/docs/Getting-Started-TPU-VM.md b/docs/Getting-Started-TPU-VM.md
@@ -85,18 +85,40 @@ the VM. That's explained down below in the [Running Levanter GPT-2](#running-lev
 ## Running Levanter GPT-2
 Now that you have a TPU VM instance, you can follow the [Getting Started](Getting-Started-Training.md) steps, but here are a few shortcuts:
 
-### Launch a GPT-2 Small in unattended mode (using nohup)
+### Launch a GPT-2 Small in unattended mode
+
+You will need a [Docker installation](https://docs.docker.com/engine/install/)
+on your development machine to build and run images on TPUs.
+
+First create a configuration file for future launches in your Levanter directory:
+
+```
+cat > .config <<EOF
+env:
+    WANDB_API_KEY:  ...
+    WANDB_ENTITY: ...
+    WANDB_PROJECT: levanter
+    HF_TOKEN: ...
+
+docker_repository: levanter
+zone: us-west4-a
+tpu: test-tpu
+EOF
+```
+
+Everything after the `--` is run on each worker.
+
 ```bash
-gcloud compute tpus tpu-vm ssh $NAME --zone $ZONE --worker=all --command 'WANDB_API_KEY=... levanter/infra/launch.sh python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
+python infra/launch.py -- python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
 ```
 
-`launch.sh` will run the command in the background and redirect stdout and stderr to a log file in the home directory
-on each worker.
+`launch.py` will package your directory and create and deploy a Docker image  on each worker.
 
 ### Launch a GPT-2 Small in interactive mode
-This version writes to the terminal, you should use tmux or something for long running jobs for this version. It's mostly for debugging.
+
+To run in the foreground, use `--foreground` with the `launch.py` script. You should use tmux or something for long running jobs for this version. It's mostly for debugging.
 ```bash
-gcloud compute tpus tpu-vm ssh $NAME --zone $ZONE --worker=all --command 'WANDB_API_KEY=... levanter/infra/run.sh python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
+python infra/launch.py -- python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
 ```
 
 ### Babysitting Script
@@ -113,11 +135,12 @@ You can run it like this:
 
 ```bash
 infra/babysit-tpu-vm <name> -z <zone> -t <type> [--preemptible]  -- \
-    WANDB_API_KEY=... levanter/infra/run.sh python levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml
+    python infra/launch.py -- levanter/src/levanter/main/train_lm.py --config_path levanter/config/gpt2_small.yaml
 ```
 
-That `--` is important! It separates the spin up args from the running args. Also, you should never use `launch.sh`
-with `babysit`, because nohup exits immediately with exit code 0.
+That `--` is important! It separates the spin up args from the running args.
+Also you should always use `--foregrouund` with `babysit-tpu-vm`, as the
+background mode will always return immediately.
 
 ### Running your own config
 
@@ -132,7 +155,7 @@ Afterward, you can use the config directly from the TPU VM instance, e.g.:
 
 ```bash
 infra/babysit-tpu-vm <name> -z <zone> -t <type> [--preemptible] -- \
-    WANDB_API_KEY=... levanter/infra/run.sh python levanter/src/levanter/main/train_lm.py --config_path gs://my_bucket/my_config.yaml \
+    python infra/launch.py -- python levanter/src/levanter/main/train_lm.py --config_path gs://my_bucket/my_config.yaml \
     --trainer.checkpointer.base_path gs://path/to/checkpoints/
 ```
 

diff --git a/docs/Training-On-Your-Data.md b/docs/Training-On-Your-Data.md
@@ -395,8 +395,23 @@ bash infra/spin-up-tpu-vm.sh my-tpu -z us-east1-d -t v3-128
 
 This will spin up a TPU VM instance and install Levanter on it. You can then run a command like so:
 
+
+```
+cat > .config <<EOF
+env:
+    WANDB_API_KEY:  ...
+    WANDB_ENTITY: ...
+    WANDB_PROJECT: levanter
+    HF_TOKEN: ...
+
+docker_repository: levanter
+zone: us-west4-a
+tpu: test-tpu
+EOF
+```
+
 ```bash
-gcloud compute tpus tpu-vm ssh my-tpu   --zone us-east1-d --worker=all --command="WANDB_API_KEY=... levanter/infra/launch.sh python levanter/src/levanter/main/train_lm.py --config_path gs://path/to/config.yaml"
+python infra/launch.py -- python levanter/src/levanter/main/train_lm.py --config_path gs://path/to/config.yaml"
 ```
 
 ## Monitoring

diff --git a/docs/tutorials/Training-On-Audio-Data.md b/docs/tutorials/Training-On-Audio-Data.md
@@ -189,7 +189,7 @@ bash infra/spin-up-tpu-vm.sh my-tpu -z us-east1-d -t v3-128
 This will spin up a TPU VM instance and install Levanter on it. You can then run a command like so:
 
 ```bash
-gcloud compute tpus tpu-vm ssh my-tpu   --zone us-east1-d --worker=all --command="WANDB_API_KEY=... levanter/infra/launch.sh python levanter/src/levanter/main/train_asr.py --config_path gs://path/to/config.yaml"
+python infra/launch.py -- python levanter/src/levanter/main/train_asr.py --config_path gs://path/to/config.yaml"
 ```
 
 ### GPU

diff --git a/infra/__init__.py b/infra/__init__.py
diff --git a/infra/babysit-tpu-vm.sh b/infra/babysit-tpu-vm.sh
@@ -77,8 +77,7 @@ while true; do
     else
       # run the command
       echo "Running command on VM $VM_NAME"
-      echo "gcloud compute tpus tpu-vm ssh --zone=$ZONE $VM_NAME --command='$CMD_ARGS_STR' --worker=all"
-      gcloud compute tpus tpu-vm ssh --zone=$ZONE $VM_NAME --command="$CMD_ARGS_STR" --worker=all
+      $CMD_ARGS_STR
       EXIT_CODE=$?
       if [ $EXIT_CODE -eq 0 ]; then
         echo "Command succeeded. Exiting"

diff --git a/infra/helpers/parse-tpu-creation-args.sh b/infra/helpers/parse-tpu-creation-args.sh
@@ -23,6 +23,9 @@ AUTODELETE=true
 SETUP_SCRIPT="$SCRIPT_DIR/helpers/setup-tpu-vm.sh"
 SUBNETWORK="default"
 USE_ALPHA=false
+DOCKER_REPOSITORY="levanter"
+DOCKER_IMAGE="levanter"
+PROJECT=$(gcloud info --format='value(config.project)')
 RETRIES=-1  # how many times babysit-tpu-vm.sh should retry before giving up. -1 means infinite
 
 if [ -z "$GIT_BRANCH" ]; then
@@ -119,28 +122,5 @@ while [[ $# -gt 0 ]]; do
   esac
 done
 
-# check if the branch we chose has been pushed to the remote
-# if not, warn
-# if it's a commit sha/short-sha (or something that looks like one), check if it's in the remote
-if [[ "$GIT_BRANCH" =~ ^[0-9a-f]{7,40}$ ]]; then
-  # if it's a commit, check if it's in the remote
-  BRANCHES=$(git branch -r --contains "$GIT_BRANCH")
-  if [ -z "$BRANCHES" ]; then
-    >&2 echo "Warning: commit $GIT_BRANCH not found on remote $GIT_REPO"
-  fi
-else
-  # get the remote branch name
-  REMOTE_BRANCH=$(git ls-remote --heads origin "$GIT_BRANCH" | awk '{print $2}' | sed 's/refs\/heads\///g')
-  # if it's empty, warn
-  if [ -z "$REMOTE_BRANCH" ]; then
-    >&2 echo "Warning: branch $GIT_BRANCH not found on remote $GIT_REPO"
-  else
-    # make sure it's pushed
-    LOCAL_COMMIT=$(git rev-parse --short "$GIT_BRANCH")
-    REMOTE_COMMIT=$(git rev-parse --short "origin/$REMOTE_BRANCH")
-
-    if [ "$LOCAL_COMMIT" != "$REMOTE_COMMIT" ]; then
-     >&2 echo "Warning: branch $GIT_BRANCH not pushed to remote $GIT_REPO. Local commit: $LOCAL_COMMIT, remote commit: $REMOTE_COMMIT"
-    fi
-  fi
-fi
+# Extract region from ZONE (everything before last "-"")
+REGION=$(echo "$ZONE" | sed 's/-[^-]*$//g')
diff --git a/infra/helpers/setup-tpu-vm-tests.sh b/infra/helpers/setup-tpu-vm-tests.sh
-Original file line number
+Diff line change
@@ Expand Up / @@ -189,7 +189,7 @@ bash infra/spin-up-tpu-vm.sh my-tpu -z us-east1-d -t v3-128 @@
     This will spin up a TPU VM instance and install Levanter on it. You can then run a command like so:
     ```bash
-    gcloud compute tpus tpu-vm ssh my-tpu   --zone us-east1-d --worker=all --command="WANDB_API_KEY=... levanter/infra/launch.sh python levanter/src/levanter/main/train_asr.py --config_path gs://path/to/config.yaml"
+    python infra/launch.py -- python levanter/src/levanter/main/train_asr.py --config_path gs://path/to/config.yaml"
     ```
     ### GPU
@@ Expand Down @@