Skip to content

Commit

Permalink
Setup Docker for TPU execution and update infra scripts.
Browse files Browse the repository at this point in the history
I tried to optimize the Docker image size a bit using a staged build, as Ray
currently requires a source build of Meson, which requires a Clang
installation... even with this jax & libtpu are each themselves >250MB
installs, so there's no avoiding a large image size at the moment.

Still, with this configuration, a v5-32 (the most I could get given GCPs stingy
IP address allocation) takes about 50 seconds to run setup-vm.sh and pull the
initial image.  After the initial pull, new deployments take a few seconds to
package up the current source directory.

It's still possible to use the `git clone` approach via a volume mount, but the
permissions are a bit finicky at that point, and I'm not sure how many options
we want to have.
  • Loading branch information
rjpower committed May 27, 2024
1 parent 81ba8c0 commit 544bb06
Show file tree
Hide file tree
Showing 10 changed files with 290 additions and 128 deletions.
4 changes: 3 additions & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.git

scratch
cache
wandb
Expand Down Expand Up @@ -44,6 +46,7 @@ instance/

# Sphinx documentation
docs/_build/
docs/figures/

# PyBuilder
target/
Expand Down Expand Up @@ -105,7 +108,6 @@ dmypy.json
# JetBrains
.idea/


# dataset cache files
**/*.parquet
**/ledger.json
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ dmypy.json
/wandb

# dataset cache files
/cache
*.parquet
ledger.json

Expand Down
23 changes: 23 additions & 0 deletions docker/tpu/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
FROM python:3.10 AS build
RUN apt-get update && apt-get install -y clang
RUN pip install virtualenv

# venv binaries encode their directory, so we need to setup the venv in the final location
RUN virtualenv -p python3.10 /opt/levanter/.venv
RUN /opt/levanter/.venv/bin/pip install -U "jax[tpu]==0.4.26" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

WORKDIR /tmp

# Add only the requirements files to cache dependency build/installation
ADD pyproject.toml README.md /tmp/
RUN /opt/levanter/.venv/bin/pip install -e .

FROM python:3.10

WORKDIR /opt/levanter
COPY --from=build /opt/levanter/.venv /opt/levanter/.venv
ENV TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS=60 TENSORSTORE_CURL_LOW_SPEED_LIMIT_BYTES=1024
ADD . /opt/levanter/

# Setup venv Python as the default
ENV PATH=/opt/levanter/.venv/bin:$PATH
Empty file added infra/__init__.py
Empty file.
140 changes: 140 additions & 0 deletions infra/deploy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
#!/usr/bin/python

"""
Build and deploy the Levanter base image to Artifact Registry.
It is not necessary to run this yourself unless you are deploying a new base image: the launch
script will automatically build and deploy an image based on your current code.
"""

import argparse
import json
import subprocess

CLEANUP_POLICY = [
{
"name": "delete-stale",
"action": {"type": "Delete"},
"condition": {
"olderThan": "86400s",
"tagState": "ANY",
},
},
{
"name": "keep-latest",
"action": {"type": "Keep"},
"mostRecentVersions": {
"keepCount": 5,
},
},
]


def _run(*args, **kw):
print("Running ", " ".join(args[0]))
return subprocess.check_output(*args, **kw)


def build_and_push_docker_image(project_id, region, repository, image_name):
"""Builds a Docker image, enables artifact access, and pushes to Artifact Registry."""

artifact_repo = f"{region}-docker.pkg.dev/{project_id}/{repository}"

# Activate artifact registry and setup the repository.
_run(["gcloud", "services", "enable", "artifactregistry.googleapis.com"])

try:
_run(
[
"gcloud",
"artifacts",
"repositories",
"create",
"levanter",
f"--location={region}",
"--repository-format=docker",
],
stderr=subprocess.STDOUT,
)
except subprocess.CalledProcessError as e:
# Ignore error if repository already exists.
if b"ALREADY_EXISTS" not in e.output:
print("Error creating repository: ", e.output)
raise

with open("/tmp/cleanup-policy.json", "w") as f:
json.dump(CLEANUP_POLICY, f, indent=2)

_run(
[
"gcloud",
"artifacts",
"repositories",
"set-cleanup-policies",
f"--location={region}",
"--policy=/tmp/cleanup-policy.json",
repository,
]
)

# Grant public read access ('allUsers') for TPU VMs
_run(
[
"gcloud",
"artifacts",
"repositories",
"add-iam-policy-binding",
"--member=allUsers",
"--role=roles/artifactregistry.reader",
f"--location={region}",
repository,
]
)

_run(
[
"gcloud",
"--project",
project_id,
"artifacts",
"repositories",
"add-iam-policy-binding",
repository,
"--location",
region,
"--member",
"allUsers",
"--role",
"roles/artifactregistry.reader",
]
)

_run(["gcloud", "auth", "configure-docker", f"{region}-docker.pkg.dev"])
_run(
[
"docker",
"buildx",
"build",
"--platform=linux/amd64",
"-t",
image_name,
"-f",
"docker/tpu/Dockerfile",
".",
]
)

full_image_name = f"{artifact_repo}/{image_name}:latest"
_run(["docker", "tag", image_name, full_image_name])
_run(["docker", "push", full_image_name])


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Build and push Docker image to Artifact Registry.")
parser.add_argument("--project", required=True, help="GCP project ID")
parser.add_argument("--region", required=True, help="Artifact Registry region (e.g., us-west4)")
parser.add_argument("--repository", default="levanter", help="Artifact Registry repository name")
parser.add_argument("--image", default="levanter", help="Docker image name.")
args = parser.parse_args()

build_and_push_docker_image(args.project, args.region, args.repository, args.image)
6 changes: 6 additions & 0 deletions infra/helpers/parse-tpu-creation-args.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ AUTODELETE=true
SETUP_SCRIPT="$SCRIPT_DIR/helpers/setup-tpu-vm.sh"
SUBNETWORK="default"
USE_ALPHA=false
DOCKER_REPOSITORY="levanter"
DOCKER_IMAGE="levanter"
PROJECT=$(gcloud info --format='value(config.project)')
RETRIES=-1 # how many times babysit-tpu-vm.sh should retry before giving up. -1 means infinite

if [ -z "$GIT_BRANCH" ]; then
Expand Down Expand Up @@ -144,3 +147,6 @@ else
fi
fi
fi

# Extract region from ZONE (everything before last "-"")
REGION=$(echo "$ZONE" | sed 's/-[^-]*$//g')
113 changes: 10 additions & 103 deletions infra/helpers/setup-tpu-vm.sh
Original file line number Diff line number Diff line change
@@ -1,55 +1,17 @@
# broadly based on https://github.com/ayaka14732/tpu-starter

# parse some arguments
# usage: ./setup-tpu-vm.sh -b|--branch <git commit or branch for levanter> -r <git repo for levanter>
# usage: REGION=<tpu-region> PROJECT=<tpu-project> DOCKER_REPOSITORY=<artifact_repo> DOCKER_IMAGE=<base_image_name> ./setup-tpu-vm.sh

if [ "$DEBUG" == "1" ]; then
set -x
fi

REPO="https://github.com/stanford-crfm/levanter.git"
BRANCH=main

if [ "$GIT_BRANCH" != "" ]; then
BRANCH="$GIT_BRANCH"
if [[ -z "$REGION" || -z "$PROJECT" || -z "$DOCKER_REPOSITORY" || -z "$DOCKER_IMAGE" ]]; then
echo "REGION, PROJECT, DOCKER_REPOSITORY, and DOCKER_IMAGE must be set."
echo "Current values: REGION=$REGION, PROJECT=$PROJECT, DOCKER_REPOSITORY=$DOCKER_REPOSITORY, DOCKER_IMAGE=$DOCKER_IMAGE"
exit 1
fi

while [[ $# -gt 0 ]]; do
key="$1"
case $key in
-b|--branch)
BRANCH="$2"
shift
shift
;;
-r|--repo)
REPO="$2"
shift
shift
;;
*)
>&2 echo "Unknown option $1"
exit 1
;;
esac
done

# we frequently deal with commands failing, and we like to loop until they succeed. this function does that for us
function retry {
for i in {1..5}; do
$@
if [ $? -eq 0 ]; then
break
fi
if [ $i -eq 5 ]; then
>&2 echo "Error running $*, giving up"
exit 1
fi
>&2 echo "Error running $*, retrying in 5 seconds"
sleep 5
done
}

# tcmalloc interferes with intellij remote ide
sudo patch -f -b /etc/environment << EOF
2c2
Expand All @@ -58,68 +20,13 @@ sudo patch -f -b /etc/environment << EOF
> #LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4"
EOF



# don't complain if already applied
retCode=$?
[[ $retCode -le 1 ]] || exit $retCode

DOCKER_IMAGE=$REGION-docker.pkg.dev/$PROJECT/$DOCKER_REPOSITORY/$DOCKER_IMAGE:latest
sudo docker pull $DOCKER_IMAGE
sudo docker tag $DOCKER_IMAGE levanter:latest

# set these env variables b/c it makes tensorstore behave better
if ! grep -q TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS /etc/environment; then
# need sudo
echo "TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS=60" | sudo tee -a /etc/environment > /dev/null
fi

if ! grep -q TENSORSTORE_CURL_LOW_SPEED_LIMIT_BYTES /etc/environment; then
echo "TENSORSTORE_CURL_LOW_SPEED_LIMIT_BYTES=1024" | sudo tee -a /etc/environment > /dev/null
fi

# install python 3.10, latest git
sudo systemctl stop unattended-upgrades # this frequently holds the apt lock
sudo systemctl disable unattended-upgrades
sudo apt remove -y unattended-upgrades
# if it's still running somehow, kill it
if [ $(ps aux | grep unattended-upgrade | wc -l) -gt 1 ]; then
sudo kill -9 $(ps aux | grep unattended-upgrade | awk '{print $2}')
fi

# sometimes apt-get update fails, so retry a few times
retry sudo apt-get install -y software-properties-common
retry sudo add-apt-repository -y ppa:deadsnakes/ppa
retry sudo add-apt-repository -y ppa:git-core/ppa
retry sudo apt-get -qq update
retry sudo apt-get -qq install -y python3.10-full python3.10-dev git

VENV=~/venv310
# if the venv doesn't exist, make it
if [ ! -d "$VENV" ]; then
echo "Creating virtualenv at $VENV"
python3.10 -m venv $VENV
fi

source $VENV/bin/activate

pip install -U pip
pip install -U wheel

# jax and jaxlib
# libtpu sometimes has issues installing for clinical (probably firewall?)
#retry pip install -U "jax[tpu]==0.4.5" libtpu-nightly==0.1.dev20230216 -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
retry pip install -U "jax[tpu]==0.4.26" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# clone levanter
git clone $REPO levanter
echo $VENV > levanter/infra/venv_path.txt

cd levanter

# checkout the branch we want

echo "Checking out branch $BRANCH"

git checkout $BRANCH

# install levanter

pip install -e .
# let our user use docker without `sudo` from now on
sudo usermod -aG docker $USER
Loading

0 comments on commit 544bb06

Please sign in to comment.