Setup Docker for TPU execution and update infra scripts. #601

rjpower · 2024-05-27T04:09:32Z

(Still needs some cleanup & doc updates but sending for thoughts.

A few notes:

We could use draccus configs for these infra parameters, but need to adjust it to support remainder args. I think it's okay to do that as a separate cleanup pass.
I moved a few scripts into Python: trying to figure out how to preserve commands through 3 layers of bash is challenging... I think it's worth moving more of these over and cleaning it up a bit over time.
It's still possible to use the git clone with this via a volume mount, but I think this should work as-is for most use cases.
With the current setup each project needs to host the base Levanter image themselves. We can split the Docker build so that we have a base image and just add the current source on top of it. If this looks reasonable I'll do that.

FROM stanford/levanter # contains all of the setup and package installation
...

)

I tried to optimize the Docker image size a bit using a staged build, as Ray currently requires a source build of Meson, which requires a Clang installation... even with this jax & libtpu are each themselves >250MB installs, so there's no avoiding a large image size at the moment.

Still, with this configuration, a v5-32 (the most I could get given GCPs stingy IP address allocation) takes about 50 seconds to run setup-vm.sh and pull the initial image. After the initial pull, new deployments take a few seconds to package up the current source directory.

dlwh

This is amazing and wonderful. Thank you

Mostly just making sure I understand: the new workflow is:

I hack on my own fork of levanter.
Then, when I'm ready, I just run the new launch command
It will automatically spin up a tpu
build a docker image from my current checkout
push the image
deploy and run the image on the tpu machine

One thing that I think might be lost (unless I'm misreading) is that ATM we log the git commit to wandb. Is that right?

infra/launch.py

rjpower · 2024-05-27T19:08:13Z

This is amazing and wonderful. Thank you

Mostly just making sure I understand: the new workflow is:

I hack on my own fork of levanter.

Then, when I'm ready, I just run the new launch command

It will automatically spin up a tpu

build a docker image from my current checkout

push the image

deploy and run the image on the tpu machine

One thing that I think might be lost (unless I'm misreading) is that ATM we log the git commit to wandb. Is that right?

Yes mostly: I don't spin up the new TPU yet: you still use spin-up-vm.sh, and that pulls a "default" image, just to warm up the VM. From that point on, when you run launch, it's exactly as you say. It takes negligible time to setup and push any changes, as it's just the last "layer" of the docker image and <1MB of data.

I think having the combined workflow is a really good idea though and I can add that in a follow-up PR.

rjpower

I cannot for the life of me figure out the review process on Github: I can only find the "Submit Review" button from inside vscode so here goes...

It's a bit weird. You go to Files Changed on the PR page then add review comments, then there's a green button at the top right called Review Changes.

infra/launch.py

rjpower · 2024-05-27T19:13:53Z

I'll update the docs and validate testing works and then give you a ping.

rjpower · 2024-05-28T02:19:13Z

Okay, I think I caught all the docs etc. I updated the .github test integration to use launch.py but it won't run directly for me: I don't think Github will inject credentials for external PRs, so it's just failing in spin up.

I've verified it runs correctly (there are a few test failures, but they aren't my fault 😁) with the following command manually:

python infra/launch.py --foreground --tpu=test-spin-up-1 --zone=us-west4-a -- /opt/levanter/.venv/bin/pytest tests

We can of course tweak launch.py to use env/secrets as needed for that.

PTAL.

infra/launch.py

dlwh

still trying it out, but mostly looks good to me! just thinking through the migration path and the first time launch experience.

docs/Getting-Started-TPU-VM.md

infra/launch.sh

infra/launch.py

dlwh · 2024-05-28T05:32:34Z

Running directly off your branch I get

Running  docker tag levanter-dlwh us-west4-docker.pkg.dev/hai-gcp-models/levanter/levanter-dlwh:1716873733
Error response from daemon: No such image: levanter-dlwh:latest
Traceback (most recent call last):
  File "/Users/dlwh/src/levanter/infra/launch.py", line 79, in <module>
    full_image_id = deploy.push_to_gcp(
  File "/Users/dlwh/src/levanter/infra/deploy.py", line 133, in push_to_gcp
    _run(["docker", "tag", image_name, full_image_name])
  File "/Users/dlwh/src/levanter/infra/deploy.py", line 35, in _run
    return subprocess.check_output(*args, **kw)
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'tag', 'levanter-dlwh', 'us-west4-docker.pkg.dev/hai-gcp-models/levanter/levanter-dlwh:1716873733']' returned non-zero exit status 1.

rjpower · 2024-05-28T16:09:30Z

Running directly off your branch I get

Running  docker tag levanter-dlwh us-west4-docker.pkg.dev/hai-gcp-models/levanter/levanter-dlwh:1716873733
Error response from daemon: No such image: levanter-dlwh:latest
Traceback (most recent call last):
  File "/Users/dlwh/src/levanter/infra/launch.py", line 79, in <module>
    full_image_id = deploy.push_to_gcp(
  File "/Users/dlwh/src/levanter/infra/deploy.py", line 133, in push_to_gcp
    _run(["docker", "tag", image_name, full_image_name])
  File "/Users/dlwh/src/levanter/infra/deploy.py", line 35, in _run
    return subprocess.check_output(*args, **kw)
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'tag', 'levanter-dlwh', 'us-west4-docker.pkg.dev/hai-gcp-models/levanter/levanter-dlwh:1716873733']' returned non-zero exit status 1.

Doh, that's probably something I just screwed up: I added a timestamp to uniquely identify the tagged image, and neglected to update that command. So you can understand my pain:

Docker has a "latest" tag which works mostly as you expect, except when you're using remote images.
When you're using a remote image, and you use

docker pull xyz.pkg.dev/levanter/levanter:latest

it works correctly the first time, but if you update the package on pkg.dev and call pull again, there's no check for staleness: Docker just assumes if it has that tag locally everything must be okay. So basically we can't use the "latest" tag except for things like a base image that we don't mind going stale.

rjpower · 2024-05-28T16:29:37Z

Okay, it's faster with the incremental build (and a little more predictable).

[+] Building 44.5s (10/10) FINISHED

vs building the base image:

[+] Building 169.1s (16/16) FINISHED                                                                                                                                                                                     docker:desktop-linux

In both cases, the build is cached, so you should only pay this once. Further deployments are almost instantaneous, e.g. to touch a single file I get:

[+] Building 0.1s (10/10) FINISHED

dlwh · 2024-05-28T17:55:28Z

it works correctly the first time, but if you update the package on pkg.dev and call pull again, there's no check for staleness: Docker just assumes if it has that tag locally everything must be okay. So basically we can't use the "latest" tag except for things like a base image that we don't mind going stale.

Ah yes, Docker is very annoying about things like that. :(

dlwh · 2024-05-28T17:56:32Z

ok amazing. I'll make sure I can get through this workflow (and push a base image to ghcr) but otherwise lgtm!

dlwh · 2024-05-28T19:35:25Z

I had to run

gcloud auth configure-docker us-central2-docker.pkg.dev

But now I get

name unknown: Repository "levanter" not found
Traceback (most recent call last):
  File "/Users/dlwh/src/levanter-dev/infra/launch.py", line 83, in <module>
    full_image_id = push_docker.push_to_gcp(
  File "/Users/dlwh/src/levanter-dev/infra/push_docker.py", line 153, in push_to_gcp
    _run(["docker", "push", full_image_name])
  File "/Users/dlwh/src/levanter-dev/infra/push_docker.py", line 35, in _run
    return subprocess.check_output(*args, **kw)
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'push', 'us-central2-docker.pkg.dev/hai-gcp-models/levanter/levanter-dlwh:1716924862']' returned non-zero exit status 1.

rjpower · 2024-05-28T19:43:05Z

I had to run

gcloud auth configure-docker us-central2-docker.pkg.dev

But now I get
name unknown: Repository "levanter" not found

Crap, that was me: I was trying to factor out the push setup to test out pushing to Docker hub and removed the GCP setup from the push_to_gcp flow. Can you try again now?

I'll also try testing with a different repository name to see if that breaks anything.

rjpower · 2024-06-03T23:49:48Z

I went ahead and added a few cleanups:

The launch script now has retries, TPU starting and setup, so you can use it in place of babysit+spinup+setup+run
I added pushing to ghcr.io: https://ghcr.io/rjpower/levanter. The first push to GCP will take a while with this setup, as you have to push the ~500MB base image, but the build is still quick and after that everything is speedy.
Refactored so the scripts share some CLI helpers etc

I found this setup nice for dependencies, for instance, I had a few changes to Haliax so I just copied haliax into my levanter directory and added cd haliax && pip install -e . to the Dockerfile.

Let me know if you see any other things you'd like addressed!

I tried to optimize the Docker image size a bit using a staged build, as Ray currently requires a source build of Meson, which requires a Clang installation... even with this jax & libtpu are each themselves >250MB installs, so there's no avoiding a large image size at the moment. Still, with this configuration, a v5-32 (the most I could get given GCPs stingy IP address allocation) takes about 50 seconds to run setup-vm.sh and pull the initial image. After the initial pull, new deployments take a few seconds to package up the current source directory. It's still possible to use the `git clone` approach via a volume mount, but the permissions are a bit finicky at that point, and I'm not sure how many options we want to have.

dlwh · 2024-06-11T01:29:39Z

sorry getting back to this, some thoughts:

it'd be nice if launch would be willing to reuse an existing tpu-vm if it already exists
small bytes-vs-str thing in configure_gcp_docker I'll add a suggestion for (mostly note to self)
after fixing that, i'm getting this:

Running: gcloud alpha compute tpus tpu-vm ssh dlwh-test --worker=all --zone=us-east1-d --command=docker volume create --driver=local levanter
SSH key found in project metadata; not updating instance.
Using ssh batch size of 4. Attempting to SSH into 1 nodes with a total of 4 workers.
SSH: Attempting to connect to worker 0...
SSH: Attempting to connect to worker 1...
SSH: Attempting to connect to worker 2...
SSH: Attempting to connect to worker 3...
levanter
levanter
levanter
levanter
Running  gcloud artifacts repositories describe --location=us-east1 levanter
ERROR: (gcloud.artifacts.repositories.describe) NOT_FOUND: Requested entity was not found.
Error running command.

infra/push_docker.py

dlwh · 2024-06-11T06:12:24Z

lol now i'm getting this completely useless error

~/src/levanter-dev (faster-setup)$ PYTHONPATH=. python infra/launch.py --foreground --tpu=test-spin-up-1 --zone=us-central2-b --tpu_type v4-8 -- /opt/levanter/.venv/bin/pytest tests
Listed 0 items.
Creating new TPU test-spin-up-1 in us-central2-b of type v4-8...
Running: gcloud alpha compute tpus tpu-vm create test-spin-up-1 --accelerator-type=v4-8 --version=tpu-ubuntu2204-base --zone=us-central2-b  --quiet
ERROR: (gcloud.alpha.compute.tpus.tpu-vm.create) unrecognized arguments:

To search the help text of gcloud commands, run:
  gcloud help -- SEARCH_TERMS
Error running command.

dlwh · 2024-06-11T06:15:52Z

repr helped me figure it out, lol:

Running: 'gcloud' 'alpha' 'compute' 'tpus' 'tpu-vm' 'create' 'test-spin-up-1' '--accelerator-type=v4-8' '--version=tpu-ubuntu2204-base' '--zone=us-central2-b' '' '--quiet'

infra/launch.py

rjpower · 2024-06-11T14:51:57Z

No worries, I've been super busy with other obligations the past week so it's honestly good timing :).

it'd be nice if launch would be willing to reuse an existing tpu-vm if it already exists

It should already do this... (oh, just saw your patch..)

misc dumb fixes

dlwh

lgtm, thanks!

docs/Getting-Started-TPU-VM.md

rjpower · 2024-06-11T22:18:37Z

Okay, cleaned everything up. I think I can't submit because the TPU tests are failing for an unrelated reason so you may need to merge for me:

gcloud compute tpus tpu-vm delete ci-run-9473410622 --zone us-central2-b --quiet
ERROR: (gcloud.compute.tpus.tpu-vm.delete) Error parsing [tpu].
The [tpu] resource is not properly specified.
Failed to find attribute [project]. The attribute can be set in the following ways: 
- provide the argument `tpu` on the command line with a fully specified name
- provide the argument `--project` on the command line
- set the property `core/project`

It appears github secrets aren't available in forked repos: https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions#using-secrets-in-a-workflow (even if approved). I couldn't find a good workaround unfortunately; I'm sure there's some rube-goldberg machine setup where you can have GCP cloud run pick up approved requests and call web services, but I expect it would work 4% of the time.

dlwh · 2024-06-11T23:36:26Z

I couldn't find a good workaround unfortunately; I'm sure there's some rube-goldberg machine setup where you can have GCP cloud run pick up approved requests and call web services, but I expect it would work 4% of the time.

a very simple solution is for me to add you to the repo :-)

dlwh · 2024-06-11T23:37:40Z

(I don't care enough about pre-commit. we can fix it later)

dlwh · 2024-06-11T23:37:48Z

Thanks so much!

dlwh reviewed May 27, 2024

View reviewed changes

infra/launch.py Outdated Show resolved Hide resolved

infra/launch.py Outdated Show resolved Hide resolved

infra/launch.py Outdated Show resolved Hide resolved

infra/launch.py Outdated Show resolved Hide resolved

infra/launch.py Outdated Show resolved Hide resolved

rjpower force-pushed the faster-setup branch from 544bb06 to 8f05e56 Compare May 27, 2024 19:09

rjpower commented May 27, 2024

View reviewed changes

infra/launch.py Outdated Show resolved Hide resolved

infra/launch.py Outdated Show resolved Hide resolved

infra/launch.py Outdated Show resolved Hide resolved

infra/launch.py Outdated Show resolved Hide resolved

infra/launch.py Outdated Show resolved Hide resolved

rjpower force-pushed the faster-setup branch from 8f05e56 to 15ee9d2 Compare May 28, 2024 01:30

rjpower force-pushed the faster-setup branch from 5ecaf7f to e3c3e55 Compare May 28, 2024 02:20

dlwh reviewed May 28, 2024

View reviewed changes

infra/launch.py Outdated Show resolved Hide resolved

dlwh reviewed May 28, 2024

View reviewed changes

docs/Getting-Started-TPU-VM.md Show resolved Hide resolved

infra/launch.sh Outdated Show resolved Hide resolved

infra/launch.py Show resolved Hide resolved

rjpower force-pushed the faster-setup branch 5 times, most recently from 8f2689e to 726d0ab Compare June 3, 2024 23:44

rjpower force-pushed the faster-setup branch 4 times, most recently from 5906255 to be1326f Compare June 5, 2024 21:40

rjpower force-pushed the faster-setup branch from be1326f to 3b3173c Compare June 8, 2024 00:50

dlwh reviewed Jun 11, 2024

View reviewed changes

infra/push_docker.py Outdated Show resolved Hide resolved

dlwh reviewed Jun 11, 2024

View reviewed changes

infra/launch.py Outdated Show resolved Hide resolved

infra/launch.py Outdated Show resolved Hide resolved

infra/launch.py Outdated Show resolved Hide resolved

dlwh added 2 commits June 11, 2024 00:02

misc dumb fixes

d4a1d4f

there we go

bc861c0

rjpower added 2 commits June 11, 2024 10:52

Merge pull request #1 from stanford-crfm/faster-setup

bd81df1

misc dumb fixes

Adjust docs to reflect new config format and cleanup a few flags.

2099930

rjpower force-pushed the faster-setup branch from c64c38d to 2099930 Compare June 11, 2024 15:35

dlwh approved these changes Jun 11, 2024

View reviewed changes

docs/Getting-Started-TPU-VM.md Show resolved Hide resolved

docs/Getting-Started-TPU-VM.md Outdated Show resolved Hide resolved

rjpower added 3 commits June 11, 2024 09:59

Tiny doc cleanups.

1929ac2

Add back tokenizer documentation.

22d1a64

Fix doc typo, cleanup base dependency installation.

911fbbc

dlwh merged commit 87a8b81 into stanford-crfm:main Jun 11, 2024
3 of 6 checks passed

This was referenced Jun 12, 2024

Improve VM startup time #310

Closed

Reinvestigate docker on tpu vm #126

Closed

Use some lightweight secret management thing #331

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup Docker for TPU execution and update infra scripts. #601

Setup Docker for TPU execution and update infra scripts. #601

rjpower commented May 27, 2024

dlwh left a comment

rjpower commented May 27, 2024

rjpower left a comment •

edited by dlwh

Loading

rjpower commented May 27, 2024

rjpower commented May 28, 2024

dlwh left a comment

dlwh commented May 28, 2024

rjpower commented May 28, 2024

rjpower commented May 28, 2024

dlwh commented May 28, 2024

dlwh commented May 28, 2024

dlwh commented May 28, 2024

rjpower commented May 28, 2024

rjpower commented Jun 3, 2024

dlwh commented Jun 11, 2024

dlwh commented Jun 11, 2024

dlwh commented Jun 11, 2024

rjpower commented Jun 11, 2024 •

edited

Loading

dlwh left a comment

rjpower commented Jun 11, 2024

dlwh commented Jun 11, 2024

dlwh commented Jun 11, 2024

dlwh commented Jun 11, 2024

Setup Docker for TPU execution and update infra scripts. #601

Setup Docker for TPU execution and update infra scripts. #601

Conversation

rjpower commented May 27, 2024

dlwh left a comment

Choose a reason for hiding this comment

rjpower commented May 27, 2024

rjpower left a comment • edited by dlwh Loading

Choose a reason for hiding this comment

rjpower commented May 27, 2024

rjpower commented May 28, 2024

dlwh left a comment

Choose a reason for hiding this comment

dlwh commented May 28, 2024

rjpower commented May 28, 2024

rjpower commented May 28, 2024

dlwh commented May 28, 2024

dlwh commented May 28, 2024

dlwh commented May 28, 2024

rjpower commented May 28, 2024

rjpower commented Jun 3, 2024

dlwh commented Jun 11, 2024

dlwh commented Jun 11, 2024

dlwh commented Jun 11, 2024

rjpower commented Jun 11, 2024 • edited Loading

dlwh left a comment

Choose a reason for hiding this comment

rjpower commented Jun 11, 2024

dlwh commented Jun 11, 2024

dlwh commented Jun 11, 2024

dlwh commented Jun 11, 2024

rjpower left a comment •

edited by dlwh

Loading

rjpower commented Jun 11, 2024 •

edited

Loading