Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup Docker for TPU execution and update infra scripts. #601

Merged
merged 8 commits into from
Jun 11, 2024

Conversation

rjpower
Copy link
Collaborator

@rjpower rjpower commented May 27, 2024

(Still needs some cleanup & doc updates but sending for thoughts.

A few notes:

  • We could use draccus configs for these infra parameters, but need to adjust it to support remainder args. I think it's okay to do that as a separate cleanup pass.
  • I moved a few scripts into Python: trying to figure out how to preserve commands through 3 layers of bash is challenging... I think it's worth moving more of these over and cleaning it up a bit over time.
  • It's still possible to use the git clone with this via a volume mount, but I think this should work as-is for most use cases.
  • With the current setup each project needs to host the base Levanter image themselves. We can split the Docker build so that we have a base image and just add the current source on top of it. If this looks reasonable I'll do that.
FROM stanford/levanter # contains all of the setup and package installation
...

)

I tried to optimize the Docker image size a bit using a staged build, as Ray currently requires a source build of Meson, which requires a Clang installation... even with this jax & libtpu are each themselves >250MB installs, so there's no avoiding a large image size at the moment.

Still, with this configuration, a v5-32 (the most I could get given GCPs stingy IP address allocation) takes about 50 seconds to run setup-vm.sh and pull the initial image. After the initial pull, new deployments take a few seconds to package up the current source directory.

Copy link
Member

@dlwh dlwh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing and wonderful. Thank you

Mostly just making sure I understand: the new workflow is:

  • I hack on my own fork of levanter.
  • Then, when I'm ready, I just run the new launch command
  • It will automatically spin up a tpu
  • build a docker image from my current checkout
  • push the image
  • deploy and run the image on the tpu machine

One thing that I think might be lost (unless I'm misreading) is that ATM we log the git commit to wandb. Is that right?

infra/launch.py Outdated Show resolved Hide resolved
infra/launch.py Outdated Show resolved Hide resolved
infra/launch.py Outdated Show resolved Hide resolved
infra/launch.py Outdated Show resolved Hide resolved
infra/launch.py Outdated Show resolved Hide resolved
@rjpower
Copy link
Collaborator Author

rjpower commented May 27, 2024

This is amazing and wonderful. Thank you

Mostly just making sure I understand: the new workflow is:

  • I hack on my own fork of levanter.
  • Then, when I'm ready, I just run the new launch command
  • It will automatically spin up a tpu
  • build a docker image from my current checkout
  • push the image
  • deploy and run the image on the tpu machine

One thing that I think might be lost (unless I'm misreading) is that ATM we log the git commit to wandb. Is that right?

Yes mostly: I don't spin up the new TPU yet: you still use spin-up-vm.sh, and that pulls a "default" image, just to warm up the VM. From that point on, when you run launch, it's exactly as you say. It takes negligible time to setup and push any changes, as it's just the last "layer" of the docker image and <1MB of data.

I think having the combined workflow is a really good idea though and I can add that in a follow-up PR.

Copy link
Collaborator Author

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot for the life of me figure out the review process on Github: I can only find the "Submit Review" button from inside vscode so here goes...

It's a bit weird. You go to Files Changed on the PR page then add review comments, then there's a green button at the top right called Review Changes.

infra/launch.py Outdated Show resolved Hide resolved
infra/launch.py Outdated Show resolved Hide resolved
infra/launch.py Outdated Show resolved Hide resolved
infra/launch.py Outdated Show resolved Hide resolved
infra/launch.py Outdated Show resolved Hide resolved
@rjpower
Copy link
Collaborator Author

rjpower commented May 27, 2024

I'll update the docs and validate testing works and then give you a ping.

@rjpower
Copy link
Collaborator Author

rjpower commented May 28, 2024

Okay, I think I caught all the docs etc. I updated the .github test integration to use launch.py but it won't run directly for me: I don't think Github will inject credentials for external PRs, so it's just failing in spin up.

I've verified it runs correctly (there are a few test failures, but they aren't my fault 😁) with the following command manually:

python infra/launch.py --foreground --tpu=test-spin-up-1 --zone=us-west4-a -- /opt/levanter/.venv/bin/pytest tests

We can of course tweak launch.py to use env/secrets as needed for that.

PTAL.

infra/launch.py Outdated Show resolved Hide resolved
Copy link
Member

@dlwh dlwh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still trying it out, but mostly looks good to me! just thinking through the migration path and the first time launch experience.

docs/Getting-Started-TPU-VM.md Show resolved Hide resolved
infra/launch.sh Outdated Show resolved Hide resolved
infra/launch.py Show resolved Hide resolved
@dlwh
Copy link
Member

dlwh commented May 28, 2024

Running directly off your branch I get

Running  docker tag levanter-dlwh us-west4-docker.pkg.dev/hai-gcp-models/levanter/levanter-dlwh:1716873733
Error response from daemon: No such image: levanter-dlwh:latest
Traceback (most recent call last):
  File "/Users/dlwh/src/levanter/infra/launch.py", line 79, in <module>
    full_image_id = deploy.push_to_gcp(
  File "/Users/dlwh/src/levanter/infra/deploy.py", line 133, in push_to_gcp
    _run(["docker", "tag", image_name, full_image_name])
  File "/Users/dlwh/src/levanter/infra/deploy.py", line 35, in _run
    return subprocess.check_output(*args, **kw)
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'tag', 'levanter-dlwh', 'us-west4-docker.pkg.dev/hai-gcp-models/levanter/levanter-dlwh:1716873733']' returned non-zero exit status 1.

@rjpower
Copy link
Collaborator Author

rjpower commented May 28, 2024

Running directly off your branch I get

Running  docker tag levanter-dlwh us-west4-docker.pkg.dev/hai-gcp-models/levanter/levanter-dlwh:1716873733
Error response from daemon: No such image: levanter-dlwh:latest
Traceback (most recent call last):
  File "/Users/dlwh/src/levanter/infra/launch.py", line 79, in <module>
    full_image_id = deploy.push_to_gcp(
  File "/Users/dlwh/src/levanter/infra/deploy.py", line 133, in push_to_gcp
    _run(["docker", "tag", image_name, full_image_name])
  File "/Users/dlwh/src/levanter/infra/deploy.py", line 35, in _run
    return subprocess.check_output(*args, **kw)
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'tag', 'levanter-dlwh', 'us-west4-docker.pkg.dev/hai-gcp-models/levanter/levanter-dlwh:1716873733']' returned non-zero exit status 1.

Doh, that's probably something I just screwed up: I added a timestamp to uniquely identify the tagged image, and neglected to update that command. So you can understand my pain:

Docker has a "latest" tag which works mostly as you expect, except when you're using remote images.
When you're using a remote image, and you use

docker pull xyz.pkg.dev/levanter/levanter:latest

it works correctly the first time, but if you update the package on pkg.dev and call pull again, there's no check for staleness: Docker just assumes if it has that tag locally everything must be okay. So basically we can't use the "latest" tag except for things like a base image that we don't mind going stale.

@rjpower
Copy link
Collaborator Author

rjpower commented May 28, 2024

Okay, it's faster with the incremental build (and a little more predictable).

[+] Building 44.5s (10/10) FINISHED

vs building the base image:

[+] Building 169.1s (16/16) FINISHED                                                                                                                                                                                     docker:desktop-linux

In both cases, the build is cached, so you should only pay this once. Further deployments are almost instantaneous, e.g. to touch a single file I get:

[+] Building 0.1s (10/10) FINISHED 

@dlwh
Copy link
Member

dlwh commented May 28, 2024

it works correctly the first time, but if you update the package on pkg.dev and call pull again, there's no check for staleness: Docker just assumes if it has that tag locally everything must be okay. So basically we can't use the "latest" tag except for things like a base image that we don't mind going stale.

Ah yes, Docker is very annoying about things like that. :(

@dlwh
Copy link
Member

dlwh commented May 28, 2024

ok amazing. I'll make sure I can get through this workflow (and push a base image to ghcr) but otherwise lgtm!

@dlwh
Copy link
Member

dlwh commented May 28, 2024

I had to run

gcloud auth configure-docker us-central2-docker.pkg.dev

But now I get

name unknown: Repository "levanter" not found
Traceback (most recent call last):
  File "/Users/dlwh/src/levanter-dev/infra/launch.py", line 83, in <module>
    full_image_id = push_docker.push_to_gcp(
  File "/Users/dlwh/src/levanter-dev/infra/push_docker.py", line 153, in push_to_gcp
    _run(["docker", "push", full_image_name])
  File "/Users/dlwh/src/levanter-dev/infra/push_docker.py", line 35, in _run
    return subprocess.check_output(*args, **kw)
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'push', 'us-central2-docker.pkg.dev/hai-gcp-models/levanter/levanter-dlwh:1716924862']' returned non-zero exit status 1.

@rjpower
Copy link
Collaborator Author

rjpower commented May 28, 2024

I had to run

gcloud auth configure-docker us-central2-docker.pkg.dev

But now I get

name unknown: Repository "levanter" not found

Crap, that was me: I was trying to factor out the push setup to test out pushing to Docker hub and removed the GCP setup from the push_to_gcp flow. Can you try again now?

I'll also try testing with a different repository name to see if that breaks anything.

@rjpower rjpower force-pushed the faster-setup branch 5 times, most recently from 8f2689e to 726d0ab Compare June 3, 2024 23:44
@rjpower
Copy link
Collaborator Author

rjpower commented Jun 3, 2024

I went ahead and added a few cleanups:

  • The launch script now has retries, TPU starting and setup, so you can use it in place of babysit+spinup+setup+run
  • I added pushing to ghcr.io: https://ghcr.io/rjpower/levanter. The first push to GCP will take a while with this setup, as you have to push the ~500MB base image, but the build is still quick and after that everything is speedy.
  • Refactored so the scripts share some CLI helpers etc

I found this setup nice for dependencies, for instance, I had a few changes to Haliax so I just copied haliax into my levanter directory and added cd haliax && pip install -e . to the Dockerfile.

Let me know if you see any other things you'd like addressed!

@rjpower rjpower force-pushed the faster-setup branch 4 times, most recently from 5906255 to be1326f Compare June 5, 2024 21:40
I tried to optimize the Docker image size a bit using a staged build, as Ray
currently requires a source build of Meson, which requires a Clang
installation... even with this jax & libtpu are each themselves >250MB
installs, so there's no avoiding a large image size at the moment.

Still, with this configuration, a v5-32 (the most I could get given GCPs stingy
IP address allocation) takes about 50 seconds to run setup-vm.sh and pull the
initial image.  After the initial pull, new deployments take a few seconds to
package up the current source directory.

It's still possible to use the `git clone` approach via a volume mount, but the
permissions are a bit finicky at that point, and I'm not sure how many options
we want to have.
@dlwh
Copy link
Member

dlwh commented Jun 11, 2024

sorry getting back to this, some thoughts:

  • it'd be nice if launch would be willing to reuse an existing tpu-vm if it already exists
  • small bytes-vs-str thing in configure_gcp_docker I'll add a suggestion for (mostly note to self)
  • after fixing that, i'm getting this:
Running: gcloud alpha compute tpus tpu-vm ssh dlwh-test --worker=all --zone=us-east1-d --command=docker volume create --driver=local levanter
SSH key found in project metadata; not updating instance.
Using ssh batch size of 4. Attempting to SSH into 1 nodes with a total of 4 workers.
SSH: Attempting to connect to worker 0...
SSH: Attempting to connect to worker 1...
SSH: Attempting to connect to worker 2...
SSH: Attempting to connect to worker 3...
levanter
levanter
levanter
levanter
Running  gcloud artifacts repositories describe --location=us-east1 levanter
ERROR: (gcloud.artifacts.repositories.describe) NOT_FOUND: Requested entity was not found.
Error running command.

infra/push_docker.py Outdated Show resolved Hide resolved
@dlwh
Copy link
Member

dlwh commented Jun 11, 2024

lol now i'm getting this completely useless error

~/src/levanter-dev (faster-setup)$ PYTHONPATH=. python infra/launch.py --foreground --tpu=test-spin-up-1 --zone=us-central2-b --tpu_type v4-8 -- /opt/levanter/.venv/bin/pytest tests
Listed 0 items.
Creating new TPU test-spin-up-1 in us-central2-b of type v4-8...
Running: gcloud alpha compute tpus tpu-vm create test-spin-up-1 --accelerator-type=v4-8 --version=tpu-ubuntu2204-base --zone=us-central2-b  --quiet
ERROR: (gcloud.alpha.compute.tpus.tpu-vm.create) unrecognized arguments:

To search the help text of gcloud commands, run:
  gcloud help -- SEARCH_TERMS
Error running command.

@dlwh
Copy link
Member

dlwh commented Jun 11, 2024

repr helped me figure it out, lol:

Running: 'gcloud' 'alpha' 'compute' 'tpus' 'tpu-vm' 'create' 'test-spin-up-1' '--accelerator-type=v4-8' '--version=tpu-ubuntu2204-base' '--zone=us-central2-b' '' '--quiet'

infra/launch.py Outdated Show resolved Hide resolved
infra/launch.py Outdated Show resolved Hide resolved
infra/launch.py Outdated Show resolved Hide resolved
@rjpower
Copy link
Collaborator Author

rjpower commented Jun 11, 2024

No worries, I've been super busy with other obligations the past week so it's honestly good timing :).

it'd be nice if launch would be willing to reuse an existing tpu-vm if it already exists

It should already do this... (oh, just saw your patch..)

Copy link
Member

@dlwh dlwh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

docs/Getting-Started-TPU-VM.md Show resolved Hide resolved
docs/Getting-Started-TPU-VM.md Outdated Show resolved Hide resolved
@rjpower
Copy link
Collaborator Author

rjpower commented Jun 11, 2024

Okay, cleaned everything up. I think I can't submit because the TPU tests are failing for an unrelated reason so you may need to merge for me:

gcloud compute tpus tpu-vm delete ci-run-9473410622 --zone us-central2-b --quiet
ERROR: (gcloud.compute.tpus.tpu-vm.delete) Error parsing [tpu].
The [tpu] resource is not properly specified.
Failed to find attribute [project]. The attribute can be set in the following ways: 
- provide the argument `tpu` on the command line with a fully specified name
- provide the argument `--project` on the command line
- set the property `core/project`

It appears github secrets aren't available in forked repos: https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions#using-secrets-in-a-workflow (even if approved). I couldn't find a good workaround unfortunately; I'm sure there's some rube-goldberg machine setup where you can have GCP cloud run pick up approved requests and call web services, but I expect it would work 4% of the time.

@dlwh
Copy link
Member

dlwh commented Jun 11, 2024

I couldn't find a good workaround unfortunately; I'm sure there's some rube-goldberg machine setup where you can have GCP cloud run pick up approved requests and call web services, but I expect it would work 4% of the time.

a very simple solution is for me to add you to the repo :-)

@dlwh
Copy link
Member

dlwh commented Jun 11, 2024

(I don't care enough about pre-commit. we can fix it later)

@dlwh dlwh merged commit 87a8b81 into stanford-crfm:main Jun 11, 2024
3 of 6 checks passed
@dlwh
Copy link
Member

dlwh commented Jun 11, 2024

Thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants