-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setup Docker for TPU execution and update infra scripts. #601
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing and wonderful. Thank you
Mostly just making sure I understand: the new workflow is:
- I hack on my own fork of levanter.
- Then, when I'm ready, I just run the new launch command
- It will automatically spin up a tpu
- build a docker image from my current checkout
- push the image
- deploy and run the image on the tpu machine
One thing that I think might be lost (unless I'm misreading) is that ATM we log the git commit to wandb. Is that right?
Yes mostly: I don't spin up the new TPU yet: you still use I think having the combined workflow is a really good idea though and I can add that in a follow-up PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot for the life of me figure out the review process on Github: I can only find the "Submit Review" button from inside vscode so here goes...
It's a bit weird. You go to Files Changed on the PR page then add review comments, then there's a green button at the top right called Review Changes.
I'll update the docs and validate testing works and then give you a ping. |
Okay, I think I caught all the docs etc. I updated the .github test integration to use I've verified it runs correctly (there are a few test failures, but they aren't my fault 😁) with the following command manually:
We can of course tweak PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still trying it out, but mostly looks good to me! just thinking through the migration path and the first time launch experience.
Running directly off your branch I get
|
Doh, that's probably something I just screwed up: I added a timestamp to uniquely identify the tagged image, and neglected to update that command. So you can understand my pain: Docker has a "latest" tag which works mostly as you expect, except when you're using remote images.
it works correctly the first time, but if you update the package on |
Okay, it's faster with the incremental build (and a little more predictable).
vs building the base image:
In both cases, the build is cached, so you should only pay this once. Further deployments are almost instantaneous, e.g. to touch a single file I get:
|
Ah yes, Docker is very annoying about things like that. :( |
ok amazing. I'll make sure I can get through this workflow (and push a base image to ghcr) but otherwise lgtm! |
I had to run
But now I get
|
Crap, that was me: I was trying to factor out the push setup to test out pushing to Docker hub and removed the GCP setup from the I'll also try testing with a different repository name to see if that breaks anything. |
8f2689e
to
726d0ab
Compare
I went ahead and added a few cleanups:
I found this setup nice for dependencies, for instance, I had a few changes to Haliax so I just copied Let me know if you see any other things you'd like addressed! |
5906255
to
be1326f
Compare
I tried to optimize the Docker image size a bit using a staged build, as Ray currently requires a source build of Meson, which requires a Clang installation... even with this jax & libtpu are each themselves >250MB installs, so there's no avoiding a large image size at the moment. Still, with this configuration, a v5-32 (the most I could get given GCPs stingy IP address allocation) takes about 50 seconds to run setup-vm.sh and pull the initial image. After the initial pull, new deployments take a few seconds to package up the current source directory. It's still possible to use the `git clone` approach via a volume mount, but the permissions are a bit finicky at that point, and I'm not sure how many options we want to have.
sorry getting back to this, some thoughts:
|
lol now i'm getting this completely useless error
|
repr helped me figure it out, lol: Running: 'gcloud' 'alpha' 'compute' 'tpus' 'tpu-vm' 'create' 'test-spin-up-1' '--accelerator-type=v4-8' '--version=tpu-ubuntu2204-base' '--zone=us-central2-b' '' '--quiet' |
No worries, I've been super busy with other obligations the past week so it's honestly good timing :).
It should already do this... (oh, just saw your patch..) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
Okay, cleaned everything up. I think I can't submit because the TPU tests are failing for an unrelated reason so you may need to merge for me:
It appears github secrets aren't available in forked repos: https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions#using-secrets-in-a-workflow (even if approved). I couldn't find a good workaround unfortunately; I'm sure there's some rube-goldberg machine setup where you can have GCP cloud run pick up approved requests and call web services, but I expect it would work 4% of the time. |
a very simple solution is for me to add you to the repo :-) |
(I don't care enough about pre-commit. we can fix it later) |
Thanks so much! |
(Still needs some cleanup & doc updates but sending for thoughts.
A few notes:
git clone
with this via a volume mount, but I think this should work as-is for most use cases.)
I tried to optimize the Docker image size a bit using a staged build, as Ray currently requires a source build of Meson, which requires a Clang installation... even with this jax & libtpu are each themselves >250MB installs, so there's no avoiding a large image size at the moment.
Still, with this configuration, a v5-32 (the most I could get given GCPs stingy IP address allocation) takes about 50 seconds to run setup-vm.sh and pull the initial image. After the initial pull, new deployments take a few seconds to package up the current source directory.