-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gpu improvements #72
base: master
Are you sure you want to change the base?
Gpu improvements #72
Conversation
Update on the PyTorch environment. Previously I was having some trouble with getting the PyTorch environment defined by the environment file in 4f10904, my last commit. The problem was that, when I built the JupyterLab image using this environment file, pushed it to Dockerhub, added it as an available image in my development JupyterHub, and ran the PyTorch section of this notebook, For the last few weeks I've been working on fixing this on and off, and I've come to conclude that when the Docker image is being built on a machine without a GPU, which I've been doing, I've verified this is the case by creating the environment from the same environment file from within my hybrid development cluster. In one instance, I build the environment on a CPU node: In another instance, I built it on a GPU node. It's important to note that in both instances I'm using the same JupyterLab image. If I then do a
It looks like I have to either:
Nevertheless, I think I've identified the problem and now "only" have to work on fixing it. I'll update again when I have something functional. Thanks, |
@ana-v-espinoza what error do you get when you run on a CPU-only node?
|
Ah. I think I got my images mixed up. I believe I was originally using something based on the An image based on |
CC: @julienchastang
Hey Andrea,
This is a WIP, and I will ping again when it's ready for you to take a closer look at this PR.
Here are a variety of improvements to the image used in GPU enabled JupyterHubs:
When we build our own images (with additional environments) on top of the current image (Image ID:
edf1c6ebcc09
above), we occasionally run into the problem of filling more of the root disk on our worker nodes than Kubernetes is comfortable with. This results in Kubernetes evicting Pods and tainting the affected nodes with DiskPressure. The lower storage footprint of this new image allows us to operate within a more comfortable range of disk usage.The current image is ultimately built on top of an NVIDIA image which expects a GPU to be available. If you're trying to use the current image in a CPU/GPU hybrid cluster scenario (2.21.0 hybrid jetstream_kubespray#29), you would need a separate CPU only image which will itself take up more disk space, potentially resulting in the DiskPressure mentioned above. As this new image is based of off
jupyter/minimal-notebook
with any GPU related packages installed in a conda environment, you do not run into this issue and can use the same image on both GPU and CPU-only worker nodes.The addition of a PyTorch environment. Interestingly, during testing with the PyTorch section of the same notebook as noted above, this minimal PyTorch environment did not yield the desired result, while the non-minimal environment it's based on does. I'll have to take a closer look as to why this is.
Let me know if you have any thoughts or questions,
Ana