Gpu improvements #72

ana-v-espinoza · 2024-01-31T01:35:34Z

Hey Andrea,

This is a WIP, and I will ping again when it's ready for you to take a closer look at this PR.

Here are a variety of improvements to the image used in GPU enabled JupyterHubs:

Reduction in image size. This new image has been tested in a CPU/GPU hybrid cluster using the Tensorflow section of this notebook

$ docker image ls | grep zonca
zonca/pytorch-jupyterhub             latest                  2ff2414d6d5e   46 minutes ago   5.96GB
zonca/nvidia-tensorflow-jupyterhub   gpu-improvements        62d85764099a   2 hours ago      9.03GB
zonca/nvidia-tensorflow-jupyterhub   latest                  edf1c6ebcc09   15 months ago    13.2GB

When we build our own images (with additional environments) on top of the current image (Image ID: edf1c6ebcc09 above), we occasionally run into the problem of filling more of the root disk on our worker nodes than Kubernetes is comfortable with. This results in Kubernetes evicting Pods and tainting the affected nodes with DiskPressure. The lower storage footprint of this new image allows us to operate within a more comfortable range of disk usage.

The current image is ultimately built on top of an NVIDIA image which expects a GPU to be available. If you're trying to use the current image in a CPU/GPU hybrid cluster scenario (2.21.0 hybrid jetstream_kubespray#29), you would need a separate CPU only image which will itself take up more disk space, potentially resulting in the DiskPressure mentioned above. As this new image is based of off jupyter/minimal-notebook with any GPU related packages installed in a conda environment, you do not run into this issue and can use the same image on both GPU and CPU-only worker nodes.
The addition of a PyTorch environment. Interestingly, during testing with the PyTorch section of the same notebook as noted above, this minimal PyTorch environment did not yield the desired result, while the non-minimal environment it's based on does. I'll have to take a closer look as to why this is.

Let me know if you have any thoughts or questions,

Ana

ana-v-espinoza · 2024-02-14T23:11:26Z

Update on the PyTorch environment.

Previously I was having some trouble with getting the PyTorch environment defined by the environment file in 4f10904, my last commit. The problem was that, when I built the JupyterLab image using this environment file, pushed it to Dockerhub, added it as an available image in my development JupyterHub, and ran the PyTorch section of this notebook, torch.cuda.is_available() returns False, while the expected output is True.

For the last few weeks I've been working on fixing this on and off, and I've come to conclude that when the Docker image is being built on a machine without a GPU, which I've been doing, mamba "knows" this and solves the environment using the CPU build of torch. If instead I build the environment on a GPU enabled machine, mamba similarly knows this and uses the appropriate torch version built with cuda.

I've verified this is the case by creating the environment from the same environment file from within my hybrid development cluster. In one instance, I build the environment on a CPU node:
mamba env create -n cpu-pytorch -f environment-pytorch.yml

In another instance, I built it on a GPU node. It's important to note that in both instances I'm using the same JupyterLab image.
mamba env create -n gpu-pytorch -f environment-pytorch.yml

If I then do a diff on the resulting environments you'll see the differing build versions on libtorch and pytorch. I assume (likely correctly) that the additional packages for the gpu-pytorch environment (cudnn, libmagma, etc.) are simply due to additional dependencies of the cuda enabled builds:

~/jupyterhub-deploy-kubernetes-jetstream/gpu/nvidia-tensorflow-jupyterhub$ diff <(conda list -n cpu-pytorch) <(conda list -n gpu-pytorch)
1c1
< # packages in environment at /home/jovyan/additional-envs/cpu-pytorch:
---
> # packages in environment at /home/jovyan/additional-envs/gpu-pytorch:
23a24
> cudnn                     8.8.0.121            h264754d_4    conda-forge
84a86,87
> libmagma                  2.7.2                h173bb3b_2    conda-forge
> libmagma_sparse           2.7.2                h173bb3b_1    conda-forge
111c114
< libtorch                  2.1.2           cpu_mkl_hcefb67d_101    conda-forge
---
> libtorch                  2.1.2           cuda120_h2aa5df7_301    conda-forge
122a126
> magma                     2.7.2                h51420fd_1    conda-forge
129a134
> nccl                      2.20.3.1             h3a97aeb_0    conda-forge
161c166
< pytorch                   2.1.2           cpu_mkl_py311hc5c8824_101    conda-forge
---
> pytorch                   2.1.2           cuda120_py311h25b6552_301    conda-forge

It looks like I have to either:

Trick/tell docker build to think it's building for a GPU enabled machine
Trick/tell mamba to think it's building for a GPU enabled machine
Explicitly tell mamba to solve the environment using a certain build for the pytorch package, although I would prefer not to "hardcode" anything
Actually build our images on a GPU enabled machine, bypassing the need for any trickery

Nevertheless, I think I've identified the problem and now "only" have to work on fixing it. I'll update again when I have something functional.

Thanks,
Ana

zonca · 2024-02-21T17:04:21Z

@ana-v-espinoza what error do you get when you run on a CPU-only node?
I am testing zonca/nvidia-tensorflow-jupyterhub:23.1.5 on a CPU-only node and the tensorflow example works (slow) but fine https://gist.github.com/zonca/3da7896544da9881fe9081a441964a26.
It complaints about there not being a GPU, but can run without it, and does not segfault:

2024-02-21 17:00:14.856937: W tensorflow/stream_executor/platform/default/dso_loader.cc:65] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2024-02-21 17:00:14.856990: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2024-02-21 17:00:14.857010: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (jupyter-zonca): /proc/driver/nvidia/version does not exist

ana-v-espinoza · 2024-02-21T18:03:19Z

Ah. I think I got my images mixed up. I believe I was originally using something based on the tensorflow/tensorflow image when trying to reduce the image size, and that was what had resulted in an error concerning drivers being unavailable.

An image based on nvcr.io/nvidia/tensorflow:22.04-tf2-py3 does run on a CPU node, as you've just observed.

ana-v-espinoza added 2 commits January 30, 2024 18:24

Docker image improvements

2493435

Initial try at a minimal pytorch environment

4f10904

ana-v-espinoza mentioned this pull request Feb 21, 2024

2.21.0 hybrid zonca/jetstream_kubespray#29

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpu improvements #72

Gpu improvements #72

ana-v-espinoza commented Jan 31, 2024

ana-v-espinoza commented Feb 14, 2024

zonca commented Feb 21, 2024

ana-v-espinoza commented Feb 21, 2024

Gpu improvements #72

Are you sure you want to change the base?

Gpu improvements #72

Conversation

ana-v-espinoza commented Jan 31, 2024

ana-v-espinoza commented Feb 14, 2024

zonca commented Feb 21, 2024

ana-v-espinoza commented Feb 21, 2024