Notes and examples on setting up a reproducible ML dev environment.
- Ubuntu (22.04)
- NVIDIA driver (535)
- Docker (24.06)
- NVIDIA Container Toolkit (1.14.2)
- NVIDIA GPU Cloud (NGC) Container (nvcr.io/nvidia/pytorch:23.09-py3)
sudo apt-get install nvidia-driver-535
After installing the NVIDIA driver, the nvidia-smi
command should show CUDA version 12.2,
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2 |
https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04
After installing Docker, you should be able to run the hello world image,
docker run hello-world
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
install-nct.sh
After installing NVIDIA Container Toolkit, you should be able to run nvidia-smi
from within a docker container,
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html
docker run --rm --gpus all ubuntu nvidia-smi
NVIDIA GPU Cloud (NGC) provides many Docker containers,
https://catalog.ngc.nvidia.com/orgs/nvidia/containers
We tested with the nvcr.io/nvidia/pytorch:23.09-py3
container
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags
A set of default base flags for docker run are,
--gpus all
--ipc=host
or--shm-size 1gb
--ulimit memlock=-1
--ulimit stack=67108864
An example interactive session that will remove the container on exit is,
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm nvcr.io/nvidia/pytorch:23.09-py3
see Dockerfile
https://huggingface.co/docs/transformers/perf_train_gpu_one https://huggingface.co/docs/transformers/perf_infer_gpu_one https://huggingface.co/docs/transformers/perf_infer_cpu
https://huggingface.co/blog/hf-bitsandbytes-integration https://huggingface.co/blog/4bit-transformers-bitsandbytes
https://huggingface.co/docs/transformers/main_classes/quantization