Containerization has long been of interest to the Arm community. Today, Arm64 CPUs are ideal for container-based workloads.
The NVIDIA Container Toolkit allows users to build and run GPU accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPUs. Follow this installation guide to get started: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
As an example, here are the steps for installing NVIDIA Container Toolkit on Ubuntu 20.04 with Docker. Note that other container frameworks like Podman are also supported.
# Install Docker dependencies
sudo apt-get install ca-certificates curl gnupg lsb-release
# Add Docker repo
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# Install Docker
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin
# Enable docker service
sudo systemctl --now enable docker
# Add the NVIDIA Container Toolkit repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install NVIDIA Container Toolkit
sudo apt update
sudo apt install nvidia-docker2
# Restart docker services to enable GPU support
sudo systemctl restart docker
# Run a simple test using the CUDA multi-arch container
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu18.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.0.3-base-ubuntu18.04' locally
11.0.3-base-ubuntu18.04: Pulling from nvidia/cuda
e196da37f904: Pull complete
0b7ba59c359b: Pull complete
84bc5f8689bc: Pull complete
b926124172ef: Pull complete
fef6c6f16e98: Pull complete
Digest: sha256:f7b595695b06ad8590aed1accd6437ba068ca44e71c5cf9c11c8cb799c2d8335
Status: Downloaded newer image for nvidia/cuda:11.0.3-base-ubuntu18.04
Thu Jul 7 17:57:04 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 0000000C:01:00.0 Off | 0 |
| N/A 44C P0 65W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... Off | 0000000D:01:00.0 Off | 0 |
| N/A 36C P0 63W / 300W | 0MiB / 81920MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The first step for leveraging the benefits of Arm64 systems as container hosts is to ensure all production software dependencies support the Arm64 architecture, as one cannot run images built for an x86_64 host on an Arm64 host, and vice versa.
Most of the container ecosystem supports both architectures, and often does so transparently through multiple-architecture (multi-arch) images, where the correct image for the host architecture is deployed automatically.
The major container image repositories, including Dockerhub, Quay, and Amazon Elastic Container Registry (ECR) all support multi-arch images.
While most images already support multi-arch (i.e. arm64 and x86_64/amd64), we describe couple of ways for developers to to create a multi-arch image if needed.
- Docker Buildx
- Using a CI/CD Build Pipeline such as Amazon CodePipeline to coordinate native build and manifest generation.
Most container orchestration platforms support both arm64 and x86_64 hosts. As an example, here is an incomplete, non-exhaustive list of popular software within the container ecosystem that explicitly supports Arm64.
If your software isn't listed above, it doesn't mean it won't work!
Many products work on arm64 but don't explicitly distribute arm64 binaries or build multi-arch images (yet). NVIDIA, AWS, Arm, and many developers in the community are working with maintainers and contributing expertise and code to enable full binary or multi-arch support.
Kubernetes fully supports Arm64.
If all of your containerized workloads support Arm64, then you can run your cluster with Arm64 nodes exclusively. However, if you have some workloads that can only run on x86, or if you just want to be able to run both x86 and Arm64 nodes in the same cluster, then there are a couple of ways to accomplish that:
-
Multiarch Images: If you are able to use multiarch images (see above) for all containers in your cluster, then you can simply run a mix of x86 and Arm64 nodes without any further action. The multiarch image manifest will ensure that the correct image layers are pulled for a given node's architecture.
-
Built-in labels: You can schedule pods on nodes according to the
kubernetes.io/arch
label. This label is automatically added to nodes by Kubernetes and allows you to schedule pods accordingly with a node selector like this:
nodeSelector:
kubernetes.io/arch: amd64
- Using taints:
Taints are especially helpful if adding Arm64 nodes to an existing cluster with mostly x86-only containers. While using the built-in
kubernetes.io/arch
label requires you to explicitly use a node selector to place x86-only containers on the right instances, tainting Arm64 instances prevents Kubernetes from scheduling incompatible containers on them without requiring you to change any existing configuration. For example, you can do this with a managed node group using eksctl by adding--kubelet-extra-args '--register-with-taints=arm=true:NoSchedule'
to the kubelet startup arguments as documented here. Note that if you only taint Arm64 instances and don't specify any node selectors, then you will need to ensure that the images you build for Arm64 are multiarch images that can also run on x86 instance types. Alternatively, you can build Arm64-only images and ensure that they are only scheduled onto Arm64 images using node selectors.