Skip to content

Commit

Permalink
Merge pull request #772 from javipolo/cloud-providers
Browse files Browse the repository at this point in the history
Add customizations per cloud provider
  • Loading branch information
rhatdan authored Sep 3, 2024
2 parents 2d03084 + e7e7c96 commit e96a0a9
Show file tree
Hide file tree
Showing 69 changed files with 446 additions and 16 deletions.
17 changes: 1 addition & 16 deletions training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,6 @@ In order to run accelerated AI workloads, we've prepared [bootc](https://github.
|-----------------|---------------------------------------------------------------------|
| amd | Create bootable container for AMD platform |
| deepspeed | DeepSpeed container for optimization deep learning |
| cloud-amd | Add cloud-init to bootable container for AMD platform |
| cloud-intel | Add cloud-init to bootable container for Intel platform |
| cloud-nvidia | Add cloud-init to bootable container for Nvidia platform |
| disk-amd | Create disk image from bootable container for AMD platform |
| disk-intel | Create disk image from bootable container for Intel platform |
| disk-nvidia | Create disk image from bootable container for Nvidia platform |
Expand Down Expand Up @@ -86,18 +83,6 @@ Of course, the other Makefile variables are still available, so the following is
make nvidia REGISTRY=myregistry.com REGISTRY_ORG=ai-training IMAGE_NAME=nvidia IMAGE_TAG=v1 FROM=registry.redhat.io/rhel9/rhel-bootc:9.4
```

# How to build Cloud ready images

Bootc container images can be installed on physical machines, virtual machines and in the cloud. Often it is useful to add the cloud-init package when running the operating systems in the cloud.

To add cloud-init to your existing bootc container image, executing `make cloud-<platform>` should be enough. For example to build the `cloud-nvidia`, `cloud-amd` and `cloud-intel` bootc containers, respectively:

```
make cloud-nvidia
make cloud-amd
make cloud-intel
```

# How to build disk images
bootc-image-builder produces disk images using a bootable container as input. Disk images can be used to directly provision a host
The process will write the disk image in <platform>-bootc/build
Expand All @@ -110,7 +95,7 @@ make disk-nvidia
```
or
```
make disk-nvidia DISK_TYPE=ami BOOTC_IMAGE=quay.io/ai-lab/nvidia-bootc-cloud:latest
make disk-nvidia DISK_TYPE=ami BOOTC_IMAGE=quay.io/ai-lab/nvidia-bootc-custom:latest
```

In addition to the variables common to all targets, a few extra can be defined to customize disk image creation
Expand Down
8 changes: 8 additions & 0 deletions training/cloud/Containerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
ARG BASEIMAGE=quay.io/ai-labs/bootc-nvidia:latest
FROM ${BASEIMAGE}

ARG CLOUD

COPY $CLOUD/cloud-setup.sh /tmp
RUN /tmp/cloud-setup.sh && rm -f /tmp/cloud-setup.sh
COPY $CLOUD/files/ /
34 changes: 34 additions & 0 deletions training/cloud/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
CLOUD ?=
VERSION ?= 1.1
HARDWARE ?= nvidia
REGISTRY ?= quay.io
REGISTRY_ORG ?= ai-lab
IMAGE_NAME ?= bootc-${HARDWARE}-rhel9-${CLOUD}
IMAGE_TAG ?= ${VERSION}
CONTAINER_TOOL ?= podman
CONTAINER_TOOL_EXTRA_ARGS ?=

BOOTC_IMAGE_CLOUD ?= ${REGISTRY}/${REGISTRY_ORG}/${IMAGE_NAME}:${IMAGE_TAG}

default: help

-include $(CLOUD)/Makefile.env

cloud-image: ## Create bootc image for a cloud, using stable RHEL AI as base
"${CONTAINER_TOOL}" build \
$(BASEIMAGE:%=--build-arg BASEIMAGE=%) \
$(CLOUD:%=--build-arg CLOUD=%) \
${CONTAINER_TOOL_EXTRA_ARGS} \
--tag ${BOOTC_IMAGE_CLOUD} \
--file Containerfile \
.

cloud-disk: ## Create disk image for a cloud, using the image built with cloud-image target
make -f ../common/Makefile.common bootc-image-builder \
BOOTC_IMAGE=${BOOTC_IMAGE_CLOUD} \
DISK_TYPE=${DISK_TYPE} \
IMAGE_BUILDER_CONFIG=$(abspath $(CLOUD))/config.toml

help: ## Shows this message.
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(shell echo "$(MAKEFILE_LIST) " | tac -s' ') | perl -pe 's/^.*Makefile.*?://g' | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}'

47 changes: 47 additions & 0 deletions training/cloud/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
Customizing RHEL AI for the different cloud providers
===

In order to create images for the different cloud providers, we need to add some extra packages and configuration, and create special disk images

Please refer to the official RHEL AI documentation on how to create machine images for different clouds.

# Makefile targets

| Target | Description |
|-----------------|-----------------------------------------------------------------------|
| cloud-image | Create bootc image for a cloud, using stable RHEL AI as base |
| cloud-disk | Create disk image for a cloud, using the image built with cloud-image |

# Makefile variables

| Variable | Description | Default |
|---------------------------|-------------------------------------------------|--------------------------------------------------------------|
| CLOUD | Sets the name of the cloud: aws, gcp, azure, ...| ` ` |
| HARDWARE | Hardware accelerator RHEL AI source image | `nvidia` |
| VERSION | RHEL AI version | `1.1` |
| REGISTRY | Container Registry for storing container images | `quay.io` |
| REGISTRY_ORG | Container Registry organization | `ai-lab` |
| IMAGE_NAME | Container image name | `bootc-${HARDWARE}-rhel9-${CLOUD}` |
| IMAGE_TAG | Container image tag | `${CLOUD}-latest` |
| CONTAINER_TOOL | Container tool used for build | `podman` |
| CONTAINER_TOOL_EXTRA_ARGS | Container tool extra arguments | ` ` |
| BASEIMAGE | Source RHEL AI image | `registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:latest` |
| BOOTC_IMAGE_CLOUD | Override cloud image name | `${REGISTRY}/${REGISTRY_ORG}/${IMAGE_NAME}:${IMAGE_TAG}` |


# Example on how to build your own AI Bootc disk image

Simply execute `make cloud-image CLOUD=<cloud_provider> BASEIMAGE=<rhel_ai_base_image>`. For example:

* make cloud-image CLOUD=azure BASEIMAGE=quay.io/ai-lab/nvidia-bootc:1.1
* make cloud-image CLOUD=gcp BASEIMAGE=quay.io/ai-lab/nvidia-bootc:1.1

Once you have the bootc image, you can use it to create a disk image.
Simply execute `make cloud-disk CLOUD=<cloud_provider>`. For example:

* make cloud-disk CLOUD=azure
* make cloud-disk CLOUD=gcp


This will produce an image in the `build/output` directory.
Then, you can follow RHEL AI documentation on how to create a machine image in your cloud provider.
1 change: 1 addition & 0 deletions training/cloud/aws/Makefile.env
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DISK_TYPE=ami
34 changes: 34 additions & 0 deletions training/cloud/aws/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Amazon Web Services modifications for RHEL AI
Trying to mimic as much as possible the [changes on RHEL for AWS](https://github.com/osbuild/images/blob/main/pkg/distro/rhel/rhel9/ami.go)

## Changes

- Extra kernel parameters

```
console=ttyS0,115200n8 net.ifnames=0 nvme_core.io_timeout=4294967295
```

- Timezone: UTC
- Chrony configuration:
- Change server
- LeapsecTz
- Locale: en_US.UTF-8
- Keymap: us
- X11 layout: us

- Getty configuration
- NautoVTs false

- Cloud init default user: `ec2-user`

- Packages
- @core metapackage
- authselect-compat
- langpacks-en
- tuned

- Services
- nm-cloud-setup.service
- nm-cloud-setup.timer
- tuned
14 changes: 14 additions & 0 deletions training/cloud/aws/cloud-setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash

set -o errexit

dnf install -y --nobest \
cloud-init \
langpacks-en \
tuned

# Chrony configuration
sed -i \
-e '/^pool /c\server 169.254.169.123 prefer iburst minpoll 4 maxpoll 4' \
-e '/^leapsectz /d' \
/etc/chrony.conf
4 changes: 4 additions & 0 deletions training/cloud/aws/config.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[customizations.kernel]
name = "customizations-for-aws"
append = "console=ttyS0,115200n8 net.ifnames=0 nvme_core.io_timeout=4294967295"

6 changes: 6 additions & 0 deletions training/cloud/aws/files/etc/X11/xorg.conf.d/00-keyboard.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Do not edit manually, use localectl(1).
Section "InputClass"
Identifier "system-keyboard"
MatchIsKeyboard "on"
Option "XkbLayout" "us"
EndSection
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
system_info:
default_user:
name: ec2-user
1 change: 1 addition & 0 deletions training/cloud/aws/files/etc/locale.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
LANG=en_US.UTF-8
1 change: 1 addition & 0 deletions training/cloud/aws/files/etc/localtime
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# In some cloud-init enabled images the sshd-keygen template service may race
# with cloud-init during boot causing issues with host key generation. This
# drop-in config adds a condition to [email protected] if it exists and
# prevents the sshd-keygen units from running *if* cloud-init is going to run.
#
[Unit]
ConditionPathExists=!/run/systemd/generator.early/multi-user.target.wants/cloud-init.target
1 change: 1 addition & 0 deletions training/cloud/aws/files/etc/vconsole.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
KEYMAP=us
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[install]
kargs = ["console=tty0", "console=ttyS0,115200n8", "net.ifnames=0", "nvme_core.io_timeout=4294967295"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[Login]
NAutoVTs=0

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[Service]
Environment="NM_CLOUD_SETUP_EC2=yes"
1 change: 1 addition & 0 deletions training/cloud/azure/Makefile.env
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DISK_TYPE=raw
60 changes: 60 additions & 0 deletions training/cloud/azure/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Azure for RHEL AI
Trying to mimic as much as possible the [changes on RHEL for Azure](https://github.com/osbuild/images/blob/main/pkg/distro/rhel/rhel9/azure.go)

# Summary
- Extra kernel parameters

Even if in the link [Kernel Parameters on RHEL for Azure](https://github.com/osbuild/images/blob/a4ae81dc3eed3e86c359635e3135fc8a07f411dd/pkg/distro/rhel/rhel9/azure.go#L454) we see other changes, when running a RHEL instance in Azure, the extra kernel parameters are others, so we will take those as our reference
```
loglevel=3 console=tty1 console=ttyS0,115200n8 earlyprintk=ttyS0,115200 net.ifnames=0 cloud-init=disabled
```

Note that we also disable cloud-init via kernel parameter

- Timezone: UTC
- Locale: en_US.UTF-8
- Keymap: us
- X11 layout: us

- sshd config
- ClientAliveInterval: 180

- Packages
- hyperv-daemons
- langpacks-en
- NetworkManager-cloud-setup
- nvme-cli
- patch
- rng-tools
- uuid
- WALinuxAgent

- Services
- nm-cloud-setup.service
- nm-cloud-setup.timer
- waagent

- Systemd
- nm-cloud-setup.service: `Environment=NM_CLOUD_SETUP_AZURE=yes`

- Kernel Modules
- blacklist amdgpu
- blacklist intel_cstate
- blacklist floppy
- blacklist nouveau
- blacklist lbm-nouveau
- blacklist skx_edac

- Cloud Init
- 10-azure-kvp.cfg
- 91-azure_datasource.cfg

- PwQuality
- /etc/security/pwquality.conf

- WaAgentConfig
- RDFormat false
- RDEnableSwap false

- udev rules
- /etc/udev/rules.d/68-azure-sriov-nm-unmanaged.rules
36 changes: 36 additions & 0 deletions training/cloud/azure/cloud-setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/bin/bash

set -o errexit

dnf install -y --nobest \
cloud-init \
hyperv-daemons \
langpacks-en \
NetworkManager-cloud-setup \
nvme-cli \
patch \
rng-tools \
uuid \
WALinuxAgent

# sshd configuration
cat << EOF >> /etc/ssh/sshd_config
ClientAliveInterval 180
EOF

# pwquality configuration
cat << EOF >> /etc/security/pwquality.conf
minlen = 6
dcredit = 0
ucredit = 0
lcredit = 0
ocredit = 0
minclass = 3
EOF

# WAAgent configuration
sed -i \
-e '/^ResourceDisk.Format=y/c\ResourceDisk.Format=n' \
-e '/^ResourceDisk.EnableSwap=y/c\ResourceDisk.EnableSwap=n' \
-e '/^Provisioning.RegenerateSshHostKeyPair=y/c\Provisioning.RegenerateSshHostKeyPair=n' \
/etc/waagent.conf
6 changes: 6 additions & 0 deletions training/cloud/azure/config.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[customizations.kernel]
name = "customizations-for-azure"
# This is suggested by https://github.com/osbuild/images/blob/a4ae81dc3eed3e86c359635e3135fc8a07f411dd/pkg/distro/rhel/rhel9/azure.go#L454
# append = "ro loglevel=3 console=tty1 console=ttyS0 earlyprintk=ttyS0 rootdelay=300"
# However, starting a RHEL instance in azure shows this one, and I'll be using it
append = "loglevel=3 console=tty1 console=ttyS0,115200n8 earlyprintk=ttyS0,115200 net.ifnames=0"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Do not edit manually, use localectl(1).
Section "InputClass"
Identifier "system-keyboard"
MatchIsKeyboard "on"
Option "XkbLayout" "us"
EndSection
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# This configuration file is used to enable logging to Hyper-V kvp
reporting:
logging:
type: log
telemetry:
type: hyperv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
datasource_list: [ Azure ]
datasource:
Azure:
apply_network_config: False
1 change: 1 addition & 0 deletions training/cloud/azure/files/etc/locale.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
LANG=en_US.UTF-8
1 change: 1 addition & 0 deletions training/cloud/azure/files/etc/localtime
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
blacklist amdgpu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
blacklist floppy
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
blacklist intel_cstate
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
blacklist nouveau
blacklist lbm-nouveau
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
blacklist skx_edac
Loading

0 comments on commit e96a0a9

Please sign in to comment.