Automated installation of GH200 system using Ubuntu 22.04 using USB drive
A lot of this came from the Official Nvidia Ubuntu 22.04 Grace Installation Guide. If you have problems or want to install manually, please see that guide for more details. There is also a Grace Performance Tuning Guide that may be helpful.
This repo was created based on testing of the GH200 system (specifically the Supermicro ARS-111GL-NHR). The systems I'm testing on also have a single Bluiefield-3 installed (although 2 BF-3 have also been tested), but this should not matter for the installation.
The contained scripts will perform the following actions:
- Create a new user (as defined by user)
- Make
linux-nvidia-64k-hwe-22.04
the default kernel - Install the Nvidia MLNX-OFED repo
- Install the Nvidia CUDA SBSA repo
- Update/Upgrade all packages
- Install the First-boot service that will run on the first boot
(NOTE: Nvidia requires MLNX-OFED to be installed BEFORE the GPU drivers for NCCL to work correctly)
- Install mlnx-fw-updater mlnx-ofed-all
- Install cuda-drivers and nvidia-kernel-open drivers
- Install cuda-toolkit-12-4 nvidia-container-toolkit
- Update and enable the nvidia-persistenced service due to bug
- Disable
irqbalance
service (recommended by performance tuning guide) - Disable NUMA balancing (recommended by performance tuning guide)
- Automatically load the nvidia-peermem kernel module (required for NCCL and GPUDirect)
- Disable the first-boot service
- Reboot the system
There are multiple ways to create an installable Ubuntu 22.04 USB drive. I used Rufus on a Windows-11 system.
- Download Rufus Portable and the latest Ubuntu-22.04.4 ISO.
- Select USB drive.
- Install as ISO (enuring you can modify the files on the USB afterwards)
Before copying over the files, you'll need to customize the cidata/user-data file with your installation details
Replace the following items:
Item | Description |
---|---|
<HOSTNAME> | Hostname of the system |
<PASSWORD> | Generated SHA-512 hash (can generate with openssl passwd -6 ) |
<USERNAME> | Initial User name |
<ADDRESS-CIDR> | Address of network port in CIDR form (e.g. 10.1.1.10/24) |
<GATEWAY-ADDRESS> | Address of network gateway |
<NAMESERVER-N> | Address of DNS Nameservers |
<SEARCH-DOMAIN> | DNS Search domain (e.g. my-domain.com) |
After creating a bootable Ubuntu installation drive, copy the files from cidata to the
-
Create directory
cidata
in the root of the Ubuntu USB drive -
Copy All files over to the cidata directory on the Ubuntu USB drive
- user-data : Ubuntu Autoinstall file
- meta-data : Ubuntu Meta-data files
- first-boot.service : One-shot service to launch the first-boot.sh script
- first-boot.sh : Script that will run on first boot
-
Update the
boot/grub/grub.cfg
file and add the following menuentry to the list:
menuentry "Install GH200 System (Requires Internet)" {
set gfxpayload=keep
linux /casper/vmlinuz quiet autoinstall 'ds=nocloud-net;s=file:///cdrom/cidata/'
initrd /casper/initrd
}
NOTE: There is a bug with the Supermicro ARS-111GL-NHR that causes the system to hang when rebooting with a USB drive connected. Unknown if this affects all GH200 systems. After installation, it's likely the screen is stuck. Simply remove the USB drive (or power cycle from BMC) and the system should boot right into the OS.
Run the following commands to validate all drivers are working:
- Confirm the nvidia drivers have loaded correctly:
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GH200 480GB On | 00000009:01:00.0 Off | 0 |
| N/A 33C P0 76W / 900W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
- Ensure the nvidia-peermem kernel module has loaded correctly
lsmod | grep nvidia_peermem