Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA A16 GPU with talos LTS driver does not work! #9861

Open
rbuffi opened this issue Dec 2, 2024 — with Slack · 5 comments
Open

NVIDIA A16 GPU with talos LTS driver does not work! #9861

rbuffi opened this issue Dec 2, 2024 — with Slack · 5 comments

Comments

Copy link

rbuffi commented Dec 2, 2024

Hi, we are cannot make the NVIDIA A16 GPU work with talos 1.81 (VMware OVA) with passthrough:

Logging:
NVRM: The NVIDIA GPU 0000:02:00.0 (PCI ID: 10de:25b6)
NVRM: installed in this system is not supported by the
NVRM: NVIDIA 535.183.06 driver release.

Get extensions
NODE NAMESPACE TYPE ID VERSION NAME VERSION
10.242.249.50 runtime ExtensionStatus 0 1 nonfree-kmod-nvidia-lts 535.183.06-v1.8.1
10.242.249.50 runtime ExtensionStatus 1 1 nvidia-container-toolkit-lts 535.183.06-v1.16.1

GET PCIDEVICES:
10.242.249.50 hardware PCIDevice 0000:02:00.0 1 Display controller VGA compatible controller NVIDIA Corporation GA107GL [A2 / A16]

The driver versions are supported and matrix is attached.
On the same ESXI host with ubuntu and same NVIDIA driver version everything works great, so the problem must me in the NVIDIA extension.
Does the extension need internet connection to download extension image from registry? Because we are on a dark site.

Slack Message

@smira
Copy link
Member

smira commented Dec 2, 2024

Have you tried open-source drivers? 550.x version?

NVIDIA supported hardware is up to NVIDIA, we just repackage the driver.

@rbuffi
Copy link
Author

rbuffi commented Dec 2, 2024

GPU

NVIDIA A16 is supported by this driver and works with the same driver version on ubuntu. We also tried open source driver wit same issue. Can it be there is an issue with the extension in VMware OVA from image factory?

@smira
Copy link
Member

smira commented Dec 2, 2024

I don't have any exact idea, the message is printed by the kernel module from NVIDIA.

It might need some extra flags to the module (?).

@rbuffi
Copy link
Author

rbuffi commented Dec 2, 2024

thank you for your reply. When we set the following advanced settings on the VM:

pciPassthru.use64bitMMIO=”TRUE”
pciPassthru.64bitMMIOSizeGB=32

The modules are being loaded:

read /proc/modules

nvidia_uvm 1884160 - - Live 0xffffffffc3fbb000 (PO)
nvidia_drm 94208 - - Live 0xffffffffc3fa0000 (PO)
nvidia_modeset 1531904 - - Live 0xffffffffc3e07000 (PO)
nvidia 62754816 - - Live 0xffffffffc022c000 (PO)

So the card is now working in passthrough mode!

We would like to get the GPU working in vGPU (Grid) mode as wel but when we try connecting in vGPU (GRID) mode the A16 is not recognized. Do you have any directions for us to get this working?

@nebula-it
Copy link
Contributor

Whats the error you are getting in dmesg?
Also, if you are doing a PCIe passthrough of GPU to a VM, the pciPassthru.64bitMMIOSizeGB=32 needs to have atleast same value as GPU vRAM. Which will be 64 with A16 I believe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants