Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate EGM fixes + sysfs linkage required for libvirt #33

Open
wants to merge 12 commits into
base: 24.04_linux-nvidia-adv-6.8-next
Choose a base branch
from

Conversation

nvmochs
Copy link
Collaborator

@nvmochs nvmochs commented Nov 22, 2024

Small series of patches to support vEGM with libvirt. Tested on CG1 and CG4.

To regression test the EGM patches, I booted the host with the 4k and 64k tech preview kernel + patches, and launched a VM backed by the EGM character device. The guest VM ran the same tech preview kernel used for the vCMDQ tests in PR 32 and I tested with both 4k/64k and the same tests and success criteria. All tests passed.

To test the sysfs linkage patch, with EGM configured on the host, I verified the presence of the PCI dev -> EGM chardev and EGM chardev -> PCI dev links, their removal upon unconfiguring the device, and their recreation when configuring the device again.

The memory free, registration error handling, and invalid retired pages base patches were unit tested with scaffolding while being developed. Specifically, the retired pages base patch was added because I happened to initially be using a system that had an invalid firmware image that was presenting that node but without an address.

Free the kmalloc'd region when the EGM is unregistered.

Signed-off-by: Matthew R. Ochs <[email protected]>
Move region hash initiaization alongside the other region initialization
statements to avoid situations where the hash table was not properly
initialized.

Signed-off-by: Matthew R. Ochs <[email protected]>
Update error handling within EGM regiration routine to catch and
return errors to the caller.

Signed-off-by: Matthew R. Ochs <[email protected]>
Detect and handle a failure from the EGM registration service.

Signed-off-by: Matthew R. Ochs <[email protected]>
Fix source to resolve checkpatch warnings

Signed-off-by: Matthew R. Ochs <[email protected]>
Fix minor syntax errors from sparse.

Signed-off-by: Matthew R. Ochs <[email protected]>
Return the intended errno upon a copyout fault, remove unnecessary
checks following container_of pointer derivation, and use the correct
macro and types for overflow checking.

Signed-off-by: Matthew R. Ochs <[email protected]>
Use the correct macro and types for overflow checking.

Signed-off-by: Matthew R. Ochs <[email protected]>
Ensure ACPI table reads are successful prior to using the value.

Signed-off-by: Matthew R. Ochs <[email protected]>
Some environments may provide a "nvidia,egm-retired-pages-data-base” but
fail to populate it with a base address, leaving it NULL. Mapping this
invalid value results in a synchronous exception when the region is first
touched. Detect a NULL value, generate a warning to draw attention to the
firmware bug, and return without mapping.

INFO:    th500_ras_intr_handler: External Abort reason=1 syndrome=0x92000410 flags=0x1
[   82.104493] Internal error: synchronous external abort: 0000000096000410 [NVIDIA#1] SMP
[   82.114898] Modules linked in: nvgrace_gpu_vfio_pci(E) nvgrace_egm(E)
[   82.257218] CPU: 0 PID: 10 Comm: kworker/0:1 Tainted: G           OE      6.8.12+ NVIDIA#5
[   82.265135] Hardware name: NVIDIA GH200 P5042, BIOS 24103110 20241031
[   82.271720] Workqueue: events work_for_cpu_fn
[   82.276180] pstate: 03400009 (nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[   82.283298] pc : register_egm_node+0x2cc/0x440 [nvgrace_egm]
[   82.289087] lr : register_egm_node+0x2c4/0x440 [nvgrace_egm]
[   82.294872] sp : ffff8000802ebc30
[   82.298254] x29: ffff8000802ebc60 x28: 00000000000000ff x27: 0000000000000000
[   82.305550] x26: ffff000087a320c8 x25: ffff0000a5700000 x24: ffff000087a32000
[   82.312846] x23: ffffa77cd758e368 x22: 0000000000000000 x21: ffffa77cd758c640
[   82.320141] x20: ffffa77cd758e170 x19: ffff800081e7d000 x18: ffff800080293038
[   82.327437] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[   82.334732] x14: 0000000000000000 x13: 65203a65646f6e5f x12: 0000000000000000
[   82.342027] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
[   82.349322] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[   82.356618] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[   82.363913] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff800081e7d000
[   82.371210] Call trace:
[   82.373705]  register_egm_node+0x2cc/0x440 [nvgrace_egm]
[   82.379135]  nvgrace_gpu_probe+0x2ac/0x528 [nvgrace_gpu_vfio_pci]
[   82.385366]  local_pci_probe+0x4c/0xe0
[   82.389198]  work_for_cpu_fn+0x28/0x58
[   82.393026]  process_one_work+0x168/0x3f0
[   82.397123]  worker_thread+0x360/0x480
[   82.400952]  kthread+0x11c/0x128
[   82.404248]  ret_from_fork+0x10/0x20
[   82.407906] Code: d2820001 940002b3 aa0003f3 b4fffac0 (f9400017)
[   82.414134] ---[ end trace 0000000000000000 ]---

Signed-off-by: Matthew R. Ochs <[email protected]>
Create a sysfs link between the egm character device and its associated
GPU (PCI device) for correlation.

Example:
$ realpath /sys/class/egm/egm4/0009\:01\:00.0
/sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0

$ realpath /sys/bus/pci/devices/0009:01:00.0/egm4
/sys/devices/virtual/egm/egm4

Signed-off-by: Matthew R. Ochs <[email protected]>
…ve error handling

Small series of fixes/improvements to the nvgrace VFIO modules.

Signed-off-by: Matthew R. Ochs <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant