-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate EGM fixes + sysfs linkage required for libvirt #33
Open
nvmochs
wants to merge
12
commits into
NVIDIA:24.04_linux-nvidia-adv-6.8-next
Choose a base branch
from
nvmochs:adv_ghvirt_11222024
base: 24.04_linux-nvidia-adv-6.8-next
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Integrate EGM fixes + sysfs linkage required for libvirt #33
nvmochs
wants to merge
12
commits into
NVIDIA:24.04_linux-nvidia-adv-6.8-next
from
nvmochs:adv_ghvirt_11222024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Free the kmalloc'd region when the EGM is unregistered. Signed-off-by: Matthew R. Ochs <[email protected]>
Move region hash initiaization alongside the other region initialization statements to avoid situations where the hash table was not properly initialized. Signed-off-by: Matthew R. Ochs <[email protected]>
Update error handling within EGM regiration routine to catch and return errors to the caller. Signed-off-by: Matthew R. Ochs <[email protected]>
Detect and handle a failure from the EGM registration service. Signed-off-by: Matthew R. Ochs <[email protected]>
Fix source to resolve checkpatch warnings Signed-off-by: Matthew R. Ochs <[email protected]>
Fix minor syntax errors from sparse. Signed-off-by: Matthew R. Ochs <[email protected]>
Return the intended errno upon a copyout fault, remove unnecessary checks following container_of pointer derivation, and use the correct macro and types for overflow checking. Signed-off-by: Matthew R. Ochs <[email protected]>
Use the correct macro and types for overflow checking. Signed-off-by: Matthew R. Ochs <[email protected]>
Ensure ACPI table reads are successful prior to using the value. Signed-off-by: Matthew R. Ochs <[email protected]>
Some environments may provide a "nvidia,egm-retired-pages-data-base” but fail to populate it with a base address, leaving it NULL. Mapping this invalid value results in a synchronous exception when the region is first touched. Detect a NULL value, generate a warning to draw attention to the firmware bug, and return without mapping. INFO: th500_ras_intr_handler: External Abort reason=1 syndrome=0x92000410 flags=0x1 [ 82.104493] Internal error: synchronous external abort: 0000000096000410 [NVIDIA#1] SMP [ 82.114898] Modules linked in: nvgrace_gpu_vfio_pci(E) nvgrace_egm(E) [ 82.257218] CPU: 0 PID: 10 Comm: kworker/0:1 Tainted: G OE 6.8.12+ NVIDIA#5 [ 82.265135] Hardware name: NVIDIA GH200 P5042, BIOS 24103110 20241031 [ 82.271720] Workqueue: events work_for_cpu_fn [ 82.276180] pstate: 03400009 (nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 82.283298] pc : register_egm_node+0x2cc/0x440 [nvgrace_egm] [ 82.289087] lr : register_egm_node+0x2c4/0x440 [nvgrace_egm] [ 82.294872] sp : ffff8000802ebc30 [ 82.298254] x29: ffff8000802ebc60 x28: 00000000000000ff x27: 0000000000000000 [ 82.305550] x26: ffff000087a320c8 x25: ffff0000a5700000 x24: ffff000087a32000 [ 82.312846] x23: ffffa77cd758e368 x22: 0000000000000000 x21: ffffa77cd758c640 [ 82.320141] x20: ffffa77cd758e170 x19: ffff800081e7d000 x18: ffff800080293038 [ 82.327437] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 82.334732] x14: 0000000000000000 x13: 65203a65646f6e5f x12: 0000000000000000 [ 82.342027] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 [ 82.349322] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 82.356618] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 82.363913] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff800081e7d000 [ 82.371210] Call trace: [ 82.373705] register_egm_node+0x2cc/0x440 [nvgrace_egm] [ 82.379135] nvgrace_gpu_probe+0x2ac/0x528 [nvgrace_gpu_vfio_pci] [ 82.385366] local_pci_probe+0x4c/0xe0 [ 82.389198] work_for_cpu_fn+0x28/0x58 [ 82.393026] process_one_work+0x168/0x3f0 [ 82.397123] worker_thread+0x360/0x480 [ 82.400952] kthread+0x11c/0x128 [ 82.404248] ret_from_fork+0x10/0x20 [ 82.407906] Code: d2820001 940002b3 aa0003f3 b4fffac0 (f9400017) [ 82.414134] ---[ end trace 0000000000000000 ]--- Signed-off-by: Matthew R. Ochs <[email protected]>
Create a sysfs link between the egm character device and its associated GPU (PCI device) for correlation. Example: $ realpath /sys/class/egm/egm4/0009\:01\:00.0 /sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0 $ realpath /sys/bus/pci/devices/0009:01:00.0/egm4 /sys/devices/virtual/egm/egm4 Signed-off-by: Matthew R. Ochs <[email protected]>
…ve error handling Small series of fixes/improvements to the nvgrace VFIO modules. Signed-off-by: Matthew R. Ochs <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Small series of patches to support vEGM with libvirt. Tested on CG1 and CG4.
To regression test the EGM patches, I booted the host with the 4k and 64k tech preview kernel + patches, and launched a VM backed by the EGM character device. The guest VM ran the same tech preview kernel used for the vCMDQ tests in PR 32 and I tested with both 4k/64k and the same tests and success criteria. All tests passed.
To test the sysfs linkage patch, with EGM configured on the host, I verified the presence of the PCI dev -> EGM chardev and EGM chardev -> PCI dev links, their removal upon unconfiguring the device, and their recreation when configuring the device again.
The memory free, registration error handling, and invalid retired pages base patches were unit tested with scaffolding while being developed. Specifically, the retired pages base patch was added because I happened to initially be using a system that had an invalid firmware image that was presenting that node but without an address.