Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate EGM fixes + sysfs linkage required for libvirt #33

Open
wants to merge 12 commits into
base: 24.04_linux-nvidia-adv-6.8-next
Choose a base branch
from

Commits on Nov 22, 2024

  1. vfio/nvgrace-egm: Free region memory during unregistration

    Free the kmalloc'd region when the EGM is unregistered.
    
    Signed-off-by: Matthew R. Ochs <[email protected]>
    nvmochs committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    488ba9d View commit details
    Browse the repository at this point in the history
  2. vfio/nvgrace-egm: Move region hash initialization

    Move region hash initiaization alongside the other region initialization
    statements to avoid situations where the hash table was not properly
    initialized.
    
    Signed-off-by: Matthew R. Ochs <[email protected]>
    nvmochs committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    5a2a63a View commit details
    Browse the repository at this point in the history
  3. vfio/nvgrace-egm: Handle and convey EGM registration errors

    Update error handling within EGM regiration routine to catch and
    return errors to the caller.
    
    Signed-off-by: Matthew R. Ochs <[email protected]>
    nvmochs committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    f05e845 View commit details
    Browse the repository at this point in the history
  4. vfio/nvgrace-gpu: Handle EGM registration failure

    Detect and handle a failure from the EGM registration service.
    
    Signed-off-by: Matthew R. Ochs <[email protected]>
    nvmochs committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    9f2e0ad View commit details
    Browse the repository at this point in the history
  5. vfio/nvgrace-gpu: Address checkpatch warnings

    Fix source to resolve checkpatch warnings
    
    Signed-off-by: Matthew R. Ochs <[email protected]>
    nvmochs committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    1981e05 View commit details
    Browse the repository at this point in the history
  6. vfio/nvgrace-egm: Address sparse errors

    Fix minor syntax errors from sparse.
    
    Signed-off-by: Matthew R. Ochs <[email protected]>
    nvmochs committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    eeef16d View commit details
    Browse the repository at this point in the history
  7. vfio/nvgrace-egm: Address smatch errors

    Return the intended errno upon a copyout fault, remove unnecessary
    checks following container_of pointer derivation, and use the correct
    macro and types for overflow checking.
    
    Signed-off-by: Matthew R. Ochs <[email protected]>
    nvmochs committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    299bc85 View commit details
    Browse the repository at this point in the history
  8. vfio/nvgrace-gpu: Address smatch errors

    Use the correct macro and types for overflow checking.
    
    Signed-off-by: Matthew R. Ochs <[email protected]>
    nvmochs committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    228d53d View commit details
    Browse the repository at this point in the history
  9. vfio/nvgrace-egm: Ensure ACPI value reads are successful

    Ensure ACPI table reads are successful prior to using the value.
    
    Signed-off-by: Matthew R. Ochs <[email protected]>
    nvmochs committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    dcbb01f View commit details
    Browse the repository at this point in the history
  10. vfio/nvgrace-egm: Avoid invalid retired pages base

    Some environments may provide a "nvidia,egm-retired-pages-data-base” but
    fail to populate it with a base address, leaving it NULL. Mapping this
    invalid value results in a synchronous exception when the region is first
    touched. Detect a NULL value, generate a warning to draw attention to the
    firmware bug, and return without mapping.
    
    INFO:    th500_ras_intr_handler: External Abort reason=1 syndrome=0x92000410 flags=0x1
    [   82.104493] Internal error: synchronous external abort: 0000000096000410 [NVIDIA#1] SMP
    [   82.114898] Modules linked in: nvgrace_gpu_vfio_pci(E) nvgrace_egm(E)
    [   82.257218] CPU: 0 PID: 10 Comm: kworker/0:1 Tainted: G           OE      6.8.12+ NVIDIA#5
    [   82.265135] Hardware name: NVIDIA GH200 P5042, BIOS 24103110 20241031
    [   82.271720] Workqueue: events work_for_cpu_fn
    [   82.276180] pstate: 03400009 (nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
    [   82.283298] pc : register_egm_node+0x2cc/0x440 [nvgrace_egm]
    [   82.289087] lr : register_egm_node+0x2c4/0x440 [nvgrace_egm]
    [   82.294872] sp : ffff8000802ebc30
    [   82.298254] x29: ffff8000802ebc60 x28: 00000000000000ff x27: 0000000000000000
    [   82.305550] x26: ffff000087a320c8 x25: ffff0000a5700000 x24: ffff000087a32000
    [   82.312846] x23: ffffa77cd758e368 x22: 0000000000000000 x21: ffffa77cd758c640
    [   82.320141] x20: ffffa77cd758e170 x19: ffff800081e7d000 x18: ffff800080293038
    [   82.327437] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
    [   82.334732] x14: 0000000000000000 x13: 65203a65646f6e5f x12: 0000000000000000
    [   82.342027] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
    [   82.349322] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
    [   82.356618] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
    [   82.363913] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff800081e7d000
    [   82.371210] Call trace:
    [   82.373705]  register_egm_node+0x2cc/0x440 [nvgrace_egm]
    [   82.379135]  nvgrace_gpu_probe+0x2ac/0x528 [nvgrace_gpu_vfio_pci]
    [   82.385366]  local_pci_probe+0x4c/0xe0
    [   82.389198]  work_for_cpu_fn+0x28/0x58
    [   82.393026]  process_one_work+0x168/0x3f0
    [   82.397123]  worker_thread+0x360/0x480
    [   82.400952]  kthread+0x11c/0x128
    [   82.404248]  ret_from_fork+0x10/0x20
    [   82.407906] Code: d2820001 940002b3 aa0003f3 b4fffac0 (f9400017)
    [   82.414134] ---[ end trace 0000000000000000 ]---
    
    Signed-off-by: Matthew R. Ochs <[email protected]>
    nvmochs committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    db5fdd3 View commit details
    Browse the repository at this point in the history
  11. vfio/nvgrace-egm: Link egm and PCI devices

    Create a sysfs link between the egm character device and its associated
    GPU (PCI device) for correlation.
    
    Example:
    $ realpath /sys/class/egm/egm4/0009\:01\:00.0
    /sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0
    
    $ realpath /sys/bus/pci/devices/0009:01:00.0/egm4
    /sys/devices/virtual/egm/egm4
    
    Signed-off-by: Matthew R. Ochs <[email protected]>
    nvmochs committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    cd68d43 View commit details
    Browse the repository at this point in the history
  12. cover-letter: vfio/nvgrace-egm: Support EGM/GPU correlation and impro…

    …ve error handling
    
    Small series of fixes/improvements to the nvgrace VFIO modules.
    
    Signed-off-by: Matthew R. Ochs <[email protected]>
    nvmochs committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    420d79b View commit details
    Browse the repository at this point in the history