Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: rdma exlusive handling #603

Merged
merged 1 commit into from
Nov 3, 2024

Conversation

rollandf
Copy link
Contributor

@rollandf rollandf commented Oct 21, 2024

In case a RDMA device in exclusive mode is in use by a Pod, the DP was not reporting it as a resource after DP restart.

Following changes are introduced in RdmaSpec:

  • isRdma: in case of no rdma resources, check if netlink "enable_rdma" is available.
  • GetRdmaDeviceSpec: the device specs are retrieved dynamically and not on discovery stage as before.

Dynamic RDMA specs computation vs on discovery, comes to solve following scenario for exlusive mode:

  • Discover RDMA device
  • Allocate to Pod (resources are hidden on host)
  • Restart DP pod
  • Discovery
  • Deallocate
  • Reallocate

Fixes #565

@coveralls
Copy link
Collaborator

coveralls commented Oct 21, 2024

Pull Request Test Coverage Report for Build 11577952041

Details

  • 28 of 58 (48.28%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.6%) to 74.628%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/utils/utils.go 0 6 0.0%
pkg/devices/rdma.go 27 36 75.0%
pkg/utils/netlink_provider.go 0 15 0.0%
Totals Coverage Status
Change from base Build 11458515878: -0.6%
Covered Lines: 2109
Relevant Lines: 2826

💛 - Coveralls

@rollandf
Copy link
Contributor Author

rollandf commented Oct 22, 2024

@SchSeba @zeeke PTAL

Note that exclusive mode will be exposed in SRIOV-Network-Operator with this PR:
k8snetworkplumbingwg/sriov-network-operator#666

pkg/utils/netlink_provider.go Outdated Show resolved Hide resolved
pkg/devices/rdma_test.go Outdated Show resolved Hide resolved
pkg/devices/rdma_test.go Outdated Show resolved Hide resolved
pkg/utils/netlink_provider.go Outdated Show resolved Hide resolved
pkg/utils/utils.go Outdated Show resolved Hide resolved
pkg/utils/utils.go Outdated Show resolved Hide resolved
default:
return false
}
// Checking for netlink param for exclusive RDMA use case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add more information here why we need to check netlink param in this case. (requested by sebastian)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
@SchSeba PTAL

In case a RDMA device in exclusive mode is in use
by a Pod, the DP was not reporting it as a resource
after DP restart.

Following changes are introduced in RdmaSpec:

- isRdma: in case of no rdma resources,
  check if netlink "enable_rdma" is available.
- GetRdmaDeviceSpec: the device specs are retrieved
  dynamically and not on discovery stage as before.

Dynamic RDMA specs computation vs on discovery, comes
to solve following scenario for exlusive mode:
- Discover RDMA device
- Allocate to Pod (resources are hidden on host)
- Restart DP pod
- Deallocate
- Reallocate

Fixes k8snetworkplumbingwg#565

Signed-off-by: Fred Rolland <[email protected]>
@rollandf
Copy link
Contributor Author

@SchSeba can you PTAL?

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this one and also add a functional test to cover this one in the operator https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/799/files#diff-909069834ea269a01a51f28b8830efebec16799766767ebcd01b58f966ddc5c5R226 (real mlx device is needed for the test to run)

@SchSeba SchSeba merged commit a380ca5 into k8snetworkplumbingwg:master Nov 3, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Capacity and Allocatable number shows wrong if sriov-network-device-plugin restarts
7 participants