Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-up, kind, common: Enable TopologyManager for kind-sriov #1347

Merged
merged 1 commit into from
Jan 31, 2025

Conversation

nirdothan
Copy link
Contributor

@nirdothan nirdothan commented Jan 16, 2025

What this PR does / why we need it:
SR-IOV tests check topology alignment. Currently in kind/sriov kueblet does not attempt to align resources.
In SR-IOV alignment test, we call the function hardware.LookupDeviceVCPUAffinity(). This function returns a slice of
aligned CPUs, i.e. (complex sentence ahead warning) CPU numbers,
that are assigned to the guest, which share a NUMA with SRIOV VFs, that are also passed to the guest.
In other words: If an app is running in the guest, and using a network interface, that NIC is physically close to its CPU, and it doesn't have to cross the system bus to get to it.
Back to LookupDeviceVCPUAffinity(): if it finds that there is no aligment, it doesn't err, it just returns an empty slice.
As the test is currently written, an empty list is fine i.e. no alignment: it simply validates that the guest knows that there's no alignment, by validating it in the cloud-init metadata file in the guest.
This change adding a topology manager policy of single-numa-node, forces alignment. If alignment is not achieved, scheduling will fail.
We will also assert in the test that the alignment slice is not empty.

Add topology manager[1] to kubelet config and set its policy to single-numa-node.
Together with cpu-manager policy=static, which we already set,
kubelet will reject a pod that it is unable to align.
[1] https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:

NONE

Sorry, something went wrong.

@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Jan 16, 2025
@nirdothan
Copy link
Contributor Author

/retest-required

@nirdothan
Copy link
Contributor Author

/retest-required

@nirdothan
Copy link
Contributor Author

nirdothan commented Jan 16, 2025

Changed topology manager policy to restricted as sriov tests failed since scheduler could not allocate all resources on a single numa (I suspect)

@nirdothan
Copy link
Contributor Author

/retest-required

@brianmcarey
Copy link
Member

/test check-up-kind-sriov

@nirdothan
Copy link
Contributor Author

/test check-up-kind-sriov

@brianmcarey @ormergi It's probably failing barbecue of #1348

@ormergi
Copy link
Contributor

ormergi commented Jan 20, 2025

/test check-up-kind-sriov

@brianmcarey @ormergi It's probably failing barbecue of #1348

I dont see why #1348 cuasing this PR is failing.
Looking into the last failure logs, it seems kubevirt deployment failed
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1347/check-up-kind-sriov/1879983806077734912#1:build-log.txt%3A2473

If anything, I hope we can get #1348 merged as soon as possible in favor of returning the lane to be required.

@brianmcarey
Copy link
Member

/test check-up-kind-sriov

@ormergi
Copy link
Contributor

ormergi commented Jan 30, 2025

/test check-up-kind-sriov

EDIT:
job failed due to failure downloading sonobuoy binary

Copy link
Contributor

@ormergi ormergi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Nir thanks for fixing this!

It seems the mentioned PR in the commit message (kubevirt/kubevirt#13685) that test this PR against kubevirt/kubevirt seem outdated.
Could you please verify it again?

Regarding the commit message, could you please mention why enabling TopiogyManager makes the mentioned test meaningful?
I think you can drop mentioning it was tested against kubevirt/kubeivrt, you can comment about it in the PR.
Please consider wrapping the commit message body lines length, they are very long.

Other than that LGTM

SRIOV tests check topology alignment. Currently in kind/sriov kueblet does not attempt
to align resources.
In SRIOV alignment test, we call the function
hardware.LookupDeviceVCPUAffinity(). This function returns a slice of
aligned CPUs, i.e. (complex sentence ahead warning) CPU numbers,
that are assigned to the guest, which share a NUMA with SRIOV VFs, that are
also passed to the guest.
In other words: If an app is running in the guest, and using a network
interface, that NIC is physically close to its CPU, and it doesn't have
to cross the system bus to get to it.
Back to LookupDeviceVCPUAffinity(): if it finds that there is
no aligment, it doesn't err, it just returns an empty slice.
As the test is currently written, an empty list is fine i.e. no
alignment: it simply validates that the guest knows that there's no
alignment, by validating it in the cloud-init metadata file in the
guest.
This change adding a topology manager policy of single-numa-node, forces
alignment. If alignment is not achieved, scheduling will fail.
We will also assert in the test that the alignment slice is not empty.

Add topology manager[1] to kubelet config and set its policy to single-numa-node.
Together with cpu-manager policy=static, which we already set,
kubelet will reject a pod that it is unable to align.
[1] https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/

Signed-off-by: Nir Dothan <ndothan@redhat.com>
@nirdothan
Copy link
Contributor Author

nirdothan commented Jan 30, 2025

This change addresses @ormergi 's comment

In addition, changed the TopologyManager policy from restricted back to single-numa-node which is the policy that I was originally aiming for, and just changed it to see if it resolved failed checks. However those were related to CPU-Manager and that has been fixed in #1348.

Copy link
Contributor

@ormergi ormergi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 30, 2025
Copy link
Member

@brianmcarey brianmcarey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

thanks @nirdothan

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brianmcarey

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 31, 2025
@kubevirt-bot kubevirt-bot merged commit 1e31064 into kubevirt:main Jan 31, 2025
13 checks passed
kubevirt-bot added a commit to kubevirt-bot/kubevirt that referenced this pull request Feb 1, 2025
[1e31064 cluster-up, kind, common: Enable TopologyManager for kind-sriov](kubevirt/kubevirtci#1347)
[00a9d12 Kubelet drop in](kubevirt/kubevirtci#1299)
[6b7d14b build(deps): bump golang.org/x/net in /cluster-provision/gocli](kubevirt/kubevirtci#1364)

```release-note
NONE
```

Signed-off-by: kubevirt-bot <kubevirtbot@redhat.com>
kubevirt-bot added a commit to kubevirt-bot/kubevirt that referenced this pull request Feb 1, 2025
[997cbff k8s-provider: Remove KUBEVIRT_CPU_MANAGER_POLICY](kubevirt/kubevirtci#1289)
[c2e547b fix: gocli startup in s390x architecture](kubevirt/kubevirtci#1354)
[1e31064 cluster-up, kind, common: Enable TopologyManager for kind-sriov](kubevirt/kubevirtci#1347)
[00a9d12 Kubelet drop in](kubevirt/kubevirtci#1299)
[6b7d14b build(deps): bump golang.org/x/net in /cluster-provision/gocli](kubevirt/kubevirtci#1364)

```release-note
NONE
```

Signed-off-by: kubevirt-bot <kubevirtbot@redhat.com>
nirdothan pushed a commit to nirdothan/kubevirt that referenced this pull request Feb 2, 2025
[997cbff k8s-provider: Remove KUBEVIRT_CPU_MANAGER_POLICY](kubevirt/kubevirtci#1289)
[c2e547b fix: gocli startup in s390x architecture](kubevirt/kubevirtci#1354)
[1e31064 cluster-up, kind, common: Enable TopologyManager for kind-sriov](kubevirt/kubevirtci#1347)
[00a9d12 Kubelet drop in](kubevirt/kubevirtci#1299)
[6b7d14b build(deps): bump golang.org/x/net in /cluster-provision/gocli](kubevirt/kubevirtci#1364)

```release-note
NONE
```

Signed-off-by: kubevirt-bot <kubevirtbot@redhat.com>
nirdothan pushed a commit to nirdothan/kubevirt that referenced this pull request Feb 4, 2025
[997cbff k8s-provider: Remove KUBEVIRT_CPU_MANAGER_POLICY](kubevirt/kubevirtci#1289)
[c2e547b fix: gocli startup in s390x architecture](kubevirt/kubevirtci#1354)
[1e31064 cluster-up, kind, common: Enable TopologyManager for kind-sriov](kubevirt/kubevirtci#1347)
[00a9d12 Kubelet drop in](kubevirt/kubevirtci#1299)
[6b7d14b build(deps): bump golang.org/x/net in /cluster-provision/gocli](kubevirt/kubevirtci#1364)

```release-note
NONE
```

Signed-off-by: kubevirt-bot <kubevirtbot@redhat.com>
HarshithaMS005 pushed a commit to HarshithaMS005/kubevirt that referenced this pull request Feb 19, 2025
[997cbff k8s-provider: Remove KUBEVIRT_CPU_MANAGER_POLICY](kubevirt/kubevirtci#1289)
[c2e547b fix: gocli startup in s390x architecture](kubevirt/kubevirtci#1354)
[1e31064 cluster-up, kind, common: Enable TopologyManager for kind-sriov](kubevirt/kubevirtci#1347)
[00a9d12 Kubelet drop in](kubevirt/kubevirtci#1299)
[6b7d14b build(deps): bump golang.org/x/net in /cluster-provision/gocli](kubevirt/kubevirtci#1364)

```release-note
NONE
```

Signed-off-by: kubevirt-bot <kubevirtbot@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/enhancement lgtm Indicates that a PR is ready to be merged. size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants