OPNET-470: Collect host networking logs #404

mkowalski · 2024-02-20T15:07:22Z

It happens very often that must-gather is the only bundle provided to the networking team when asking to help debugging various issues. In reality, in majority of those cases we would rather get sosreport and not must-gather because the former contains useful outputs from ip command.

With this PR we are trying to get the most basic networking informations already into must-gather to speed up debugging.

Contributes-to: OPNET-470
Contributes-to: OCPBUGS-26217
Contributes-to: OCPBUGS-29624

openshift-ci-robot · 2024-02-20T15:07:26Z

@mkowalski: This pull request references OPNET-470 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

It happens very often that must-gather is the only bundle provided to the networking team when asking to help debugging various issues. In reality, in majority of those cases we would rather get sosreport and not must-gather because the former contains useful outputs from ip command.

With this PR we are trying to get the most basic networking informations already into must-gather to speed up debugging.

Contributes-to: OPNET-470
Contributes-to: OCPBUGS-26217
Contributes-to: OCPBUGS-29624

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-02-20T15:07:34Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2024-02-20T15:07:36Z

@mkowalski: This pull request references OPNET-470 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

It happens very often that must-gather is the only bundle provided to the networking team when asking to help debugging various issues. In reality, in majority of those cases we would rather get sosreport and not must-gather because the former contains useful outputs from ip command.

With this PR we are trying to get the most basic networking informations already into must-gather to speed up debugging.

Contributes-to: OPNET-470
Contributes-to: OCPBUGS-26217
Contributes-to: OCPBUGS-29624

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

mkowalski · 2024-02-20T15:07:37Z

/test all

mkowalski · 2024-02-20T15:14:56Z

/cc @cybertron

/cc @dgoodwin @stbenjam
This is a closure on the last issues we were debugging. If we had this from the very beginning (i.e. inside must-gather and not only on the live system), we would have saved a bit of time

mkowalski · 2024-02-20T15:15:19Z

/test all

mkowalski · 2024-02-21T08:57:55Z

/test all

It happens very often that must-gather is the only bundle provided to the networking team when asking to help debugging various issues. In reality, in majority of those cases we would rather get sosreport and not must-gather because the former contains useful outputs from `ip` command. With this PR we are trying to get the most basic networking informations already into must-gather to speed up debugging. Contributes-to: OPNET-470 Contributes-to: OCPBUGS-26217 Contributes-to: OCPBUGS-29624

openshift-ci · 2024-02-21T13:10:28Z

@mkowalski: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

RickJWagner

Looks good.

cybertron · 2024-03-01T21:45:34Z

/lgtm

openshift-ci · 2024-03-01T21:46:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cybertron, mkowalski
Once this PR has been reviewed and has the lgtm label, please assign rickjwagner for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

collection-scripts/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sferich888 · 2024-04-01T14:00:19Z

collection-scripts/gather_host_network_logs

+echo "INFO: Collecting host networking logs"
+
+collect_service_logs --role=master "on-prem-resolv-prepender"
+collect_service_logs --role=worker "on-prem-resolv-prepender"


Gathering logs from the workers (when there are 500 nodes) might be a problem is there any way to limit this ?

Are clusters with 500 nodes a real thing? From which perspective it may be a problem, are we now talking about the time oc must-gather [...] or about number of files collected?

@mkowalski we need to watch for both (we don't want our collection tool to take too long to run or to collect GB's of data; that are impossible to review/process).

sferich888 · 2024-04-01T14:00:28Z

collection-scripts/gather_host_network_logs

+collect_service_logs --role=worker "on-prem-resolv-prepender"
+
+collect_service_logs --role=master "nodeip-configuration"
+collect_service_logs --role=worker "nodeip-configuration"


Gathering logs from the workers (when there are 500 nodes) might be a problem is there any way to limit this ?

This service should run only once per boot. No matter how many nodes customer has, unless they reboot once-per-minute, we don't expect it to be a big log bundle

Yes but we still have to reachout to N nodes; and copy files off of them (from the service); that process takes both time and space; and that is the worry here.

sferich888 · 2024-04-01T14:00:35Z

collection-scripts/gather_host_network_logs

+collect_service_logs --role=master "nodeip-configuration"
+collect_service_logs --role=worker "nodeip-configuration"
+
+CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}"


Gathering logs from the workers (when there are 500 nodes) might be a problem is there any way to limit this ?

Can we make it so that this is something a user pass in (the list of nodes to collect this data from)?

I feel like this would defeat the purpose. Maybe I never defined the problem we have correctly - what happens today is that people with "some networking problem" (as they feel) come to our team with the most basic must-gather and ask for help. At this moment we start playing ping-pong, because our first reply is "please bring sosreport from all the nodes". Only based on sosreport we can say if the problem is really on us, or not at all. If yes, we ask for the next set of logs (networkmanager with level=TRACE).

With this PR I aimed to at least bring must-gather and sosreport a bit closer, so that customer coming with must-gather collected using this PR would receive as a next instruction "yes, it's networking, please bring sosreport and NM trace logs" or "no, it's not networking" already.

If now we make it configurable, I do not believe any customer will ever come with a correct must-gather in the first place (as in, they will first give us must-gather without logs from any worker what renders useless).

I think we need to time box this (you only get N (<5) minutes to collect all this (or this script exits), or find some way to potentially limit or target this collection at a list of nodes.

IMO; the exposure risk of trying to collect all of this on hundreds of nodes; add too much to the collection and/or takes too much time (so we have to address that).

As a trade-off I am okay to limit this only to master nodes, is this acceptable for you? It would still put us in a better situation than we are today and if after some months we still feel that what support&customers bring us does not make sense, we can reiterate on this discussion

sferich888 · 2024-04-01T14:00:56Z

collection-scripts/gather_host_network_logs

+collect_service_logs --role=master "nodeip-configuration"
+collect_service_logs --role=worker "nodeip-configuration"
+
+CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}"


Suggested change

CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}"

CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -o name)}"

sferich888 · 2024-09-18T16:42:04Z

@mkowalski are you still working this issue? Have you had a chance to look at my suggestions/recommendations and are we able to limit this in some way?

@ingvagabund this is another example where your comments (in #416 (comment)) apply to more diligence in what we collect/pull in.

openshift-bot · 2024-12-18T01:00:22Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2025-01-17T08:30:15Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2025-02-17T00:00:29Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2025-02-17T00:01:33Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 20, 2024

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2024

mkowalski force-pushed the OPNET-470 branch from 0a5f54e to 9425026 Compare February 20, 2024 15:14

openshift-ci bot requested review from cybertron, dgoodwin and stbenjam February 20, 2024 15:15

mkowalski force-pushed the OPNET-470 branch from 9425026 to 29c88b0 Compare February 21, 2024 08:54

mkowalski marked this pull request as ready for review February 21, 2024 11:18

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 21, 2024

openshift-ci bot requested review from RickJWagner and sferich888 February 21, 2024 11:19

mkowalski force-pushed the OPNET-470 branch from 29c88b0 to 237430e Compare February 21, 2024 11:21

RickJWagner reviewed Feb 22, 2024

View reviewed changes

openshift-ci bot assigned cybertron Mar 1, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 1, 2024

sferich888 suggested changes Apr 1, 2024

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2024

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 17, 2025

openshift-ci bot closed this Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPNET-470: Collect host networking logs #404

OPNET-470: Collect host networking logs #404

mkowalski commented Feb 20, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 20, 2024 •

edited by openshift-ci bot

Loading

openshift-ci bot commented Feb 20, 2024

openshift-ci-robot commented Feb 20, 2024 •

edited by openshift-ci bot

Loading

mkowalski commented Feb 20, 2024

mkowalski commented Feb 20, 2024

mkowalski commented Feb 20, 2024

mkowalski commented Feb 21, 2024

openshift-ci bot commented Feb 21, 2024

RickJWagner left a comment

cybertron commented Mar 1, 2024

openshift-ci bot commented Mar 1, 2024

sferich888 Apr 1, 2024

mkowalski Apr 5, 2024

sferich888 Jun 28, 2024

sferich888 Apr 1, 2024

mkowalski Apr 5, 2024

sferich888 Jun 28, 2024

sferich888 Apr 1, 2024

sferich888 Apr 1, 2024

mkowalski Apr 5, 2024

sferich888 Jun 28, 2024

mkowalski Jul 1, 2024

sferich888 Apr 1, 2024

sferich888 commented Sep 18, 2024

openshift-bot commented Dec 18, 2024

openshift-bot commented Jan 17, 2025

openshift-bot commented Feb 17, 2025

openshift-ci bot commented Feb 17, 2025

	CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}"
	CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -o name)}"

OPNET-470: Collect host networking logs #404

OPNET-470: Collect host networking logs #404

Conversation

mkowalski commented Feb 20, 2024 • edited by openshift-ci bot Loading

openshift-ci-robot commented Feb 20, 2024 • edited by openshift-ci bot Loading

openshift-ci bot commented Feb 20, 2024

openshift-ci-robot commented Feb 20, 2024 • edited by openshift-ci bot Loading

mkowalski commented Feb 20, 2024

mkowalski commented Feb 20, 2024

mkowalski commented Feb 20, 2024

mkowalski commented Feb 21, 2024

openshift-ci bot commented Feb 21, 2024

RickJWagner left a comment

Choose a reason for hiding this comment

cybertron commented Mar 1, 2024

openshift-ci bot commented Mar 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sferich888 commented Sep 18, 2024

openshift-bot commented Dec 18, 2024

openshift-bot commented Jan 17, 2025

openshift-bot commented Feb 17, 2025

openshift-ci bot commented Feb 17, 2025

mkowalski commented Feb 20, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 20, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 20, 2024 •

edited by openshift-ci bot

Loading