Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPNET-470: Collect host networking logs #404

Closed
wants to merge 1 commit into from

Conversation

mkowalski
Copy link

@mkowalski mkowalski commented Feb 20, 2024

It happens very often that must-gather is the only bundle provided to the networking team when asking to help debugging various issues. In reality, in majority of those cases we would rather get sosreport and not must-gather because the former contains useful outputs from ip command.

With this PR we are trying to get the most basic networking informations already into must-gather to speed up debugging.

Contributes-to: OPNET-470
Contributes-to: OCPBUGS-26217
Contributes-to: OCPBUGS-29624

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 20, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 20, 2024

@mkowalski: This pull request references OPNET-470 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

It happens very often that must-gather is the only bundle provided to the networking team when asking to help debugging various issues. In reality, in majority of those cases we would rather get sosreport and not must-gather because the former contains useful outputs from ip command.

With this PR we are trying to get the most basic networking informations already into must-gather to speed up debugging.

Contributes-to: OPNET-470
Contributes-to: OCPBUGS-26217
Contributes-to: OCPBUGS-29624

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2024
Copy link
Contributor

openshift-ci bot commented Feb 20, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 20, 2024

@mkowalski: This pull request references OPNET-470 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

It happens very often that must-gather is the only bundle provided to the networking team when asking to help debugging various issues. In reality, in majority of those cases we would rather get sosreport and not must-gather because the former contains useful outputs from ip command.

With this PR we are trying to get the most basic networking informations already into must-gather to speed up debugging.

Contributes-to: OPNET-470
Contributes-to: OCPBUGS-26217
Contributes-to: OCPBUGS-29624

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mkowalski
Copy link
Author

/test all

@mkowalski
Copy link
Author

/cc @cybertron

/cc @dgoodwin @stbenjam
This is a closure on the last issues we were debugging. If we had this from the very beginning (i.e. inside must-gather and not only on the live system), we would have saved a bit of time

@mkowalski
Copy link
Author

/test all

@mkowalski
Copy link
Author

/test all

@mkowalski mkowalski marked this pull request as ready for review February 21, 2024 11:18
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 21, 2024
It happens very often that must-gather is the only bundle provided to
the networking team when asking to help debugging various issues. In
reality, in majority of those cases we would rather get sosreport and
not must-gather because the former contains useful outputs from `ip`
command.

With this PR we are trying to get the most basic networking informations
already into must-gather to speed up debugging.

Contributes-to: OPNET-470
Contributes-to: OCPBUGS-26217
Contributes-to: OCPBUGS-29624
Copy link
Contributor

openshift-ci bot commented Feb 21, 2024

@mkowalski: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@RickJWagner RickJWagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@cybertron
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 1, 2024
Copy link
Contributor

openshift-ci bot commented Mar 1, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cybertron, mkowalski
Once this PR has been reviewed and has the lgtm label, please assign rickjwagner for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

echo "INFO: Collecting host networking logs"

collect_service_logs --role=master "on-prem-resolv-prepender"
collect_service_logs --role=worker "on-prem-resolv-prepender"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gathering logs from the workers (when there are 500 nodes) might be a problem is there any way to limit this ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are clusters with 500 nodes a real thing? From which perspective it may be a problem, are we now talking about the time oc must-gather [...] or about number of files collected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkowalski we need to watch for both (we don't want our collection tool to take too long to run or to collect GB's of data; that are impossible to review/process).

collect_service_logs --role=worker "on-prem-resolv-prepender"

collect_service_logs --role=master "nodeip-configuration"
collect_service_logs --role=worker "nodeip-configuration"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gathering logs from the workers (when there are 500 nodes) might be a problem is there any way to limit this ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This service should run only once per boot. No matter how many nodes customer has, unless they reboot once-per-minute, we don't expect it to be a big log bundle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but we still have to reachout to N nodes; and copy files off of them (from the service); that process takes both time and space; and that is the worry here.

collect_service_logs --role=master "nodeip-configuration"
collect_service_logs --role=worker "nodeip-configuration"

CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gathering logs from the workers (when there are 500 nodes) might be a problem is there any way to limit this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it so that this is something a user pass in (the list of nodes to collect this data from)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this would defeat the purpose. Maybe I never defined the problem we have correctly - what happens today is that people with "some networking problem" (as they feel) come to our team with the most basic must-gather and ask for help. At this moment we start playing ping-pong, because our first reply is "please bring sosreport from all the nodes". Only based on sosreport we can say if the problem is really on us, or not at all. If yes, we ask for the next set of logs (networkmanager with level=TRACE).

With this PR I aimed to at least bring must-gather and sosreport a bit closer, so that customer coming with must-gather collected using this PR would receive as a next instruction "yes, it's networking, please bring sosreport and NM trace logs" or "no, it's not networking" already.

If now we make it configurable, I do not believe any customer will ever come with a correct must-gather in the first place (as in, they will first give us must-gather without logs from any worker what renders useless).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to time box this (you only get N (<5) minutes to collect all this (or this script exits), or find some way to potentially limit or target this collection at a list of nodes.

IMO; the exposure risk of trying to collect all of this on hundreds of nodes; add too much to the collection and/or takes too much time (so we have to address that).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a trade-off I am okay to limit this only to master nodes, is this acceptable for you? It would still put us in a better situation than we are today and if after some months we still feel that what support&customers bring us does not make sense, we can reiterate on this discussion

collect_service_logs --role=master "nodeip-configuration"
collect_service_logs --role=worker "nodeip-configuration"

CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}"
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -o name)}"

@sferich888
Copy link
Contributor

@mkowalski are you still working this issue? Have you had a chance to look at my suggestions/recommendations and are we able to limit this in some way?

@ingvagabund this is another example where your comments (in #416 (comment)) apply to more diligence in what we collect/pull in.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2024
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 17, 2025
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Feb 17, 2025
Copy link
Contributor

openshift-ci bot commented Feb 17, 2025

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants