-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPNET-470: Collect host networking logs #404
Conversation
@mkowalski: This pull request references OPNET-470 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Skipping CI for Draft Pull Request. |
@mkowalski: This pull request references OPNET-470 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test all |
/cc @cybertron /cc @dgoodwin @stbenjam |
/test all |
/test all |
It happens very often that must-gather is the only bundle provided to the networking team when asking to help debugging various issues. In reality, in majority of those cases we would rather get sosreport and not must-gather because the former contains useful outputs from `ip` command. With this PR we are trying to get the most basic networking informations already into must-gather to speed up debugging. Contributes-to: OPNET-470 Contributes-to: OCPBUGS-26217 Contributes-to: OCPBUGS-29624
@mkowalski: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
/lgtm |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cybertron, mkowalski The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
echo "INFO: Collecting host networking logs" | ||
|
||
collect_service_logs --role=master "on-prem-resolv-prepender" | ||
collect_service_logs --role=worker "on-prem-resolv-prepender" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gathering logs from the workers (when there are 500 nodes) might be a problem is there any way to limit this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are clusters with 500 nodes a real thing? From which perspective it may be a problem, are we now talking about the time oc must-gather [...]
or about number of files collected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkowalski we need to watch for both (we don't want our collection tool to take too long to run or to collect GB's of data; that are impossible to review/process).
collect_service_logs --role=worker "on-prem-resolv-prepender" | ||
|
||
collect_service_logs --role=master "nodeip-configuration" | ||
collect_service_logs --role=worker "nodeip-configuration" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gathering logs from the workers (when there are 500 nodes) might be a problem is there any way to limit this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This service should run only once per boot. No matter how many nodes customer has, unless they reboot once-per-minute, we don't expect it to be a big log bundle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but we still have to reachout to N nodes; and copy files off of them (from the service); that process takes both time and space; and that is the worry here.
collect_service_logs --role=master "nodeip-configuration" | ||
collect_service_logs --role=worker "nodeip-configuration" | ||
|
||
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gathering logs from the workers (when there are 500 nodes) might be a problem is there any way to limit this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it so that this is something a user pass in (the list of nodes to collect this data from)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this would defeat the purpose. Maybe I never defined the problem we have correctly - what happens today is that people with "some networking problem" (as they feel) come to our team with the most basic must-gather and ask for help. At this moment we start playing ping-pong, because our first reply is "please bring sosreport from all the nodes". Only based on sosreport we can say if the problem is really on us, or not at all. If yes, we ask for the next set of logs (networkmanager with level=TRACE
).
With this PR I aimed to at least bring must-gather and sosreport a bit closer, so that customer coming with must-gather collected using this PR would receive as a next instruction "yes, it's networking, please bring sosreport and NM trace logs" or "no, it's not networking" already.
If now we make it configurable, I do not believe any customer will ever come with a correct must-gather in the first place (as in, they will first give us must-gather without logs from any worker what renders useless).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to time box this (you only get N (<5) minutes to collect all this (or this script exits), or find some way to potentially limit or target this collection at a list of nodes.
IMO; the exposure risk of trying to collect all of this on hundreds of nodes; add too much to the collection and/or takes too much time (so we have to address that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a trade-off I am okay to limit this only to master nodes, is this acceptable for you? It would still put us in a better situation than we are today and if after some months we still feel that what support&customers bring us does not make sense, we can reiterate on this discussion
collect_service_logs --role=master "nodeip-configuration" | ||
collect_service_logs --role=worker "nodeip-configuration" | ||
|
||
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}" | |
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -o name)}" |
@mkowalski are you still working this issue? Have you had a chance to look at my suggestions/recommendations and are we able to limit this in some way? @ingvagabund this is another example where your comments (in #416 (comment)) apply to more diligence in what we collect/pull in. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
It happens very often that must-gather is the only bundle provided to the networking team when asking to help debugging various issues. In reality, in majority of those cases we would rather get sosreport and not must-gather because the former contains useful outputs from
ip
command.With this PR we are trying to get the most basic networking informations already into must-gather to speed up debugging.
Contributes-to: OPNET-470
Contributes-to: OCPBUGS-26217
Contributes-to: OCPBUGS-29624