-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task/internal/syslog: Add capability to ignore kernel failures #1666
base: main
Are you sure you want to change the base?
Conversation
run.Raw('|'), | ||
'egrep', '-v', '\\btcmu-runner\\b.*\\bINFO\\b', | ||
run.Raw('|'), | ||
'head', '-n', '1', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which of these need moved to the qa suite? Please post a ceph PR. It needs backported to octopus/pacific before this can be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of this is stale. We will have to run all relevant suites and find out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that but I'm not sure it is worth the effort. If the objective is to make the exclude list configurable, is there a problem with leaving these in (meaning that these would always be on the exclude list)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@batrick What's the best filter to exercise only kernel code. Which of the following covers all ?
-s fs --filter mount/kclient
This listed more than 600 jobs-s fs --filter kernel
This listed around 54 jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that but I'm not sure it is worth the effort. If the objective is to make the exclude list configurable, is there a problem with leaving these in (meaning that these would always be on the exclude list)?
In the past, this has been confusing to newcomers. There's all sorts of magic defaults in teuthology (e.g. the log ignorelist). Best to move these to the ceph.git qa/ when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@batrick What's the best filter to exercise only kernel code. Which of the following covers all ?
1. `-s fs --filter mount/kclient` This listed more than 600 jobs
This will have the best coverage. Add --subset x/16
.
Since this started out as "begin grepping kernel logs for kclient warnings/failures", is there a suspicion that the existing greps don't work (or perhaps work but not on all distros)? Have you attempted to trigger a WARNING or BUG and observe the test failing? |
adab615
to
958c1ed
Compare
Honestly, we didn't know about this syslog grep at the start. I was concerned there was no detection of kernel faults at all since I would occasionally see warnings that should fail tests in a syslog. So I guess we're both concerned that this grep wasn't catching genuine faults/warnings, that it's not easily configurable, and that it's not easily discovered. |
I remember it breaking a few years ago -- something to do with traditional syslog vs systemd journal interaction but it was fixed then. It might be the case that it broke again, which is why I asked. The issue is not with the grep invocation but rather with what is being grepped (i.e. does that kern.log file actually contain any kernel log messages at the time grep is invoked, etc). Removing outdated exclude items is obviously lower priority than resurrecting the core functionality in case it is broken. |
958c1ed
to
baa60a8
Compare
@kotreshhr what's the status on this PR? |
@batrick I think it's good to be merged. Here is the run triggered and Jeff's comments on the same. Sorry, I should have discussed here. It was in the cephfs-team mailing list. Copying it here for reference.
Many thanks for doing this. Better testing with teuthology is definitely The kernel is built from commit 7c1f4b5e3842, so this is missing some of
I suspect this is fixed with a recent patch from Xiubo (1a45b99820e7).
Doesn't look familiar. It looks like the client is just waiting on an
Also doesn't look familiar. The task is hung waiting for dirty pages to
Looks unrelated to ceph at all. This one happened at boot time. Probably
Not much to go on here. There are no stack traces, AFAICT and it's
Hung (briefly) while waiting on ceph_fsync. It looks like this one might
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise LGTM; please post a teuthology run with the ceph PR requested for qa/cephfs/begin.yaml
.
'egrep', '--binary-files=text', | ||
'\\bBUG\\b|\\bINFO\\b|\\bDEADLOCK\\b', | ||
'\\bBUG\\b|\\bINFO\\b|\\bDEADLOCK\\b|\\bOops\\b|\\bWARNING\\b|\\bKASAN\\b', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtlayton are we missing anything here or LGTY?
baa60a8
to
6f99b63
Compare
I killed this @kotreshhr . It has 1000+ jobs. Please use a subset. |
@batrick I am not sure I understand |
Here is what I usually run and it gets about 300 jobs:
|
Thanks @batrick, here is the teuthology run: |
There's some ... weird failures in that run. Please do |
Rerun QA link: |
Good to merge IMO from CephFS side. @idryomov do you want to have another look? rbd run needed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a confirmation that the whole grep invocation is actually working. Could you please build a kernel that would emit a custom WARNING or ERROR on e.g. CephFS mount and run this against that kernel (and make sure that all supported distros are covered by that run because there used to be annoying differences in how dmesg was persisted between distros). And then another run with that custom log message added to ignorelist, again on all distros. Or maybe you already did something like that?
Would it be crazy to add a patch to upstream kernel where a mount option triggers a warning/error so we can continuously verify this works? |
Definitely crazy IMO. But we could find a way to trigger something with one of the "bad" keywords in it on vanilla kernels. For example, take all OSDs out against some in-flight I/O -- that should trigger "INFO: task ... blocked for more than %ld seconds". |
... or we could carry the kind of patch you envision in our testing branch. |
I wouldn't use a mount option. This sounds more like a job for debugfs or maybe consider the fault injection framework documented here: https://www.kernel.org/doc/html/latest/fault-injection/fault-injection.html |
Hmm, I think we can just use the network namespaces to network partition the kernel mount ( |
6f99b63
to
4464035
Compare
I bought this up in cephfs standup last while discussing https://tracker.ceph.com/issues/64471 with @batrick and recalled this change. So, this was close to getting merged, but @idryomov suggested a validation before merging. Can we agree on a way forward to validate this? |
To start, @kotreshhr needs to rebase. Then I suggest trying Ilya's suggestion of stopping OSDs while some application is writing a large file. This shouldn't be hard to test... |
Adds capability to ignore kernel failures to jobs. This is done by adding 'syslog' dict to config dictionary which holds the 'ignorelist' of kernel failures. Also removes old kernel failurs from exclude list. Fixes: https://tracker.ceph.com/issues/50150 Signed-off-by: Kotresh HR <[email protected]>
4464035
to
4c0bd70
Compare
@batrick Does that mean, I write a teuthology test that kills OSDs while writing large file ? |
As discussed during standup - that's a good start to validate the filter. |
Adds capability to ignore kernel failures to jobs. This is done by
adding 'syslog' dict to config dictionary which holds the 'ignorelist'
of kernel failures.
Also removes old kernel failurs from exclude list.
Fixes: https://tracker.ceph.com/issues/50150
Signed-off-by: Kotresh HR [email protected]