Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft errors aren't masked sufficiently for disrupt_decommission_streaming_err #9490

Open
lersek opened this issue Dec 5, 2024 · 1 comment
Assignees

Comments

@lersek
Copy link
Member

lersek commented Dec 5, 2024

Report against SCT as of commit 08636e9.

Context: longevity-large-partition-200k-pks-4days-gce-test/6 | argus.

disrupt_decommission_streaming_err masks some raft errors through ignore_raft_topology_cmd_failing, for the purpose of invoking start_and_interrupt_decommission_streaming.

In the test run noted above, start_and_interrupt_decommission_streaming rebooted DB Node 1. This caused DB Node 7, which was at the time in a drain RPC with Node 1, to log:

raft_topology - tablets draining failed with std::runtime_error (raft topology: exec_global_command(barrier) failed with seastar::rpc::closed_error (connection is closed)). Aborting the topology operation

(See the more complete log snippet here.)

Because ignore_raft_topology_cmd_failing did not mask this particular raft error, SCT considered the whole test a failure.

I think ignore_raft_topology_cmd_failing should downgrade the above-noted error, too, to a warning.

Notes:

  1. ignore_raft_topology_cmd_failing already masks a related error (drain rpc failed, proceed to fence old writes ... connection is closed); however, the specific error above doesn't seem to be masked.

  2. Independently, said drain rpc failed, proceed to fence old writes ... connection is closed seems to be masked twice (reduntantly). Commit 03eb8b0 and commit 8b9a75f added the following regexes, respectively:

    .*raft_topology - drain rpc failed, proceed to fence old writes:.*connection is closed
    .*raft_topology - drain rpc failed, proceed to fence old writes.*connection is closed
    

    Note the single character difference: the first pattern contains a colon (:), which is useless, because the second pattern matches a superset of what the first pattern matches.

    Arguably, the first pattern should be cleaned up, in a followup patch to the new log pattern addition (tablets draining failed...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants