-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cassandra-stress gets stuck when a node is restarted while the stress is already running #30
Comments
There are two slightly different failure modes:
In the above, it can be noted that the c-s output ends soon after the "Restarting second node ..." log message, there is no more output and the wait times out.
In this case, the c-s seems to actually complete (the last output being approx the 2m after the c-s test started), even printing the output after the wait to finish started. But then it still goes stuck and the wait times out. Also, the c-s doesn't print the stats at the end, which it normally does. |
cc @fruch |
cc @dkropachev |
@emdroid , thanks for reproducer, we will take a look at it |
Note that it might also have something to do with this: scylladb/scylladb#19645 (comment) |
the branches mentioned here, are scylla or dtest ? |
The branches are scylla, for dtest I used "next" with the scylla master and "branch-2024.1" with the scylla enterprise 2024.1 |
@emdroid Can you try running it again with Cassandra-Stress 3.15, support in it got merged 2 days ago and i'm not sure if you ran it with latest? |
I tried to run the dtest after I built the c-s in docker + tagged the image as 3.15 locally, but I'm not sure whether it is actually being used (via the docker). Because if I run the dtest locally (I'm using the "dist-unified-dev" build of scylla and the "--scylla-version=local_tarball" when running the dtest), I can see the c-s processes directly on the machine (i.e. it seems it isn't being run there via the docker). I.e. for example:
I know that the above is just the runner script, however I also have the java process running on the machine which seems to be the actual c-s process:
So not really sure how to run the dtest so that it picks up the docker image (and I also don't know where the unified dist tarball gets the c-s from eventually) |
What branch of dtest is it ? |
Here is the trace when c-s stuck on dtest: |
As you can see main loop stuck at
Which means some of the worker thread is stuck.
Which means that thread is stuck on waiting |
Thanks @dkropachev for following up on this. Yes, I was able to also debug this eventually and found the same place. In particular, one of the read threads gets stuck after this message:
The current op completes, but the next one gets stuck. Curiously this also only happens when there are 2 Consumer read threads - if I change the thread count to just 1, the problem cannot be reproduced. |
This is the behavior observed in scylladb/scylladb#16002 and also the likely reason of scylladb/scylladb#16219 (although tht one hasn't been reproduced recently).
In particular, when the node is being restarted at the same time a c-s being run, it works when the node restart is initiated immediately (i.e. before the actual stress test begins to run - there is a 2s delay at the beginning of the c-s).
But if the node restart is initiated only after the c-s is already running (after the 2s delay), it often makes the c-s being stuck (not every time though, but fairly often).
I prepared a simple reproducer in https://github.com/scylladb/scylla-dtest/pull/5261 - fails 1-3 out of 10 in the older branches (2024.1 enterprise, scylla 5.4). With the current master the failure rate is much lower (sometimes just 1 of 25), probably because of some fixes that have been already done, but still seeing the error sometimes.
The text was updated successfully, but these errors were encountered: