Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cassandra-stress gets stuck when a node is restarted while the stress is already running #30

Open
emdroid opened this issue Nov 5, 2024 · 13 comments
Assignees

Comments

@emdroid
Copy link

emdroid commented Nov 5, 2024

This is the behavior observed in scylladb/scylladb#16002 and also the likely reason of scylladb/scylladb#16219 (although tht one hasn't been reproduced recently).

In particular, when the node is being restarted at the same time a c-s being run, it works when the node restart is initiated immediately (i.e. before the actual stress test begins to run - there is a 2s delay at the beginning of the c-s).

But if the node restart is initiated only after the c-s is already running (after the 2s delay), it often makes the c-s being stuck (not every time though, but fairly often).

I prepared a simple reproducer in https://github.com/scylladb/scylla-dtest/pull/5261 - fails 1-3 out of 10 in the older branches (2024.1 enterprise, scylla 5.4). With the current master the failure rate is much lower (sometimes just 1 of 25), probably because of some fixes that have been already done, but still seeing the error sometimes.

@emdroid
Copy link
Author

emdroid commented Nov 5, 2024

There are two slightly different failure modes:

  1. The c-s goes stuck immediately at the node shut down:
17:45:52,271 tools.stress    DEBUG stress.py      :160  | c-s: Sleeping 2s...
17:45:54,275 tools.stress    DEBUG stress.py      :160  | c-s: Running READ with 2 threads 2 minutes
17:45:54,344 tools.stress    DEBUG stress.py      :160  | c-s: ===== Using optimized driver!!! =====
17:45:54,717 tools.stress    DEBUG stress.py      :160  | c-s: Connected to cluster: test, max pending requests per connection null, max connections per host 8
17:45:54,718 tools.stress    DEBUG stress.py      :160  | c-s: Datatacenter: datacenter1; Host: /127.0.93.1; Rack: rack1
17:45:54,718 tools.stress    DEBUG stress.py      :160  | c-s: Datatacenter: datacenter1; Host: /127.0.93.2; Rack: rack1
17:45:54,718 tools.stress    DEBUG stress.py      :160  | c-s: Datatacenter: datacenter1; Host: /127.0.93.3; Rack: rack1
17:45:54,855 tools.stress    DEBUG stress.py      :160  | c-s: type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
17:45:55,027 tools.stress    DEBUG stress.py      :160  | c-s: total,           279,     279,     279,     279,     0.8,     0.8,     1.1,     1.7,     4.2,     4.2,    1.0,  0.00000,      0,      0,       0,       0,       0,       0
17:45:56,014 tools.stress    DEBUG stress.py      :160  | c-s: total,          4540,    4261,    4261,    4261,     0.5,     0.4,     0.8,     1.3,     3.0,     4.0,    2.0,  0.62018,      0,      0,       0,       0,       0,       0
17:45:57,008 tools.stress    DEBUG stress.py      :160  | c-s: total,         10293,    5753,    5753,    5753,     0.3,     0.3,     0.6,     0.9,     1.8,     4.2,    3.0,  0.38883,      0,      0,       0,       0,       0,       0
17:45:58,006 tools.stress    DEBUG stress.py      :160  | c-s: total,         18942,    8649,    8649,    8649,     0.2,     0.2,     0.3,     0.5,     1.3,     2.1,    4.0,  0.31837,      0,      0,       0,       0,       0,       0
17:45:59,006 tools.stress    DEBUG stress.py      :160  | c-s: total,         28904,    9962,    9962,    9962,     0.2,     0.2,     0.3,     0.4,     0.8,     1.1,    5.0,  0.26379,      0,      0,       0,       0,       0,       0
17:46:00,005 tools.stress    DEBUG stress.py      :160  | c-s: total,         39487,   10583,   10583,   10583,     0.2,     0.1,     0.3,     0.4,     1.3,    27.1,    6.0,  0.22436,      0,      0,       0,       0,       0,       0
17:46:01,004 tools.stress    DEBUG stress.py      :160  | c-s: total,         51734,   12247,   12247,   12247,     0.2,     0.1,     0.3,     0.3,     0.5,     4.6,    7.0,  0.19871,      0,      0,       0,       0,       0,       0
17:46:01,792 stress_restart_test DEBUG stress_restart_test.py:54   | Restarting second node...
17:46:01,796 tools.stress    DEBUG stress.py      :160  | c-s: com.datastax.driver.core.exceptions.DriverInternalError: An unexpected error happened while sending requests
17:46:01,796 tools.stress    DEBUG stress.py      :160  | c-s: com.datastax.driver.core.exceptions.TransportException: [127.0.93.2/127.0.93.2:9042] Connection has been closed
17:46:02,125 tools.stress    DEBUG stress.py      :160  | c-s: total,         62065,   10331,   10331,   10331,     0.1,     0.1,     0.2,     0.3,     0.4,     3.2,    8.0,  0.17141,      0,      0,       0,       0,       0,       0
17:46:02,805 tools.stress    DEBUG stress.py      :160  | c-s: WARN  17:46:02,804 Error creating netty channel to 127.0.93.2/127.0.93.2:9042
17:46:02,805 tools.stress    DEBUG stress.py      :160  | c-s: com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 127.0.93.2/127.0.93.2:9042
17:46:02,806 tools.stress    DEBUG stress.py      :160  | c-s: Caused by: java.net.ConnectException: Connection refused

...

17:46:29,788 ccm             INFO  cluster.py     :760  | node2: Started scylla: pid: 1212336
17:46:29,788 ccm             DEBUG cluster.py     :757  | node2: Starting scylla-jmx: args=['/home/emilm/.dtest/dtest-a68r2l6w/test/node2/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.93.2', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.93.2', '-Dcom.sun.ma
nagement.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.93.2', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.managem
ent.jmxremote.ssl=false', '-jar', '/home/emilm/.dtest/dtest-a68r2l6w/test/node2/bin/scylla-jmx-1.0.jar']
17:46:30,242 stress_restart_test DEBUG stress_restart_test.py:79   | Waiting until read stress thread will finish running

In the above, it can be noted that the c-s output ends soon after the "Restarting second node ..." log message, there is no more output and the wait times out.

  1. The c-s completes after the 2m scheduled time, but goes stuck afterwards (still causing the test to timeout):
18:53:09,245 tools.stress    DEBUG stress.py      :160  | c-s: Sleeping 2s...
18:53:11,248 tools.stress    DEBUG stress.py      :160  | c-s: Running READ with 2 threads 2 minutes
18:53:11,307 tools.stress    DEBUG stress.py      :160  | c-s: ===== Using optimized driver!!! =====
18:53:11,602 tools.stress    DEBUG stress.py      :160  | c-s: Connected to cluster: test, max pending requests per connection null, max connections per host 8
18:53:11,602 tools.stress    DEBUG stress.py      :160  | c-s: Datatacenter: datacenter1; Host: /127.0.15.1; Rack: rack1
18:53:11,602 tools.stress    DEBUG stress.py      :160  | c-s: Datatacenter: datacenter1; Host: /127.0.15.2; Rack: rack1
18:53:11,602 tools.stress    DEBUG stress.py      :160  | c-s: Datatacenter: datacenter1; Host: /127.0.15.3; Rack: rack1
18:53:11,695 tools.stress    DEBUG stress.py      :160  | c-s: type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
18:53:12,014 tools.stress    DEBUG stress.py      :160  | c-s: total,           904,     904,     904,     904,     0.6,     0.6,     1.0,     1.2,     2.9,     2.9,    1.0,  0.00000,      0,      0,       0,       0,       0,       0
18:53:13,007 tools.stress    DEBUG stress.py      :160  | c-s: total,          6001,    5097,    5097,    5097,     0.4,     0.4,     0.6,     0.8,     1.6,     1.8,    2.0,  0.49388,      0,      0,       0,       0,       0,       0
18:53:14,005 tools.stress    DEBUG stress.py      :160  | c-s: total,         11348,    5347,    5347,    5347,     0.4,     0.3,     0.6,     0.8,     2.5,     8.2,    3.0,  0.31138,      0,      0,       0,       0,       0,       0
18:53:15,005 tools.stress    DEBUG stress.py      :160  | c-s: total,         17614,    6266,    6266,    6266,     0.3,     0.3,     0.5,     0.6,     0.9,     1.3,    4.0,  0.23457,      0,      0,       0,       0,       0,       0
18:53:16,006 tools.stress    DEBUG stress.py      :160  | c-s: total,         26322,    8708,    8708,    8708,     0.2,     0.2,     0.4,     0.5,     0.8,     1.8,    5.0,  0.21432,      0,      0,       0,       0,       0,       0
18:53:17,004 tools.stress    DEBUG stress.py      :160  | c-s: total,         33304,    6982,    6982,    6982,     0.3,     0.2,     0.5,     0.6,     0.9,     1.2,    6.0,  0.17578,      0,      0,       0,       0,       0,       0
18:53:18,005 tools.stress    DEBUG stress.py      :160  | c-s: total,         42037,    8733,    8733,    8733,     0.2,     0.2,     0.4,     0.5,     0.6,     1.8,    7.0,  0.15586,      0,      0,       0,       0,       0,       0
18:53:18,731 stress_restart_test DEBUG stress_restart_test.py:54   | Restarting second node...
18:53:18,735 tools.stress    DEBUG stress.py      :160  | c-s: com.datastax.driver.core.exceptions.DriverInternalError: An unexpected error happened while sending requests
18:53:19,062 tools.stress    DEBUG stress.py      :160  | c-s: total,         48701,    6664,    6664,    6664,     0.2,     0.2,     0.4,     0.5,     0.7,     6.8,    8.0,  0.13510,      0,      0,       0,       0,       0,       0
18:53:19,734 tools.stress    DEBUG stress.py      :160  | c-s: WARN  18:53:19,734 Error creating netty channel to /127.0.15.2:9042

...

18:53:20,004 tools.stress    DEBUG stress.py      :160  | c-s: total,         55181,    6480,    6480,    6480,     0.1,     0.1,     0.2,     0.3,     2.5,    12.1,    9.0,  0.11935,      0,      0,       0,       0,       0,       0
18:53:21,003 tools.stress    DEBUG stress.py      :160  | c-s: total,         60081,    4900,    4900,    4900,     0.2,     0.2,     0.3,     0.4,     0.5,     0.7,   10.0,  0.11141,      0,      0,       0,       0,       0,       0
18:53:21,740 tools.stress    DEBUG stress.py      :160  | c-s: WARN  18:53:21,735 Error creating netty channel to /127.0.15.2:9042
18:53:21,746 tools.stress    DEBUG stress.py      :160  | c-s: com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.15.2:9042
18:53:21,751 tools.stress    DEBUG stress.py      :160  | c-s: Caused by: java.net.ConnectException: Connection refused
18:53:21,751 tools.stress    DEBUG stress.py      :160  | c-s: WARN  18:53:21,746 Error creating netty channel to 127.0.15.2/127.0.15.2:9042

...

18:53:44,002 tools.stress    DEBUG stress.py      :160  | c-s: total,        185153,    5940,    5940,    5940,     0.2,     0.2,     0.2,     0.3,     0.6,     0.9,   33.0,  0.04030,      0,      0,       0,       0,       0,       0
18:53:45,003 tools.stress    DEBUG stress.py      :160  | c-s: total,        190028,    4875,    4875,    4875,     0.2,     0.2,     0.2,     0.3,     1.2,     3.8,   34.0,  0.03945,      0,      0,       0,       0,       0,       0
18:53:46,003 tools.stress    DEBUG stress.py      :160  | c-s: total,        195488,    5460,    5460,    5460,     0.2,     0.2,     0.2,     0.3,     0.3,     1.0,   35.0,  0.03836,      0,      0,       0,       0,       0,       0
18:53:46,807 ccm             INFO  cluster.py     :760  | node2: Started scylla: pid: 1234981
18:53:46,808 ccm             DEBUG cluster.py     :757  | node2: Starting scylla-jmx: args=['/home/emilm/.dtest/dtest-nsuqpind/test/node2/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.15.2', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.15.2', '-Dcom.sun.ma
nagement.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.15.2', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.managem
ent.jmxremote.ssl=false', '-jar', '/home/emilm/.dtest/dtest-nsuqpind/test/node2/bin/scylla-jmx-1.0.jar']
18:53:47,002 tools.stress    DEBUG stress.py      :160  | c-s: total,        201321,    5833,    5833,    5833,     0.2,     0.2,     0.2,     0.2,     0.6,     2.7,   36.0,  0.03727,      0,      0,       0,       0,       0,       0
18:53:47,110 stress_restart_test DEBUG stress_restart_test.py:79   | Waiting until read stress thread will finish running
18:53:48,013 tools.stress    DEBUG stress.py      :160  | c-s: total,        207096,    5775,    5775,    5775,     0.2,     0.1,     0.2,     0.3,     0.4,     0.7,   37.0,  0.03624,      0,      0,       0,       0,       0,       0
18:53:49,004 tools.stress    DEBUG stress.py      :160  | c-s: total,        211181,    4085,    4085,    4085,     0.2,     0.2,     0.4,     0.6,     0.8,     1.2,   38.0,  0.03625,      0,      0,       0,       0,       0,       0
18:53:50,003 tools.stress    DEBUG stress.py      :160  | c-s: total,        215121,    3940,    3940,    3940,     0.2,     0.2,     0.4,     0.6,     1.0,     1.5,   39.0,  0.03636,      0,      0,       0,       0,       0,       0

...

18:55:10,003 tools.stress    DEBUG stress.py      :160  | c-s: total,        556028,    5482,    5482,    5482,     0.2,     0.2,     0.3,     0.4,     0.6,     2.4,  119.0,  0.02023,      0,      0,       0,       0,       0,       0
18:55:11,003 tools.stress    DEBUG stress.py      :160  | c-s: total,        560133,    4105,    4105,    4105,     0.2,     0.2,     0.4,     0.4,     0.5,     1.5,  120.0,  0.02010,      0,      0,       0,       0,       0,       0
18:55:12,003 tools.stress    DEBUG stress.py      :160  | c-s: total,        563287,    3154,    3154,    3154,     0.2,     0.2,     0.4,     0.4,     0.5,     1.3,  121.0,  0.02017,      0,      0,       0,       0,       0,       0
---------------------------- Captured log teardown -----------------------------
18:58:47,191 tools.stress    DEBUG stress.py      :160  | c-s: ERROR 18:58:47,190 [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
18:58:47,193 tools.stress    DEBUG stress.py      :160  | c-s: ERROR 18:58:47,192 [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
18:58:48,191 tools.stress    DEBUG stress.py      :160  | c-s: ERROR 18:58:48,191 [Control connection] Cannot connect to any host, scheduling retry in 2000 milliseconds
18:58:48,192 tools.stress    DEBUG stress.py      :160  | c-s: WARN  18:58:48,191 Error creating netty channel to 127.0.15.1/127.0.15.1:9042

In this case, the c-s seems to actually complete (the last output being approx the 2m after the c-s test started), even printing the output after the wait to finish started. But then it still goes stuck and the wait times out. Also, the c-s doesn't print the stats at the end, which it normally does.

@kbr-scylla
Copy link

cc @fruch

@kbr-scylla
Copy link

cc @dkropachev

@dkropachev
Copy link
Collaborator

@emdroid , thanks for reproducer, we will take a look at it

@emdroid
Copy link
Author

emdroid commented Nov 5, 2024

Note that it might also have something to do with this: scylladb/scylladb#19645 (comment)

@fruch
Copy link

fruch commented Nov 5, 2024

I prepared a simple reproducer in scylladb/scylla-dtest#5261 - fails 1-3 out of 10 in the older branches (2024.1 enterprise, scylla 5.4). With the current master the failure rate is much lower (sometimes just 1 of 25), probably because of some fixes that have been already done, but still seeing the error sometimes.

the branches mentioned here, are scylla or dtest ?

@emdroid
Copy link
Author

emdroid commented Nov 5, 2024

The branches are scylla, for dtest I used "next" with the scylla master and "branch-2024.1" with the scylla enterprise 2024.1

@CodeLieutenant
Copy link
Collaborator

CodeLieutenant commented Nov 5, 2024

The branches are scylla, for dtest I used "next" with the scylla master and "branch-2024.1" with the scylla enterprise 2024.1

@emdroid Can you try running it again with Cassandra-Stress 3.15, support in it got merged 2 days ago and i'm not sure if you ran it with latest?

@emdroid
Copy link
Author

emdroid commented Nov 7, 2024

I tried to run the dtest after I built the c-s in docker + tagged the image as 3.15 locally, but I'm not sure whether it is actually being used (via the docker). Because if I run the dtest locally (I'm using the "dist-unified-dev" build of scylla and the "--scylla-version=local_tarball" when running the dtest), I can see the c-s processes directly on the machine (i.e. it seems it isn't being run there via the docker).

I.e. for example:

$ ps aux | grep "cassandra-stress"
emilm     895297  0.0  0.0 228692  3324 pts/31   S+   21:40   0:00 /bin/sh /home/emilm/.ccm/scylla-repository/local_tarball/scylla-core-package/scylla-tools/tools/bin/cassandra-stress read duration=2m no-warmup -rate threads=2 -pop seq=1...10000 -node 127.0.13.1 -port jmx=7199

I know that the above is just the runner script, however I also have the java process running on the machine which seems to be the actual c-s process:

$ ps aux | grep Stress
emilm     895335 19.7  0.4 40184360 535204 pts/31 Sl+ 21:40   0:54 /usr/lib/jvm/java-21-openjdk-21.0.4.0.7-2.fc40.x86_64//bin/java -server -ea -cp <some_very_large_classpath_and_params> -Dcassandra.storagedir=/home/emilm/.ccm/scylla-repository/local_tarball/scylla-core-package/scylla-tools/tools/bin/../../data -Dlogback.configurationFile=logback-tools.xml org.apache.cassandra.stress.Stress read duration=2m no-warmup -rate threads=2 -pop seq=1...10000 -node 127.0.13.1 -port jmx=7199

So not really sure how to run the dtest so that it picks up the docker image (and I also don't know where the unified dist tarball gets the c-s from eventually)

@dkropachev
Copy link
Collaborator

dkropachev commented Nov 7, 2024

I tried to run the dtest after I built the c-s in docker + tagged the image as 3.15 locally, but I'm not sure whether it is actually being used (via the docker). Because if I run the dtest locally (I'm using the "dist-unified-dev" build of scylla and the "--scylla-version=local_tarball" when running the dtest), I can see the c-s processes directly on the machine (i.e. it seems it isn't being run there via the docker).

I.e. for example:

$ ps aux | grep "cassandra-stress"
emilm     895297  0.0  0.0 228692  3324 pts/31   S+   21:40   0:00 /bin/sh /home/emilm/.ccm/scylla-repository/local_tarball/scylla-core-package/scylla-tools/tools/bin/cassandra-stress read duration=2m no-warmup -rate threads=2 -pop seq=1...10000 -node 127.0.13.1 -port jmx=7199

I know that the above is just the runner script, however I also have the java process running on the machine which seems to be the actual c-s process:

$ ps aux | grep Stress
emilm     895335 19.7  0.4 40184360 535204 pts/31 Sl+ 21:40   0:54 /usr/lib/jvm/java-21-openjdk-21.0.4.0.7-2.fc40.x86_64//bin/java -server -ea -cp <some_very_large_classpath_and_params> -Dcassandra.storagedir=/home/emilm/.ccm/scylla-repository/local_tarball/scylla-core-package/scylla-tools/tools/bin/../../data -Dlogback.configurationFile=logback-tools.xml org.apache.cassandra.stress.Stress read duration=2m no-warmup -rate threads=2 -pop seq=1...10000 -node 127.0.13.1 -port jmx=7199

So not really sure how to run the dtest so that it picks up the docker image (and I also don't know where the unified dist tarball gets the c-s from eventually)

What branch of dtest is it ?
Commit that introduced docker-based c-s: https://github.com/scylladb/scylla-dtest/commit/3e2457d856919c3fed724b8a5d01b5903727a76c

@dkropachev
Copy link
Collaborator

Here is the trace when c-s stuck on dtest:
c-s-stuck-trage.log

@dkropachev
Copy link
Collaborator

As you can see main loop stuck at at app//org.apache.cassandra.stress.StressAction.run(StressAction.java:295, which is:

            done.await();

Which means some of the worker thread is stuck.
Now here is the stacktrace of the stuck worker:

"Thread-2" prio=5 Id=59 WAITING on com.datastax.driver.core.DefaultResultSetFuture@2ad5ebfd
  at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
  -  waiting on com.datastax.driver.core.DefaultResultSetFuture@2ad5ebfd
  at [email protected]/java.util.concurrent.locks.LockSupport.park(Unknown Source)
  at app//com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:561)
  at app//com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:247)
  at app//com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:308)
  at app//com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
  at app//org.apache.cassandra.stress.util.JavaDriverClient.executePrepared(JavaDriverClient.java:247)
  at app//org.apache.cassandra.stress.operations.predefined.CqlOperation$JavaDriverWrapper.execute(CqlOperation.java:308)

Which means that thread is stuck on waiting DefaultResultSetFuture results from the driver.

@emdroid
Copy link
Author

emdroid commented Nov 11, 2024

Thanks @dkropachev for following up on this.

Yes, I was able to also debug this eventually and found the same place. In particular, one of the read threads gets stuck after this message:

10:36:27,972 tools.stress_thread_docker DEBUG stress_thread_docker.py:167  | com.datastax.driver.core.exceptions.DriverInternalError: An unexpected error happened while sending requests
10:36:27,972 tools.stress_thread_docker DEBUG stress_thread_docker.py:167  | com.datastax.driver.core.exceptions.TransportException: [127.0.7.2/127.0.7.2:9042] Connection has been closed

The current op completes, but the next one gets stuck.

Curiously this also only happens when there are 2 Consumer read threads - if I change the thread count to just 1, the problem cannot be reproduced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants