DecommissionStreamingErr is waiting for non-existing log lines #9501

fruch · 2024-12-08T22:26:15Z

Packages

Scylla version: 6.3.0~dev-20241206.7e2875d6489d with build-id 5227dd2a3fce4d2beb83ec6c17d47ad2e8ba6f5c

Kernel Version: 6.8.0-1019-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

no all of the log mention DecommissionStreamingErr nemesis does exist when raft topology is enabled,

see https://github.com/scylladb/scylladb/blame/f744007e13657491082ccf7023d07947f2b23ea1/service/storage_service.cc#L3711

all the "DECOMMISSIONING: .*" are in the branch used when raft topology is disable,

ABORT_DECOMMISSION_LOG_PATTERNS: Iterable[MessagePosition] = [
    MessagePosition("api - decommission", LogPosition.BEGIN),
    MessagePosition("DECOMMISSIONING: unbootstrap starts", LogPosition.BEGIN),
    MessagePosition("DECOMMISSIONING: unbootstrap done", LogPosition.END),
    MessagePosition("becoming a group 0 non-voter", LogPosition.END),
    MessagePosition("became a group 0 non-voter", LogPosition.END),
    MessagePosition("leaving token ring", LogPosition.END),
    MessagePosition("left token ring", LogPosition.END),
    MessagePosition("raft_topology - decommission: waiting for completion", LogPosition.BEGIN),
    MessagePosition("repair - decommission_with_repair", LogPosition.END)
]

or those logs should be removed, or to put under appropriate logic

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

longevity-twcs-48h-master-db-node-00441d41-9 (54.194.53.160 | 10.4.9.27) (shards: 7)
longevity-twcs-48h-master-db-node-00441d41-8 (34.244.107.75 | 10.4.8.101) (shards: 7)
longevity-twcs-48h-master-db-node-00441d41-7 (63.33.71.172 | 10.4.10.42) (shards: 7)
longevity-twcs-48h-master-db-node-00441d41-6 (52.211.182.14 | 10.4.10.239) (shards: -1)
longevity-twcs-48h-master-db-node-00441d41-5 (34.251.145.212 | 10.4.8.74) (shards: 7)
longevity-twcs-48h-master-db-node-00441d41-4 (54.217.134.87 | 10.4.8.135) (shards: 7)
longevity-twcs-48h-master-db-node-00441d41-3 (18.200.239.48 | 10.4.8.44) (shards: 7)
longevity-twcs-48h-master-db-node-00441d41-2 (34.243.168.32 | 10.4.11.104) (shards: 7)
longevity-twcs-48h-master-db-node-00441d41-10 (34.245.46.14 | 10.4.11.13) (shards: 7)
longevity-twcs-48h-master-db-node-00441d41-1 (54.78.241.215 | 10.4.9.219) (shards: 7)

OS / Image: ami-0c7b4b0835c9342f7 (aws: undefined_region)

Test: longevity-twcs-48h-test
Test id: 00441d41-0edb-47a9-bbab-8f9e7a5b5821
Test name: scylla-master/tier1/longevity-twcs-48h-test
Test method: longevity_twcs_test.TWCSLongevityTest.test_custom_time
Test config file(s):

longevity-twcs-48h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 00441d41-0edb-47a9-bbab-8f9e7a5b5821
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 00441d41-0edb-47a9-bbab-8f9e7a5b5821

Logs:

longevity-twcs-48h-master-db-node-00441d41-1 - https://cloudius-jenkins-test.s3.amazonaws.com/00441d41-0edb-47a9-bbab-8f9e7a5b5821/20241207_041011/longevity-twcs-48h-master-db-node-00441d41-1-00441d41.tar.gz
longevity-twcs-48h-master-db-node-00441d41-6 - https://cloudius-jenkins-test.s3.amazonaws.com/00441d41-0edb-47a9-bbab-8f9e7a5b5821/20241207_041011/longevity-twcs-48h-master-db-node-00441d41-6-00441d41.tar.gz
longevity-twcs-48h-master-db-node-00441d41-2 - https://cloudius-jenkins-test.s3.amazonaws.com/00441d41-0edb-47a9-bbab-8f9e7a5b5821/20241207_041011/longevity-twcs-48h-master-db-node-00441d41-2-00441d41.tar.gz
longevity-twcs-48h-master-db-node-00441d41-4 - https://cloudius-jenkins-test.s3.amazonaws.com/00441d41-0edb-47a9-bbab-8f9e7a5b5821/20241207_041011/longevity-twcs-48h-master-db-node-00441d41-4-00441d41.tar.gz
longevity-twcs-48h-master-db-node-00441d41-3 - https://cloudius-jenkins-test.s3.amazonaws.com/00441d41-0edb-47a9-bbab-8f9e7a5b5821/20241207_041011/longevity-twcs-48h-master-db-node-00441d41-3-00441d41.tar.gz
db-cluster-00441d41.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/00441d41-0edb-47a9-bbab-8f9e7a5b5821/20241208_034948/db-cluster-00441d41.tar.gz
sct-runner-events-00441d41.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/00441d41-0edb-47a9-bbab-8f9e7a5b5821/20241208_034948/sct-runner-events-00441d41.tar.gz
sct-00441d41.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/00441d41-0edb-47a9-bbab-8f9e7a5b5821/20241208_034948/sct-00441d41.log.tar.gz
loader-set-00441d41.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/00441d41-0edb-47a9-bbab-8f9e7a5b5821/20241208_034948/loader-set-00441d41.tar.gz
monitor-set-00441d41.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/00441d41-0edb-47a9-bbab-8f9e7a5b5821/20241208_034948/monitor-set-00441d41.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

fruch · 2024-12-08T22:27:50Z

@temichus @aleksbykov, seems like this logic is there for almost a year now.

can all of those log be validated to actually exist in the code ? and when they should be waited for ?

aleksbykov · 2024-12-10T12:22:23Z

@temichus @aleksbykov, seems like this logic is there for almost a year now.

can all of those log be validated to actually exist in the code ? and when they should be waited for ?

The reason was , that jobs are running with/without raft and for different versions. Now, raft is always enabled log messages will be set to actual state

fruch · 2024-12-10T13:24:54Z

@temichus @aleksbykov, seems like this logic is there for almost a year now.

can all of those log be validated to actually exist in the code ? and when they should be waited for ?

The reason was , that jobs are running with/without raft and for different versions. Now, raft is always enabled log messages will be set to actual state

But the code didn't split to select only from the correct group of logs according to if the raft is on or not, and if it did, it didn't reach 6.1 branch

github-actions bot assigned fruch Dec 8, 2024

fruch added the tests/longevity-tier1 label Dec 8, 2024

fruch removed their assignment Dec 8, 2024

fruch added the master/triage label Dec 8, 2024

aleksbykov self-assigned this Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DecommissionStreamingErr is waiting for non-existing log lines #9501

DecommissionStreamingErr is waiting for non-existing log lines #9501

fruch commented Dec 8, 2024

Logs:

fruch commented Dec 8, 2024

aleksbykov commented Dec 10, 2024

fruch commented Dec 10, 2024

DecommissionStreamingErr is waiting for non-existing log lines #9501

DecommissionStreamingErr is waiting for non-existing log lines #9501

Comments

fruch commented Dec 8, 2024

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

fruch commented Dec 8, 2024

aleksbykov commented Dec 10, 2024

fruch commented Dec 10, 2024