Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disrupt_add_remove_dc nemesis reported failure but all operations within it were successful #9470

Open
timtimb0t opened this issue Dec 3, 2024 · 1 comment
Assignees
Labels
on_core_qa tasks that should be solved by Core QA team tests/longevity-tier1

Comments

@timtimb0t
Copy link
Contributor

Packages

Scylla version: 6.3.0~dev-20241129.65949ce60780 with build-id d0921e78443678667ebaf5d8cdfda19428d03e6c

Kernel Version: 6.8.0-1019-aws

Issue description

disrupt_add_remove_dc nemesis implies adding and removing new DC. DC been added and successfully removed but during the exit from temporary_replication_strategy_setter the following error returned:

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5430, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4787, in disrupt_add_remove_dc
    with temporary_replication_strategy_setter(node) as replication_strategy_setter:
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/replication_strategy_utils.py", line 114, in __exit__
    self(**self.preserved)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/replication_strategy_utils.py", line 125, in __call__
    strategy.apply(self.node, keyspace)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/replication_strategy_utils.py", line 47, in apply
    session.execute(cql)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1318, in execute_verbose
    return execute_orig(*args, **kwargs)
  File "cassandra/cluster.py", line 2729, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 5120, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={<Host: 10.4.10.165:9042 eu-west>: ConnectionException('Host has been marked down or removed')}, last_host=10.4.9.78:9042

target node (ie new DC) been removed successfully

Impact

No scylla impact

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-50gb-12h-master-db-node-899fa2dc-7 (34.246.171.132 | 10.4.11.139) (shards: 8)
  • longevity-50gb-12h-master-db-node-899fa2dc-6 (18.201.126.28 | 10.4.10.177) (shards: 8)
  • longevity-50gb-12h-master-db-node-899fa2dc-5 (52.209.17.188 | 10.4.11.7) (shards: 8)
  • longevity-50gb-12h-master-db-node-899fa2dc-4 (34.247.85.22 | 10.4.10.165) (shards: 11)
  • longevity-50gb-12h-master-db-node-899fa2dc-3 (3.253.80.20 | 10.4.9.78) (shards: 14)
  • longevity-50gb-12h-master-db-node-899fa2dc-2 (52.209.150.157 | 10.4.10.62) (shards: 9)
  • longevity-50gb-12h-master-db-node-899fa2dc-1 (18.201.192.26 | 10.4.8.24) (shards: 11)

OS / Image: ami-09d8c22a0006e46a8 (aws: undefined_region)

Test: longevity-150gb-asymmetric-cluster-12h-test
Test id: 899fa2dc-a66d-4948-bb82-ede7d1fba930
Test name: scylla-master/tier1/longevity-150gb-asymmetric-cluster-12h-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 899fa2dc-a66d-4948-bb82-ede7d1fba930
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 899fa2dc-a66d-4948-bb82-ede7d1fba930

Logs:

Jenkins job URL
Argus

@timtimb0t timtimb0t self-assigned this Dec 3, 2024
@timtimb0t timtimb0t added tests/longevity-tier1 on_core_qa tasks that should be solved by Core QA team labels Dec 3, 2024
@soyacz
Copy link
Contributor

soyacz commented Dec 3, 2024

should be fixed when #9430 is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on_core_qa tasks that should be solved by Core QA team tests/longevity-tier1
Projects
None yet
Development

No branches or pull requests

3 participants