Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tablets 4.x #241

Merged
merged 1 commit into from
Jun 17, 2024
Merged

Conversation

Bouncheck
Copy link
Collaborator

@Bouncheck Bouncheck commented Aug 21, 2023

(Copied over from commit message)
Introduces basic tablets support for version 4.x of the driver.
Metadata about tablets will be kept in TabletMap that gets continuously updated
through the tablets-routing-v1 extension. Each time the BoundStatement targets
the wrong node and shard combination the server supporting tablets should
respond with tablet metadata inside custom payload of its response.
This metadata will be transparently processed and used for future queries.

Tablets metadata will by enabled by default. Until now driver was using
TokenMaps to choose replicas and appropriate shards. Having a token was enough
information to do that. Now driver will first attempt tablet-based lookup
and only after failing to find corresponding tablet it will defer to TokenMap
lookup. Since to find a correct tablet besides the token we need the keyspace
and table names, many of the methods were extended to also accept those
as parameters.

RequestHandlerTestHarness was adjusted to mock also MetadataManager.
Before it used to mock only session.getMetadata() call but the same can
be obtained by context.getMetadataManager().getMetadata(). Using the
second method was causing test failures.

Fixes #268

@roydahan
Copy link
Collaborator

@Bouncheck can you please describe the current status of it?
Where we're in terms of code completion / testing / what's left / any estimation for completion?

Also, if review is already required or not and if any blockers exist at the moment.

@Bouncheck
Copy link
Collaborator Author

Currently I've made almost complete switch to the solution that uses custom payloads instead of querying system.tablets.
I've ran some tests on local 4 node cluster and so far there were no issues with reading the payloads and selecting the correct replica node. The only thing current code lacks is choosing a shard correctly.
I think the code will be complete during this week.

@bhalevy bhalevy added this to the Java 4.x Q2-2023 release milestone Mar 13, 2024
@Bouncheck Bouncheck force-pushed the scylla-4.x-tablets-1 branch from 6425a4f to 6669f85 Compare March 15, 2024 19:13
@Bouncheck Bouncheck marked this pull request as ready for review March 15, 2024 19:14
@Bouncheck Bouncheck requested a review from avelanarius March 15, 2024 19:15
@Bouncheck Bouncheck changed the title Tablets 4.x (WIP) Tablets 4.x Mar 15, 2024
@Bouncheck Bouncheck force-pushed the scylla-4.x-tablets-1 branch from 6669f85 to c6730c6 Compare March 15, 2024 19:44
@Bouncheck Bouncheck force-pushed the scylla-4.x-tablets-1 branch 2 times, most recently from c24d841 to 8d5575b Compare March 25, 2024 06:08
@roydahan
Copy link
Collaborator

roydahan commented Mar 28, 2024

@Bouncheck can you please update what’s the status and what’s this PR waiting for?

@Bouncheck
Copy link
Collaborator Author

Currently it's waiting for review. I've done some manual testing, but haven't done any stress testing under heavy load.

@roydahan
Copy link
Collaborator

Thanks @Bouncheck, can you please check the failing CI check?
Also, please ping someone from the team to help you with a peer review.

Regarding stress testing, I don't know if we have a tool/app that utilizes this driver version, IIRC there was a problem to build c-s with this driver version. However, if it's not the case anymore or we can utilize newer c-s from the upstream that we can build with this driver, we can have a longevity test using SCT.

@Bouncheck
Copy link
Collaborator Author

Will do. At first i thought 2024.1.2 failures were a one off, since it looked liked there's a lot of similar test errors relating to starting the cluster (and only for that scylla version), but it seems they reproduce after all.

@avelanarius avelanarius requested a review from Lorak-mmk April 22, 2024 11:23
@roydahan
Copy link
Collaborator

@Bouncheck what's the status of this PR?

@Lorak-mmk can you please help reviewing this?

Copy link

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. Some things that fit better here:

  • Why are there no tests for tablets? This is a complex feature so it should have proper tests (both unit tests of the new data structure and integration tests that check that tablet routing works properly in various scenarios)
  • I see you decided to store replicas Uuids instead of Node objects. This has a performance cost, because you need to allocate a new HashSet of all reploca Node objects for each query. I assume it was done to simplify the code, because you don't need to handle the case where we can't map some id to Node, like I do in maintenance procedures in Rust Driver Tablets PR (Introduce support for Tablets scylla-rust-driver#937). But the maintenance is still required for cases like node being replaces (lack of handling of this case is another problem with this PR), so handling the missing-UUID case doesn't really add much complexity.
  • The PR is one being commit. It should be split into small atomic commits to make it easier to analyze fopr reviewers and future readers.

@bhalevy
Copy link
Member

bhalevy commented May 21, 2024

ping
I see review comments from @Lorak-mmk from ~3 weeks ago that weren't discussed.

@Bouncheck
Copy link
Collaborator Author

Yes, I've seen them and I'm reworking those parts. Should be ready soon

@Bouncheck
Copy link
Collaborator Author

I'll push the code likely today - I want to run 1 more concurrency experiment.

@Bouncheck Bouncheck force-pushed the scylla-4.x-tablets-1 branch from 8ccda18 to 17b1f40 Compare May 28, 2024 16:21
@Lorak-mmk Lorak-mmk self-assigned this Jun 1, 2024
@roydahan
Copy link
Collaborator

roydahan commented Jun 2, 2024

@Bouncheck is it ready for re-review?
If so, please request @avelanarius and @Lorak-mmk reviewing it ASAP.

@Bouncheck Bouncheck force-pushed the scylla-4.x-tablets-1 branch from 17b1f40 to da118b2 Compare June 3, 2024 01:01
@Bouncheck Bouncheck requested a review from Lorak-mmk June 3, 2024 07:52
@mykaul
Copy link

mykaul commented Jun 5, 2024

@dani-tweig - this important PR is not tracked anywhere - there's no 'tablets' label for it, etc.

@dani-tweig dani-tweig added the enhancement New feature or request label Jun 5, 2024
@Bouncheck Bouncheck force-pushed the scylla-4.x-tablets-1 branch from da118b2 to cabda4d Compare June 10, 2024 06:41
Introduces basic tablets support for version 4.x of the driver.
Metadata about tablets will be kept in TabletMap that gets continuously updated
through the tablets-routing-v1 extension. Each time the BoundStatement targets
the wrong node and shard combination the server supporting tablets should
respond with tablet metadata inside custom payload of its response.
This metadata will be transparently processed and used for future queries.

Tablets metadata will by enabled by default. Until now driver was using
TokenMaps to choose replicas and appropriate shards. Having a token was enough
information to do that. Now driver will first attempt tablet-based lookup
and only after failing to find corresponding tablet it will defer to TokenMap
lookup. Since to find a correct tablet besides the token we need the keyspace
and table names, many of the methods were extended to also accept those
as parameters.

RequestHandlerTestHarness was adjusted to mock also MetadataManager.
Before it used to mock only `session.getMetadata()` call but the same can
be obtained by `context.getMetadataManager().getMetadata()`. Using the
second method was causing test failures.
@Bouncheck Bouncheck force-pushed the scylla-4.x-tablets-1 branch from cabda4d to e0fbffd Compare June 10, 2024 06:48
@Bouncheck
Copy link
Collaborator Author

Bouncheck commented Jun 10, 2024

Pushed another version:

  • Removed redundant/unused fields and collections from DefaultTablet
  • Added NonNull annotation to pair structures
  • Removed NodeShardPair altogether
  • Added comment about thread safety to DefaultTabletMap
  • Fixed comments about sweep during tabletAdd in DefaultTabletMap
  • Removed unused elements from tests
  • Removed inferTable, added getRoutingTable in Request; Provided overrides for BatchStatement and BoundStatement. Ultimately I did not add setRoutingTable or routingTable fields. I do not have a usecase for them right now and this saves me from adjusting statement builders and tests.
  • Removed unnecessary nullity checks and other misc. items pointed out
  • Reduced 1 partitioner.hash(key) call in BasicLoadBalancingPolicy in case when falling back to TokenMap
  • Switched to CqlIdentifiers instead of Strings when handling keyspace and table in tablet contexts

@Bouncheck Bouncheck requested a review from Lorak-mmk June 10, 2024 08:31
@Lorak-mmk
Copy link

I see CI is failing, do you know why? Also, I remember you mentioned that Tablet tests were not being executed, because of some edge case with version comparing. Is hat fixed?

@Bouncheck
Copy link
Collaborator Author

Last time I checked this test these same two methods (should_apply_node_filter, should_use_round_robin_on_local_dc_when_not_enough_routing_information) were failing and it was due to one node failing to start. This changes actual number of nodes seen in local dc and changes the output in second method to different than expected because round robin does not include that downed node.

The problem with tablets integration test was that the CI used to pull release candidate version for which I've noticed it was skipping this test. It's pulling 6.0.0 now which is the minimal required version and if you search the logs for DefaultMetadataTabletMapIT you can see that it indeed does run the test and launch 2 nodes it's configured for.

@Bouncheck Bouncheck merged commit 07dfa3e into scylladb:scylla-4.x Jun 17, 2024
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants