Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNDB-12905: Allow usage of ReadQuery#execute rather than only executeInternal in CQL utests #1564

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

adelapena
Copy link

Many if not most of our unit tests extend CQLTester. CQLTester#execute executes CQL queries with ReadQuery#executeInternal, which is what is used for internal system queries, and it's pretty similar to the local execution done by replica nodes. This misses the full validation and all the reconciliation patch used by ReadQuery#execute.

Missing reconciliation is particularly problematic for utests querying secondary indexes, because they miss the application of the full RowFilter by the coordinator.

A more realistic behaviour can be achieved by using CQLTester#executeNet, which will use a driver session that will exercise the entire path. However, this has to be done manually in every test and it's easy to miss when executeNet is necessary. Also, executeNet enables more serialization stuff that is costly in terms of resources and not that interesting for most cases.

This PR adds a new CQLTester#executeDistributed method that makes the query go through ReadQuery#execute rather than ReadQuery#executeInternal.

It also adds a new CQLTester.enableDistributedExecution method that changes the default behaviour of CQLTester#execute, so we can easily modify the query execution behaviour of entire tests. I haven't been brave enough to update all CQLTester-based unit tests, but I have changed all SAI tests to use this new way of running queries. Let's see how many things this breaks in CI.

CC @michaeljmarshall

@adelapena adelapena self-assigned this Feb 12, 2025
Copy link

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits

@adelapena adelapena marked this pull request as draft February 12, 2025 15:37
@adelapena
Copy link
Author

There are lots of failures on CI. Most will need to adapt the tests a bit, but I'll also check if the new test execution has caught some real bug.

@adelapena
Copy link
Author

Mmh, the failures of InetTest might be a real bug due to a difference between the filtering used by SAI and ALLOW FILTERING. Looking into it to confirm.

@adelapena adelapena marked this pull request as ready for review February 12, 2025 17:02
@adelapena adelapena force-pushed the CNDB-12905-main branch 3 times, most recently from 6a9c88e to 78dff15 Compare February 14, 2025 11:57
@@ -516,19 +516,28 @@ public static UntypedResultSet execute(String query, ConsistencyLevel cl, Object
public static UntypedResultSet execute(String query, ConsistencyLevel cl, QueryState state, Object... values)
throws RequestExecutionException
{
try
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the try-catch block because the only caller of this method is SystemDistributedKeyspace#viewStatus, who simply swallows the exception making this wrapping of RequestValidationException no-op.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who simply swallows the exception making this wrapping of RequestValidationException no-op

Could this be an actual bug?
Also, it seems Sonar complains about test coverage in this area.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, how do you think this could fail?

Copy link

@ekaterinadimitrova2 ekaterinadimitrova2 Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems parseAndPrepare also throws that exception, and it is called by this method.
It seems that parseStatement (called by parseAndPrepare) is throwing SyntaxException that extends the RequestValidationException.
Interesting history - this execute method was added with the re-write of the storage engine. The method did not have any callers and the execute method implementations were throwing that exception. They stopped throwing it in CASSANDRA-13426 so I guess that was left over.

I haven't written particular tests to verify what is going on, just looking around the code. Shall we leave the cleaning for follow-up ticket?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the generic try-catch block in that variant of execute looks like a reminder of the original implementation. The method is only used by SystemDistributedKeyspace#viewStatus, which shallows the exception because it's only interested in collecting view statuses for a nodetool command. So it's a no-op. Now we want to re-use the method for QueryTester, but the useless exception replacement gets in the way and hides the original exception, which we want for our test checks.

Do you agree on simply removing the try-catch block, since it doesn't seem to use any purpose? Or I can add a variant of the QueryProcessor#execute method without the exception replacement, or even embed it in CQLTester#executeFormattedQuery, given how simple it is.

Copy link

@ekaterinadimitrova2 ekaterinadimitrova2 Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's remove it, convinced. Just trying not to break something and being extra cautious.

Copy link

@ekaterinadimitrova2 ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, left some minor questions.

@@ -516,19 +516,28 @@ public static UntypedResultSet execute(String query, ConsistencyLevel cl, Object
public static UntypedResultSet execute(String query, ConsistencyLevel cl, QueryState state, Object... values)
throws RequestExecutionException
{
try

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who simply swallows the exception making this wrapping of RequestValidationException no-op

Could this be an actual bug?
Also, it seems Sonar complains about test coverage in this area.

@ekaterinadimitrova2
Copy link

I am looking into CI results again now - seems like NodeStartupTest is a new regression even if it was not marked by Butler?
This one also seems like something that needs a ticket:
https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-pr-gate/job/PR-1564/9/testReport/org.apache.cassandra.index.sai.cql/VectorTypeTest/tests_stage_1___jvm_unit_tests___jvm_unit_compression_tests___tracingTest_ca__compression_jdk11/

Butler marked a lot of tests for regressions. We need to ensure they match previous failures on main and they do not fail in a new way.

@ekaterinadimitrova2
Copy link

ekaterinadimitrova2 commented Feb 24, 2025

The latest changes LGTM but what do we do about NodeStartupTest? It is not failing on main, but it fails consistently with this patch.
On a quick look, we probably discovered another bug. So, here are two options if that is the case:

@adelapena
Copy link
Author

The latest changes LGTM but what do we do about NodeStartupTest? It is not failing on main, but it fails consistently with this patch.

Ah, I've missed that one! I think the node restart simulated by SAITester#simulateNodeRestart doesn't work with CQLTester#requireNetwork nor CQLTester#requireNetworkWithoutDriver. If we start the services required for networking even calls to executeInternal will fail. The failures can be reproduced on unpatched main by simply making a call to requireNetwork anywhere before the failing query, even if we don't use the network calls at all. I suspect this is probably a test issue rather than a product defect.

I'm simply disabling coordinator execution on NodeStartupTest so it doesn't call requireNetworkWithoutDriver, and we can try to improve the test separately. Or we could simply port the utest to a trusty dtest, so we don't have to do too much black magic with a particular test. wdyt?

@ekaterinadimitrova2
Copy link

we could simply port the utest to a trusty dtest

+1, let's do that, please

@adelapena
Copy link
Author

we could simply port the utest to a trusty dtest

+1, let's do that, please

Do you mean here or in a follow-up patch?

@ekaterinadimitrova2
Copy link

we could simply port the utest to a trusty dtest

+1, let's do that, please

Do you mean here or in a follow-up patch?

Hm, sorry, I thought you suggest to make it here so we are sure no regressions. If it is not too much work, let's do it here while on top of it, please.

@adelapena
Copy link
Author

Hm, sorry, I thought you suggest to make it here so we are sure no regressions. If it is not too much work, let's do it here while on top of it, please.

I don't see how calling the same method in the same context is going to produce a regression, this seems more about using this to improve a previously existing test. I think it's likely I'll have to switch contexts this week, but I'll retake this patch with the port to dtests as soon as possible.

@ekaterinadimitrova2
Copy link

I don't see how calling the same method in the same context is going to produce a regression, this seems more about using this to improve a previously existing test.

I didn't see you were so quick to add a patch. :-) If you don't have time to add distributed test now let's commit the current PR and get to test improvement when the heat goes down a bit. I don't want to keep this improvement back.

@adelapena
Copy link
Author

adelapena commented Feb 25, 2025

Ok, I've created #13125 as a follow-up for investigating and improving/porting NodeStartupTest. I'll merge this one in a bit if the PR gets an approving review.

@cassci-bot
Copy link

✔️ Build ds-cassandra-pr-gate/PR-1564 approved by Butler


Approved by Butler
See build details here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants