CNDB-12905: Allow usage of ReadQuery#execute rather than only executeInternal in CQL utests #1564

adelapena · 2025-02-12T13:58:52Z

Many if not most of our unit tests extend CQLTester. CQLTester#execute executes CQL queries with ReadQuery#executeInternal, which is what is used for internal system queries, and it's pretty similar to the local execution done by replica nodes. This misses the full validation and all the reconciliation patch used by ReadQuery#execute.

Missing reconciliation is particularly problematic for utests querying secondary indexes, because they miss the application of the full RowFilter by the coordinator.

A more realistic behaviour can be achieved by using CQLTester#executeNet, which will use a driver session that will exercise the entire path. However, this has to be done manually in every test and it's easy to miss when executeNet is necessary. Also, executeNet enables more serialization stuff that is costly in terms of resources and not that interesting for most cases.

This PR adds a new CQLTester#executeDistributed method that makes the query go through ReadQuery#execute rather than ReadQuery#executeInternal.

It also adds a new CQLTester.enableDistributedExecution method that changes the default behaviour of CQLTester#execute, so we can easily modify the query execution behaviour of entire tests. I haven't been brave enough to update all CQLTester-based unit tests, but I have changed all SAI tests to use this new way of running queries. Let's see how many things this breaks in CI.

CC @michaeljmarshall

github-actions · 2025-02-12T13:59:08Z

Checklist before you submit for review

Make sure there is a PR in the CNDB project updating the Converged Cassandra version
Use NoSpamLogger for log lines that may appear frequently in the logs
Verify test results on Butler
Test coverage for new/modified code is > 80%
Proper code formatting
Proper title for each commit staring with the project-issue number, like CNDB-1234
Each commit has a meaningful description
Each commit is not very long and contains related changes
Renames, moves and reformatting are in distinct commits

adelapena · 2025-02-12T15:40:19Z

There are lots of failures on CI. Most will need to adapt the tests a bit, but I'll also check if the new test execution has caught some real bug.

adelapena · 2025-02-12T15:48:45Z

Mmh, the failures of InetTest might be a real bug due to a difference between the filtering used by SAI and ALLOW FILTERING. Looking into it to confirm.

adelapena · 2025-02-14T16:32:19Z

src/java/org/apache/cassandra/cql3/QueryProcessor.java

@@ -516,19 +516,28 @@ public static UntypedResultSet execute(String query, ConsistencyLevel cl, Object
    public static UntypedResultSet execute(String query, ConsistencyLevel cl, QueryState state, Object... values)
    throws RequestExecutionException
    {
-        try


I have removed the try-catch block because the only caller of this method is SystemDistributedKeyspace#viewStatus, who simply swallows the exception making this wrapping of RequestValidationException no-op.

who simply swallows the exception making this wrapping of RequestValidationException no-op

Could this be an actual bug?
Also, it seems Sonar complains about test coverage in this area.

I don't know, how do you think this could fail?

It seems parseAndPrepare also throws that exception, and it is called by this method.
It seems that parseStatement (called by parseAndPrepare) is throwing SyntaxException that extends the RequestValidationException.
Interesting history - this execute method was added with the re-write of the storage engine. The method did not have any callers and the execute method implementations were throwing that exception. They stopped throwing it in CASSANDRA-13426 so I guess that was left over.

I haven't written particular tests to verify what is going on, just looking around the code. Shall we leave the cleaning for follow-up ticket?

Yes, the generic try-catch block in that variant of execute looks like a reminder of the original implementation. The method is only used by SystemDistributedKeyspace#viewStatus, which shallows the exception because it's only interested in collecting view statuses for a nodetool command. So it's a no-op. Now we want to re-use the method for QueryTester, but the useless exception replacement gets in the way and hides the original exception, which we want for our test checks.

Do you agree on simply removing the try-catch block, since it doesn't seem to use any purpose? Or I can add a variant of the QueryProcessor#execute method without the exception replacement, or even embed it in CQLTester#executeFormattedQuery, given how simple it is.

Ok, let's remove it, convinced. Just trying not to break something and being extra cautious.

test/unit/org/apache/cassandra/cql3/CQLTester.java

test/unit/org/apache/cassandra/index/sai/cql/GenericOrderByTest.java

test/unit/org/apache/cassandra/index/sai/cql/InetAddressTypeEquivalencyTest.java

test/unit/org/apache/cassandra/index/sai/functional/CompactionTest.java

test/unit/org/apache/cassandra/nodes/virtual/LegacySystemKeyspaceToNodesTest.java

ekaterinadimitrova2

Overall LGTM, left some minor questions.

ekaterinadimitrova2 · 2025-02-18T18:35:52Z

src/java/org/apache/cassandra/cql3/QueryProcessor.java

@@ -516,19 +516,28 @@ public static UntypedResultSet execute(String query, ConsistencyLevel cl, Object
    public static UntypedResultSet execute(String query, ConsistencyLevel cl, QueryState state, Object... values)
    throws RequestExecutionException
    {
-        try


who simply swallows the exception making this wrapping of RequestValidationException no-op

Could this be an actual bug?
Also, it seems Sonar complains about test coverage in this area.

test/unit/org/apache/cassandra/cql3/CQLTester.java

src/java/org/apache/cassandra/cql3/QueryProcessor.java

test/unit/org/apache/cassandra/index/sai/SAITester.java

test/unit/org/apache/cassandra/index/sai/cql/LuceneAnalyzerTest.java

test/unit/org/apache/cassandra/cql3/CQLTester.java

test/unit/org/apache/cassandra/index/sai/cql/GenericOrderByTest.java

ekaterinadimitrova2 · 2025-02-18T21:40:29Z

I am looking into CI results again now - seems like NodeStartupTest is a new regression even if it was not marked by Butler?
This one also seems like something that needs a ticket:
https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-pr-gate/job/PR-1564/9/testReport/org.apache.cassandra.index.sai.cql/VectorTypeTest/tests_stage_1___jvm_unit_tests___jvm_unit_compression_tests___tracingTest_ca__compression_jdk11/

Butler marked a lot of tests for regressions. We need to ensure they match previous failures on main and they do not fail in a new way.

ekaterinadimitrova2 · 2025-02-24T16:06:28Z

The latest changes LGTM but what do we do about NodeStartupTest? It is not failing on main, but it fails consistently with this patch.
On a quick look, we probably discovered another bug. So, here are two options if that is the case:

open a ticket and leave it failing
do similar TO DO to what we did for the other test - https://github.com/datastax/cassandra/pull/1564/files#diff-a59706dac4edd4d75341dddec74d94c880566e188518bacb3bafeaf6c2c3b09eR36-R44 and still open a ticket, ofc
80% test coverage is because the code where we remove the try-catch is not tested. Did you consider adding some test? I mean 80% is boarder line and we are trying not to regress

adelapena · 2025-02-24T17:43:59Z

The latest changes LGTM but what do we do about NodeStartupTest? It is not failing on main, but it fails consistently with this patch.

Ah, I've missed that one! I think the node restart simulated by SAITester#simulateNodeRestart doesn't work with CQLTester#requireNetwork nor CQLTester#requireNetworkWithoutDriver. If we start the services required for networking even calls to executeInternal will fail. The failures can be reproduced on unpatched main by simply making a call to requireNetwork anywhere before the failing query, even if we don't use the network calls at all. I suspect this is probably a test issue rather than a product defect.

I'm simply disabling coordinator execution on NodeStartupTest so it doesn't call requireNetworkWithoutDriver, and we can try to improve the test separately. Or we could simply port the utest to a trusty dtest, so we don't have to do too much black magic with a particular test. wdyt?

ekaterinadimitrova2 · 2025-02-24T17:48:25Z

we could simply port the utest to a trusty dtest

+1, let's do that, please

adelapena · 2025-02-24T20:23:46Z

we could simply port the utest to a trusty dtest

+1, let's do that, please

Do you mean here or in a follow-up patch?

ekaterinadimitrova2 · 2025-02-24T20:30:47Z

we could simply port the utest to a trusty dtest

+1, let's do that, please

Do you mean here or in a follow-up patch?

Hm, sorry, I thought you suggest to make it here so we are sure no regressions. If it is not too much work, let's do it here while on top of it, please.

adelapena · 2025-02-24T20:35:47Z

Hm, sorry, I thought you suggest to make it here so we are sure no regressions. If it is not too much work, let's do it here while on top of it, please.

I don't see how calling the same method in the same context is going to produce a regression, this seems more about using this to improve a previously existing test. I think it's likely I'll have to switch contexts this week, but I'll retake this patch with the port to dtests as soon as possible.

ekaterinadimitrova2 · 2025-02-24T20:48:48Z

I don't see how calling the same method in the same context is going to produce a regression, this seems more about using this to improve a previously existing test.

I didn't see you were so quick to add a patch. :-) If you don't have time to add distributed test now let's commit the current PR and get to test improvement when the heat goes down a bit. I don't want to keep this improvement back.

…Internal in CQL utests

SAI queries using inet should be fixed independently by CNDB-12978

This tries not to give the idea that the execution will involve multiple nodes.

This fixes failures in VectorSiftSmallTest due to parallel writes

adelapena · 2025-02-25T12:03:17Z

Ok, I've created #13125 as a follow-up for investigating and improving/porting NodeStartupTest. I'll merge this one in a bit if the PR gets an approving review.

sonarqubecloud · 2025-02-25T12:23:24Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
80.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-02-25T12:26:07Z

✔️ Build ds-cassandra-pr-gate/PR-1564 approved by Butler

Approved by Butler
See build details here

adelapena self-assigned this Feb 12, 2025

adelapena force-pushed the CNDB-12905-main branch from 979aba7 to d5bb70f Compare February 12, 2025 14:07

adelapena marked this pull request as draft February 12, 2025 15:37

adelapena force-pushed the CNDB-12905-main branch from d5bb70f to 38bdb05 Compare February 12, 2025 15:55

adelapena marked this pull request as ready for review February 12, 2025 17:02

adelapena force-pushed the CNDB-12905-main branch 3 times, most recently from 6a9c88e to 78dff15 Compare February 14, 2025 11:57