Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding for Elasticsearch count pushdown #23425

Merged
merged 2 commits into from
Sep 19, 2024

Conversation

bvolpato
Copy link
Member

@bvolpato bvolpato commented Sep 14, 2024

Description

We've identified that when a COUNT(*) query was pushed down and contained special characters, the QueryBuilder string was being handled as ISO-8859-1 and causing parsing issues for Elasticsearch.

Additional context and related issues

For example, this query:

SELECT COUNT(*) FROM catalog.default.users where country = 'Türkiye';

In case the "country" field is a keyword, would result in:

{"error":{"root_cause":[{"type":"parsing_exception","reason":"Failed to parse","line":1,"col":53}],"type":"parsing_exception","reason":"Failed to parse","line":1,"col":53,"caused_by":{"type":"x_content_parse_exception","reason":"[1:53] [bool] failed to parse field [filter]","caused_by":{"type":"json_parse_exception","reason":"Invalid UTF-8 middle byte 0x73\n at [Source: (org.elasticsearch.common.io.stream.ByteBufferStreamInput); line: 1, column: 64]"}}},"status":400}

The source for the problem was the new StringEntity(sourceBuilder.toString()), which uses https://github.com/apache/httpcomponents-core/blob/rel/v4.4.16/httpcore/src/main/java/org/apache/http/entity/ContentType.java#L106-L107 and defaults to ISO.

image

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# ElasticSearch, OpenSearch
* Fix query failure for some queries when a predicate contains unicode text. ({issue}`issuenumber`)

Copy link

cla-bot bot commented Sep 14, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenSearch connector probably has the same issue. Can you apply the fix there, too?

Copy link

cla-bot bot commented Sep 15, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@bvolpato
Copy link
Member Author

bvolpato commented Sep 15, 2024

The OpenSearch connector probably has the same issue. Can you apply the fix there, too?

Good call, yes, just reproduced there with the same set of tests. Pushed the same fix there.

I also applied the text blocks change/suggestions too, agree that it looks much cleaner. For context, I followed the same structure from the surrounding tests -- so they could likely get a similar refactoring.

@bvolpato
Copy link
Member Author

Lastly, I've submitted the CLA, but I guess it might take a couple of days to hear back.

@pettyjamesm
Copy link
Member

@martint - looks like this is just pending CLA processing before it can merge.

@martint
Copy link
Member

martint commented Sep 18, 2024

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Sep 18, 2024
Copy link

cla-bot bot commented Sep 18, 2024

The cla-bot has been summoned, and re-checked this pull request!

@hashhar hashhar merged commit 3c1b11c into trinodb:master Sep 19, 2024
17 checks passed
@github-actions github-actions bot added this to the 459 milestone Sep 19, 2024
@bvolpato bvolpato deleted the count-utf8-encoding branch September 20, 2024 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants