-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clickhouse string ordering and string filtering by UTF8 instead of bytes #6143
base: master
Are you sure you want to change the base?
Conversation
Hey @casab ! Thanks for contributing! Could you please add an integration test for it? |
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 8 Skipped Deployments
|
645512c
to
63b935c
Compare
@casab is attempting to deploy a commit to the Cube Dev Team on Vercel. A member of the Team first needs to authorize it. |
…e_utf8_filter_order
@casab May I kindly ask you to test with the latest release and rebase your changes on top of it? We have migrated to a new ClickHouse client library recently. Thanks in advance! |
@igorlukanin Can you approve running the workflows? |
Hi @casab Could you please sync with the latest master, resolve conflicts and fix warnings/errors? Ping me whenever you need to approve running workflows! |
@KSDaemon of course, done it. |
@casab Hey! Thnx for the updates! Unfortunately, some lint errors still exists:
Maybe thats because of merge... But anyway... Can you fix them, please? |
Co-authored-by: Konstantin Burkalev <[email protected]>
const orderByString = R.pipe( | ||
R.map((order) => { | ||
let orderString = this.orderHashToString(order); | ||
if (this.getFieldType(order) === 'string') && collation !== '' { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the mess, after thinking for a while, I agree with your original suggestion to fall back to en
locale as it will work more correctly even for data in other locales...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written before it didn't allow removing COLLATE
at all, and I think it's important to have a way to disable it. At the very least, to keep old behaviour for users, who for any reason does not expect this change.
I think we can enable it by default, but way to opt-out from using collations is a must: without opt-out there's no way to make ordering case-sensitive, and even if we want to force it, users would have only one way to fix this during migration - revert back to previous release and keep waiting.
I, personally, would not expect it to use collation by default, but I also find default collations in PostgreSQL or MySQL strange, so maybe it's just me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I wrote when reviewing previous version:
Any collation in order by makes ordering case-insensitive in ClickHouse, and it would not match orderBy
for any other DB adapter in Cube, at least from ops perspective.
Consider query like this:
SELECT
s
FROM (
SELECT
'a' AS s
UNION ALL
SELECT
'A' AS s
UNION ALL
SELECT
'a' AS s
UNION ALL
SELECT
'b' AS s
UNION ALL
SELECT
'B' AS s
UNION ALL
SELECT
'b' AS s
) t
ORDER BY
s COLLATE 'en'
;
With COLLATE 'en'
ClickHouse would return all a
and A
, and then all b
and B
.
But for DBMS like PostgreSQL this would depend on type of column (and encoding, for that matter), so admin of warehouse can decide how to order those, not admin of Cube, and they can decide whether it should or shouldn't be case sensitive.
Also keep in mind that Cube can issue ungrouped queries with order and limit, that would need to sort very large set of rows, so performance impact from enabling it can (I think) be huge in some specific cases, and specifically it can be unexpected for ClickHouse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on being able to disable collation behavior. We might document it to set the env variable to a specific value. Or I might add an another env variable.
The problem with Clickhouse is that it only allows collation in the order by clause. Nothing else, so there is no database level, table level, column level, session level collation setting anywhere. Otherwise I would agree.
@@ -18,7 +20,7 @@ class ClickHouseFilter extends BaseFilter { | |||
public likeIgnoreCase(column, not, param, type) { | |||
const p = (!type || type === 'contains' || type === 'ends') ? '%' : ''; | |||
const s = (!type || type === 'contains' || type === 'starts') ? '%' : ''; | |||
return `lower(${column}) ${not ? 'NOT' : ''} LIKE CONCAT('${p}', lower(${this.allocateParam(param)}), '${s}')`; | |||
return `lowerUTF8(toValidUTF8(${column})) ${not ? 'NOT' : ''} LIKE CONCAT('${p}', lowerUTF8(toValidUTF8(${this.allocateParam(param)})), '${s}')`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use ILIKE
instead of toValidUTF8
? I mean, it was written like this, but since you are messing around here - why not? Both lowerUTF8
and ILIKE
can handle UTF-8, but ILIKE
is a bit more direct, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if ILIKE is UTF-8 compatible. It might just be calling lower(). Let me ask this to the Clickhouse folk first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, for UTF-8 strings _
has a "single code point" meaning
https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions#ilike
@@ -168,6 +184,35 @@ export class ClickHouseQuery extends BaseQuery { | |||
return `${fieldAlias} ${direction}`; | |||
} | |||
|
|||
public override orderBy() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to mention, it's not exactly clear to me how to fix it, so maybe comment with TODO/caveat would be enough
It's not enough to override orderBy
, templates.expressions.order_by
and templates.statements.select
should be adjusted as well: they can be used by SQL API while generating complex queries, and it would not pick up changes here.
I think, expressions.order_by
is not used at all at the moment, but, probably, it should. And even then, COLLATE
can only be used with strings, so that would require plumbing type information into template.
Consider query like this coming to Cube's SQL API
SELECT
"measure_count_distinct",
COUNT(DISTINCT "dim_str0") AS "distinct_dimension_values"
FROM (
SELECT
COUNT(DISTINCT "countDistinct") AS "measure_count_distinct",
CAST("dim_str0" AS TEXT) AS "dim_str0"
FROM "MultiTypeCube"
-- this filter is important
WHERE
LOWER("dim_str1") = 'foo'
GROUP BY
2
) t
GROUP BY 1
ORDER BY 1
Inner query can (almost) be represented as regular query to Cube.
But because of filter it would be executed via SQL pushdown. And ORDER BY in query to ClickHouse will be generated around here:
cube/rust/cubesql/cubesql/src/compile/engine/df/wrapper.rs
Lines 1415 to 1440 in 92a4b8e
let resulting_sql = generator | |
.get_sql_templates() | |
.select( | |
sql.sql.to_string(), | |
projection, | |
group_by, | |
group_descs, | |
aggregate, | |
// TODO | |
from_alias.unwrap_or("".to_string()), | |
if !filter.is_empty() { | |
Some( | |
filter | |
.iter() | |
.map(|f| f.expr.to_string()) | |
.join(" AND "), | |
) | |
} else { | |
None | |
}, | |
None, | |
order, | |
limit, | |
offset, | |
distinct, | |
) |
This code path will not use
orderBy
methods, but will use templates, that can be overridden by adapter.
But there's more!
If there's no filter this whole query would execute as data load through Cube for inner query, and post-processing on SQL API side for outer, including sorting. This sorting would not be done by ClickHouse, so it would completely ignore collation setting, and, in some sense, may result in unexpected ordering change for seemingly.
This issue is not specific to ClickHouse, or collations. I just wanted to highlight that Cube already can have different orderings for same source DB and dataset, but different queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mcheshkov As far as I can see expressions.order_by
was not used. But expressions.sort
is used. I have made changes to the templates. And tried to give data_type for the column in the expression from the Rust part. I would appreciate it if you let me know if I'm on the right path.
To be honest, I got a little bit overwhelmed by cubesql code. It took sometime reading the code to understand what you meant. I think I've got to understand in which cases SQL API pushes down queries, and how SQL API handles sorting.
const orderByString = R.pipe( | ||
R.map((order) => { | ||
let orderString = this.orderHashToString(order); | ||
if (this.getFieldType(order) === 'string') && collation !== '' { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written before it didn't allow removing COLLATE
at all, and I think it's important to have a way to disable it. At the very least, to keep old behaviour for users, who for any reason does not expect this change.
I think we can enable it by default, but way to opt-out from using collations is a must: without opt-out there's no way to make ordering case-sensitive, and even if we want to force it, users would have only one way to fix this during migration - revert back to previous release and keep waiting.
I, personally, would not expect it to use collation by default, but I also find default collations in PostgreSQL or MySQL strange, so maybe it's just me.
const orderByString = R.pipe( | ||
R.map((order) => { | ||
let orderString = this.orderHashToString(order); | ||
if (this.getFieldType(order) === 'string') && collation !== '' { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I wrote when reviewing previous version:
Any collation in order by makes ordering case-insensitive in ClickHouse, and it would not match orderBy
for any other DB adapter in Cube, at least from ops perspective.
Consider query like this:
SELECT
s
FROM (
SELECT
'a' AS s
UNION ALL
SELECT
'A' AS s
UNION ALL
SELECT
'a' AS s
UNION ALL
SELECT
'b' AS s
UNION ALL
SELECT
'B' AS s
UNION ALL
SELECT
'b' AS s
) t
ORDER BY
s COLLATE 'en'
;
With COLLATE 'en'
ClickHouse would return all a
and A
, and then all b
and B
.
But for DBMS like PostgreSQL this would depend on type of column (and encoding, for that matter), so admin of warehouse can decide how to order those, not admin of Cube, and they can decide whether it should or shouldn't be case sensitive.
Also keep in mind that Cube can issue ungrouped queries with order and limit, that would need to sort very large set of rows, so performance impact from enabling it can (I think) be huge in some specific cases, and specifically it can be unexpected for ClickHouse
…add data_type to AliasedColumn and TemplateColumn to be able to use the data type in the templates
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6143 +/- ##
==========================================
- Coverage 83.44% 80.65% -2.79%
==========================================
Files 227 227
Lines 81626 81654 +28
==========================================
- Hits 68113 65860 -2253
- Misses 13513 15794 +2281
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@KSDaemon @mcheshkov Hey, I have fixed all the errors. And implemented all the required parts. I would appreciate it if you can review it. Especially the rust part. |
Clickhouse defaults to using bytes to order by and string manipulation functions such as
lower
,upper
uses ascii. To overcome this limitation they haveCOLLATE
keyword, andlowerUTF8
,upperUTF8
functions.Check List
Description of Changes Made (if issue reference is not provided)
CONCAT
SQL function with js template literal to prevent unnecessary DB function calllowerUTF8
instead oflower
to support utf8 compatible searchCOLLATE ‘en’
when ordering by strings to order incasesensitive. Clickhouse orders by bytes on default.