Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse string ordering and string filtering by UTF8 instead of bytes #6143

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

casab
Copy link

@casab casab commented Feb 9, 2023

Clickhouse defaults to using bytes to order by and string manipulation functions such as lower, upper uses ascii. To overcome this limitation they have COLLATE keyword, and lowerUTF8, upperUTF8 functions.

Check List

  • Tests has been run in packages where changes made if available
  • Linter has been run for changed code
  • Tests for the changes have been added if not covered yet
  • Docs have been added / updated if required

Description of Changes Made (if issue reference is not provided)

  • Replaced CONCAT SQL function with js template literal to prevent unnecessary DB function call
  • Used lowerUTF8 instead of lower to support utf8 compatible search
  • Added COLLATE ‘en’ when ordering by strings to order incasesensitive. Clickhouse orders by bytes on default.

@casab casab requested a review from a team as a code owner February 9, 2023 14:32
@github-actions github-actions bot added the pr:community Contribution from Cube.js community members. label Feb 9, 2023
@paveltiunov
Copy link
Member

Hey @casab ! Thanks for contributing! Could you please add an integration test for it?

@paveltiunov paveltiunov self-assigned this Mar 4, 2023
Copy link

vercel bot commented Jun 7, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

8 Skipped Deployments
Name Status Preview Comments Updated (UTC)
examples-angular-dashboard ⬜️ Ignored (Inspect) Visit Preview Nov 22, 2024 1:40pm
examples-react-d3 ⬜️ Ignored (Inspect) Visit Preview Nov 22, 2024 1:40pm
examples-react-dashboard ⬜️ Ignored (Inspect) Visit Preview Nov 22, 2024 1:40pm
examples-react-data-table ⬜️ Ignored (Inspect) Visit Preview Nov 22, 2024 1:40pm
examples-react-highcharts ⬜️ Ignored (Inspect) Visit Preview Nov 22, 2024 1:40pm
examples-react-material-ui ⬜️ Ignored (Inspect) Visit Preview Nov 22, 2024 1:40pm
examples-react-pivot-table ⬜️ Ignored (Inspect) Visit Preview Nov 22, 2024 1:40pm
examples-vue-query-builder ⬜️ Ignored (Inspect) Visit Preview Nov 22, 2024 1:40pm

@casab casab force-pushed the feature/clickhouse_utf8_filter_order branch from 645512c to 63b935c Compare June 7, 2024 14:56
Copy link

vercel bot commented Jun 7, 2024

@casab is attempting to deploy a commit to the Cube Dev Team on Vercel.

A member of the Team first needs to authorize it.

@igorlukanin
Copy link
Member

@casab May I kindly ask you to test with the latest release and rebase your changes on top of it? We have migrated to a new ClickHouse client library recently. Thanks in advance!

@casab
Copy link
Author

casab commented Nov 22, 2024

@igorlukanin Can you approve running the workflows?

@KSDaemon
Copy link
Member

KSDaemon commented Jan 17, 2025

Hi @casab Could you please sync with the latest master, resolve conflicts and fix warnings/errors? Ping me whenever you need to approve running workflows!
Btw, for a quick linter fixes you can run yarn run lint:fix

@casab
Copy link
Author

casab commented Jan 24, 2025

@KSDaemon of course, done it.

@KSDaemon
Copy link
Member

KSDaemon commented Jan 24, 2025

@casab Hey! Thnx for the updates! Unfortunately, some lint errors still exists:

  /home/runner/work/cube/cube/packages/cubejs-schema-compiler/src/adapter/ClickHouseQuery.ts
    178:1   error    Expected indentation of 6 spaces but found 8  indent
    178:16  warning  Strings must use singlequote                  quotes
    182:1   error    Expected indentation of 6 spaces but found 8  indent
    182:26  error    Unexpected string concatenation               prefer-template
    182:58  warning  Strings must use singlequote                  quotes
    183:1   error    Expected indentation of 6 spaces but found 8  indent
    184:1   error    Expected indentation of 6 spaces but found 8  indent
    184:16  warning  Strings must use singlequote                  quotes
    188:1   error    Expected indentation of 6 spaces but found 8  indent
    188:16  warning  Strings must use singlequote                  quotes
    192:1   error    Expected indentation of 2 spaces but found 0  indent

Maybe thats because of merge... But anyway...

Can you fix them, please?

@casab casab requested a review from KSDaemon January 28, 2025 13:57
const orderByString = R.pipe(
R.map((order) => {
let orderString = this.orderHashToString(order);
if (this.getFieldType(order) === 'string') && collation !== '' {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the mess, after thinking for a while, I agree with your original suggestion to fall back to en locale as it will work more correctly even for data in other locales...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written before it didn't allow removing COLLATE at all, and I think it's important to have a way to disable it. At the very least, to keep old behaviour for users, who for any reason does not expect this change.
I think we can enable it by default, but way to opt-out from using collations is a must: without opt-out there's no way to make ordering case-sensitive, and even if we want to force it, users would have only one way to fix this during migration - revert back to previous release and keep waiting.

I, personally, would not expect it to use collation by default, but I also find default collations in PostgreSQL or MySQL strange, so maybe it's just me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I wrote when reviewing previous version:

Any collation in order by makes ordering case-insensitive in ClickHouse, and it would not match orderBy for any other DB adapter in Cube, at least from ops perspective.

Consider query like this:

SELECT
    s
FROM (
    SELECT
        'a' AS s
    UNION ALL
    SELECT
        'A' AS s
    UNION ALL
    SELECT
        'a' AS s
    UNION ALL
    SELECT
        'b' AS s
    UNION ALL
    SELECT
        'B' AS s
    UNION ALL
    SELECT
        'b' AS s
    ) t
ORDER BY
    s COLLATE 'en'
;

With COLLATE 'en' ClickHouse would return all a and A, and then all b and B.
But for DBMS like PostgreSQL this would depend on type of column (and encoding, for that matter), so admin of warehouse can decide how to order those, not admin of Cube, and they can decide whether it should or shouldn't be case sensitive.

Also keep in mind that Cube can issue ungrouped queries with order and limit, that would need to sort very large set of rows, so performance impact from enabling it can (I think) be huge in some specific cases, and specifically it can be unexpected for ClickHouse

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on being able to disable collation behavior. We might document it to set the env variable to a specific value. Or I might add an another env variable.

The problem with Clickhouse is that it only allows collation in the order by clause. Nothing else, so there is no database level, table level, column level, session level collation setting anywhere. Otherwise I would agree.

@@ -18,7 +20,7 @@ class ClickHouseFilter extends BaseFilter {
public likeIgnoreCase(column, not, param, type) {
const p = (!type || type === 'contains' || type === 'ends') ? '%' : '';
const s = (!type || type === 'contains' || type === 'starts') ? '%' : '';
return `lower(${column}) ${not ? 'NOT' : ''} LIKE CONCAT('${p}', lower(${this.allocateParam(param)}), '${s}')`;
return `lowerUTF8(toValidUTF8(${column})) ${not ? 'NOT' : ''} LIKE CONCAT('${p}', lowerUTF8(toValidUTF8(${this.allocateParam(param)})), '${s}')`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use ILIKE instead of toValidUTF8? I mean, it was written like this, but since you are messing around here - why not? Both lowerUTF8 and ILIKE can handle UTF-8, but ILIKE is a bit more direct, IMO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if ILIKE is UTF-8 compatible. It might just be calling lower(). Let me ask this to the Clickhouse folk first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, for UTF-8 strings _ has a "single code point" meaning

https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions#ilike

@@ -168,6 +184,35 @@ export class ClickHouseQuery extends BaseQuery {
return `${fieldAlias} ${direction}`;
}

public override orderBy() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to mention, it's not exactly clear to me how to fix it, so maybe comment with TODO/caveat would be enough

It's not enough to override orderBy, templates.expressions.order_by and templates.statements.select should be adjusted as well: they can be used by SQL API while generating complex queries, and it would not pick up changes here.

I think, expressions.order_by is not used at all at the moment, but, probably, it should. And even then, COLLATE can only be used with strings, so that would require plumbing type information into template.

Consider query like this coming to Cube's SQL API

SELECT
    "measure_count_distinct",
    COUNT(DISTINCT "dim_str0") AS "distinct_dimension_values"
FROM (
    SELECT
        COUNT(DISTINCT "countDistinct") AS "measure_count_distinct",
        CAST("dim_str0" AS TEXT) AS "dim_str0"
    FROM "MultiTypeCube"
-- this filter is important
    WHERE
        LOWER("dim_str1") = 'foo'
    GROUP BY
        2
) t
GROUP BY 1
ORDER BY 1

Inner query can (almost) be represented as regular query to Cube.
But because of filter it would be executed via SQL pushdown. And ORDER BY in query to ClickHouse will be generated around here:

let resulting_sql = generator
.get_sql_templates()
.select(
sql.sql.to_string(),
projection,
group_by,
group_descs,
aggregate,
// TODO
from_alias.unwrap_or("".to_string()),
if !filter.is_empty() {
Some(
filter
.iter()
.map(|f| f.expr.to_string())
.join(" AND "),
)
} else {
None
},
None,
order,
limit,
offset,
distinct,
)

This code path will not use orderBy methods, but will use templates, that can be overridden by adapter.

But there's more!
If there's no filter this whole query would execute as data load through Cube for inner query, and post-processing on SQL API side for outer, including sorting. This sorting would not be done by ClickHouse, so it would completely ignore collation setting, and, in some sense, may result in unexpected ordering change for seemingly.
This issue is not specific to ClickHouse, or collations. I just wanted to highlight that Cube already can have different orderings for same source DB and dataset, but different queries.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcheshkov As far as I can see expressions.order_by was not used. But expressions.sort is used. I have made changes to the templates. And tried to give data_type for the column in the expression from the Rust part. I would appreciate it if you let me know if I'm on the right path.

To be honest, I got a little bit overwhelmed by cubesql code. It took sometime reading the code to understand what you meant. I think I've got to understand in which cases SQL API pushes down queries, and how SQL API handles sorting.

const orderByString = R.pipe(
R.map((order) => {
let orderString = this.orderHashToString(order);
if (this.getFieldType(order) === 'string') && collation !== '' {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written before it didn't allow removing COLLATE at all, and I think it's important to have a way to disable it. At the very least, to keep old behaviour for users, who for any reason does not expect this change.
I think we can enable it by default, but way to opt-out from using collations is a must: without opt-out there's no way to make ordering case-sensitive, and even if we want to force it, users would have only one way to fix this during migration - revert back to previous release and keep waiting.

I, personally, would not expect it to use collation by default, but I also find default collations in PostgreSQL or MySQL strange, so maybe it's just me.

const orderByString = R.pipe(
R.map((order) => {
let orderString = this.orderHashToString(order);
if (this.getFieldType(order) === 'string') && collation !== '' {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I wrote when reviewing previous version:

Any collation in order by makes ordering case-insensitive in ClickHouse, and it would not match orderBy for any other DB adapter in Cube, at least from ops perspective.

Consider query like this:

SELECT
    s
FROM (
    SELECT
        'a' AS s
    UNION ALL
    SELECT
        'A' AS s
    UNION ALL
    SELECT
        'a' AS s
    UNION ALL
    SELECT
        'b' AS s
    UNION ALL
    SELECT
        'B' AS s
    UNION ALL
    SELECT
        'b' AS s
    ) t
ORDER BY
    s COLLATE 'en'
;

With COLLATE 'en' ClickHouse would return all a and A, and then all b and B.
But for DBMS like PostgreSQL this would depend on type of column (and encoding, for that matter), so admin of warehouse can decide how to order those, not admin of Cube, and they can decide whether it should or shouldn't be case sensitive.

Also keep in mind that Cube can issue ungrouped queries with order and limit, that would need to sort very large set of rows, so performance impact from enabling it can (I think) be huge in some specific cases, and specifically it can be unexpected for ClickHouse

casab added 2 commits February 6, 2025 04:21
…add data_type to AliasedColumn and TemplateColumn to be able to use the data type in the templates
@casab casab requested a review from a team as a code owner February 6, 2025 01:24
@casab casab requested a review from mcheshkov February 6, 2025 06:45
Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 96.55172% with 1 line in your changes missing coverage. Please review.

Project coverage is 80.65%. Comparing base (b3777c3) to head (a62eb42).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...t/cubesql/cubesql/src/compile/engine/df/wrapper.rs 96.15% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6143      +/-   ##
==========================================
- Coverage   83.44%   80.65%   -2.79%     
==========================================
  Files         227      227              
  Lines       81626    81654      +28     
==========================================
- Hits        68113    65860    -2253     
- Misses      13513    15794    +2281     
Flag Coverage Δ
cubesql 80.65% <96.55%> (-2.79%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@casab casab requested a review from KSDaemon February 11, 2025 18:58
@casab
Copy link
Author

casab commented Feb 11, 2025

@KSDaemon @mcheshkov Hey, I have fixed all the errors. And implemented all the required parts. I would appreciate it if you can review it. Especially the rust part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr:community Contribution from Cube.js community members.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants