Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement LIKE for StringView arrays #11024

Closed
Tracked by #10918
alamb opened this issue Jun 20, 2024 · 3 comments
Closed
Tracked by #10918

Implement LIKE for StringView arrays #11024

alamb opened this issue Jun 20, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 20, 2024

Is your feature request related to a problem or challenge?

Part of #10918 where we are integrating StringView into DataFusion, initially targeting making ClickBench queries faster

In the ClickBench queries there are several LIKE predicates on String columns such as

SELECT COUNT(*) FROM hits WHERE "URL" LIKE '%google%';

SELECT COUNT(*) FROM hits WHERE "URL" LIKE '%google%';
SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;
SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;
SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

Describe the solution you'd like

Given a table with StringView data such as:

> CREATE OR REPLACE TABLE string_views AS VALUES (arrow_cast('Andrew', 'Utf8View'), 'A Much Longer String Than 12 Characters',  'Nonsense', 'A Much', NULL);
0 row(s) fetched.
Elapsed 0.006 seconds.

> select arrow_typeof(column1) from string_views limit 1;
+------------------------------------+
| arrow_typeof(string_views.column1) |
+------------------------------------+
| Utf8View                           |
+------------------------------------+
1 row(s) fetched.
Elapsed 0.012 seconds.

I would like queries using LIKE and NOT LIKE to work:

> select column1 from string_views WHERE column1 LIKE 'A%';
Arrow error: Invalid argument error: Invalid string operation: Utf8View LIKE Utf8View
> select column1 from string_views WHERE column1 NOT LIKE 'A%';
Arrow error: Invalid argument error: Invalid string operation: Utf8View NLIKE Utf8View
> select column1 from string_views WHERE column1 LIKE '%e';
Arrow error: Invalid argument error: Invalid string operation: Utf8View LIKE Utf8View
> select column1 from string_views WHERE column1 LIKE '%A%';
Arrow error: Invalid argument error: Invalid string operation: Utf8View LIKE Utf8View
>

Describe alternatives you've considered

We could add a coercion rule to LIKE to automatically coerce Utf8View to Utf8, however that is inefficient as it will involve copying all the strings. It would be much better to actually implement LIKE for Utf8View arrays directly

Currently LIKE and ILIKE are implemented in arrow-rs kernels, such as https://docs.rs/arrow/latest/arrow/compute/kernels/comparison/fn.like.html

The DataFusion implementation is here:
https://github.com/apache/datafusion/blob/main/datafusion/physical-expr/src/expressions/like.rs

I think there are two potential implementations:

  1. Add special case code to datafusion (at least temporarily)
  2. Add support upstream in arrow-rs

Additional context

Please remember to target the string-view branch in DataFusion, rather than main with your PR

@alamb alamb added the enhancement New feature or request label Jun 20, 2024
@XiangpengHao
Copy link
Contributor

take

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2024

I think this was done as a kernel in arrow and datafusion uses the kernel:

(false, false) => apply_cmp(&lhs, &rhs, like),
(false, true) => apply_cmp(&lhs, &rhs, ilike),
(true, false) => apply_cmp(&lhs, &rhs, nlike),
(true, true) => apply_cmp(&lhs, &rhs, nilike),

@alamb
Copy link
Contributor Author

alamb commented Jul 31, 2024

I wrote a test in #11753 which verifies this is already done

@alamb alamb closed this as completed Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants