-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support Utf8View
type in starts_with
function
#11787
Conversation
let string_types = vec![Utf8, LargeUtf8, Utf8View]; | ||
let mut type_signatures = vec![]; | ||
|
||
for left in &string_types { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there's a better way to do this? Maybe I should just list them out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a interesting idea, i liked it compared to earlier implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the underlying kernel supports different string types (you can't actually call it with (Utf8
, LargeUtf8
) so the existing function actually has a bug
In fact here is reproducer showing the probelm on main
DataFusion CLI v40.0.0
> create table foo as values (arrow_cast('foo', 'Utf8'), arrow_cast('bar', 'LargeUtf8'));
0 row(s) fetched.
Elapsed 0.046 seconds.
> select * from foo;
+---------+---------+
| column1 | column2 |
+---------+---------+
| foo | bar |
+---------+---------+
1 row(s) fetched.
Elapsed 0.010 seconds.
> select starts_with(column1, column2) from foo;
Internal error: could not cast value to arrow_array::array::byte_array::GenericByteArray<arrow_array::types::GenericStringType<ii32>>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
So in other words, I think the correct signature is
signature: Signature::one_of(
vec![
Exact(vec![Utf8, Utf8]),
Exact(vec![LargeUtf8, LargeUtf8]),
Exact(vec![Utf8View, Utf8View]),
],
Volatility::Immutable,
),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this and link to the tests -- very helpful!
I made this update along with simplifying the arrow starts_with call (#11787 (comment)), and I'm noticing that STARTS_WITH(column1_utf8view, 'foo')
results in a cast of the first column to utf8 while keeping the second one utf8, but I think we'd want to keep the first argument as a utf8view while coercing the second argument to utf8view? It feels like calling starts_with
with a second argument as a scalar is common.
For example, with the changes, this test passes:
query TT
EXPLAIN SELECT
STARTS_WITH(column1_utf8view, 'foo') as c,
STARTS_WITH(column1_utf8view, column2_utf8view) as c2
FROM test;
----
logical_plan
01)Projection: starts_with(CAST(test.column1_utf8view AS Utf8), Utf8("foo")) AS c, starts_with(test.column1_utf8view, test.column2_utf8view) AS c2
02)--TableScan: test projection=[column1_utf8view, column2_utf8view]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree figuring out how to coerce the constant to Utf8view is important
I think the coercion logic just goes linearly through the signature and stops at the first type that works.
Thus I wonder if you can make Utf8View
preferred by listing the types in a different order and put the Utf8View
signature at the first option? Like
signature: Signature::one_of(
vec![
// make `Utf8View` first
Exact(vec![Utf8View, Utf8View]),
Exact(vec![Utf8, Utf8]),
Exact(vec![LargeUtf8, LargeUtf8]),
],
Volatility::Immutable,
),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still seems to be happening with this latest push 🤔 (222263c).
I'll try to poke around and see if I can understand what's happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needed this update in the argument coercion logic... before this even with Utf8View being preferred in the signature it wasn't recognized in the type to/from statement: https://github.com/apache/datafusion/pull/11787/files#diff-6e1fb265597317a8256c60670ff3ea7be6896b2df1199a40ca79419ce29b4ce3
#[test] | ||
fn test_functions() -> Result<()> { | ||
// Generate test cases for starts_with | ||
let test_cases = vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is the best way to test this change. It tests the function but doesn't a view wouldn't be coerced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tests for the coercion in #11753 -- you beat me to filing the ticket ❤️ -- I plan to make an epic / start filing items for the various other functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using a slt test (example of sql test above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to throw it out there, if you make the top level epic, I'd be happy to work though that PR and make individual issues for the problematic functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deal! I filed #11790
What I recommend doing is starting with a single ticket and making it a complete template / description with the idea being that that someone that is not familar DataFusion or Utf8View can work on it
Then when we file tickets for the rest of the functions we can reuse the same description over again. This
- Reduces the barrier to contribution
- I have found people love working on the code when there is a clear description
I took a shot at updating #11786 with background for use a template. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I'll have a go at replicating one of the other functions following #11786, and then if we're happy with it, I can grind through the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good -- thank you. What I have found works well in the past is to do like 10 at a time (if you make more than that the number of parallel threads / work I found overwhelmed our ability to review them)
Utf8View
for starts_with
Utf8View
type in starts_with
function
} | ||
_ => internal_err!("Unsupported data type"), | ||
DataType::LargeUtf8 => make_scalar_function(starts_with::<i64>, vec![])(args), | ||
DataType::Utf8View => make_scalar_function(starts_with::<i32>, vec![])(args), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do |
operator to handle Utf8 and Utf8View together.
( DataType::Utf8View | DataType::Utf8 ) => make_scalar_function(starts_with::<i32>, vec![])(args)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let left = as_generic_string_array::<T>(&args[0])?; | ||
let right = as_generic_string_array::<T>(&args[1])?; | ||
let bool_result = match (args[0].data_type(), args[1].data_type()) { | ||
(DataType::Utf8View, DataType::Utf8View) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since https://docs.rs/arrow/latest/arrow/compute/kernels/comparison/fn.starts_with.html already does this downcasting, the I think you could simply call this with two &dyn Array
rather than having to check the types
In particular I think you could simply call
arrow::compute::kernels::comparison::starts_with(left, right)?
and remove the <T: OffsetSizeTrait>
from this call
BTW I double checked that arrow has native StringView support (thanks to @XiangpengHao ):
https://docs.rs/arrow-string/52.2.0/src/arrow_string/like.rs.html#129-148 ✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super cool.
let string_types = vec![Utf8, LargeUtf8, Utf8View]; | ||
let mut type_signatures = vec![]; | ||
|
||
for left in &string_types { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the underlying kernel supports different string types (you can't actually call it with (Utf8
, LargeUtf8
) so the existing function actually has a bug
In fact here is reproducer showing the probelm on main
DataFusion CLI v40.0.0
> create table foo as values (arrow_cast('foo', 'Utf8'), arrow_cast('bar', 'LargeUtf8'));
0 row(s) fetched.
Elapsed 0.046 seconds.
> select * from foo;
+---------+---------+
| column1 | column2 |
+---------+---------+
| foo | bar |
+---------+---------+
1 row(s) fetched.
Elapsed 0.010 seconds.
> select starts_with(column1, column2) from foo;
Internal error: could not cast value to arrow_array::array::byte_array::GenericByteArray<arrow_array::types::GenericStringType<ii32>>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
So in other words, I think the correct signature is
signature: Signature::one_of(
vec![
Exact(vec![Utf8, Utf8]),
Exact(vec![LargeUtf8, LargeUtf8]),
Exact(vec![Utf8View, Utf8View]),
],
Volatility::Immutable,
),
}
#[test] | ||
fn test_functions() -> Result<()> { | ||
// Generate test cases for starts_with | ||
let test_cases = vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using a slt test (example of sql test above)
Gonna mark this ready for review, though happy to make any additional updates. |
This pr is really nicely done! Looks good to me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this nice work, it looks good to me.
Given this PR can become an example for supporting StringView
for remaining string functions, I suggest also:
- Add more sqllogictests, since they're easier to understand and follow.
- In sqllogictest end-to-end tests, include more edge cases like empty strings, mismatched argument types like
starts_with(utf8_col, utf8view_col)
, this way we can encourage other contibutors to follow the pattern and write better tests for the remaining functions.
@@ -573,6 +573,8 @@ fn coerced_from<'a>( | |||
(Interval(_), _) if matches!(type_from, Utf8 | LargeUtf8) => { | |||
Some(type_into.clone()) | |||
} | |||
// We can go into a Utf8View from a Utf8 or LargeUtf8 | |||
(Utf8View, _) if matches!(type_from, Utf8 | LargeUtf8) => Some(type_into.clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LargeUtf8 has 64 bit length, Utf8View only got 32 bit length.
Loos like LargeUtf8 -> Utf8View
is not possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arrow cast kernel supports this conversion
Though I am not sure if we should allow the automatic coercion as it could potentially fail if the strings were over 2GB 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, the semantics is < 32 bit length LargeUtf8
conversion is supported, otherwise, some runtime error will occur. This makes sense.
let result = arrow::compute::kernels::comparison::starts_with(left, right)?; | ||
|
||
pub fn starts_with(args: &[ArrayRef]) -> Result<ArrayRef> { | ||
let result = arrow::compute::kernels::comparison::starts_with(&args[0], &args[1])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I think this is a great suggestion @2010YOUY01 -- I can help with this too Here are the instructions: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest Ideally you should be able to extend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective this PR could be merged as is (thank you @tshauck @dharanad @2010YOUY01 for the work)
I plan to leave it open until @tshauck has a chance to respond to the outstanding comments. While I think they could be done as follow on work, @2010YOUY01 has a good point that if this PR will be used as an example / template for how to add native StringView support getting all the changes (esp the testing) in a single PR has a lot of value
Co-authored-by: Yongting You <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Thanks @alamb @2010YOUY01 @dharanad for the feedback. Please give me a bit to add additional tests in this PR, then I'll re-request review. |
🚀 |
Which issue does this PR close?
Closes #11786
Rationale for this change
Utf8Views don't appear to be supported
starts_with
yet.What changes are included in this PR?
Update the type signature of the UDF and percolate the new data type support in the arrow interactions.
Are these changes tested?
Yes, added some unittests and did manual testing via the CLI.
Are there any user-facing changes?
No