Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: substr() on StringView column's behavior is inconsistent with the old version #12383

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Sep 8, 2024

Which issue does this PR close?

N.A.

Rationale for this change

Recently native StringView type support is implemented on substr() string function, but its behaviour is not consistent with the older implementation on StringArray string physical column type
The function semantics on the original StringArray type is consistent with postgres https://www.postgresql.org/docs/9.1/functions-string.html

The postgres version of substr(s, start, count)'s behavior is: start is 1-indexed, then we have a maybe out of bounds , and 0-indexed character range [start-1, start-1+count) , the valid char range is [0, s.chars().count()), and this function will take the intersection of two.

They handle negative start index differently and also handle invalid range differently

DataFusion CLI v41.0.0
> select substr('foo', -2,4), substr(arrow_cast('foo', 'Utf8View'), -2, 4);
+----------------------------------------+---------------------------------------------------------------------+
| substr(Utf8("foo"),Int64(-2),Int64(4)) | substr(arrow_cast(Utf8("foo"),Utf8("Utf8View")),Int64(-2),Int64(4)) |
+----------------------------------------+---------------------------------------------------------------------+
| f                                      | foo                                                                 |
+----------------------------------------+---------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.104 seconds.

> select substr('', -10, 5)='', substr(arrow_cast('', 'Utf8View'), -10, 5)='';
+-------------------------------------------------+------------------------------------------------------------------------------+
| substr(Utf8(""),Int64(-10),Int64(5)) = Utf8("") | substr(arrow_cast(Utf8(""),Utf8("Utf8View")),Int64(-10),Int64(5)) = Utf8("") |
+-------------------------------------------------+------------------------------------------------------------------------------+
| true                                            |                                                                              |
+-------------------------------------------------+------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.010 seconds.

What changes are included in this PR?

substr's behavior is already consistent with postgres if input is StringArray
This PR changes implementation to keep StringView version also consistent with postgres

Are these changes tested?

Added more sqllogictests

Are there any user-facing changes?

No

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions labels Sep 8, 2024
match args.len() {
1 => {
for (idx, (raw, start)) in string_view_array
.views()
for ((str_opt, raw_view), start_opt) in string_view_array
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old implementation will treat empty string in view as NULL (within make_and_append_view()), so it handled NULL case correctly but handled the empty string case wrong.
Since it's not possible to use view to check if the current element is null, so here we also have to iterate on the string array to check whether each element is NULL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential thing that might be faster is to iterate on the null array directly (rather than the Option<&str>) so the code doesn't have to figure out where the &str comes from

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for correcting my mistakes. I really appreciate it! I'll be more careful with my code next time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential thing that might be faster is to iterate on the null array directly (rather than the Option<&str>) so the code doesn't have to figure out where the &str comes from

Great point, there are also other potential optimizations like specializing not null columns, scalar literal arguments, and ASCII case. So I prefer only to make a simple and correct implementation in this PR, and evaluate those optimizations as a future task

@alamb
Copy link
Contributor

alamb commented Sep 9, 2024

FYI @Kev1n8 who I think added substring support in #12044

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @2010YOUY01 and @dmitrybugakov for the review

match args.len() {
1 => {
for (idx, (raw, start)) in string_view_array
.views()
for ((str_opt, raw_view), start_opt) in string_view_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential thing that might be faster is to iterate on the null array directly (rather than the Option<&str>) so the code doesn't have to figure out where the &str comes from

@@ -118,15 +118,37 @@ pub fn substr(args: &[ArrayRef]) -> Result<ArrayRef> {
}
}

// Return the exact byte index for [start, end), set count to -1 to ignore count
fn get_true_start_end(input: &str, start: usize, count: i64) -> (usize, usize) {
// Convert the given `start` and `count` to valid byte indices within `input` string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍 for the comments

Copy link
Contributor

@Kev1n8 Kev1n8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing my mistakes and all the improvements!

match args.len() {
1 => {
for (idx, (raw, start)) in string_view_array
.views()
for ((str_opt, raw_view), start_opt) in string_view_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for correcting my mistakes. I really appreciate it! I'll be more careful with my code next time.

@alamb
Copy link
Contributor

alamb commented Sep 9, 2024

Thanks for correcting my mistakes. I really appreciate it! I'll be more careful with my code next time.

Agreed -- thanks @2010YOUY01 (and sorry I didn't catch this in review either)

In general I wonder if there is a better way to test the string functions (since many of them now have multiple implementations for different input type combinations, but each implementation should get the same answer)

Comment on lines +178 to +183
let sub_view = if substr_len > 12 {
let view = ByteView::from(*raw);
make_view(substr.as_bytes(), view.buffer_index, view.offset + start)
} else {
let sub_view = if substr_len > 12 {
let view = ByteView::from(*raw);
make_view(substr.as_bytes(), view.buffer_index, view.offset + start)
} else {
// inline value does not need block id or offset
make_view(substr.as_bytes(), 0, 0)
};
views_buffer.push(sub_view);
null_builder.append_non_null();
}
// inline value does not need block id or offset
make_view(substr.as_bytes(), 0, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the inline check here is duplicated with the one within make_view?
I guess inline len(12) may be the internal logic in string view, and we should not expose and use it in datafusion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say the point we are checking size here is to help with the operation of "modifying the views directly". But I agree that maybe we could add more API upstream in arrow to hide the logic.

Copy link
Contributor

@Rachelint Rachelint Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say the point we are checking size here is to help with the operation of "modifying the views directly". But I agree that maybe we could add more API upstream in arrow to hide the logic.

Maybe we can just let it run as same as the > 12 branch? ByteView::from(*raw) seems cheap, maybe it is not more expensive than a if else? I am not sure.

Copy link
Contributor

@Kev1n8 Kev1n8 Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean. I think ByteVIew is not equal to inlined_view. The former one indicates the str whose size > 12, while the inlined str is represented by a simple u128.

Maybe we can just let it run as same as the > 12 branch?

It might actually work since from<u128> for ByteView would load meaningless(which are actually the str) prefix, buffer_index, offset, which does no harm. But I think it would be ambiguous. It's a good suggestion tho.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean. I think ByteVIew is not equal to inlined_view. The former one indicates the str whose size > 12, while the inlined str is represented by a simple u128.

Maybe we can just let it run as same as the > 12 branch?

It might actually work since from<u128> for ByteView would load meaningless prefix, buffer_index, offset, which does no harm. But I think it would be ambiguous. It's a good suggestion tho.

Ok, it is indeed more clear to distinguish inlined and not inlned.

@Rachelint
Copy link
Contributor

It is really cool, and also fix the potential problem in #12395 !

@2010YOUY01
Copy link
Contributor Author

Thanks for correcting my mistakes. I really appreciate it! I'll be more careful with my code next time.

Agreed -- thanks @2010YOUY01 (and sorry I didn't catch this in review either)

In general I wonder if there is a better way to test the string functions (since many of them now have multiple implementations for different input type combinations, but each implementation should get the same answer)

One thing to do is to duplicate existing SQL test cases and convert StringArray columns to StringView columns, I think not every PR for string function native StringView support did this 🤔
I can do a sample PR and open several issues for string functions missing StringView SQL level tests

@alamb
Copy link
Contributor

alamb commented Sep 10, 2024

One thing to do is to duplicate existing SQL test cases and convert StringArray columns to StringView columns, I think not every PR for string function native StringView support did this 🤔
I can do a sample PR and open several issues for string functions missing StringView SQL level tests

Thanks @2010YOUY01

The potential downside of doing that is that it will not evolve as we add new string functions (e.g. we add a new string function that doesn't handle StringView, how will we know?)

I wonder if we can potentially do something more automated like have the same queries and expected files and run them against different test cases.

I think we could do this in sqllogictest by separating the table definitions into different .slt files and then includeing the same queries file. Here is an example:

https://github.com/apache/datafusion/blob/02eab80cd62e02fcb68dee8b99d63aaac680a66c/datafusion/sqllogictest/test_files/join_disable_repartition_joins.slt#L26-L25

@alamb
Copy link
Contributor

alamb commented Sep 10, 2024

I filed #12415 to track the idea of a more systematic testing

@alamb alamb merged commit 376a0b8 into apache:main Sep 10, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 10, 2024

Thanks again @2010YOUY01 @dmitrybugakov @Kev1n8 and @Rachelint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants