-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial support for regex_replace on StringViewArray
#11556
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @XiangpengHao -- I think the code looks reasonable. I am not quite sure about the comments
Implement very limited support for regex_replace on StringViewArray.
What do you think is "very limited" ? Do you mean that there is potentially more room for optimization? Or that the functionality is limited in some way (like will error / panic on some inputs)?
It seems to em the diference is "could be faster" but I may be missing something
let pattern = fetch_string_arg!( | ||
&args[1], | ||
"pattern", | ||
T, | ||
i32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why thiw was changed from T
to i32
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the second input is always must be Utf8 (as shown in the Signature above).
I'm trying to make T
to only be used on the arg[0]
, so that later we can test different types of arg[0]
, e.g., Utf8View
, LargeUtf8
let result_array = GenericStringArray::<T>::from(data); | ||
Ok(Arc::new(result_array) as ArrayRef) | ||
DataType::Utf8View => { | ||
let string_view_array = as_string_view_array(&args[0])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the idea that we could specialize this implementation to potentially do inplace updates or something if the string replacements allowed (e.g. replacing larger strings with smaller)?
@@ -54,6 +57,7 @@ impl RegexpReplaceFunc { | |||
signature: Signature::one_of( | |||
vec![ | |||
Exact(vec![Utf8, Utf8, Utf8]), | |||
Exact(vec![Utf8View, Utf8, Utf8]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I added Exact(vec![Utf8View, Utf8, Utf8])
, but I wonder if other combinations might also work, e.g., Exact(vec![LargeUtf8, Utf8, Utf8])
, Exact(vec![Utf8View, Utf8, LargeUtf8])
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- this is the kind of things that the LogicalType proposal #11513 https://github.com/notfilippo would I think make eaiser
Sorry I could be more explicit. I added more comments, let me know you thoughts! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @XiangpengHao
@@ -54,6 +57,7 @@ impl RegexpReplaceFunc { | |||
signature: Signature::one_of( | |||
vec![ | |||
Exact(vec![Utf8, Utf8, Utf8]), | |||
Exact(vec![Utf8View, Utf8, Utf8]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- this is the kind of things that the LogicalType proposal #11513 https://github.com/notfilippo would I think make eaiser
… some ClickBench queries (not on by default) (#11667) * Pin to pre-release version of arrow 52.2.0 * Update for deprecated method * Add a config to force using string view in benchmark (#11514) * add a knob to force string view in benchmark * fix sql logic test * update doc * fix ci * fix ci only test * Update benchmarks/src/util/options.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/common/src/config.rs Co-authored-by: Andrew Lamb <[email protected]> * update tests --------- Co-authored-by: Andrew Lamb <[email protected]> * Add String view helper functions (#11517) * add functions * add tests for hash util * Add ArrowBytesViewMap and ArrowBytesViewSet (#11515) * Update `string-view` branch to arrow-rs main (#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * merge * update cast * consistent dep * fix ci * add more tests * make doc happy * update new implementation * fix bug * avoid unused dep * update dep * update * fix cargo check * update doc * pick up the comments change again --------- Co-authored-by: Andrew Lamb <[email protected]> * Enable `GroupValueBytesView` for aggregation with StringView types (#11519) * add functions * Update `string-view` branch to arrow-rs main (#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * merge * update cast * consistent dep * fix ci * avoid unused dep * update dep * update * fix cargo check * better group value view aggregation * update --------- Co-authored-by: Andrew Lamb <[email protected]> * Initial support for regex_replace on `StringViewArray` (#11556) * initial support for string view regex * update tests * Add support for Utf8View for date/temporal codepaths (#11518) * Add StringView support for date_part and make_date funcs * run cargo update in datafusion-cli * cargo fmt --------- Co-authored-by: Andrew Lamb <[email protected]> * GC `StringViewArray` in `CoalesceBatchesStream` (#11587) * gc string view when appropriate * make clippy happy * address comments * make doc happy * update style * Add comments and tests for gc_string_view_batch * better herustic * update test * Update datafusion/physical-plan/src/coalesce_batches.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> * [Bug] fix bug in return type inference of `utf8_to_int_type` (#11662) * fix bug in return type inference * update doc * add tests --------- Co-authored-by: Andrew Lamb <[email protected]> * Fix clippy * Increase ByteViewMap block size to 2MB (#11674) * better default block size * fix related test * Change `--string-view` to only apply to parquet formats (#11663) * use inferenced schema, don't load schema again * move config to parquet-only * update * update * better format * format * update * Implement native support StringView for character length (#11676) * native support for character length * Update datafusion/functions/src/unicode/character_length.rs --------- Co-authored-by: Andrew Lamb <[email protected]> * Remove uneeded patches * cargo fmt --------- Co-authored-by: Xiangpeng Hao <[email protected]> Co-authored-by: Xiangpeng Hao <[email protected]> Co-authored-by: Andrew Duffy <[email protected]>
(targets string-view2 branch)
Which issue does this PR close?
part of #11025
Rationale for this change
Implement very limited support for
regex_replace
onStringViewArray.
It is the minimal implementation to run ClickBench queries. I believe fully support
StringView
requires a bit discussion and much more work. This PR should help us understand the remaining work.What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?