-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deduplicate variadic buffers in MutableArrayData::extend for ByteView arrays #6808
base: main
Are you sure you want to change the base?
deduplicate variadic buffers in MutableArrayData::extend for ByteView arrays #6808
Conversation
_ => vec![], | ||
let (variadic_data_buffers, buffer_to_idx) = match &data_type { | ||
DataType::BinaryView | DataType::Utf8View => { | ||
let mut buffer_to_idx = HashMap::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if building a hashmap / vec would be overly expensive (though we would need to run benchmarks to be sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to run benchmarks, any particular in mind or should I create one with criterion specific to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ones in cast are probably a good place to start
@alamb @tustvold I did add a string view case for the interleave benchmark and ran on main, this PR (interleave-deduplicated), and #6779 (interleave-specific-impl)
I believe the penalty introduced by this PR would be mitigated for interleave's case if we also merge #6779, for other cases it feels like the read / transfer over the wire improvements might outweigh the cost. Happy to hear your thoughts |
Thank you @onursatici -- I hope to find time to review this PR this weekend or early next week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I (again) apologize for the delay in reviewing this PR. We are stretched quite thin as always
In general, I think this PR needs some tests to show it is working as well as ensure we don't break this functionality with some future PR.
Thank you for running the benchmarks. They seem promising and I will give them a more careful look if we proceed with this PR
@alamb no worries and thank you for having a look. I added some tests now checking the deduplication and remapping behaviour, let me know whenever you have time if this looks good, happy holidays! |
I have merged #6779 now I think one of the potential performance concerns is that |
Thank you for looking into this, I am inclined to agree with your assessment that the returns of this are probably not worthwhile to include as part of the general purpose MutableArrayData. I do think this sort of optimisation is possibly relevant in some places, e.g. DataFusion when coalescing multiple RecordBatch, but potentially something to be included as part of a more holistic rework of how StringViewArray "compaction" occurs. I am not sure where that leaves this PR, but I would be inclined to close it. |
Which issue does this PR close?
Closes #.
Rationale for this change
MutableArrayData adds all variadic buffers from input arrays together, potentially duplicating the same buffers in the output array.
What changes are included in this PR?
extend
now checks if the same buffer is added from some other input array and changes the views to be appended to point to the new deduplicated buffer indicesAre there any user-facing changes?