-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43573: [C++] Copy bitmap when casting from string-view to offset string and binary types #44822
base: main
Are you sure you want to change the base?
GH-43573: [C++] Copy bitmap when casting from string-view to offset string and binary types #44822
Conversation
|
if (input.offset == output->offset) { | ||
output->buffers[0] = input.GetBuffer(0); | ||
} else { | ||
if (input.buffers[0].data != NULLPTR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// When the offsets are different (e.g., due to slice operation), we need to check if
// the null bitmap buffer is not null before copying it. The null bitmap buffer can be
// null if the input array value does not contain any null value.
Do we also need a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you copy and paste this utility function [1] to this compilation unit and call it from here instead?
[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc#L42
(Later the utility could be moved to a .h so it's callable from anywhere and inlinable. But I'm suggesting a copy because it's tricky to name this function in an informative and non-error-prone way.)
} | ||
|
||
// Set up offset and data buffer | ||
DataBuilder data_builder(ctx->memory_pool()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous implementation caculate a sum_of_binary_view_sizes
, and ReserveData
for it. Why did here doesn't use same way to reserve data? Would blindly check append the buffer being faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this optimization in the original code. Just added it back. With this change, I am also updating the data_builder.Append
call to UnsafeAppend
Co-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Asking for some tweaks.
if (input.offset == output->offset) { | ||
output->buffers[0] = input.GetBuffer(0); | ||
} else { | ||
if (input.buffers[0].data != NULLPTR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you copy and paste this utility function [1] to this compilation unit and call it from here instead?
[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc#L42
(Later the utility could be moved to a .h so it's callable from anywhere and inlinable. But I'm suggesting a copy because it's tricky to name this function in an informative and non-error-prone way.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM, thanks!
RETURN_NOT_OK(VisitArraySpanInline<I>( | ||
batch[0].array, | ||
[&](std::string_view s) { | ||
// for non-null value, append string view to buffer and calculate offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does calculate offset means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taking it from the length of the buffer builder after the append
@mapleFU don't you want take this to the finish line? Unless @CrystalZhou0529 is available to implement the final changes. |
Thanks for reviewing it! Sorry for falling behind on this PR. I will implement the final changes now. |
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
Hi @felipecrv @mapleFU, I have just committed the requested changes. Please take another look and let me know if it looks reasonable! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM!
return in_array.GetBuffer(0); | ||
} | ||
|
||
// If a non-zero offset, we need to shift the bitmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felipecrv Just curious can here uses Slicing Buffer if offset % 8 == 0
Rationale for this change
Use
CopyBitmap
to optimize performance in string casting from string-view to offset string.What changes are included in this PR?
Originally, the way we create the bitmap is by appending one bit at a time, which is slow. Since casting should not change the values in bitmap, this feature takes advantage of
CopyBitmap
to create the entire bitmap at once.Then, to create offsets and buffer array, I use
TypedBufferBuilder
as suggested in the original comment #43302 (comment).Are these changes tested?
The original unit tests have passed.
Are there any user-facing changes?
No, the casting behavior should remain unchanged.
closes #43573