-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent Slicing of ArrayData for StructArray #1750
Comments
Hi @tustvold
fn from(data: ArrayData) -> Self {
let boxed_fields = data
.child_data()
.iter()
.map(|cd| make_array(cd.slice(data.offset(), data.len())))
.collect();
Self { data, boxed_fields }
} I am afraid this cannot solve the problem. Because
After calling
We still have 3
pub fn value_offsets(&self) -> &[OffsetSize] {
unsafe {
std::slice::from_raw_parts(
self.value_offsets.as_ptr().add(self.data.offset()),
self.len() + 1,
)
}
} Slice when getting the child array. (API change, because we own the pub fn column(&self, pos: usize) -> ArrayRef {
self.boxed_fields[pos].slice(self.offset, self.length)
}
|
The suggestion is never to push offsets down within ArrayData, and so you should never have this ambiguity? When constructing an instance of StructArray you "hydrate" the offsets to boxed fields to avoid having to recompute slices on access, but this does not modify the underlying ArrayData? |
Although we don't push down the offset within As a result, I guess your suggestions (slicing child arrays within |
Aah, I see the confusion. I'm suggesting storing slices of the child data as boxed_fields on StructArray. So in StructArray::from it would inspect the ArrayData, compute slices, and store those as members. I am not suggesting StructArray::from make any modifications to ArrayData. |
Implement as this: fn from(data: ArrayData) -> Self {
let boxed_fields = data
.child_data()
.iter()
.map(|cd| make_array(cd.slice(data.offset(), data.len())))
.collect();
Self { data, boxed_fields }
} ? |
I think that should work, but there could be some arcane edge case I've failed to consider |
This will not modify the parent ArrayData, which is great. |
I don't find |
We have the same "inconsistency" on other array types, e.g.
Yeah this is just a usability thing, the Rust implementation is just calling through to
I would definitely support this change, and it would be the better long-term solution. It has been discussed at length, all that really remains is someone to volunteer to do it 😅. The only outstanding sticking point has been handling offsets for BooleanArrays, but I don't imagine there would be resistance to special casing these as a Bitmap in ArrayData... |
cc @jhorstmann and @bjchambers who I think have interest in this area |
Sorry for being late to this discussion, the proposed change makes sense to me. If I understand it correctly, this would be a silent behavior change of the |
Is this not a bug? |
Difficult to say without a spec or comment describing the expected behavior. Someone might rely on the current behavior and acccess a StructArray using something like |
I've read this thread a few times, but I'm still hazy on what a good approach is, given how out of the loop I have been for so long. We had discussed with Jorge many moons ago that passing the offset and length to Buffer and Bitmap would be a good solution (as is done in arrow2 like you mention @tustvold). I haven't written arrow code in very long, so I can't quite remember the details. However, what I recall was having a challenge figuring out what happens in the below scenario. An array is of type The trouble was that if we don't pass down the offset and length to the So in principle I favoured pushing down the offset at the time. Which I suppose has led us here:
This makes sense to implement as a solution (interim?), but yea perhaps first-prize would be propagating offsets and value lengths to a redesigned Buffer and Bitmap |
I think caching the sliced array on
Agreed, and I think the first step of this is removing the remaining places that "leak" offsets into the Array-level interfaces 😄 |
Closing in favor of removing the offsets entirely (#1799) |
Closed by #4061 |
Problem
ArrayData::Slice contains a special case for StructArray where it recurses the offset into its children. However, it preserves the offset on the parent ArrayData, in order for the validity buffer to work correctly.
The result is the offset has been pushed down to the child arrays, but is still present on the parent ArrayData. This makes it unclear if an offset should or should not be applied to the child_array.
Proposal
There are longer term suggestions around handling offsets in ArrayData differently, but until then I would like to propose:
StructArray
when constructingboxed_fields
This is consistent with how we handle offsets elsewhere, with ArrayData a dumb container, and the
Array
providing the logic to interpret the offset correctly.Thoughts @nevi-me
The text was updated successfully, but these errors were encountered: