-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Full UnionArray validation #1444
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1444 +/- ##
==========================================
+ Coverage 82.71% 82.73% +0.02%
==========================================
Files 187 187
Lines 54255 54365 +110
==========================================
+ Hits 44877 44979 +102
- Misses 9378 9386 +8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct to me.
arrow/src/array/data.rs
Outdated
self.validate_offsets_full::<i8>(self.child_data.len())?; | ||
} | ||
UnionMode::Dense => { | ||
self.validate_dense_union_full()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to suggest that this should validate that the null bitmasks are disjoint, but this may not even be a requirement - the specification says "All “unselected” values are ignored and could be any semantically correct array value."
arrow/src/array/data.rs
Outdated
@@ -1117,6 +1119,44 @@ impl ArrayData { | |||
) | |||
} | |||
|
|||
/// Ensures that for each union element, the offset is correct for | |||
/// the corresponding child array | |||
fn validate_dense_union_full(&self) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think should also check that offsets are monotonic for a given array type, but that could definitely be left as a todo
arrow/src/array/data.rs
Outdated
// https://github.com/apache/arrow-rs/issues/85 | ||
// | ||
// TODO file follow on ticket for full union validation | ||
DataType::Union(_fields, mode) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, we should probably also add buffer length checks into ArrayData::validate
as I don't think these are currently present anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean "buffer length checks"? Which buffers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this got lost in my various rewording of this, I mean child arrays 🤦♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The length of the ArrayData::buffers
are checked as part of the 'layout' in ArrayData::validate
https://github.com/apache/arrow-rs/blob/master/arrow/src/array/data.rs#L1210
The child arrays ArrayData::child_data
are validated recursively in ArrayData::validate_child_data
https://github.com/apache/arrow-rs/blob/master/arrow/src/array/data.rs#L788
arrow/src/array/data.rs
Outdated
/// Ensures that for each union element, the offset is correct for | ||
/// the corresponding child array | ||
fn validate_dense_union_full(&self) -> Result<()> { | ||
// safety justification is that the size of the buffers was validated in self.validate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially make validate
also check that all child arrays have the length of the parent in the case of a dense representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think actually that all child arrays need to have the same length for a sparse representation (which is checked in cheap validate
)
Which is covered by test_validate_sparse_union_different_child_len
arrow/src/array/data.rs
Outdated
child_offset, i)) | ||
})?; | ||
|
||
if let Some(last_offset) = last_offset.as_ref() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect, the requirement is that for a given type the offsets must be increasing. But it is perfectly correct, in fact desirable, for the offsets array itself to not be increasing.
e.g. the following would be valid
types: [0, 1, 1, 0],
offsets: [0, 0, 1, 0],
Otherwise the dense representation is no different from the sparse representation
#[should_panic( | ||
expected = "Offset invariant failure: position 2 invalid. 0 should be >= 1" | ||
)] | ||
fn test_validate_dense_union_non_increasing_offset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should not panic, it is valid as no single child-array has non-increasing offsets.
This is not valid
type_ids = [0, 0, 0],
offsets = [0, 1, 0],
But this is
type_ids = [0, 0, 1],
offsets = [0, 1, 0],
As is the example from https://arrow.apache.org/docs/format/Columnar.html#dense-union
type_ids = [0, 0, 0, 1]
offsets = [0, 1, 2, 0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 2769573
I am thinking I may try and add a CI check and a feature that runs all the tests with full data checking enabled to double check this isn't introducing any issues |
FWIW I tested this PR locally with full validation enabled and several tests failed.
Putting back to draft until is sort that out |
I plan to implement this in #1799 |
Draft (need to write tests and file a ticket)Which issue does this PR close?
Closes #1486
Rationale for this change
While reviewing #1412, I remembered to check the validation code for UnionArray and it turns out it was not implemented yet.
Since I had the union layout already loaded in my head, I figured I would pound out this PR:
What changes are included in this PR?
Validate all Union array data in
validate_full
Are there any user-facing changes?
More data validation. The APIs are all the same