-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Utf8View and BinaryView in substrait serialization. #12199
Conversation
700fe41
to
5c4ebec
Compare
/// Arrow-cast does not currently handle direct casting from utf8 to binaryView. | ||
#[tokio::test] | ||
async fn binaryview_type_literal_needs_casting_fix() -> Result<()> { | ||
let err = roundtrip_all_types( | ||
"select * from data where | ||
view_binary_col = arrow_cast('binary_view', 'BinaryView');", | ||
) | ||
.await; | ||
|
||
assert!( | ||
matches!(err, Err(e) if e.to_string().contains("Unsupported CAST from Utf8 to BinaryView")) | ||
); | ||
Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have a few missing arrow_cast
implementations for BinaryView (explicit casting). Going to file a ticket in arrow and put up a PR; I'll be assessing possible changes in cast_with_options and can_cast_types.
Note that datafusion's type coercion has been previously updated to prefer coercion to the view types. It's the explicit casting that has coverage gaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see sqllogictests which demonstrate what is supported by arrow_cast. Then my follow ups will be: (a) make sqllogictests showing what is, and is not, supported of the new view types, and then (b) make the upstream arrow-rs changes (with some correctness guidance during code review).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sqllogictests added: #12200
Turns out the arrow-cast changes are already made, but not in the current release used in datafusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have now updated to the latest arrow-rs so we'll have the correct code #12032
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet. I've removed the work-arounds and deleted this (no longer applicable) test. Thank you.
@@ -716,12 +716,29 @@ async fn all_type_literal() -> Result<()> { | |||
date32_col = arrow_cast('2020-01-01', 'Date32') AND | |||
binary_col = arrow_cast('binary', 'Binary') AND | |||
large_binary_col = arrow_cast('large_binary', 'LargeBinary') AND | |||
view_binary_col = arrow_cast(arrow_cast('binary_view', 'Binary'), 'BinaryView') AND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See test binaryview_type_literal_needs_casting_fix()
below, as for the reason behind the double casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this workaround in a5bfedd
@@ -52,6 +52,7 @@ pub const DATE_32_TYPE_VARIATION_REF: u32 = 0; | |||
pub const DATE_64_TYPE_VARIATION_REF: u32 = 1; | |||
pub const DEFAULT_CONTAINER_TYPE_VARIATION_REF: u32 = 0; | |||
pub const LARGE_CONTAINER_TYPE_VARIATION_REF: u32 = 1; | |||
pub const VIEW_CONTAINER_TYPE_VARIATION_REF: u32 = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, hardcoding the numbers isn't really the proper way to do type variations. (Rather we should add the variation as an extension and refer to the extension's id.) However, given this is already used for default vs large, I guess adding view makes sense - and they can all be migrated at once to the proper way someday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I'll craft a follow up ticket later today, and link here (for future reference).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #12355 to track
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -2351,8 +2371,10 @@ mod test { | |||
round_trip_type(DataType::Binary)?; | |||
round_trip_type(DataType::FixedSizeBinary(10))?; | |||
round_trip_type(DataType::LargeBinary)?; | |||
round_trip_type(DataType::BinaryView)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/// Arrow-cast does not currently handle direct casting from utf8 to binaryView. | ||
#[tokio::test] | ||
async fn binaryview_type_literal_needs_casting_fix() -> Result<()> { | ||
let err = roundtrip_all_types( | ||
"select * from data where | ||
view_binary_col = arrow_cast('binary_view', 'BinaryView');", | ||
) | ||
.await; | ||
|
||
assert!( | ||
matches!(err, Err(e) if e.to_string().contains("Unsupported CAST from Utf8 to BinaryView")) | ||
); | ||
Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have now updated to the latest arrow-rs so we'll have the correct code #12032
@@ -716,12 +716,29 @@ async fn all_type_literal() -> Result<()> { | |||
date32_col = arrow_cast('2020-01-01', 'Date32') AND | |||
binary_col = arrow_cast('binary', 'Binary') AND | |||
large_binary_col = arrow_cast('large_binary', 'LargeBinary') AND | |||
view_binary_col = arrow_cast(arrow_cast('binary_view', 'Binary'), 'BinaryView') AND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this workaround in a5bfedd
Which issue does this PR close?
Closes #12118
Rationale for this change
We have two new view data types, Utf8View and BinaryView. Support in datafusion is part of this epic, and this specific PR is about adding support for the (de-)serialization of logical and physical plans into the substrait format.
This PR adds new substrait variations on existing type classes. For example, there is a "string" substrait class which can have different variations representing different physical types (e.g. Utf8 vs LargeUtf8 vs Utf8View). If we serialize using string variation=2 (e.g. view physical type), then the deserialization of variation=2 will give us back the Utf8View. More background is given here.
What changes are included in this PR?
Are these changes tested?
Logical plan: The Utf8View and BinaryView are covered in the logical plan roundtrip serialization tests.
Physical plan: However, the physical plan roundtrip serialization tests are not yet implemented. There is an ongoing epic to finish the physical plan serialization. As such, I added code for the physical plan substrait handling of Utf8View and BinaryView (to avoid incurring more tech debt) -- but this code is not tested.
Are there any user-facing changes?
No API contract change.
Removal of unimplemented errors if using these new datatypes in subtrait serialization.