-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support array_insert #1073
Conversation
As I was able to realize, |
@andygrove Sorry for tagging but I have questions about the ticket (
Thanks in advance! That is my first serious attempt to contribute to the project, so sorry If I'm annoying. |
+ fix tests for spark < 3.4
- added test for the negative index - added test for the legacy spark mode
Thanks, @SemyonSinchenko. I think it's fine to skip the test for Spark 3.3. I plan on reviewing this PR in more detail tomorrow, but it looks good from an initial read. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1073 +/- ##
============================================
- Coverage 34.46% 34.27% -0.20%
Complexity 888 888
============================================
Files 113 113
Lines 43580 43355 -225
Branches 9658 9488 -170
============================================
- Hits 15021 14860 -161
- Misses 25507 25596 +89
+ Partials 3052 2899 -153 ☔ View full report in Codecov by Sentry. |
This PR is ready for the review. The only failed check failed due to the internal GHA error: GitHub Actions has encountered an internal error when running your job. |
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala
Outdated
Show resolved
Hide resolved
Co-authored-by: Andy Grove <[email protected]>
I am not sure if this should be done together with this PR but it would some add "free" tests. Spark introduced with 3.5 array_prepend which it implements with array_insert and starting 4.0 it also implements array_append with array_insert. If you want you can copy the array_append tests and replace array_append with array_prepend for Spark 3.5+ and enable the array_append tests for spark 4.0. You can also ignore this comment if you do not agree or if it leads to unrelated errors. |
- fixes; - tests; - comments in the code;
In one case there is a zero in index and test fails due to spark error
Thanks for comments and suggestions! This PR is ready for the review again. What were changed from the last round of the review:
These tests pointed my attention to the uncovered parts of the code that I fixed and I also added couple of additional tests to the At the moment this PR is tested in multiple ways:
So, it looks to me, that all the possible cases are covered and the behavior is the same like in spark. |
Done! |
native/spark-expr/src/list.rs
Outdated
let src_element_type = match src_value.data_type() { | ||
DataType::List(field) => field.data_type(), | ||
DataType::LargeList(field) => field.data_type(), | ||
data_type => { | ||
return Err(DataFusionError::Internal(format!( | ||
"Unexpected src array type in ArrayInsert: {:?}", | ||
data_type | ||
))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: this logic for extracting a list type is repeated a few times and could be factored out into a function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andygrove Thanks for the suggestion!
I moved a checking of the array type (and the exception logic) to the method:
pub fn array_type(&self, data_type: &DataType) -> DataFusionResult<DataType> {
match data_type {
DataType::List(field) => Ok(DataType::List(Arc::clone(field))),
DataType::LargeList(field) => Ok(DataType::LargeList(Arc::clone(field))),
data_type => {
return Err(DataFusionError::Internal(format!(
"Unexpected src array type in ArrayInsert: {:?}",
data_type
)))
}
}
}
It allows at least to avoid returning the same error multiple time. Is it what you suggested? Or should I move this method to a helper function and refactor also GerArrayStructField
to use such a function?
P.S. Sorry for the stupid question... But can you please explain to me why we always check both List
and LargeList
, while Apache Spark only supports i32
indexes for arrays (max length is Integer.MAX_VALUE - 15
), which is the case of List
to my understanding? All the code in the list.rs
might become a bit simpler if we make it non-generic (it also makes implementation of other missing methods like array_zip
simpler).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I wonder if the existing code for LargeList is actually being tested. It would be interesting to try removing it and see if there are any regressions. It makes sense to only handle List if Spark only supports i32 indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @SemyonSinchenko!
Which issue does this PR close?
Related to #1042
array_insert: SELECT array_insert(array(1, 2, 3, 4), 5, 5)
Rationale for this change
As described in #1042
What changes are included in this PR?
QueryPlanSerde.scala
: I added an additional case for the array insert;expr.proto
: I added a new message for theArrayInsert
;planner.rs
: I added a case for the array_insert;list.rs
:ArrayInsert
struct;PhysicalExpr
,Display
andPartialExpr
for it;fn array_insert
How are these changes tested?
At the moment I added a simple tests for
fn array_insert
and a test forQueryPlanSerde
.