Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support array_insert #1073

Merged
merged 20 commits into from
Nov 22, 2024
Merged

Conversation

SemyonSinchenko
Copy link
Member

@SemyonSinchenko SemyonSinchenko commented Nov 9, 2024

Which issue does this PR close?

Related to #1042

array_insert: SELECT array_insert(array(1, 2, 3, 4), 5, 5)

Rationale for this change

As described in #1042

What changes are included in this PR?

  • QueryPlanSerde.scala: I added an additional case for the array insert;
  • expr.proto: I added a new message for the ArrayInsert;
  • planner.rs: I added a case for the array_insert;
  • list.rs:
    • I added a new ArrayInsert struct;
    • I implemented PhysicalExpr, Display and PartialExpr for it;
    • The main logic of insertion is in fn array_insert

How are these changes tested?

At the moment I added a simple tests for fn array_insert and a test for QueryPlanSerde.

@SemyonSinchenko
Copy link
Member Author

As I was able to realize, array_insert does not supported in datafusion. Is the list.rs a good place to have an implementation of ArrayInsert and PhysicalExpr for it?

@SemyonSinchenko
Copy link
Member Author

SemyonSinchenko commented Nov 11, 2024

@andygrove Sorry for tagging but I have questions about the ticket (array_insert).

  1. [RESOLVED] array_insert was added in spark 3.4, so all the 3.3.x tests are obviously failed. I checked and it looks like the EoL for 3.3 is about the end of 2024. Technically I think I can try to workaround tests in 3.3.x by reflection API, my question is mostly should I do it due to soon EoL of the 3.3.x?
  2. array_insert is not supported in DataFusion. I made an implementation (and it looks like it works, except negative indices and corner cases). Is the list.rs a good place for it? Or should I move my code somewhere else?
  3. Spark does not support anything except Int32 for position argument, is it OK if I will support only int32 too? In theory, other types can be supported too, but I'm still trying to realize how to achieve it and it may become complex...

Thanks in advance! That is my first serious attempt to contribute to the project, so sorry If I'm annoying.

@SemyonSinchenko SemyonSinchenko changed the title [WIP][DO-NOT-MERGE] feat: support array_insert [WIP] feat: support array_insert Nov 13, 2024
@SemyonSinchenko SemyonSinchenko changed the title [WIP] feat: support array_insert feat: support array_insert Nov 13, 2024
@SemyonSinchenko SemyonSinchenko changed the title feat: support array_insert [WIP] feat: support array_insert Nov 13, 2024
- added test for the negative index
- added test for the legacy spark mode
@SemyonSinchenko SemyonSinchenko changed the title [WIP] feat: support array_insert feat: support array_insert Nov 13, 2024
@SemyonSinchenko SemyonSinchenko marked this pull request as ready for review November 13, 2024 18:04
@andygrove
Copy link
Member

Thanks, @SemyonSinchenko. I think it's fine to skip the test for Spark 3.3. I plan on reviewing this PR in more detail tomorrow, but it looks good from an initial read.

@codecov-commenter
Copy link

codecov-commenter commented Nov 14, 2024

Codecov Report

Attention: Patch coverage is 59.09091% with 9 lines in your changes missing coverage. Please review.

Project coverage is 34.27%. Comparing base (845b654) to head (8b58d8d).
Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 59.09% 7 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1073      +/-   ##
============================================
- Coverage     34.46%   34.27%   -0.20%     
  Complexity      888      888              
============================================
  Files           113      113              
  Lines         43580    43355     -225     
  Branches       9658     9488     -170     
============================================
- Hits          15021    14860     -161     
- Misses        25507    25596      +89     
+ Partials       3052     2899     -153     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SemyonSinchenko
Copy link
Member Author

This PR is ready for the review. The only failed check failed due to the internal GHA error:

GitHub Actions has encountered an internal error when running your job.

@NoeB
Copy link
Contributor

NoeB commented Nov 19, 2024

I am not sure if this should be done together with this PR but it would some add "free" tests. Spark introduced with 3.5 array_prepend which it implements with array_insert and starting 4.0 it also implements array_append with array_insert. If you want you can copy the array_append tests and replace array_append with array_prepend for Spark 3.5+ and enable the array_append tests for spark 4.0. You can also ignore this comment if you do not agree or if it leads to unrelated errors.

- fixes;
- tests;
- comments in the code;
In one case there is a zero in index and test fails due to spark error
@SemyonSinchenko
Copy link
Member Author

Thanks for comments and suggestions!

This PR is ready for the review again.

What were changed from the last round of the review:

  • I added tests for the array_prepend (spark 3.5+) and enabled the test for array_append (spark 4.0+);
  • I added tests for the corner cases:
    • Index is negative;
    • Index is bigger than the array length;
    • Index is negative and it's abs is greater than the array length;
    • Test of the fallback to spark (udf as child);
    • Value to insert is actually null;
  • I fixed the native part behavior for some of the corner cases;

These tests pointed my attention to the uncovered parts of the code that I fixed and I also added couple of additional tests to the native part. I revisited how spark do array_insert and it is more tricky than I realized at the first glance. I added few additional comments to the native part of the implementation and fixed the behavior.

At the moment this PR is tested in multiple ways:

  • Basic tests in native part that are useful because easy to debug and very fast to run;
  • Basic tests in native part for the so called "legacy mode" in spark;
  • Basic tests on the scala side;
  • Logical tests on small data for corner cases (negative, positive, long, short, null, etc.) on the scala side;
  • Tests in array_prepend and array_append that are calling array_insert under the hood (NULLS, different data types, etc.);
  • Test of the fallback to the Spark in case when one of children is not supported by Comet;

So, it looks to me, that all the possible cases are covered and the behavior is the same like in spark.

@SemyonSinchenko
Copy link
Member Author

I am not sure if this should be done together with this PR but it would some add "free" tests. Spark introduced with 3.5 array_prepend which it implements with array_insert and starting 4.0 it also implements array_append with array_insert. If you want you can copy the array_append tests and replace array_append with array_prepend for Spark 3.5+ and enable the array_append tests for spark 4.0. You can also ignore this comment if you do not agree or if it leads to unrelated errors.

Done!

Comment on lines 493 to 501
let src_element_type = match src_value.data_type() {
DataType::List(field) => field.data_type(),
DataType::LargeList(field) => field.data_type(),
data_type => {
return Err(DataFusionError::Internal(format!(
"Unexpected src array type in ArrayInsert: {:?}",
data_type
)))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: this logic for extracting a list type is repeated a few times and could be factored out into a function

Copy link
Member Author

@SemyonSinchenko SemyonSinchenko Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove Thanks for the suggestion!
I moved a checking of the array type (and the exception logic) to the method:

    pub fn array_type(&self, data_type: &DataType) -> DataFusionResult<DataType> {
        match data_type {
            DataType::List(field) => Ok(DataType::List(Arc::clone(field))),
            DataType::LargeList(field) => Ok(DataType::LargeList(Arc::clone(field))),
            data_type => {
                return Err(DataFusionError::Internal(format!(
                    "Unexpected src array type in ArrayInsert: {:?}",
                    data_type
                )))
            }
        }
    }

It allows at least to avoid returning the same error multiple time. Is it what you suggested? Or should I move this method to a helper function and refactor also GerArrayStructField to use such a function?

P.S. Sorry for the stupid question... But can you please explain to me why we always check both List and LargeList, while Apache Spark only supports i32 indexes for arrays (max length is Integer.MAX_VALUE - 15), which is the case of List to my understanding? All the code in the list.rs might become a bit simpler if we make it non-generic (it also makes implementation of other missing methods like array_zip simpler).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I wonder if the existing code for LargeList is actually being tested. It would be interesting to try removing it and see if there are any regressions. It makes sense to only handle List if Spark only supports i32 indexes.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @SemyonSinchenko!

@andygrove andygrove merged commit 9990b34 into apache:main Nov 22, 2024
74 checks passed
@SemyonSinchenko SemyonSinchenko deleted the array-insert branch November 23, 2024 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants