Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow: add support for null vectors #10953
Arrow: add support for null vectors #10953
Changes from 47 commits
ac6440a
becf6f7
4e2cb86
1193d02
12bc3de
d8f3e13
bb4e010
6e7a1aa
28451a5
24a9932
9bcb2b1
7a25b52
a31bf94
44a7f91
c2eaf24
e323db7
061ab02
bf0c905
5610dd4
a13415d
2eaa63f
62108da
7115e93
08bb07c
442b381
5e7668e
83913a0
e2b428e
7ffa7ed
cda0423
9aec9e5
0c87dc7
e5eebd0
fe60793
1a3896b
5c3b460
a2df95c
bbc776d
2bf5b2f
1edd680
e1b3931
e574623
c8bcc1c
da9e514
fe83726
01f96f0
f509e47
163ee62
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we keeping this constructor? Is it needed if we can make a NullVectorHolder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constructor could be deleted but that would be an API breaking change because this is a public constructor in a public class. Which would you prefer, an extra constructor or a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of placing an if-statement in the constructor itself, it feels more like we should just have a different constructor or class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer In an earlier draft of this PR I did create a NullVectorHolder class but was told to instead adapt ConstantVectorHolder to work for my case, so I did that.
Here's a patch to add a new constructor, but it has at least one issue. Passing
null
as the constant value in the existing constructor is no longer allowed. This is a semantic breaking change, though not an API breaking change. What this means is that third party code that upgrades to a version of Apache Iceberg containing this patch that was previously using this constructor and passing in a null constant value will still compile fine but will have a runtime failure. That doesn't seem like an appealing solution to me.I think adding a new class such as NullVectorHolder has its own issue. What would be the use case for a ConstantVectorHolder containing a null value versus a NullVectorHolder containing a null value?
Which solution would you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok having gone through the code I think I may have a strategy here but please let me know what i'm missing in this interpretation.
ConstantVectorHolder is used as a dummy holder prior to this commit. It has no vector , column descriptor, etc ... everything is null. The current approach sometimes makes a ConstantVectorHolder which sometimes does have something inside of it. This is done so that GenericArrowVectorAccessorFactory will have something to work with.
From what I can see in GenericArrowVectorAccessorFactory is that it doesn't work with ConstantVectorHolders at all. It currently can only handle cases in which holder.vector() is a non null type and matches some known vector type. Spark on the other hand, does not actually use this path and when it sees a ConstantVectorHolder it instead creates a o.a.i.s.ConstantColumnVector. This is why I don't think Spark has any issue with this. (cc @amogh-jahagirdar )
Now this makes me think what we really need is to just fix GenericArrowVectorAccessorFactory to handle the case where the "vector" is null and in that circumstance attempt to cast the VectorHolder to a ConstantVectorHolder and create the appropriate vector type or accessor. I think we have a few options to do this.
I think the easiest may be to just create an accessor that looks like the Spark ConstantColumnVector and just use that as our generic accessor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The another approach is to directly mimic the Spark implementation and fork here
iceberg/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowBatchReader.java
Line 53 in c07f2aa
And if the holder is a ConstantVectorHolder set the vector to a new ConstantVector which we define similar to the Spark ConstantColumnVector code again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be confusing vectorized read paths but I'm curious why this isn't reproducible in Spark vectorized reads? Or is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try and repro in a unit test for Spark and see if it's the case. To be clear I don't want to hold up this PR on that though since it does seem like a legitimate problem based on the test being done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never tried reproducing the issue in Spark. My guess as to why there was no existing test case is that null vectors isn't a bug so much as it is a subfeature that was never implemented. No sense in creating a test for a subfeature that was knowingly never implemented.
What makes me say null vector support was knowingly never implemented? Look at the removed comment in arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowBatchReader.java. That comment wasn't a "this is how this code works" type of comment. In my opinion that comment is a TODO comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to explore and see how this looks in Spark's vectorized path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also update all other places in this test class