-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Update variant class visibility #12105
Core: Update variant class visibility #12105
Conversation
a5229ee
to
68069b4
Compare
PrimitiveWrapper(Variants.PhysicalType type, T value) { | ||
this.type = type; | ||
PrimitiveWrapper(PhysicalType type, T value) { | ||
if (value instanceof Boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: seems I prefer the existing implementation which is cleaner and consistent with other types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trade-off is that this would require a separate reader just for boolean values. I'm open to that if there are strong objections, but it seems to me that allowing the actual type to be fixed up depending on the value is a reasonable change that saves quite a bit of code. If you object to this, I can revert it and add the boolean-specific reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. You are right. BOOLEAN_TRUE
and BOOLEAN_FALSE
physical types need special handling. When we shred them, they should be grouped together, considered as one type.
And also we may need to have a type list {NULL
, BOOLEAN
, INT8
, INT16
, etc}, which is almost same as physical type list with BOOLEAN for BOOLEAN_TRUE
and BOOLEAN_FALSE
. Otherwise, we can't represent a shredded column type for true/false.
I'm fine to keep it simple for now. We can revisit if needed.
@VisibleForTesting | ||
int numElements() { | ||
@Override | ||
public int numElements() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing it. I was trying to make Parquet change in (https://github.com/apache/iceberg/pull/11653/files#diff-b8e8443fcec3843e538dbc702d4c131ff58359cb83ccdb211d8679c1d77c16bd) and we need to expose this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed it for the readers and for the updates here to ShreddedObject
, too.
return nameSet().size(); | ||
} | ||
|
||
public void remove(String field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this remove() try to support or to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this class is to create objects from an unshredded, serialized variant in value
and the fields in its corresponding typed_value
. The serialized object is used to construct the ShreddedObject
instance and then the shredded fields are set through put
.
This is intended to handle fields that are "missing" because the field's value
and typed_value
are null. In those cases, we need to basically add a null
value to the shreddedFields
map. We could do that, but the map implementations that we use (from Guava) don't allow null
values. Even if we used a map that could handle null
, we would have to handle those nulls in places like nameSet
and in serialization. That way we correctly store that the field was missing according to the shredding spec, rather than defined and equal to a Variant null.
I thought it was cleaner to handle missing fields by calling remove
for the field name to show that it is not present in the shredded fields. I also think that using a separate set of field names makes the most sense for handling these instead of using null
as a sentinel value in the shreddedFields
map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explanation.
If a field is missing and we remove the field from shreddedFields
, why do we still need removedFields
to keep track of it? Would the following get the correct field list?
private Set<String> nameSet() {
Set<String> names = Sets.newHashSet(shreddedFields.keySet());
if (unshredded != null) {
Iterables.addAll(names, unshredded.fieldNames());
}
return names;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't handle the case where unshredded
incorrectly includes the field. We need to keep track of the shredded fields, whether present or missing, so that the shredded fields are always used.
This updates variants with changes needed for Parquet readers.
numFields
to objects,numElements
to arrays, anddictionarySize
to metadataVariantMetadata
instead of direct references toSerializedMetadata
VariantObject
for unshredded fields inShreddedObject
to avoid leakingSerializedObject
ShreddedObject
to handle shredded fields that are missing, not overriddenVariants.from