Core: Update variant class visibility #12105

rdblue · 2025-01-25T21:40:39Z

This updates variants with changes needed for Parquet readers.

Add numFields to objects, numElements to arrays, and dictionarySize to metadata
Use VariantMetadata instead of direct references to SerializedMetadata
Support any VariantObject for unshredded fields in ShreddedObject to avoid leaking SerializedObject
Support suppressing fields in ShreddedObject to handle shredded fields that are missing, not overridden
For boolean variants, alter the type based on the value to simplify creating booleans
Rename Variants.from

aihuaxu · 2025-01-27T18:20:27Z

core/src/main/java/org/apache/iceberg/variants/PrimitiveWrapper.java

-  PrimitiveWrapper(Variants.PhysicalType type, T value) {
-    this.type = type;
+  PrimitiveWrapper(PhysicalType type, T value) {
+    if (value instanceof Boolean


nit: seems I prefer the existing implementation which is cleaner and consistent with other types.

The trade-off is that this would require a separate reader just for boolean values. I'm open to that if there are strong objections, but it seems to me that allowing the actual type to be fixed up depending on the value is a reasonable change that saves quite a bit of code. If you object to this, I can revert it and add the boolean-specific reader.

Yeah. You are right. BOOLEAN_TRUE and BOOLEAN_FALSE physical types need special handling. When we shred them, they should be grouped together, considered as one type.

And also we may need to have a type list {NULL, BOOLEAN, INT8, INT16, etc}, which is almost same as physical type list with BOOLEAN for BOOLEAN_TRUE and BOOLEAN_FALSE. Otherwise, we can't represent a shredded column type for true/false.

I'm fine to keep it simple for now. We can revisit if needed.

aihuaxu · 2025-01-27T18:23:55Z

core/src/main/java/org/apache/iceberg/variants/SerializedArray.java

-  @VisibleForTesting
-  int numElements() {
+  @Override
+  public int numElements() {


Thanks for addressing it. I was trying to make Parquet change in (https://github.com/apache/iceberg/pull/11653/files#diff-b8e8443fcec3843e538dbc702d4c131ff58359cb83ccdb211d8679c1d77c16bd) and we need to expose this.

I needed it for the readers and for the updates here to ShreddedObject, too.

aihuaxu · 2025-01-27T18:42:10Z

core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java

+    return nameSet().size();
+  }
+
+  public void remove(String field) {


What does this remove() try to support or to be used?

The purpose of this class is to create objects from an unshredded, serialized variant in value and the fields in its corresponding typed_value. The serialized object is used to construct the ShreddedObject instance and then the shredded fields are set through put.

This is intended to handle fields that are "missing" because the field's value and typed_value are null. In those cases, we need to basically add a null value to the shreddedFields map. We could do that, but the map implementations that we use (from Guava) don't allow null values. Even if we used a map that could handle null, we would have to handle those nulls in places like nameSet and in serialization. That way we correctly store that the field was missing according to the shredding spec, rather than defined and equal to a Variant null.

I thought it was cleaner to handle missing fields by calling remove for the field name to show that it is not present in the shredded fields. I also think that using a separate set of field names makes the most sense for handling these instead of using null as a sentinel value in the shreddedFields map.

Thanks for explanation.

If a field is missing and we remove the field from shreddedFields, why do we still need removedFields to keep track of it? Would the following get the correct field list?

private Set<String> nameSet() { Set<String> names = Sets.newHashSet(shreddedFields.keySet()); if (unshredded != null) { Iterables.addAll(names, unshredded.fieldNames()); } return names; }

That doesn't handle the case where unshredded incorrectly includes the field. We need to keep track of the shredded fields, whether present or missing, so that the shredded fields are always used.

core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java

amogh-jahagirdar · 2025-01-28T16:46:01Z

Thanks @rdblue , and @aihuaxu for reviewing!

github-actions bot added the core label Jan 25, 2025

Core: Update variant class visibility.

68069b4

rdblue force-pushed the variant-update-visibility branch from a5229ee to 68069b4 Compare January 25, 2025 21:44

Fix ShreddedObject#nameSet and update tests.

98cc7b4

aihuaxu reviewed Jan 27, 2025

View reviewed changes

amogh-jahagirdar approved these changes Jan 27, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java Outdated Show resolved Hide resolved

amogh-jahagirdar merged commit 61241ed into apache:main Jan 28, 2025
46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Update variant class visibility #12105

Core: Update variant class visibility #12105

rdblue commented Jan 25, 2025 •

edited

Loading

aihuaxu Jan 27, 2025

rdblue Jan 27, 2025

aihuaxu Jan 28, 2025

aihuaxu Jan 27, 2025

rdblue Jan 27, 2025

aihuaxu Jan 27, 2025

rdblue Jan 27, 2025 •

edited

Loading

aihuaxu Jan 28, 2025

rdblue Jan 28, 2025

amogh-jahagirdar commented Jan 28, 2025

Core: Update variant class visibility #12105

Core: Update variant class visibility #12105

Conversation

rdblue commented Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogh-jahagirdar commented Jan 28, 2025

rdblue commented Jan 25, 2025 •

edited

Loading

rdblue Jan 27, 2025 •

edited

Loading