Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: add variant builder implementation #11857

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Dec 22, 2024

This PR adds the implementation to build a variant for primitive, object and array.

The usage:

  • Build from Json
    VariantBuilder.parseJson(json_string);

  • build a primitive

VariantBuilder builder = new VariantBuilder();
VariantPrimitiveBuilder primBuilder = builder.createPrimitive();
Variant variant = primBuilder.writeNumeric(10).build();

  • build an object
 VariantBuilder builder = new VariantBuilder();
 VariantObjectBuilder objBuilder = builder.startObject();
 objBuilder.writeString("key1", "value1").writeNumeric("key2", 3).endObject();
 Variant variant = builder.build();
  • build an array
 VariantBuilder builder = new VariantBuilder();
 VariantArrayBuilder arrBuilder = builder.createArray();
 arrBuilder.writerNumeric(10).writerNumeric(20).endArray();
 Variant variant = builder.build();

Nested arrays and objects can be supported as well. Please see the test cases for the usage. Compared to build from JSON string, we also support some additional types such as timestamp, date.

Part of: #10392

@github-actions github-actions bot added the core label Dec 22, 2024
@aihuaxu aihuaxu marked this pull request as ready for review December 22, 2024 20:11
@aihuaxu aihuaxu force-pushed the variant-add-writer-impl branch 6 times, most recently from f8f8e8f to d806c66 Compare December 24, 2024 03:54
@aihuaxu aihuaxu force-pushed the variant-add-writer-impl branch from d806c66 to 16210a0 Compare January 6, 2025 22:26
@aihuaxu
Copy link
Contributor Author

aihuaxu commented Jan 6, 2025

@rdblue, @RussellSpitzer Please help review. Thanks.

@aihuaxu aihuaxu force-pushed the variant-add-writer-impl branch from 16210a0 to d4f1d0e Compare January 6, 2025 22:57
@@ -20,6 +20,10 @@

/** An variant array value. */
public interface VariantArray extends VariantValue {
default int numElements() {
throw new UnsupportedOperationException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has VariantArray been in a release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked. It's not in any release. What is the question for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been added so I just rebased so we don't need this anymore.

*/
package org.apache.iceberg.variants;

public class VariantSizeLimitException extends RuntimeException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there an exception here? Do we need to limit the size of variants that are being constructed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had the discussion that we will not limit the size. Let me remove that.

private final VariantMetadata metadata;
private final VariantValue value;

public VariantImpl(byte[] metadata, byte[] value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variants should use ByteBuffer to avoid copying problems from byte[]. That's why the current implementation is built around ByteBuffer.

*/
package org.apache.iceberg.variants;

public class VariantConstants {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other constants in Variants. Why add a new class for these? Should we standardize where they live?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. You intend to place the constants in Variants? That works.


public static final int MAX_DECIMAL4_PRECISION = 9;
public static final int MAX_DECIMAL8_PRECISION = 18;
public static final int MAX_DECIMAL16_PRECISION = 38;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all of these constants need to be public? What is the intended use outside of this package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored to VariantBuilderBase since they are only used there.

Preconditions.checkArgument(
json != null && !json.isEmpty(), "Input JSON string cannot be null or empty.");

try (JsonParser parser = new JsonFactory().createParser(json)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing utilities in JSONUtil for a lot of the boilerplate around parsing and generating JSON. Is there a reason not to use the same approach here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to parse the JSON token by token so we can encode them in binary. Seems we only have the version to parse the JSON string into a full JSONNode and then create the object.

/**
* Attempts to parse a JSON number as a decimal and write it. The input must meet the following
* criteria: - Be in a valid decimal format (integer with an optional '.'). - Not in scientific
* notation. - Fit within the precision and scale limits of decimal types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: lines seem to have been accidentally wrapped.

*
* @param value The numeric value to append.
*/
protected void writeNumericInternal(long value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this duplicate all of the serialization that is already implemented in PrimitiveWrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added writePrimitive() in ByteBufferWrapper which will call the serialization in PrimitiveWrapper.

dataSize, fieldOffsetSize, offsetStart + numElements * fieldOffsetSize);
}

protected ByteBufferWrapper getBuffer() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: Iceberg does not use get in accessor names and discourages its use in general. If there's a more helpful verb then it is generally better to use it. Otherwise we usually omit get.

* @param value The numeric value to write.
* @param numBytes The number of bytes to write (e.g., 2 for short, 4 for int, 8 for long).
*/
void writeLittleEndianUnsigned(long value, int numBytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These utilities already exist. Why add them a second time? Are the existing ones incorrect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to work on the resizable ByteBuffer and just implement the logic. But should be able to reuse. Let me update.

@aihuaxu aihuaxu force-pushed the variant-add-writer-impl branch 2 times, most recently from 9cba102 to 86bf18b Compare February 2, 2025 03:59
@aihuaxu aihuaxu force-pushed the variant-add-writer-impl branch 3 times, most recently from fa34686 to 3ad94d3 Compare February 3, 2025 19:10
@aihuaxu aihuaxu requested a review from rdblue February 3, 2025 19:17
@aihuaxu aihuaxu force-pushed the variant-add-writer-impl branch from 3ad94d3 to 38e6f2f Compare February 3, 2025 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants