-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Add InternalData read and write builders #12060
base: main
Are you sure you want to change the base?
Changes from all commits
afd08b1
1262417
3835e04
183cd19
3373d06
4127a4d
acb5778
799914b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,164 @@ | ||||||
/* | ||||||
* Licensed to the Apache Software Foundation (ASF) under one | ||||||
* or more contributor license agreements. See the NOTICE file | ||||||
* distributed with this work for additional information | ||||||
* regarding copyright ownership. The ASF licenses this file | ||||||
* to you under the Apache License, Version 2.0 (the | ||||||
* "License"); you may not use this file except in compliance | ||||||
* with the License. You may obtain a copy of the License at | ||||||
* | ||||||
* http://www.apache.org/licenses/LICENSE-2.0 | ||||||
* | ||||||
* Unless required by applicable law or agreed to in writing, | ||||||
* software distributed under the License is distributed on an | ||||||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||||||
* KIND, either express or implied. See the License for the | ||||||
* specific language governing permissions and limitations | ||||||
* under the License. | ||||||
*/ | ||||||
package org.apache.iceberg; | ||||||
|
||||||
import java.io.IOException; | ||||||
import java.util.Map; | ||||||
import java.util.function.Function; | ||||||
import org.apache.iceberg.avro.Avro; | ||||||
import org.apache.iceberg.avro.InternalReader; | ||||||
import org.apache.iceberg.avro.InternalWriter; | ||||||
import org.apache.iceberg.common.DynMethods; | ||||||
import org.apache.iceberg.io.CloseableIterable; | ||||||
import org.apache.iceberg.io.FileAppender; | ||||||
import org.apache.iceberg.io.InputFile; | ||||||
import org.apache.iceberg.io.OutputFile; | ||||||
import org.apache.iceberg.relocated.com.google.common.collect.Maps; | ||||||
import org.slf4j.Logger; | ||||||
import org.slf4j.LoggerFactory; | ||||||
|
||||||
public class InternalData { | ||||||
private InternalData() {} | ||||||
|
||||||
private static final Logger LOG = LoggerFactory.getLogger(InternalData.class); | ||||||
private static final Map<FileFormat, Function<OutputFile, WriteBuilder>> WRITE_BUILDERS = | ||||||
Maps.newConcurrentMap(); | ||||||
private static final Map<FileFormat, Function<InputFile, ReadBuilder>> READ_BUILDERS = | ||||||
Maps.newConcurrentMap(); | ||||||
|
||||||
static { | ||||||
InternalData.register( | ||||||
FileFormat.AVRO, | ||||||
outputFile -> Avro.write(outputFile).createWriterFunc(InternalWriter::create), | ||||||
inputFile -> Avro.read(inputFile).createResolvingReader(InternalReader::create)); | ||||||
|
||||||
try { | ||||||
DynMethods.StaticMethod registerParquet = | ||||||
DynMethods.builder("register") | ||||||
.impl("org.apache.iceberg.InternalParquet") | ||||||
.buildStaticChecked(); | ||||||
|
||||||
registerParquet.invoke(); | ||||||
|
||||||
} catch (NoSuchMethodException e) { | ||||||
// failing to load Parquet is normal and does not require a stack trace | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This if normal for now? Don't we expect this to be failing bug in the future? I'm also a little interested in when we would actually fail here if we are using the Iceberg repo as is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wouldn't say that it's normal to fail. I'm actually not aware of any situations where the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be normal whenever the |
||||||
LOG.info("Unable to register Parquet for metadata files: {}", e.getMessage()); | ||||||
} | ||||||
} | ||||||
|
||||||
static void register( | ||||||
FileFormat format, | ||||||
Function<OutputFile, WriteBuilder> writeBuilder, | ||||||
Function<InputFile, ReadBuilder> readBuilder) { | ||||||
WRITE_BUILDERS.put(format, writeBuilder); | ||||||
READ_BUILDERS.put(format, readBuilder); | ||||||
} | ||||||
|
||||||
public static WriteBuilder write(FileFormat format, OutputFile file) { | ||||||
Function<OutputFile, WriteBuilder> writeBuilder = WRITE_BUILDERS.get(format); | ||||||
if (writeBuilder != null) { | ||||||
return writeBuilder.apply(file); | ||||||
} | ||||||
|
||||||
throw new UnsupportedOperationException( | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Personally I think it may be a bit clearer to extract the handling the missing writer/reader maybe have writerFor(File format) {
writer = WRITE_BUILDERS.get(format)
if (writer == null) {
throw new Unsupported Exception
} else {
return writer;
}
} So that this code is just return writerFor(format).apply(file) Mostly I feel a little unease about the implicit else in the current logic so having an |
||||||
"Cannot write using unregistered internal data format: " + format); | ||||||
} | ||||||
|
||||||
public static ReadBuilder read(FileFormat format, InputFile file) { | ||||||
Function<InputFile, ReadBuilder> readBuilder = READ_BUILDERS.get(format); | ||||||
if (readBuilder != null) { | ||||||
return readBuilder.apply(file); | ||||||
} | ||||||
|
||||||
throw new UnsupportedOperationException( | ||||||
"Cannot read using unregistered internal data format: " + format); | ||||||
} | ||||||
|
||||||
public interface WriteBuilder { | ||||||
/** Set the file schema. */ | ||||||
WriteBuilder schema(Schema schema); | ||||||
|
||||||
/** Set the file schema's root name. */ | ||||||
WriteBuilder named(String name); | ||||||
|
||||||
/** | ||||||
* Set a writer configuration property. | ||||||
* | ||||||
* <p>Write configuration affects writer behavior. To add file metadata properties, use {@link | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "This" doesn't refer to a writer. This is configuring a builder that creates the writer, so I think that the existing language is correct. |
||||||
* #meta(String, String)}. | ||||||
* | ||||||
* @param property a writer config property name | ||||||
* @param value config value | ||||||
* @return this for method chaining | ||||||
*/ | ||||||
WriteBuilder set(String property, String value); | ||||||
|
||||||
/** | ||||||
* Set a file metadata property. | ||||||
* | ||||||
* <p>Metadata properties are written into file metadata. To alter a writer configuration | ||||||
* property, use {@link #set(String, String)}. | ||||||
* | ||||||
* @param property a file metadata property name | ||||||
* @param value config value | ||||||
* @return this for method chaining | ||||||
*/ | ||||||
WriteBuilder meta(String property, String value); | ||||||
|
||||||
/** | ||||||
* Set a file metadata properties from a Map. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
* | ||||||
* <p>Metadata properties are written into file metadata. To alter a writer configuration | ||||||
* property, use {@link #set(String, String)}. | ||||||
* | ||||||
* @param properties a map of file metadata properties | ||||||
* @return this for method chaining | ||||||
*/ | ||||||
default WriteBuilder meta(Map<String, String> properties) { | ||||||
properties.forEach(this::meta); | ||||||
return this; | ||||||
} | ||||||
|
||||||
/** Overwrite the file if it already exists. */ | ||||||
WriteBuilder overwrite(); | ||||||
|
||||||
/** Build the configured {@link FileAppender}. */ | ||||||
<D> FileAppender<D> build() throws IOException; | ||||||
} | ||||||
|
||||||
public interface ReadBuilder { | ||||||
/** Set the projection schema. */ | ||||||
ReadBuilder project(Schema projectedSchema); | ||||||
|
||||||
/** Read only the split that is {@code length} bytes starting at {@code start}. */ | ||||||
ReadBuilder split(long start, long length); | ||||||
|
||||||
/** Reuse container classes, like structs, lists, and maps. */ | ||||||
ReadBuilder reuseContainers(); | ||||||
|
||||||
/** Set a custom class for in-memory objects at the schema root. */ | ||||||
ReadBuilder setRootType(Class<? extends StructLike> rootClass); | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll probably get to this later in the PR but i'm interested in why we need this and setCustomType There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok I see how it's used below, I'm wondering if instead of needing this, could we just automatically set these readers based on the root type? Ie setRootType(ManifestEntry) --- Automatically sets field types based on Manifest entry? Or do we have a plan for using this in a more custom manner in the future? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem this is solving is that we don't have an assigned ID for the root type. We could use a sentinel value like |
||||||
|
||||||
/** Set a custom class for in-memory objects at the given field ID. */ | ||||||
ReadBuilder setCustomType(int fieldId, Class<? extends StructLike> structClass); | ||||||
|
||||||
/** Build the configured reader. */ | ||||||
<D> CloseableIterable<D> build(); | ||||||
} | ||||||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses
DynMethods
to call Parquet'sregister
method directly, rather than using aServiceLoader
. There is no need for the complexity because we want to keep the number of supported formats small rather than plugging in custom formats.I'm also considering refactoring so that the
register
method here is package-private so that no one can easily call it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this one, why can we call Avro Register() but not Parquet.register(). I'm also not clear on the Service Loader comment, is that just to note we don't want to make this dynamic and only want hardcoded formats to be supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to the gradle project level isolation. Avro is currently included in core, but Parquet is in a separate subproject. I'm in favor of being explicit about what is supported (i.e. hard-coded), but we would like to keep parquet in a separate project to reduce dependency proliferation from api/core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using the Java ServiceLoader to load the internal readers and writers?
I have created a WIP for testing out how the DataFile readers could work: #12069
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fist I implemented using the registry method like on this PR (7e171cc), then moved to the ServiceLoader method.
In my head the 2 problems are very similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
ServiceLoader
framework is error prone and commonly broken because of Jar bundling. In addition, we do not want anything else registered so it is not needed. That would make it easier to plug in here, which we specifically are trying to avoid.