Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit of Iceberg integration. #5277

Merged
merged 31 commits into from
Jun 5, 2024
Merged

Conversation

lbooker42
Copy link
Contributor

@lbooker42 lbooker42 commented Mar 22, 2024

Highlights
Can select specific snapshots or automatically choose the latest when loading a table. Supports single-file, multi-file and key-value partitioned data.

TODO (known limits):

  • Support delete files (for merge-on-read support)
  • Refreshing tables (automatic when load latest snapshot?)

@lbooker42 lbooker42 self-assigned this Mar 22, 2024
@lbooker42 lbooker42 added this to the 1. March 2024 milestone Mar 22, 2024
@lbooker42 lbooker42 requested a review from malhotrashivam April 2, 2024 18:25
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a discussion with @malhotrashivam and @devinrsmith about configuration strategies to unify with Parquet and existing S3 support.

devinrsmith added a commit that referenced this pull request May 1, 2024
As part of this, the bootstrapping S3 client was changed from sync to async to support S3TransferManager. Deleting keys has also been automated (unit tests no longer responsible for keeping track of keys used).

SingletonContainers, LocalStack, and MinIO have been made public; this allows the container bootstrapping logic to be used by other projects' tests.

These changes should make it easier for Iceberg testing in #5277
}

// Add the S3 instructions.
if (instructions.s3Instructions().isPresent()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we require s3 instructions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to verify that (e.g.) an all-AWS install would not need explicit instructions but could function with defaults.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm good with this, modulo a few quibbles. Maybe @niloc132 should look as a gradle expert.

extensions/iceberg/build.gradle Show resolved Hide resolved
buildSrc/src/main/groovy/Classpaths.groovy Outdated Show resolved Hide resolved
settings.gradle Outdated
@@ -270,6 +270,9 @@ project(':extensions-s3').projectDir = file('extensions/s3')
include(':extensions-iceberg')
project(':extensions-iceberg').projectDir = file('extensions/iceberg')

include(':extensions-iceberg-aws')
project(':extensions-iceberg-aws').projectDir = file('extensions/iceberg-aws')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be iceberg/aws?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nest subdir vs nested subproject - technically orthogonal. ryan's comment is asking for the former, which is a way to structure the directories (replace the string passed to the file() method). I would argue that not only do we want that, we also want nested subprojects, using : delimiters.

The Gradle convention is that nested directories also always means nested projects. Deephaven (starting with DHE, inherited in DHC) doesn't follow this as a rule, though DHC sometimes uses this if the parent directory is itself a project.

That said, using subprojects also conflicts a bit with our naming policies (which is "deephaven-" plus the project's own name), since nesting projects removes the parent project from the name.

(for the rest of this i'm assuming we're calling it s3 rather than aws, but aws-s3 might also make sense)

From talking to Larry it sounds like we anticipate other iceberg implementations that aren't s3 specific. So, I would propose something like:

extensions/
  iceberg/ # name this project deephaven-extensions-iceberg?
    build.gradle
    src/
      s3/ # name this project deephaven-iceberg-s3? 
         build.gradle
         src/
      r2/ # possible other impl?
        build.gradle
        src/

extensions/iceberg/build.gradle Show resolved Hide resolved
chipkent
chipkent previously approved these changes Jun 4, 2024
Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only reviewed the gradle file in the py dir.

buildSrc/src/main/groovy/Classpaths.groovy Outdated Show resolved Hide resolved
buildSrc/src/main/groovy/Classpaths.groovy Outdated Show resolved Hide resolved
Comment on lines 3 to 4
testcontainers.localstack.image=localstack/localstack:3.1.0
testcontainers.minio.image=minio/minio:RELEASE.2024-02-04T22-36-13Z
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about these version numbers, since unlike maven versions, these are docker tags that are mutable. Typically we use our docker-registry project to pin to a particular image hash and ensure that even if a tag moves, our old builds will behave the same.

That said, Devin is out, and this is how he wrote the S3 testing code, so we should be consistent until he can tell us why we want it that way (or offer another way forward).

settings.gradle Outdated
@@ -270,6 +270,9 @@ project(':extensions-s3').projectDir = file('extensions/s3')
include(':extensions-iceberg')
project(':extensions-iceberg').projectDir = file('extensions/iceberg')

include(':extensions-iceberg-aws')
project(':extensions-iceberg-aws').projectDir = file('extensions/iceberg-aws')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nest subdir vs nested subproject - technically orthogonal. ryan's comment is asking for the former, which is a way to structure the directories (replace the string passed to the file() method). I would argue that not only do we want that, we also want nested subprojects, using : delimiters.

The Gradle convention is that nested directories also always means nested projects. Deephaven (starting with DHE, inherited in DHC) doesn't follow this as a rule, though DHC sometimes uses this if the parent directory is itself a project.

That said, using subprojects also conflicts a bit with our naming policies (which is "deephaven-" plus the project's own name), since nesting projects removes the parent project from the name.

(for the rest of this i'm assuming we're calling it s3 rather than aws, but aws-s3 might also make sense)

From talking to Larry it sounds like we anticipate other iceberg implementations that aren't s3 specific. So, I would propose something like:

extensions/
  iceberg/ # name this project deephaven-extensions-iceberg?
    build.gradle
    src/
      s3/ # name this project deephaven-iceberg-s3? 
         build.gradle
         src/
      r2/ # possible other impl?
        build.gradle
        src/

Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Py dir changes look ok.

@lbooker42 lbooker42 merged commit 6881afb into deephaven:main Jun 5, 2024
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jun 5, 2024
@deephaven-internal
Copy link
Contributor

Labels indicate documentation is required. Issues for documentation have been opened:

Community: deephaven/deephaven-docs-community#226

@lbooker42 lbooker42 deleted the lab-iceberg branch June 26, 2024 19:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants