-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added offset index based parquet reading support #4844
Added offset index based parquet reading support #4844
Conversation
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReader.java
Outdated
Show resolved
Hide resolved
...le/src/main/java/io/deephaven/parquet/table/pagestore/FixedPageSizeColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not my full review. Just took a quick look at the tests for now. I'll wait until after Ryan's review to dig down deeper.
...nsions/parquet/table/src/test/java/io/deephaven/parquet/table/ParquetTableReadWriteTest.java
Outdated
Show resolved
Hide resolved
...nsions/parquet/table/src/test/java/io/deephaven/parquet/table/ParquetTableReadWriteTest.java
Outdated
Show resolved
Hide resolved
...nsions/parquet/table/src/test/java/io/deephaven/parquet/table/ParquetTableReadWriteTest.java
Outdated
Show resolved
Hide resolved
...nsions/parquet/table/src/test/java/io/deephaven/parquet/table/ParquetTableReadWriteTest.java
Outdated
Show resolved
Hide resolved
...nsions/parquet/table/src/test/java/io/deephaven/parquet/table/ParquetTableReadWriteTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly high-level review pass. Be sure that we keep enterprise in the loop regarding forwards compatibility issue, and look for more external validation of our new approach to writing and interpreting offset indexes.
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReader.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
...le/src/main/java/io/deephaven/parquet/table/pagestore/FixedPageSizeColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take another pass soon; need to take a break.
...s/parquet/table/src/main/java/io/deephaven/parquet/table/pagestore/ColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReader.java
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/metadata/TableInfo.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
...le/src/main/java/io/deephaven/parquet/table/pagestore/FixedPageSizeColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReader.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReader.java
Outdated
Show resolved
Hide resolved
@NotNull final ToPage<ATTR, ?> toPage, | ||
@NotNull final ColumnDefinition<?> columnDefinition) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically a "breaking" change; another reason I don't like how our implementation is structured right now. Not your issue, but just noting. #4850
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't expect anyone to use our parquet libraries at this level but us. Java does not make it easy to specify that.
...s/parquet/table/src/main/java/io/deephaven/parquet/table/pagestore/ColumnChunkPageStore.java
Show resolved
Hide resolved
e489ee3
to
c244a97
Compare
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReader.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Show resolved
Hide resolved
...src/main/java/io/deephaven/parquet/table/pagestore/OffsetIndexBasedColumnChunkPageStore.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Outdated
Show resolved
Hide resolved
...s/parquet/table/src/main/java/io/deephaven/parquet/table/pagestore/ColumnChunkPageStore.java
Show resolved
Hide resolved
c2a9414
to
edc5312
Compare
Currently, while reading a particular row from a parquet file,
All of this happens inside
VariablePageSizeColumnChunkPageStore
.Some of the above operations can be optimized using
OffsetIndex
data structure.OffsetIndex is an optional field in Parquet which gives us three things for each page in a parquet file:
So for locating a page containing a row, we can directly binary search inside the offset index and don't need to extend pages up to that page. Doing this helps speed up sparse reads and improves the parallelization.
This PR makes the change and also fixes a bug in writing offset index for array and vector columns. Because of this bug fix, we can not use the old code to read the parquet files generated with the new code. So the PR is backwards compatible but not forwards compatible.
Also, as part of this PR, we are deleting the
FixedPageSizeColumnChunkPageStore
because to detect if page sizes are fixed, we need offset index. So fixed page size support is added in the newOffsetIndexBasedColumnChunkPageStore
.Benchmarking results: https://docs.google.com/document/d/19AIp6UpHTWpbmxo0ldkBYyuRKOIhlGLP2NHU26jOeOM/edit?usp=sharing
Closes #4717, #4718
Related to #4879