-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to buffered reading for parquet #5611
Conversation
Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelsProvider.java
Outdated
Show resolved
Hide resolved
Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelsProvider.java
Outdated
Show resolved
Hide resolved
...dfile/src/main/java/io/deephaven/extensions/trackedfile/TrackedSeekableChannelsProvider.java
Outdated
Show resolved
Hide resolved
...dfile/src/main/java/io/deephaven/extensions/trackedfile/TrackedSeekableChannelsProvider.java
Outdated
Show resolved
Hide resolved
...mpression/src/main/java/io/deephaven/parquet/compress/DeephavenCompressorAdapterFactory.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java
Outdated
Show resolved
Hide resolved
private Dictionary readDictionary(long dictionaryPageOffset, SeekableChannelContext channelContext) { | ||
// Use the context object provided by the caller, or create (and close) a new one | ||
try ( | ||
final ContextHolder holder = SeekableChannelContext.ensureContext(channelsProvider, channelContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original code, we used to make the channel and stream in the calling method and this method would just use the same stream and not touch the underlying channel.
Now we make two streams, one for header and one for data. And we use the same channel.
Note that the channel's position gets updated after reading the header.
So I wanted to make the channel's lifecycle limited to this method so that no one else should depend on or use this channel. That is why I moved the logic for making the channel inside this method.
Currently, while reading bytes from parquet files, we read in chunks of 8K bytes.
So for cases, where we need fewer bytes (like reading page headers), this leads to extra bytes read.
And for cases where we need to read bulk of data (like reading actual data bytes), this can lead to repeated reads in 8k chunks, till we get the required number of bytes.
This PR adds a size hint to the stream creation method, so that we can accurately created an internal buffered input stream based on how much data the user actually want to read from the stream.
This PR leads to minor parquet read performance improvements.