-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add peek_next_page_offset
to SerializedPageReader
#6945
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @XiangpengHao -- I think this looks good to me. I think the only thing needed is to avoid adding this as a pub
API, otherwise LGTM 👍
next_page_header, | ||
} => { | ||
loop { | ||
if *remaining_bytes == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my only real concern is the fact that this body has so much duplication with peek_next_page
(especially in the SerializedPageReaderState::Values block)
it is also somewhat strange it is in a different impl
block than peek_next_page
(I would have expected it to be next to it) but maybe I missed some generic subtlety
I tried a few ways to avoid the duplication and I didn't really find any good way to do so,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has so much duplication with peek_next_page
Agree, I tried to make peek_next_page
to return an offset as well, but has no luck to easily do it.
in a different impl block than peek_next_page
I think it's because peek_next_page
is in PageReader
trait
Co-authored-by: Andrew Lamb <[email protected]>
Let's keep pushing -- thank you @XiangpengHao |
Which issue does this PR close?
Part of #6921
Rationale for this change
Current parquet reader with row filter, will decompress pages twice. To avoid that, we want to cache the first decompressed page, as described in #6921.
This PR adds
peek_next_page_offset
function, so that we use the offset of the page to determine whether we have decompressed this page before.I'm not sure this is the best place to add the functionality, please let me know if there are better ways to do it.
We could alternatively add this method to
PageReader
crate, but that's a much larger change. Or we could add an offset field topeek_next_page
, but that would break thePageHeader
->PageMetadata
conversion.What changes are included in this PR?
Are there any user-facing changes?
No