Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add peek_next_page_offset to SerializedPageReader #6945

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

XiangpengHao
Copy link
Contributor

Which issue does this PR close?

Part of #6921

Rationale for this change

Current parquet reader with row filter, will decompress pages twice. To avoid that, we want to cache the first decompressed page, as described in #6921.

This PR adds peek_next_page_offset function, so that we use the offset of the page to determine whether we have decompressed this page before.

I'm not sure this is the best place to add the functionality, please let me know if there are better ways to do it.
We could alternatively add this method to PageReader crate, but that's a much larger change. Or we could add an offset field to peek_next_page, but that would break the PageHeader -> PageMetadata conversion.

What changes are included in this PR?

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 6, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @XiangpengHao -- I think this looks good to me. I think the only thing needed is to avoid adding this as a pub API, otherwise LGTM 👍

parquet/src/file/serialized_reader.rs Outdated Show resolved Hide resolved
next_page_header,
} => {
loop {
if *remaining_bytes == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only real concern is the fact that this body has so much duplication with peek_next_page (especially in the SerializedPageReaderState::Values block)

it is also somewhat strange it is in a different impl block than peek_next_page (I would have expected it to be next to it) but maybe I missed some generic subtlety

I tried a few ways to avoid the duplication and I didn't really find any good way to do so,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has so much duplication with peek_next_page

Agree, I tried to make peek_next_page to return an offset as well, but has no luck to easily do it.

in a different impl block than peek_next_page

I think it's because peek_next_page is in PageReader trait

@alamb
Copy link
Contributor

alamb commented Jan 6, 2025

FYI @tustvold and @etseidl

@alamb
Copy link
Contributor

alamb commented Jan 8, 2025

Let's keep pushing -- thank you @XiangpengHao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants