Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parallelization of reads through VariablePageSizeColumnChunkPageStore #4718

Closed
malhotrashivam opened this issue Oct 25, 2023 · 2 comments
Assignees
Labels
bug Something isn't working core Core development tasks parquet Related to the Parquet integration
Milestone

Comments

@malhotrashivam
Copy link
Contributor

malhotrashivam commented Oct 25, 2023

It looks like while reading a parquet file usingVariablePageSizeColumnChunkPageStore, to reach a particular page, we’re materializing every prior page sequentially. This makes the reading process extremely serial since pages cannot be read independently in parallel.

We should check if we can avoid the materialization for pages before the target page by simply reading headers to determine the page size boundaries instead of materializing whole pages prematurely.

We can also looking at widening the lock so we have less switching, too… right now it’s in extendOnePage.

@malhotrashivam malhotrashivam added bug Something isn't working triage core Core development tasks parquet Related to the Parquet integration labels Oct 25, 2023
@malhotrashivam malhotrashivam added this to the October 2023 milestone Oct 25, 2023
@malhotrashivam malhotrashivam self-assigned this Oct 25, 2023
@rcaudy rcaudy removed the triage label Oct 25, 2023
@malhotrashivam
Copy link
Contributor Author

On further inspection, its clear that we don't materialize whole pages to reach any particular page but just iterate over all the headers to calculate the number of rows. We are fixing this issue by introducing a new offset index based page store for faster page access as part of #4844

@malhotrashivam
Copy link
Contributor Author

As noted earlier, VariablePageSizeColumnChunkPageStore does not materialize full pages but just the page headers.
Also, going forward, we are using OffsetIndexBasedColumnChunkPageStore whenever offset index is present in the file, since it provides faster access for pages.
So we don't have any immediate need or clear path to optimize VariablePageSizeColumnChunkPageStore any further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core development tasks parquet Related to the Parquet integration
Projects
None yet
Development

No branches or pull requests

3 participants