-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Seek to RowGroup #461
Comments
Something like this (note that this lack of lots of nil/empty checks), maybe? My personal opinion is this is kind of "easy":
There are definitely valid use case for this, though I never encountered one, note that min and max are not mandatory so this functionality only works for a certain number of parquet files. |
Yeah, that is (more or less) how you'd identify row groups you care about. To clarify, though, the issue is that having done those checks there's no (easy, non-super-invasive) way to seek the ParquetReader into the right spot to consume from the beginning of the row-group. That's what the
True, but it's probably most likely that the files being consumed are generated using this library, and it does set the Min/MaxValue fields. |
I posted a draft PR of my PoC here. #469 |
In theory, one of the advantages of the parquet format is the ability to use metadata in the footer to avoid processing the entire file in order to locate specific records of interest. Specifically, one wants to use the RowGroup's min/max values per column to avoid processing RowGroups that don't contain records with particular values.
In practice, I can't see a way to do that using this library. SkipRows does almost what is needed, but the API doesn't make it possible (or at least easy) to navigate between row groups, and it needs to process every page so it doesn't provide the performance benefit.
I propose a new method on the Reader and ColumnReader types:
SeekRowGroup(index int64) error
that logically moves the reader to the start of the row group. This, in conjunction with the metadata in the footer, can be used to efficiently skip RowGroups that are known not to contain desired records.If you have any interest in including a feature like this, I have a proof-of-concept that seems to work and that I can flesh out.
The text was updated successfully, but these errors were encountered: