Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to correct offset index in old parquet files #4879

Open
malhotrashivam opened this issue Nov 22, 2023 · 0 comments
Open

Add support to correct offset index in old parquet files #4879

malhotrashivam opened this issue Nov 22, 2023 · 0 comments
Assignees
Labels
core Core development tasks feature request New feature or request parquet Related to the Parquet integration
Milestone

Comments

@malhotrashivam
Copy link
Contributor

As part of #4844, we found and fixed a bug in how offset index is calculated when writing vector and array columns to parquet files, and we updated the reading code to use a new OffsetIndexBasedColumnChunkPageStore class which expects these values to be correct.

The drawback is that the new reading logic cannot be used to read old parquet files which have the bug in offset index calculation. That is why we had to keep a fallback branch in the code to use VariablePageSizeColumnChunkPageStore for reading in case we detect the old parquet files.

The old reading logic is inefficient and more complicated. So the long term goal should be to remove the old reading logic and log a more appropriate error message like "Please use this tool to re-write the parquet file" on detecting old files in incorrect offset indices. Naturally, we would need to add support for updating such parquet files.

@malhotrashivam malhotrashivam added feature request New feature or request core Core development tasks parquet Related to the Parquet integration labels Nov 22, 2023
@malhotrashivam malhotrashivam added this to the Backlog milestone Nov 22, 2023
@malhotrashivam malhotrashivam self-assigned this Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core development tasks feature request New feature or request parquet Related to the Parquet integration
Projects
None yet
Development

No branches or pull requests

1 participant