Add support to correct offset index in old parquet files #4879
Labels
core
Core development tasks
feature request
New feature or request
parquet
Related to the Parquet integration
Milestone
As part of #4844, we found and fixed a bug in how offset index is calculated when writing vector and array columns to parquet files, and we updated the reading code to use a new
OffsetIndexBasedColumnChunkPageStore
class which expects these values to be correct.The drawback is that the new reading logic cannot be used to read old parquet files which have the bug in offset index calculation. That is why we had to keep a fallback branch in the code to use
VariablePageSizeColumnChunkPageStore
for reading in case we detect the old parquet files.The old reading logic is inefficient and more complicated. So the long term goal should be to remove the old reading logic and log a more appropriate error message like "Please use this tool to re-write the parquet file" on detecting old files in incorrect offset indices. Naturally, we would need to add support for updating such parquet files.
The text was updated successfully, but these errors were encountered: