-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283
Comments
Do you know how the parquet file was generated? This is related to the newly implemented size statistics: cc @wgtmac |
With version 53.3.0 of the Rust My code to generate it is: https://gitlab.softwareheritage.org/swh/devel/swh-graph/-/blob/master/tools/provenance/src/bin/list-provenance-nodes.rs?ref_type=heads I can work on a smaller repro code if you think that would help |
From my understanding parquet-rs builds statistics by default so now that we are processing those on parquet-cpp you might have found an incompatibility issue. I guess the issue stops failing if you disable statistics on the Writer, right? could you validate that please? edit: correct API to set statistics |
Confirmed, this happens both with |
The file schema is as below. All columns are
|
## Summary & Motivation Probably was partially premature in blaming Polars; they are probably just using the Rust crate under the hood to write Parquet, but the issue actually needs to be fixed on Arrow side. Linking the [issue](apache/arrow#45283) so have something to track.
…5285) ### Rationale for this change The level histogram of size statistics can be omitted if its max level is 0. We haven't implemented this yet and enforces histogram size to be equal to `max_level + 1`. However, when reading a Parquet file with omitted level histogram, exception will be thrown. ### What changes are included in this PR? Omit level histogram when max level is 0. ### Are these changes tested? Yes, a test case has been added to reflect the change. ### Are there any user-facing changes? No. * GitHub Issue: #45283 Lead-authored-by: Gang Wu <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Gang Wu <[email protected]>
Issue resolved by pull request 45285 |
@wgtmac I marked this as a possible backport-candidate, even though, I don't think there are current plans for a patch or a minor release for v19 |
I ran into this issue with parquet files written by the delta-rs library. |
## Summary & Motivation Probably was partially premature in blaming Polars; they are probably just using the Rust crate under the hood to write Parquet, but the issue actually needs to be fixed on Arrow side. Linking the [issue](apache/arrow#45283) so have something to track.
Given the severity of the issue, it seems that a patch release must happen regardless? Otherwise people get the following idea (somewhat justifiably):
|
Describe the bug, including details regarding any error messages, version, and platform.
Since pyarrow v19, this file cannot be read anymore:
test.parquet.gz (decompress it with
gunzip
; I had to compress it for Github to accept the upload)with pyarrow 18.1.0:
with pyarrow 19.0.0:
Component(s)
Parquet
The text was updated successfully, but these errors were encountered: