Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283

Closed
progval opened this issue Jan 16, 2025 · 9 comments

Comments

@progval
Copy link

progval commented Jan 16, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Since pyarrow v19, this file cannot be read anymore:

test.parquet.gz (decompress it with gunzip; I had to compress it for Github to accept the upload)

with pyarrow 18.1.0:

$ python3                     
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.dataset
>>> dataset = pyarrow.dataset.dataset("test.parquet", format="parquet")
>>> dataset.to_table().to_pylist()
[{'id': 0, 'type': 'ori', 'sha1_git': b'\x8fP\xd3\xf6\x0e\xae7\r\xdb\xf8\\\x86!\x9cU\x10\x8a5\x01e'}, {'id': 2, 'type': 'ori', 'sha1_git': b'\x83@O\x99Q\x18\xbd%wOJ\xc1D"\xa8\xf1u\xe7\xa0T'}, {'id': 3, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\t'}, {'id': 4, 'type': 'rel', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10'}, {'id': 6, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03'}, {'id': 7, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02'}, {'id': 8, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05'}, {'id': 9, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06'}, {'id': 10, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04'}, {'id': 11, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'}, {'id': 12, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08'}, {'id': 13, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x07'}, {'id': 14, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x12'}, {'id': 15, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x11'}, {'id': 16, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x13'}, {'id': 17, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16'}, {'id': 18, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15'}, {'id': 19, 'type': 'rel', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00!'}, {'id': 20, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x18'}, {'id': 21, 'type': 'rel', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x19'}, {'id': 22, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x17'}, {'id': 23, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14'}]
>>> 

with pyarrow 19.0.0:

$ python3                     
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.dataset
>>> dataset = pyarrow.dataset.dataset("test.parquet", format="parquet")
>>> dataset.to_table().to_pylist()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_dataset.pyx", line 574, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3865, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Repetition level histogram size mismatch

Component(s)

Parquet

@raulcd
Copy link
Member

raulcd commented Jan 16, 2025

Do you know how the parquet file was generated? This is related to the newly implemented size statistics:
https://github.com/raulcd/arrow/blob/f93004f23f7cb1a641abb805b10fb845c77bb23f/cpp/src/parquet/size_statistics.cc#L57-L60

cc @wgtmac

@raulcd raulcd changed the title "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 [Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 Jan 16, 2025
@progval
Copy link
Author

progval commented Jan 16, 2025

With version 53.3.0 of the Rust parquet crate.

My code to generate it is: https://gitlab.softwareheritage.org/swh/devel/swh-graph/-/blob/master/tools/provenance/src/bin/list-provenance-nodes.rs?ref_type=heads

I can work on a smaller repro code if you think that would help

@raulcd
Copy link
Member

raulcd commented Jan 16, 2025

From my understanding parquet-rs builds statistics by default so now that we are processing those on parquet-cpp you might have found an incompatibility issue. I guess the issue stops failing if you disable statistics on the Writer, right? could you validate that please?

https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_statistics_enabled

edit: correct API to set statistics

@progval
Copy link
Author

progval commented Jan 16, 2025

Confirmed, this happens both with EnabledStatistics::Page and EnabledStatistics::Chunk, but not EnabledStatistics::None. Specifically, this happens iff they are enabled on the type column which is defined as type Dictionary(Int8.into(), Utf8.into()), false)

@wgtmac
Copy link
Member

wgtmac commented Jan 16, 2025

The file schema is as below. All columns are required so their max_repetition_levels are 0 and the corresponding repetition level histograms are omitted. This is not implemented by parquet-cpp yet. Let me fix this.

message arrow_schema {
  required int64 id (INTEGER(64,false));
  required binary type (STRING);
  required fixed_len_byte_array(20) sha1_git;
}

deepyaman added a commit to dagster-io/dagster that referenced this issue Jan 17, 2025
## Summary & Motivation

Probably was partially premature in blaming Polars; they are probably
just using the Rust crate under the hood to write Parquet, but the issue
actually needs to be fixed on Arrow side. Linking the
[issue](apache/arrow#45283) so have something
to track.
wgtmac added a commit that referenced this issue Jan 21, 2025
…5285)

### Rationale for this change

The level histogram of size statistics can be omitted if its max level is 0. We haven't implemented this yet and enforces histogram size to be equal to `max_level + 1`. However, when reading a Parquet file with omitted level histogram, exception will be thrown.

### What changes are included in this PR?

Omit level histogram when max level is 0.

### Are these changes tested?

Yes, a test case has been added to reflect the change.

### Are there any user-facing changes?

No.
* GitHub Issue: #45283

Lead-authored-by: Gang Wu <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Gang Wu <[email protected]>
@wgtmac wgtmac added this to the 20.0.0 milestone Jan 21, 2025
@wgtmac
Copy link
Member

wgtmac commented Jan 21, 2025

Issue resolved by pull request 45285
#45285

@wgtmac wgtmac closed this as completed Jan 21, 2025
@raulcd
Copy link
Member

raulcd commented Jan 21, 2025

@wgtmac I marked this as a possible backport-candidate, even though, I don't think there are current plans for a patch or a minor release for v19

@ldacey
Copy link

ldacey commented Jan 21, 2025

Do you know how the parquet file was generated? This is related to the newly implemented size statistics: https://github.com/raulcd/arrow/blob/f93004f23f7cb1a641abb805b10fb845c77bb23f/cpp/src/parquet/size_statistics.cc#L57-L60

cc @wgtmac

I ran into this issue with parquet files written by the delta-rs library.

marijncv pushed a commit to marijncv/dagster that referenced this issue Jan 21, 2025
## Summary & Motivation

Probably was partially premature in blaming Polars; they are probably
just using the Rust crate under the hood to write Parquet, but the issue
actually needs to be fixed on Arrow side. Linking the
[issue](apache/arrow#45283) so have something
to track.
@h-vetinari
Copy link
Contributor

@raulcd: I marked this as a possible backport-candidate, even though, I don't think there are current plans for a patch or a minor release for v19

Given the severity of the issue, it seems that a patch release must happen regardless? Otherwise people get the following idea (somewhat justifiably):

We probably don't want [an upgrade] as arrow 19.0.0 is broken across the ecosystem due to incompatibility with arrow-rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants