[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283

progval · 2025-01-16T11:32:13Z

Describe the bug, including details regarding any error messages, version, and platform.

Since pyarrow v19, this file cannot be read anymore:

test.parquet.gz (decompress it with gunzip; I had to compress it for Github to accept the upload)

with pyarrow 18.1.0:

$ python3                     
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.dataset
>>> dataset = pyarrow.dataset.dataset("test.parquet", format="parquet")
>>> dataset.to_table().to_pylist()
[{'id': 0, 'type': 'ori', 'sha1_git': b'\x8fP\xd3\xf6\x0e\xae7\r\xdb\xf8\\\x86!\x9cU\x10\x8a5\x01e'}, {'id': 2, 'type': 'ori', 'sha1_git': b'\x83@O\x99Q\x18\xbd%wOJ\xc1D"\xa8\xf1u\xe7\xa0T'}, {'id': 3, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\t'}, {'id': 4, 'type': 'rel', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10'}, {'id': 6, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03'}, {'id': 7, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02'}, {'id': 8, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05'}, {'id': 9, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06'}, {'id': 10, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04'}, {'id': 11, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'}, {'id': 12, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08'}, {'id': 13, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x07'}, {'id': 14, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x12'}, {'id': 15, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x11'}, {'id': 16, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x13'}, {'id': 17, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16'}, {'id': 18, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15'}, {'id': 19, 'type': 'rel', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00!'}, {'id': 20, 'type': 'rev', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x18'}, {'id': 21, 'type': 'rel', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x19'}, {'id': 22, 'type': 'dir', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x17'}, {'id': 23, 'type': 'cnt', 'sha1_git': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14'}]
>>>

with pyarrow 19.0.0:

$ python3                     
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.dataset
>>> dataset = pyarrow.dataset.dataset("test.parquet", format="parquet")
>>> dataset.to_table().to_pylist()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_dataset.pyx", line 574, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3865, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Repetition level histogram size mismatch

Component(s)

Parquet

The text was updated successfully, but these errors were encountered:

raulcd · 2025-01-16T11:54:14Z

Do you know how the parquet file was generated? This is related to the newly implemented size statistics:
https://github.com/raulcd/arrow/blob/f93004f23f7cb1a641abb805b10fb845c77bb23f/cpp/src/parquet/size_statistics.cc#L57-L60

cc @wgtmac

progval · 2025-01-16T13:08:51Z

With version 53.3.0 of the Rust parquet crate.

My code to generate it is: https://gitlab.softwareheritage.org/swh/devel/swh-graph/-/blob/master/tools/provenance/src/bin/list-provenance-nodes.rs?ref_type=heads

I can work on a smaller repro code if you think that would help

raulcd · 2025-01-16T13:30:28Z

From my understanding parquet-rs builds statistics by default so now that we are processing those on parquet-cpp you might have found an incompatibility issue. I guess the issue stops failing if you disable statistics on the Writer, right? could you validate that please?

https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_statistics_enabled

edit: correct API to set statistics

progval · 2025-01-16T13:50:51Z

Confirmed, this happens both with EnabledStatistics::Page and EnabledStatistics::Chunk, but not EnabledStatistics::None. Specifically, this happens iff they are enabled on the type column which is defined as type Dictionary(Int8.into(), Utf8.into()), false)

wgtmac · 2025-01-16T14:38:23Z

The file schema is as below. All columns are required so their max_repetition_levels are 0 and the corresponding repetition level histograms are omitted. This is not implemented by parquet-cpp yet. Let me fix this.

message arrow_schema {
  required int64 id (INTEGER(64,false));
  required binary type (STRING);
  required fixed_len_byte_array(20) sha1_git;
}

## Summary & Motivation Probably was partially premature in blaming Polars; they are probably just using the Rust crate under the hood to write Parquet, but the issue actually needs to be fixed on Arrow side. Linking the [issue](apache/arrow#45283) so have something to track.

…5285) ### Rationale for this change The level histogram of size statistics can be omitted if its max level is 0. We haven't implemented this yet and enforces histogram size to be equal to `max_level + 1`. However, when reading a Parquet file with omitted level histogram, exception will be thrown. ### What changes are included in this PR? Omit level histogram when max level is 0. ### Are these changes tested? Yes, a test case has been added to reflect the change. ### Are there any user-facing changes? No. * GitHub Issue: #45283 Lead-authored-by: Gang Wu <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Gang Wu <[email protected]>

wgtmac · 2025-01-21T09:28:58Z

Issue resolved by pull request 45285
#45285

raulcd · 2025-01-21T09:56:59Z

@wgtmac I marked this as a possible backport-candidate, even though, I don't think there are current plans for a patch or a minor release for v19

ldacey · 2025-01-21T15:02:40Z

Do you know how the parquet file was generated? This is related to the newly implemented size statistics: https://github.com/raulcd/arrow/blob/f93004f23f7cb1a641abb805b10fb845c77bb23f/cpp/src/parquet/size_statistics.cc#L57-L60

cc @wgtmac

I ran into this issue with parquet files written by the delta-rs library.

## Summary & Motivation Probably was partially premature in blaming Polars; they are probably just using the Rust crate under the hood to write Parquet, but the issue actually needs to be fixed on Arrow side. Linking the [issue](apache/arrow#45283) so have something to track.

h-vetinari · 2025-01-23T02:32:38Z

@raulcd: I marked this as a possible backport-candidate, even though, I don't think there are current plans for a patch or a minor release for v19

Given the severity of the issue, it seems that a patch release must happen regardless? Otherwise people get the following idea (somewhat justifiably):

We probably don't want [an upgrade] as arrow 19.0.0 is broken across the ecosystem due to incompatibility with arrow-rs

progval added the Type: bug label Jan 16, 2025

github-actions bot added the Component: Parquet label Jan 16, 2025

raulcd changed the title ~~"OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0~~ [Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 Jan 16, 2025

raulcd added the Component: C++ label Jan 16, 2025

wgtmac added a commit to wgtmac/arrow that referenced this issue Jan 16, 2025

apacheGH-45283: [C++][Parquet] Omit level histogram when max level is 0

6959c5b

github-actions bot assigned wgtmac Jan 16, 2025

github-actions bot mentioned this issue Jan 16, 2025

GH-45283: [C++][Parquet] Omit level histogram when max level is 0 #45285

Merged

raulcd added the backport-candidate label Jan 16, 2025

deepyaman mentioned this issue Jan 17, 2025

[chore] Add GitHub issue to comment on PyArrow pin dagster-io/dagster#27185

Merged

ion-elgreco mentioned this issue Jan 20, 2025

PyArrow 19.0.0 raises exception with to_pyarrow_table() delta-io/delta-rs#3147

Closed

wgtmac added this to the 20.0.0 milestone Jan 21, 2025

wgtmac closed this as completed Jan 21, 2025

AdamGlustein mentioned this issue Jan 21, 2025

Pin pyarrow to <19 due to apparent incompatibility with vendored 16.0 Arrow C++ code on x86 macos Point72/csp#429

Merged

mhaseeb123 mentioned this issue Jan 23, 2025

Avoid converting Decimal32/Decimal64 in to_arrow and from_arrow APIs rapidsai/cudf#17422

Open

3 tasks

timkpaine mentioned this issue Jan 23, 2025

Rebuild for libarrow 19.0 conda-forge/csp-feedstock#55

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283

[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283

progval commented Jan 16, 2025

raulcd commented Jan 16, 2025

progval commented Jan 16, 2025 •

edited

Loading

raulcd commented Jan 16, 2025 •

edited

Loading

progval commented Jan 16, 2025 •

edited

Loading

wgtmac commented Jan 16, 2025 •

edited

Loading

wgtmac commented Jan 21, 2025

raulcd commented Jan 21, 2025

ldacey commented Jan 21, 2025

h-vetinari commented Jan 23, 2025

[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283

[Python][C++][Parquet] "OSError: Repetition level histogram size mismatch" when reading parquet file in pyarrow since 19.0.0 #45283

Comments

progval commented Jan 16, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

raulcd commented Jan 16, 2025

progval commented Jan 16, 2025 • edited Loading

raulcd commented Jan 16, 2025 • edited Loading

progval commented Jan 16, 2025 • edited Loading

wgtmac commented Jan 16, 2025 • edited Loading

wgtmac commented Jan 21, 2025

raulcd commented Jan 21, 2025

ldacey commented Jan 21, 2025

h-vetinari commented Jan 23, 2025

progval commented Jan 16, 2025 •

edited

Loading

raulcd commented Jan 16, 2025 •

edited

Loading

progval commented Jan 16, 2025 •

edited

Loading

wgtmac commented Jan 16, 2025 •

edited

Loading