GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation #39112

AlenkaF · 2023-12-06T14:59:29Z

Rationale for this change

Legacy ParquetDataset has been deprecated for a while now, see #31529. This PR is removing the legacy implementation from the code.

What changes are included in this PR?

The PR is removing:

ParquetDatasetPiece
ParquetManifest
_ParquetDatasetMetadata
ParquetDataset

The PR is renaming _ParquetDatasetV2 to ParquetDataset which was removed. It is also updating the docstrings.

The PR is updating:

read_table
write_to_dataset

The PR is updating all the tests to not use use_legacy_dataset keyword or legacy parametrisation.

Are these changes tested?

Yes.

Are there any user-facing changes?

Deprecated code is removed.

Closes: [Python] Remove the legacy ParquetDataset custom python-based implementation #31303

jorisvandenbossche

(already posting whatever I have right now)

The PartitionSet and ParquetPartitions classes can also be removed?

There a few helper methods like _get_filesystem_and_path and _mkdir_if_not_exists that are no longer used and can be removed as well

docs/source/python/parquet.rst

python/pyarrow/parquet/core.py

python/pyarrow/tests/parquet/test_basic.py

AlenkaF · 2023-12-11T17:31:21Z

It is very hard to review this PR due to the way the diff is presented in GitHub. I tried to summarise the main changes in the description of the PR, hope it helps a bit.

@jorisvandenbossche after the last review I have updated the marks in the tests, added use_legacy_dataset=None to ParquetDataset class, read_table and arite_to_dataset.

I have also removed ** kwargs from ParquetDataset, previously _ParquetDatasetV2, which meant I had to remove the code connected to metadata, split_row_groups and validate_schema (raising an error) and so I added a note in the docstrings b6799cf. Not sure if that is well done though.

python/pyarrow/parquet/core.py

jorisvandenbossche · 2023-12-21T14:45:10Z

The HDFS failures seem to be related: https://github.com/ursacomputing/crossbow/actions/runs/7288206604/job/19860362598#step:6:9473

AlenkaF · 2023-12-21T14:47:41Z

Yeah, missed some metadata there. Thanks, will correct!

AlenkaF · 2023-12-21T15:06:32Z

@jorisvandenbossche if I am understanding correctly read_parquet from the failed build is part of a legacy Filesystem and will be removed? So I can for time being add a ValueError if metadata is specified or I can just remove it from the signature?

AlenkaF · 2023-12-21T15:12:42Z

@github-actions crossbow submit -g python-*-hdfs

github-actions · 2023-12-21T15:13:46Z

Invalid group(s) {'python-*-hdfs'}. Must be one of {'integration', 'example-cpp', 'verify-rc-source', 'fuzz', 'nightly', 'python', 'nightly-release', 'verify-rc-binaries', 'cpp', 'c-glib', 'conan', 'java', 'linux', 'go', 'example', 'verify-rc-wheels', 'r', 'packaging', 'test', 'nightly-tests', 'example-python', 'vcpkg', 'verify-rc-jars', 'linux-amd64', 'conda', 'homebrew', 'ruby', 'linux-arm64', 'verify-rc', 'verify-rc-source-macos', 'wheel', 'verify-rc-source-linux', 'nightly-packaging'}
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/7289911234

AlenkaF · 2023-12-21T15:20:16Z

@github-actions crossbow submit hdfs

jorisvandenbossche · 2023-12-21T15:21:51Z

I am not sure if raising for the metadata will be sufficient? My understanding is that those tests (for the legacy HDFS filesystem) are using hdfs.read_parquet -> pq.read_table by passing a legacy HDFS filesystem object, and with removing the legacy parquet code, we no longer support the legacy filesystems (and now as a next step, we can actually also remove those legacy filesystems).
So maybe we can just "xfail" those tests for now? And then when removing the HDFS legacy filesystem, we can see if we just remove those tests, or if we rewrite them using the new hdfs filesystem.

github-actions · 2023-12-21T15:22:43Z

Revision: 77b4ecb

Submitted crossbow builds: ursacomputing/crossbow @ actions-bb763d7a57

Task	Status
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1

AlenkaF · 2023-12-21T16:30:49Z

Still need to look into the failures with Unrecognized filesystem error.

AlenkaF · 2023-12-21T16:34:41Z

Ah, I guess the issue is the same it only fails in _ensure_filesystem and not in read_parquet. Will add xfail marks there also.

AlenkaF · 2023-12-21T16:43:45Z

@github-actions crossbow submit hdfs

github-actions · 2023-12-21T16:46:05Z

Revision: 481a85c

Submitted crossbow builds: ursacomputing/crossbow @ actions-fc72b69813

Task	Status
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1

jorisvandenbossche · 2023-12-21T21:01:13Z

Thanks a lot @AlenkaF for the work here!

conbench-apache-arrow · 2023-12-22T09:13:58Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit b70ad0b.

There were 7 benchmark results indicating a performance regression:

Commit Run on ursa-i9-9960x at 2023-12-22 00:21:42Z
- file-read (R) with compression=uncompressed, dataset=fanniemae_2016Q4, file_type=feather, language=R, output_type=dataframe
- wide-dataframe (Python) with use_legacy_dataset=true
and 5 more (see the report linked below)

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

jorisvandenbossche · 2024-01-09T09:38:39Z

The wide-dataframe case seems a genuine perf regression (and not a flaky outlier as the other listed cases). That might mean that for wide dataframes, the new code path is slower compared to the legacy dataset reader (since with this commit, also when specifying use_legacy_dataset=True, the new code path will be used).
That seems to match with the timing in the use_legacy_dataset=False case of the wide-dataframe benchmark, as now both benchmarks more or less show the same timing.

However, I can't reproduce this locally with pyarrow 14.0 (where the legacy reader still exists):

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

dataframe = pd.DataFrame(np.random.rand(100, 10000))
table = pa.Table.from_pandas(dataframe)
pq.write_table(table, "test_wide_dataframe.parquet")

In [7]: %timeit -r 50 pq.read_table("test_wide_dataframe.parquet", use_legacy_dataset=True)
392 ms ± 4.67 ms per loop (mean ± std. dev. of 50 runs, 1 loop each)

In [8]: %timeit -r 50 pq.read_table("test_wide_dataframe.parquet", use_legacy_dataset=False)
350 ms ± 11.5 ms per loop (mean ± std. dev. of 50 runs, 1 loop each)

…on-based implementation (apache#39112) ### Rationale for this change Legacy ParquetDataset has been deprecated for a while now, see apache#31529. This PR is removing the legacy implementation from the code. ### What changes are included in this PR? The PR is removing: - `ParquetDatasetPiece ` - `ParquetManifest` - `_ParquetDatasetMetadata ` - `ParquetDataset` The PR is renaming `_ParquetDatasetV2` to `ParquetDataset` which was removed. It is also updating the docstrings. The PR is updating: - `read_table` - `write_to_dataset` The PR is updating all the tests to not use `use_legacy_dataset` keyword or legacy parametrisation. ### Are these changes tested? Yes. ### Are there any user-facing changes? Deprecated code is removed. * Closes: apache#31303

AlenkaF added 4 commits December 5, 2023 10:29

Initial commit

8070c7a

Remove additional code connected to ParquetDatasetPiece

e0228fb

Merge _ParquetDatasetV2 and ParquetDataset

94ad4e9

Remove metadata_collector duplicate in write_to_dataset

e5ff41d

github-actions bot added Component: Python Component: Documentation awaiting review Awaiting review labels Dec 6, 2023

Linter

31ad7d7

jorisvandenbossche reviewed Dec 7, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Dec 7, 2023

AlenkaF added 5 commits December 7, 2023 13:01

Remove PartitionSet, ParquetPartitions and few helper methods

d9cc412

Remove partition_filename_cb

abf58b9

Keep use_legacy_dataset but deprecate it

8e72811

Lint rst. file

863b15e

Update test marks

3e784b4

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 7, 2023

AlenkaF added 6 commits December 11, 2023 12:13

Clean up the docstrings

e13768e

Remove a test for unsupported keywords and update docstrings

bf0ce99

Change how we deal with unsupported keywords in ParquetDataset

c9a7924

Some more changes to the docstrings

b6799cf

Update ParquetDataset docstrings

45a2409

Fix docstring examples

4573479

jorisvandenbossche reviewed Dec 12, 2023

View reviewed changes

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 12, 2023

Remove metadata from read_table

88c340e

github-actions bot removed the awaiting changes Awaiting changes label Dec 12, 2023

github-actions bot added awaiting changes Awaiting changes awaiting merge Awaiting merge and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Dec 21, 2023

AlenkaF mentioned this pull request Dec 21, 2023

[Python] Investigate test_read_multiple_files TODO #39338

Open

Add ValueError for metadata in FileSystem.read_parquet

77b4ecb

xfail test_read_multiple_parquet_files

4c89276

Add more xfails for _ensure_filesystem error

481a85c

Linter fixed

a2e75a4

jorisvandenbossche merged commit b70ad0b into apache:main Dec 21, 2023
11 checks passed

AlenkaF deleted the gh-31303-remove-legacy-ParquetDataset branch December 22, 2023 15:27

austin3dickey mentioned this pull request Jan 2, 2024

Stop running use_legacy_dataset=true for wide-benchmark voltrondata-labs/benchmarks#156

Merged

AlenkaF mentioned this pull request Nov 20, 2024

[Python] Remove use_legacy_dataset from code base #44790

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation #39112

GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation #39112

AlenkaF commented Dec 6, 2023 •

edited by github-actions bot

Loading

jorisvandenbossche left a comment

AlenkaF commented Dec 11, 2023

jorisvandenbossche commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

github-actions bot commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

jorisvandenbossche commented Dec 21, 2023

github-actions bot commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

github-actions bot commented Dec 21, 2023

jorisvandenbossche commented Dec 21, 2023

conbench-apache-arrow bot commented Dec 22, 2023

jorisvandenbossche commented Jan 9, 2024

GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation #39112

GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation #39112

Conversation

AlenkaF commented Dec 6, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jorisvandenbossche left a comment

Choose a reason for hiding this comment

AlenkaF commented Dec 11, 2023

jorisvandenbossche commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

github-actions bot commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

jorisvandenbossche commented Dec 21, 2023

github-actions bot commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

AlenkaF commented Dec 21, 2023

github-actions bot commented Dec 21, 2023

jorisvandenbossche commented Dec 21, 2023

conbench-apache-arrow bot commented Dec 22, 2023

jorisvandenbossche commented Jan 9, 2024

AlenkaF commented Dec 6, 2023 •

edited by github-actions bot

Loading