Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation #39112

Merged

Conversation

AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Dec 6, 2023

Rationale for this change

Legacy ParquetDataset has been deprecated for a while now, see #31529. This PR is removing the legacy implementation from the code.

What changes are included in this PR?

The PR is removing:

  • ParquetDatasetPiece
  • ParquetManifest
  • _ParquetDatasetMetadata
  • ParquetDataset

The PR is renaming _ParquetDatasetV2 to ParquetDataset which was removed. It is also updating the docstrings.

The PR is updating:

  • read_table
  • write_to_dataset

The PR is updating all the tests to not use use_legacy_dataset keyword or legacy parametrisation.

Are these changes tested?

Yes.

Are there any user-facing changes?

Deprecated code is removed.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(already posting whatever I have right now)

The PartitionSet and ParquetPartitions classes can also be removed?

There a few helper methods like _get_filesystem_and_path and _mkdir_if_not_exists that are no longer used and can be removed as well

docs/source/python/parquet.rst Outdated Show resolved Hide resolved
python/pyarrow/parquet/core.py Show resolved Hide resolved
python/pyarrow/parquet/core.py Show resolved Hide resolved
python/pyarrow/parquet/core.py Show resolved Hide resolved
python/pyarrow/tests/parquet/test_basic.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Dec 7, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 7, 2023
@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 11, 2023

It is very hard to review this PR due to the way the diff is presented in GitHub. I tried to summarise the main changes in the description of the PR, hope it helps a bit.

@jorisvandenbossche after the last review I have updated the marks in the tests, added use_legacy_dataset=None to ParquetDataset class, read_table and arite_to_dataset.

I have also removed ** kwargs from ParquetDataset, previously _ParquetDatasetV2, which meant I had to remove the code connected to metadata, split_row_groups and validate_schema (raising an error) and so I added a note in the docstrings b6799cf. Not sure if that is well done though.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 12, 2023
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Dec 12, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting merge Awaiting merge and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Dec 21, 2023
@jorisvandenbossche
Copy link
Member

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

Yeah, missed some metadata there. Thanks, will correct!

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

@jorisvandenbossche if I am understanding correctly read_parquet from the failed build is part of a legacy Filesystem and will be removed? So I can for time being add a ValueError if metadata is specified or I can just remove it from the signature?

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

@github-actions crossbow submit -g python-*-hdfs

Copy link

Invalid group(s) {'python-*-hdfs'}. Must be one of {'integration', 'example-cpp', 'verify-rc-source', 'fuzz', 'nightly', 'python', 'nightly-release', 'verify-rc-binaries', 'cpp', 'c-glib', 'conan', 'java', 'linux', 'go', 'example', 'verify-rc-wheels', 'r', 'packaging', 'test', 'nightly-tests', 'example-python', 'vcpkg', 'verify-rc-jars', 'linux-amd64', 'conda', 'homebrew', 'ruby', 'linux-arm64', 'verify-rc', 'verify-rc-source-macos', 'wheel', 'verify-rc-source-linux', 'nightly-packaging'}
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/7289911234

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

@github-actions crossbow submit hdfs

@jorisvandenbossche
Copy link
Member

I am not sure if raising for the metadata will be sufficient? My understanding is that those tests (for the legacy HDFS filesystem) are using hdfs.read_parquet -> pq.read_table by passing a legacy HDFS filesystem object, and with removing the legacy parquet code, we no longer support the legacy filesystems (and now as a next step, we can actually also remove those legacy filesystems).
So maybe we can just "xfail" those tests for now? And then when removing the HDFS legacy filesystem, we can see if we just remove those tests, or if we rewrite them using the new hdfs filesystem.

Copy link

Revision: 77b4ecb

Submitted crossbow builds: ursacomputing/crossbow @ actions-bb763d7a57

Task Status
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

Still need to look into the failures with Unrecognized filesystem error.

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

Ah, I guess the issue is the same it only fails in _ensure_filesystem and not in read_parquet. Will add xfail marks there also.

@AlenkaF
Copy link
Member Author

AlenkaF commented Dec 21, 2023

@github-actions crossbow submit hdfs

Copy link

Revision: 481a85c

Submitted crossbow builds: ursacomputing/crossbow @ actions-fc72b69813

Task Status
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions

@jorisvandenbossche jorisvandenbossche merged commit b70ad0b into apache:main Dec 21, 2023
11 checks passed
@jorisvandenbossche
Copy link
Member

Thanks a lot @AlenkaF for the work here!

Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit b70ad0b.

There were 7 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

@jorisvandenbossche
Copy link
Member

The wide-dataframe case seems a genuine perf regression (and not a flaky outlier as the other listed cases). That might mean that for wide dataframes, the new code path is slower compared to the legacy dataset reader (since with this commit, also when specifying use_legacy_dataset=True, the new code path will be used).
That seems to match with the timing in the use_legacy_dataset=False case of the wide-dataframe benchmark, as now both benchmarks more or less show the same timing.

However, I can't reproduce this locally with pyarrow 14.0 (where the legacy reader still exists):

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

dataframe = pd.DataFrame(np.random.rand(100, 10000))
table = pa.Table.from_pandas(dataframe)
pq.write_table(table, "test_wide_dataframe.parquet")
In [7]: %timeit -r 50 pq.read_table("test_wide_dataframe.parquet", use_legacy_dataset=True)
392 ms ± 4.67 ms per loop (mean ± std. dev. of 50 runs, 1 loop each)

In [8]: %timeit -r 50 pq.read_table("test_wide_dataframe.parquet", use_legacy_dataset=False)
350 ms ± 11.5 ms per loop (mean ± std. dev. of 50 runs, 1 loop each)

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…on-based implementation (apache#39112)

### Rationale for this change

Legacy ParquetDataset has been deprecated for a while now, see
apache#31529. This PR is removing the
legacy implementation from the code.

### What changes are included in this PR?

The PR is removing:
- `ParquetDatasetPiece `
-  `ParquetManifest`
-  `_ParquetDatasetMetadata `
-  `ParquetDataset`

The PR is renaming `_ParquetDatasetV2` to `ParquetDataset` which was
removed. It is also updating the docstrings.

The PR is updating:
- `read_table`
-  `write_to_dataset`

The PR is updating all the tests to not use `use_legacy_dataset` keyword
or legacy parametrisation.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Deprecated code is removed.
* Closes: apache#31303
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…on-based implementation (apache#39112)

### Rationale for this change

Legacy ParquetDataset has been deprecated for a while now, see
apache#31529. This PR is removing the
legacy implementation from the code.

### What changes are included in this PR?

The PR is removing:
- `ParquetDatasetPiece `
-  `ParquetManifest`
-  `_ParquetDatasetMetadata `
-  `ParquetDataset`

The PR is renaming `_ParquetDatasetV2` to `ParquetDataset` which was
removed. It is also updating the docstrings.

The PR is updating:
- `read_table`
-  `write_to_dataset`

The PR is updating all the tests to not use `use_legacy_dataset` keyword
or legacy parametrisation.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Deprecated code is removed.
* Closes: apache#31303
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Remove the legacy ParquetDataset custom python-based implementation
2 participants