-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation #39112
GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation #39112
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(already posting whatever I have right now)
The PartitionSet
and ParquetPartitions
classes can also be removed?
There a few helper methods like _get_filesystem_and_path
and _mkdir_if_not_exists
that are no longer used and can be removed as well
It is very hard to review this PR due to the way the diff is presented in GitHub. I tried to summarise the main changes in the description of the PR, hope it helps a bit. @jorisvandenbossche after the last review I have updated the marks in the tests, added I have also removed |
The HDFS failures seem to be related: https://github.com/ursacomputing/crossbow/actions/runs/7288206604/job/19860362598#step:6:9473 |
Yeah, missed some metadata there. Thanks, will correct! |
@jorisvandenbossche if I am understanding correctly |
@github-actions crossbow submit -g python-*-hdfs |
|
@github-actions crossbow submit hdfs |
I am not sure if raising for the |
Revision: 77b4ecb Submitted crossbow builds: ursacomputing/crossbow @ actions-bb763d7a57
|
Still need to look into the failures with |
Ah, I guess the issue is the same it only fails in |
@github-actions crossbow submit hdfs |
Revision: 481a85c Submitted crossbow builds: ursacomputing/crossbow @ actions-fc72b69813
|
Thanks a lot @AlenkaF for the work here! |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit b70ad0b. There were 7 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
The However, I can't reproduce this locally with pyarrow 14.0 (where the legacy reader still exists):
|
…on-based implementation (apache#39112) ### Rationale for this change Legacy ParquetDataset has been deprecated for a while now, see apache#31529. This PR is removing the legacy implementation from the code. ### What changes are included in this PR? The PR is removing: - `ParquetDatasetPiece ` - `ParquetManifest` - `_ParquetDatasetMetadata ` - `ParquetDataset` The PR is renaming `_ParquetDatasetV2` to `ParquetDataset` which was removed. It is also updating the docstrings. The PR is updating: - `read_table` - `write_to_dataset` The PR is updating all the tests to not use `use_legacy_dataset` keyword or legacy parametrisation. ### Are these changes tested? Yes. ### Are there any user-facing changes? Deprecated code is removed. * Closes: apache#31303
…on-based implementation (apache#39112) ### Rationale for this change Legacy ParquetDataset has been deprecated for a while now, see apache#31529. This PR is removing the legacy implementation from the code. ### What changes are included in this PR? The PR is removing: - `ParquetDatasetPiece ` - `ParquetManifest` - `_ParquetDatasetMetadata ` - `ParquetDataset` The PR is renaming `_ParquetDatasetV2` to `ParquetDataset` which was removed. It is also updating the docstrings. The PR is updating: - `read_table` - `write_to_dataset` The PR is updating all the tests to not use `use_legacy_dataset` keyword or legacy parametrisation. ### Are these changes tested? Yes. ### Are there any user-facing changes? Deprecated code is removed. * Closes: apache#31303
Rationale for this change
Legacy ParquetDataset has been deprecated for a while now, see #31529. This PR is removing the legacy implementation from the code.
What changes are included in this PR?
The PR is removing:
ParquetDatasetPiece
ParquetManifest
_ParquetDatasetMetadata
ParquetDataset
The PR is renaming
_ParquetDatasetV2
toParquetDataset
which was removed. It is also updating the docstrings.The PR is updating:
read_table
write_to_dataset
The PR is updating all the tests to not use
use_legacy_dataset
keyword or legacy parametrisation.Are these changes tested?
Yes.
Are there any user-facing changes?
Deprecated code is removed.