Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation #39112

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
8070c7a
Initial commit
AlenkaF Dec 5, 2023
e0228fb
Remove additional code connected to ParquetDatasetPiece
AlenkaF Dec 6, 2023
94ad4e9
Merge _ParquetDatasetV2 and ParquetDataset
AlenkaF Dec 6, 2023
e5ff41d
Remove metadata_collector duplicate in write_to_dataset
AlenkaF Dec 6, 2023
31ad7d7
Linter
AlenkaF Dec 6, 2023
d9cc412
Remove PartitionSet, ParquetPartitions and few helper methods
AlenkaF Dec 7, 2023
abf58b9
Remove partition_filename_cb
AlenkaF Dec 7, 2023
8e72811
Keep use_legacy_dataset but deprecate it
AlenkaF Dec 7, 2023
863b15e
Lint rst. file
AlenkaF Dec 7, 2023
3e784b4
Update test marks
AlenkaF Dec 7, 2023
e13768e
Clean up the docstrings
AlenkaF Dec 11, 2023
bf0ce99
Remove a test for unsupported keywords and update docstrings
AlenkaF Dec 11, 2023
c9a7924
Change how we deal with unsupported keywords in ParquetDataset
AlenkaF Dec 11, 2023
b6799cf
Some more changes to the docstrings
AlenkaF Dec 11, 2023
45a2409
Update ParquetDataset docstrings
AlenkaF Dec 11, 2023
4573479
Fix docstring examples
AlenkaF Dec 11, 2023
88c340e
Remove metadata from read_table
AlenkaF Dec 12, 2023
0cbd03d
One more metadata case to remove in read_table
AlenkaF Dec 12, 2023
5687a1d
Remove dataset marks and some unused tempdir parameters
AlenkaF Dec 20, 2023
abe8ab0
Removed panads marks
AlenkaF Dec 20, 2023
5b1f6ed
Add test for duplicate column selection in read_table and remove coup…
AlenkaF Dec 21, 2023
27c9f78
Use public atribute
AlenkaF Dec 21, 2023
cbd91cd
Keep issue reference for test_empty_directory
AlenkaF Dec 21, 2023
360d762
Leave out text in docsting of ParquetDataset
AlenkaF Dec 21, 2023
33bf492
Add back one pandas mark that should have been removed
AlenkaF Dec 21, 2023
77b4ecb
Add ValueError for metadata in FileSystem.read_parquet
AlenkaF Dec 21, 2023
4c89276
xfail test_read_multiple_parquet_files
AlenkaF Dec 21, 2023
481a85c
Add more xfails for _ensure_filesystem error
AlenkaF Dec 21, 2023
a2e75a4
Linter fixed
AlenkaF Dec 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 11 additions & 27 deletions docs/source/python/parquet.rst
Original file line number Diff line number Diff line change
Expand Up @@ -511,36 +511,20 @@ from a remote filesystem into a pandas dataframe you may need to run
``sort_index`` to maintain row ordering (as long as the ``preserve_index``
option was enabled on write).

.. note::

The ParquetDataset is being reimplemented based on the new generic Dataset
API (see the :ref:`dataset` docs for an overview). This is not yet the
default, but can already be enabled by passing the ``use_legacy_dataset=False``
keyword to :class:`ParquetDataset` or :func:`read_table`::

pq.ParquetDataset('dataset_name/', use_legacy_dataset=False)

Enabling this gives the following new features:

- Filtering on all columns (using row group statistics) instead of only on
the partition keys.
- More fine-grained partitioning: support for a directory partitioning scheme
in addition to the Hive-like partitioning (e.g. "/2019/11/15/" instead of
"/year=2019/month=11/day=15/"), and the ability to specify a schema for
the partition keys.
- General performance improvement and bug fixes.
Other features:

It also has the following changes in behaviour:
- Filtering on all columns (using row group statistics) instead of only on
the partition keys.
- Fine-grained partitioning: support for a directory partitioning scheme
in addition to the Hive-like partitioning (e.g. "/2019/11/15/" instead of
"/year=2019/month=11/day=15/"), and the ability to specify a schema for
the partition keys.

- The partition keys need to be explicitly included in the ``columns``
keyword when you want to include them in the result while reading a
subset of the columns
Note:

This new implementation is already enabled in ``read_table``, and in the
future, this will be turned on by default for ``ParquetDataset``. The new
implementation does not yet cover all existing ParquetDataset features (e.g.
specifying the ``metadata``, or the ``pieces`` property API). Feedback is
very welcome.
- The partition keys need to be explicitly included in the ``columns``
keyword when you want to include them in the result while reading a
subset of the columns


Using with Spark
Expand Down
29 changes: 0 additions & 29 deletions python/benchmarks/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,35 +29,6 @@
pq = None


class ParquetManifestCreation(object):
"""Benchmark creating a parquet manifest."""

size = 10 ** 6
tmpdir = None

param_names = ('num_partitions', 'num_threads')
params = [(10, 100, 1000), (1, 8)]

def setup(self, num_partitions, num_threads):
if pq is None:
raise NotImplementedError("Parquet support not enabled")

self.tmpdir = tempfile.mkdtemp('benchmark_parquet')
rnd = np.random.RandomState(42)
num1 = rnd.randint(0, num_partitions, size=self.size)
num2 = rnd.randint(0, 1000, size=self.size)
output_df = pd.DataFrame({'num1': num1, 'num2': num2})
output_table = pa.Table.from_pandas(output_df)
pq.write_to_dataset(output_table, self.tmpdir, ['num1'])

def teardown(self, num_partitions, num_threads):
if self.tmpdir is not None:
shutil.rmtree(self.tmpdir)

def time_manifest_creation(self, num_partitions, num_threads):
pq.ParquetManifest(self.tmpdir, metadata_nthreads=num_threads)


class ParquetWriteBinary(object):

def setup(self):
Expand Down
Loading