Add python tests for Parquet DELTA_BINARY_PACKED encoder #14316

etseidl · 2023-10-23T18:57:44Z

Description

During the review of #14100 there was a suggestion to add a test of writing using cudf and then reading the resulting file back with pyarrow. This PR adds the necessary python bindings to perform this test.

NOTE: there is currently an issue with encoding 32-bit values where the deltas exceed 32-bits. parquet-mr and arrow truncate the deltas for the INT32 physical type and allow values to overflow, whereas cudf currently uses 64-bit deltas, which avoids the overflow, but can result in requiring 33-bits when encoding. The current cudf behavior is allowed by the specification (and in fact is readable by parquet-mr), but using the extra bit is not in the Parquet spirit of least output file size. This will be addressed in follow-on work.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2023-10-23T18:57:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

vuule · 2023-10-26T17:09:05Z

/ok to test

vuule

a rounding concern

cpp/src/io/parquet/page_enc.cu

Co-authored-by: Vukasin Milovanovic <[email protected]>

vuule · 2023-10-26T18:47:01Z

/ok to test

…python

GregoryKimball · 2023-11-03T20:54:04Z

@galipremsagar Would you please take a look at this testing improvement?

python/cudf/cudf/tests/test_parquet.py

mythrocks

Not a cuIO dev, but this looks good to me.

A minor nitpick regarding the naming of a function, but that's unrelated to this change.

mythrocks · 2023-11-06T23:38:38Z

python/cudf/cudf/_lib/cpp/io/parquet.pxd

@@ -191,6 +199,8 @@ cdef extern from "cudf/io/parquet.hpp" namespace "cudf::io" nogil:
        void set_row_group_size_rows(size_type val) except +
        void set_max_page_size_bytes(size_t val) except +
        void set_max_page_size_rows(size_type val) except +
+        void enable_write_v2_headers(bool val) except +


I realize it isn't the fault of this current PR, but one does wish enable_write_v2_headers were named set_write_v2_headers.

We use enable_ for bool options, so this should be consistent (for better or for worse, apparently).

galipremsagar · 2023-11-07T17:17:47Z

python/cudf/cudf/core/dataframe.py

@@ -6370,6 +6370,8 @@ def to_parquet(
        max_page_size_rows=None,
        storage_options=None,
        return_metadata=False,
+        use_dictionary=True,


Can we document these two parameters here:

cudf/python/cudf/cudf/utils/ioutils.py

Line 222 in 16051a7

_docstring_to_parquet = """

Thanks @galipremsagar. I would have never thought to look there for the docstring 😅

First time hearing about it as well 🤷‍♂️ (just don't git blame ioutils.py :P )

python/cudf/cudf/tests/test_parquet.py

Co-authored-by: GALI PREM SAGAR <[email protected]>

vuule · 2023-11-07T18:18:24Z

/ok to test

vuule · 2023-11-08T20:23:58Z

/ok to test

vuule · 2023-11-08T22:34:49Z

/merge

vuule and others added 7 commits October 23, 2023 11:33

v2

98b0f79

dictionary policy

a8427f7

get delta writing to work

09983f0

almost works

41f827f

forgot to pass another param

772a275

fix undercount on page sizes for delta binary

0379b4d

fix up delta test

b6a97e8

etseidl requested review from a team as code owners October 23, 2023 18:57

etseidl requested review from galipremsagar, isVoid, mythrocks and vuule October 23, 2023 18:57

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Oct 23, 2023

etseidl and others added 2 commits October 23, 2023 12:08

fix comment

5d5268e

Merge branch 'branch-23.12' into delta_encode_python

7a2ee66

vuule added feature request New feature or request non-breaking Non-breaking change labels Oct 23, 2023

etseidl and others added 3 commits October 24, 2023 12:05

Merge branch 'rapidsai:branch-23.12' into delta_encode_python

13330e7

Merge branch 'branch-23.12' into delta_encode_python

6592442

Merge branch 'branch-23.12' into delta_encode_python

f1d3c88

vuule requested changes Oct 26, 2023

View reviewed changes

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved

etseidl and others added 2 commits October 26, 2023 11:11

address review comments

eeea1b9

Merge branch 'branch-23.12' into delta_encode_python

e7b3694

vuule reviewed Oct 26, 2023

View reviewed changes

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved

fix typo

11e6c5e

Co-authored-by: Vukasin Milovanovic <[email protected]>

vuule approved these changes Oct 26, 2023

View reviewed changes

etseidl and others added 3 commits October 27, 2023 09:47

Merge branch 'branch-23.12' into delta_encode_python

eeac17a

Merge branch 'branch-23.12' into delta_encode_python

a49f606

Merge remote-tracking branch 'origin/branch-23.12' into delta_encode_…

0776811

…python

galipremsagar reviewed Nov 3, 2023

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved

mythrocks approved these changes Nov 6, 2023

View reviewed changes

Merge branch 'branch-23.12' into delta_encode_python

3965125

galipremsagar requested changes Nov 7, 2023

View reviewed changes

etseidl and others added 2 commits November 7, 2023 09:23

implement suggestion from review

5f85be0

Co-authored-by: GALI PREM SAGAR <[email protected]>

add documentation for new arguments to to_parquet()

7a770d6

etseidl requested a review from galipremsagar November 7, 2023 17:48

Merge branch 'rapidsai:branch-23.12' into delta_encode_python

1e0cc58

galipremsagar approved these changes Nov 7, 2023

View reviewed changes

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 7, 2023

Merge branch 'branch-23.12' into delta_encode_python

6dabbcb

rapids-bot bot merged commit c4e6c09 into rapidsai:branch-23.12 Nov 8, 2023
61 checks passed

etseidl deleted the delta_encode_python branch November 8, 2023 22:35

vuule mentioned this pull request Nov 17, 2023

[FEA] Support V2 encodings in Parquet reader and writer #13501

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add python tests for Parquet DELTA_BINARY_PACKED encoder #14316

Add python tests for Parquet DELTA_BINARY_PACKED encoder #14316

etseidl commented Oct 23, 2023

copy-pr-bot bot commented Oct 23, 2023

vuule commented Oct 26, 2023

vuule left a comment

vuule commented Oct 26, 2023

GregoryKimball commented Nov 3, 2023

mythrocks left a comment

mythrocks Nov 6, 2023

vuule Nov 6, 2023

galipremsagar Nov 7, 2023

etseidl Nov 7, 2023

etseidl Nov 7, 2023

vuule Nov 7, 2023

vuule commented Nov 7, 2023

vuule commented Nov 8, 2023

vuule commented Nov 8, 2023

Add python tests for Parquet DELTA_BINARY_PACKED encoder #14316

Add python tests for Parquet DELTA_BINARY_PACKED encoder #14316

Conversation

etseidl commented Oct 23, 2023

Description

Checklist

copy-pr-bot bot commented Oct 23, 2023

vuule commented Oct 26, 2023

vuule left a comment

Choose a reason for hiding this comment

vuule commented Oct 26, 2023

GregoryKimball commented Nov 3, 2023

mythrocks left a comment

Choose a reason for hiding this comment

mythrocks Nov 6, 2023

Choose a reason for hiding this comment

vuule Nov 6, 2023

Choose a reason for hiding this comment

galipremsagar Nov 7, 2023

Choose a reason for hiding this comment

etseidl Nov 7, 2023

Choose a reason for hiding this comment

etseidl Nov 7, 2023

Choose a reason for hiding this comment

vuule Nov 7, 2023

Choose a reason for hiding this comment

vuule commented Nov 7, 2023

vuule commented Nov 8, 2023

vuule commented Nov 8, 2023