[FEA] Parquet reader filter improvements #17142

wence- · 2024-10-22T16:09:45Z

Is your feature request related to a problem? Please describe.

In cudf-polars, predicate pushdown can result in arbitrary expressions being part of the parquet read phase. Not all of these expressions make sense for discarding rows at the row group level based on statistics, however, they can still be applied in a post-filtering stage.

If I naively translate the generic expression I get from polars to a libcudf expression and use it in the parquet reader, libcudf might throw at runtime with an unsupported operation. I must therefore encode in my transliteration, exactly which ast expressions the parquet reader does support in its statistics filters and only deliver the filter to the parquet reader if it is one that is understood.

For example, column_name_reference("a") < literal(...) is a supported expression, but literal(...) > column_name_reference("a") is not (this one I translate to something that is supported). But if the parquet reader were extended to handle both types, I'd now be doing unnecessary work.

This is suboptimal in two ways:

Exactly which filter expressions are supported is now encoded in two places, and I might have got it wrong
If parts of the whole filter expression are supported by the parquet reader, they are still applied in a post-filter stage, rather than being applied at the row group level.

Describe the solution you'd like

I'd like the parquet reader to accept arbitrary expressions as filters and do the right thing. Much of the facility already exists since for correctness, the filter must always be applied as a post-filter even after row groups have been discarded.
Bonus points if expressions where only part of the expression is supported by the statistics filters still uses statistics to discard row groups.

Describe alternatives you've considered

For point one, I can do the thing I'm doing right now and just bail if I hit a feature I've determined as unsupported.

For point two, I can convert to some kind of normal form and pick apart the pieces that are supported and deliver those to the parquet reader. However, I'd love not to have to write another propositional formula -> CNF converter :), and this still suffers from point 1: the final decision to discard things encodes information in two places.

Additional context

What I'm doing now: #17141

Additional feature req: support filtering row groups based on nulls, i.e. support is_null(column_name_reference(...)) in the statistics reader.

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2024-10-22T16:27:45Z

@wence- Thank you for mapping out this request.

What would be a good unit test of an AST that needs to be modified to run as a row group filter? I checked your PR, does test_evaluation contain such an example? I would like your help to gather a few examples of input AST to output AST that this feature would need.
Picking up on the additional feature request, I believe we added support for is_null in Add IS_NULL operator to AST #13145. Is something not working correctly?

wence- · 2024-10-22T17:11:24Z

Picking up on the additional feature request, I believe we added support for is_null in #13145. Is something not working correctly?

That's supported in ASTs, but not in the statistics filter (e.g. discarding a row group where the filter is column.is_not_null() because the row group's null count is equal to the number of rows in the group).

wence- · 2024-10-22T17:44:21Z

What would be a good unit test of an AST that needs to be modified to run as a row group filter? I checked your PR, does test_evaluation contain such an example? I would like your help to gather a few examples of input AST to output AST that this feature would need.

Here are some examples, in pseudo-code, I can write things out in C++ once we converge on something that makes sense:

expr = (10 < a)

This would be supported by the statistics reader if the user had instead written:

expr = (a > 10)

Something more complicated:

expr = (a < 10) or ((a + b < c) and (b < 1))

Here two parts of the expression can discard row groups a < 10 and b < 1, but the middle part of the expression a + b < c is probably only worthwhile applying as a post-filter (we could support some row group filtering by checking if a_max + b_max < c_min), and is currently unsupported by the statistics row group filtering.

That is, I can discard row-groups with a < 10 or b < 1 and then apply the full expression as a post-filter.

expr_stat = (a < 10) or (b < 1)
expr = (a < 10) or ((a + b < c) and (b < 1))
df = read_parquet(..., expr_stat).filter(expr)

For arbitrary expressions, determining which bits can be applied as statistics filters is programmatically achievable by converting the input expression to one of DNF or CNF and then taking those terms which the reader supports.

wence- · 2024-10-24T10:50:58Z

but the middle part of the expression a + b < c is probably only worthwhile applying as a post-filter (we could support some row group filtering by checking if a_max + b_max < c_min)

For inequality comparisons, one could imagine writing an ast visit that can handle a larger number of such expressions. To take < as an example, lexpr < rexpr as a statistics filter needs to turn into lexpr.max() < rexpr.min(). One can write distribution rules for max and min over binary operators (at least some of them), but the bounds get pessimistic quite quickly. For example, for the addition, multiplication, ..., we have:

(a + b).max() <= a.max() + b.max()
(a - b).max() <= a.max() - b.min()
(a * b).max() <= a.max() * b.max()
(a / b).max() <= a.max() / b.min()

One can apply such rules recursively to push max/min references down to column/literal references. That then gives us expressions that can used in tandem with row group statistics.

I think that's much less likely to be important than just picking out all the pieces of an expression that compare a column with a literal though.

wence- added cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Oct 22, 2024

GregoryKimball added this to libcudf Oct 22, 2024

GregoryKimball added this to the Expression evaluation milestone Oct 22, 2024

GregoryKimball moved this to Needs owner in libcudf Oct 23, 2024

wence- mentioned this issue Oct 25, 2024

Add conversion from cudf-polars expressions to libcudf ast for parquet filters #17141

Merged

3 tasks

GregoryKimball added the cudf.polars Issues specific to cudf.polars label Nov 14, 2024

github-project-automation bot added this to cuDF Python Nov 14, 2024

github-project-automation bot moved this to Todo in cuDF Python Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Parquet reader filter improvements #17142

[FEA] Parquet reader filter improvements #17142

wence- commented Oct 22, 2024 •

edited

Loading

GregoryKimball commented Oct 22, 2024

wence- commented Oct 22, 2024

wence- commented Oct 22, 2024

wence- commented Oct 24, 2024 •

edited

Loading

[FEA] Parquet reader filter improvements #17142

[FEA] Parquet reader filter improvements #17142

Comments

wence- commented Oct 22, 2024 • edited Loading

GregoryKimball commented Oct 22, 2024

wence- commented Oct 22, 2024

wence- commented Oct 22, 2024

wence- commented Oct 24, 2024 • edited Loading

wence- commented Oct 22, 2024 •

edited

Loading

wence- commented Oct 24, 2024 •

edited

Loading