Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Parquet reader filter improvements #17142

Open
wence- opened this issue Oct 22, 2024 · 4 comments
Open

[FEA] Parquet reader filter improvements #17142

wence- opened this issue Oct 22, 2024 · 4 comments
Labels
cudf.polars Issues specific to cudf.polars cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@wence-
Copy link
Contributor

wence- commented Oct 22, 2024

Is your feature request related to a problem? Please describe.

In cudf-polars, predicate pushdown can result in arbitrary expressions being part of the parquet read phase. Not all of these expressions make sense for discarding rows at the row group level based on statistics, however, they can still be applied in a post-filtering stage.

If I naively translate the generic expression I get from polars to a libcudf expression and use it in the parquet reader, libcudf might throw at runtime with an unsupported operation. I must therefore encode in my transliteration, exactly which ast expressions the parquet reader does support in its statistics filters and only deliver the filter to the parquet reader if it is one that is understood.

For example, column_name_reference("a") < literal(...) is a supported expression, but literal(...) > column_name_reference("a") is not (this one I translate to something that is supported). But if the parquet reader were extended to handle both types, I'd now be doing unnecessary work.

This is suboptimal in two ways:

  1. Exactly which filter expressions are supported is now encoded in two places, and I might have got it wrong
  2. If parts of the whole filter expression are supported by the parquet reader, they are still applied in a post-filter stage, rather than being applied at the row group level.

Describe the solution you'd like

  • I'd like the parquet reader to accept arbitrary expressions as filters and do the right thing. Much of the facility already exists since for correctness, the filter must always be applied as a post-filter even after row groups have been discarded.
  • Bonus points if expressions where only part of the expression is supported by the statistics filters still uses statistics to discard row groups.

Describe alternatives you've considered

For point one, I can do the thing I'm doing right now and just bail if I hit a feature I've determined as unsupported.

For point two, I can convert to some kind of normal form and pick apart the pieces that are supported and deliver those to the parquet reader. However, I'd love not to have to write another propositional formula -> CNF converter :), and this still suffers from point 1: the final decision to discard things encodes information in two places.

Additional context

What I'm doing now: #17141

Additional feature req: support filtering row groups based on nulls, i.e. support is_null(column_name_reference(...)) in the statistics reader.

@wence- wence- added cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Oct 22, 2024
@GregoryKimball GregoryKimball added this to the Expression evaluation milestone Oct 22, 2024
@GregoryKimball
Copy link
Contributor

@wence- Thank you for mapping out this request.

  • What would be a good unit test of an AST that needs to be modified to run as a row group filter? I checked your PR, does test_evaluation contain such an example? I would like your help to gather a few examples of input AST to output AST that this feature would need.
  • Picking up on the additional feature request, I believe we added support for is_null in Add IS_NULL operator to AST #13145. Is something not working correctly?

@wence-
Copy link
Contributor Author

wence- commented Oct 22, 2024

Picking up on the additional feature request, I believe we added support for is_null in #13145. Is something not working correctly?

That's supported in ASTs, but not in the statistics filter (e.g. discarding a row group where the filter is column.is_not_null() because the row group's null count is equal to the number of rows in the group).

@wence-
Copy link
Contributor Author

wence- commented Oct 22, 2024

What would be a good unit test of an AST that needs to be modified to run as a row group filter? I checked your PR, does test_evaluation contain such an example? I would like your help to gather a few examples of input AST to output AST that this feature would need.

Here are some examples, in pseudo-code, I can write things out in C++ once we converge on something that makes sense:

expr = (10 < a)

This would be supported by the statistics reader if the user had instead written:

expr = (a > 10)

Something more complicated:

expr = (a < 10) or ((a + b < c) and (b < 1))

Here two parts of the expression can discard row groups a < 10 and b < 1, but the middle part of the expression a + b < c is probably only worthwhile applying as a post-filter (we could support some row group filtering by checking if a_max + b_max < c_min), and is currently unsupported by the statistics row group filtering.

That is, I can discard row-groups with a < 10 or b < 1 and then apply the full expression as a post-filter.

expr_stat = (a < 10) or (b < 1)
expr = (a < 10) or ((a + b < c) and (b < 1))
df = read_parquet(..., expr_stat).filter(expr)

For arbitrary expressions, determining which bits can be applied as statistics filters is programmatically achievable by converting the input expression to one of DNF or CNF and then taking those terms which the reader supports.

@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Oct 23, 2024
@wence-
Copy link
Contributor Author

wence- commented Oct 24, 2024

but the middle part of the expression a + b < c is probably only worthwhile applying as a post-filter (we could support some row group filtering by checking if a_max + b_max < c_min)

For inequality comparisons, one could imagine writing an ast visit that can handle a larger number of such expressions. To take < as an example, lexpr < rexpr as a statistics filter needs to turn into lexpr.max() < rexpr.min(). One can write distribution rules for max and min over binary operators (at least some of them), but the bounds get pessimistic quite quickly. For example, for the addition, multiplication, ..., we have:

(a + b).max() <= a.max() + b.max()
(a - b).max() <= a.max() - b.min()
(a * b).max() <= a.max() * b.max()
(a / b).max() <= a.max() / b.min()

One can apply such rules recursively to push max/min references down to column/literal references. That then gives us expressions that can used in tandem with row group statistics.

I think that's much less likely to be important than just picking out all the pieces of an expression that compare a column with a literal though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: Todo
Status: Needs owner
Development

No branches or pull requests

2 participants