-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Parquet reader filter improvements #17142
Comments
@wence- Thank you for mapping out this request.
|
That's supported in ASTs, but not in the statistics filter (e.g. discarding a row group where the filter is |
Here are some examples, in pseudo-code, I can write things out in C++ once we converge on something that makes sense:
This would be supported by the statistics reader if the user had instead written:
Something more complicated:
Here two parts of the expression can discard row groups That is, I can discard row-groups with
For arbitrary expressions, determining which bits can be applied as statistics filters is programmatically achievable by converting the input expression to one of DNF or CNF and then taking those terms which the reader supports. |
For inequality comparisons, one could imagine writing an ast visit that can handle a larger number of such expressions. To take
One can apply such rules recursively to push I think that's much less likely to be important than just picking out all the pieces of an expression that compare a column with a literal though. |
Is your feature request related to a problem? Please describe.
In cudf-polars, predicate pushdown can result in arbitrary expressions being part of the parquet read phase. Not all of these expressions make sense for discarding rows at the row group level based on statistics, however, they can still be applied in a post-filtering stage.
If I naively translate the generic expression I get from polars to a libcudf expression and use it in the parquet reader, libcudf might throw at runtime with an unsupported operation. I must therefore encode in my transliteration, exactly which ast expressions the parquet reader does support in its statistics filters and only deliver the filter to the parquet reader if it is one that is understood.
For example,
column_name_reference("a") < literal(...)
is a supported expression, butliteral(...) > column_name_reference("a")
is not (this one I translate to something that is supported). But if the parquet reader were extended to handle both types, I'd now be doing unnecessary work.This is suboptimal in two ways:
Describe the solution you'd like
Describe alternatives you've considered
For point one, I can do the thing I'm doing right now and just bail if I hit a feature I've determined as unsupported.
For point two, I can convert to some kind of normal form and pick apart the pieces that are supported and deliver those to the parquet reader. However, I'd love not to have to write another propositional formula -> CNF converter :), and this still suffers from point 1: the final decision to discard things encodes information in two places.
Additional context
What I'm doing now: #17141
Additional feature req: support filtering row groups based on nulls, i.e. support
is_null(column_name_reference(...))
in the statistics reader.The text was updated successfully, but these errors were encountered: