Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conversion from cudf-polars expressions to libcudf ast for parquet filters #17141

Merged
merged 13 commits into from
Oct 30, 2024

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Oct 22, 2024

Description

Previously, we always applied parquet filters by post-filtering. This negates much of the potential gain from having filters available at read time, namely discarding row groups. To fix this, implement, with the new visitor system of #17016, conversion to pylibcudf expressions.

We must distinguish two types of expressions, ones that we can evaluate via cudf::compute_column, and the more restricted set of expressions that the parquet reader understands, this is handled by having a state that tracks the usage. The former style will be useful when we implement inequality joins.

While here, extend the support in pylibcudf expressions to handle all supported literal types and expose compute_column so we can test the correctness of the broader (non-parquet) implementation.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested a review from a team as a code owner October 22, 2024 15:45
@wence- wence- requested review from vyasr and isVoid October 22, 2024 15:45
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Oct 22, 2024
@wence- wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Oct 22, 2024
@wence- wence- force-pushed the wence/fea/polars-expr-to-ast branch from 6f3d385 to a98cd7b Compare October 23, 2024 11:21
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Oct 23, 2024
We will use this for inequality joins and filter pushdown in the
parquet reader.

The handling is a bit complicated, since the subset of expressions
that the parquet filter accepts is smaller than all possible
expressions. Since much of the logic is similar, however, we just
dispatch on a transformer state variable to determine which case we're
handling.
We attempt to turn the predicate into a filter expression that the
parquet reader understands. If successful then we don't have to apply
the predicate as a post-filter.

We can only do this when a row index is not requested.
@wence- wence- force-pushed the wence/fea/polars-expr-to-ast branch from a98cd7b to 16efcaf Compare October 24, 2024 18:52
Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not as familiar with expressions in libcudf or cudf::compute_column, but here are a couple of smaller things I noticed.

python/cudf_polars/cudf_polars/dsl/to_ast.py Outdated Show resolved Hide resolved
python/pylibcudf/pylibcudf/transform.pyx Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/dsl/ir.py Show resolved Hide resolved
python/pylibcudf/pylibcudf/transform.pyx Show resolved Hide resolved
python/cudf_polars/cudf_polars/dsl/to_ast.py Show resolved Hide resolved
Comment on lines +181 to +182
if isinstance(haystack, expr.LiteralColumn) and len(haystack.value) < 16:
# 16 is an arbitrary limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, what is the purpose of this limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to make one scalar for every value and upload it to the device. So I just picked a value as a cutoff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that you think once we need to create more than a certain number of scalars the cost of allocation will be high enough that we will underperform the CPU? The end result here is that we raise and fall back when there are more than 16 scalars, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that (for example) we will do the parquet filter as a post-filter (still on the GPU) rather than during the read.

Comment on lines +181 to +182
if isinstance(haystack, expr.LiteralColumn) and len(haystack.value) < 16:
# 16 is an arbitrary limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that you think once we need to create more than a certain number of scalars the cost of allocation will be high enough that we will underperform the CPU? The end result here is that we raise and fall back when there are more than 16 scalars, right?

@wence-
Copy link
Contributor Author

wence- commented Oct 30, 2024

/merge

@rapids-bot rapids-bot bot merged commit 7157de7 into rapidsai:branch-24.12 Oct 30, 2024
102 checks passed
@wence- wence- deleted the wence/fea/polars-expr-to-ast branch October 30, 2024 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants