Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading bloom filters from Parquet files and filter row groups using them #17289

Open
wants to merge 46 commits into
base: branch-25.02
Choose a base branch
from

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Nov 9, 2024

Description

This PR adds support to read bloom filters from Parquet files and use them to filter row groups based on col == literal like predicate(s), if provided.

Related to #17164

Could use some ideas:

  • How can we improve testing bloom filtering if no currently available method to tell how many row groups were filtered by bloom filters and/or StatsAST.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 9, 2024
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue cuco cuCollections related issue feature request New feature or request non-breaking Non-breaking change labels Nov 9, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 28, 2024
@mhaseeb123 mhaseeb123 changed the title 🚧 Support for reading bloom filters from Parquet files Support reading and filtering row groups in Parquet reader using bloom filters Nov 28, 2024
@@ -4341,3 +4342,56 @@ def test_parquet_reader_mismatched_nullability_structs(tmpdir):
cudf.read_parquet([buf2, buf1]),
cudf.concat([df2, df1]).reset_index(drop=True),
)


@pytest.mark.parametrize(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering using StatsAST (no bloom filters here) and BloomFilters (no stats here) yields the same data frames. Though, there is no way other than manual prints/debugging to tell if filtering is yielding correct results. I verified the filtered row groups manually for these cases but we should implement some way to measure this (in another PR perhaps).

Mixing StatsAST and Bloom Filters is unpredictable especially for numeric columns as stats filter numeric cols really well.

@mhaseeb123 mhaseeb123 marked this pull request as ready for review November 28, 2024 02:38
@mhaseeb123 mhaseeb123 requested review from a team as code owners November 28, 2024 02:38
@mhaseeb123 mhaseeb123 changed the title Support reading and filtering row groups in Parquet reader using bloom filters Support reading bloom filters and filtering row groups in Parquet reader Nov 28, 2024
@mhaseeb123 mhaseeb123 changed the title Support reading bloom filters and filtering row groups in Parquet reader Support reading bloom filters from Parquet files and filtering row groups with them. Nov 28, 2024
@mhaseeb123 mhaseeb123 changed the title Support reading bloom filters from Parquet files and filtering row groups with them. Support reading bloom filters from Parquet files and use them for row groups filtering. Nov 28, 2024
@mhaseeb123 mhaseeb123 changed the title Support reading bloom filters from Parquet files and use them for row groups filtering. Support reading bloom filters from Parquet files and filter row groups using them. Nov 28, 2024
@mhaseeb123 mhaseeb123 changed the title Support reading bloom filters from Parquet files and filter row groups using them. Support reading bloom filters from Parquet files and filter row groups using them Nov 28, 2024
@mhaseeb123
Copy link
Member Author

CC: @etseidl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress CMake CMake build issue cuco cuCollections related issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants