Support reading bloom filters from Parquet files and filter row groups using them #17289

mhaseeb123 · 2024-11-09T00:14:38Z

Description

This PR adds support to read bloom filters from Parquet files and use them to filter row groups based on col == literal like predicate(s), if provided.

Related to #17164

Could use some ideas:

How can we improve testing bloom filtering if no currently available method to tell how many row groups were filtered by bloom filters and/or StatsAST.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

mhaseeb123 · 2024-11-28T02:38:19Z

python/cudf/cudf/tests/test_parquet.py

@@ -4341,3 +4342,56 @@ def test_parquet_reader_mismatched_nullability_structs(tmpdir):
        cudf.read_parquet([buf2, buf1]),
        cudf.concat([df2, df1]).reset_index(drop=True),
    )
+
+
+@pytest.mark.parametrize(


Filtering using StatsAST (no bloom filters here) and BloomFilters (no stats here) yields the same data frames. Though, there is no way other than manual prints/debugging to tell if filtering is yielding correct results. I verified the filtered row groups manually for these cases but we should implement some way to measure this (in another PR perhaps).

Mixing StatsAST and Bloom Filters is unpredictable especially for numeric columns as stats filter numeric cols really well.

mhaseeb123 · 2024-11-28T02:45:14Z

CC: @etseidl

Initial stuff for reading bloom filter from PQ files

95fe8e8

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 9, 2024

github-actions bot assigned mhaseeb123 Nov 9, 2024

mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue cuco cuCollections related issue feature request New feature or request non-breaking Non-breaking change labels Nov 9, 2024

mhaseeb123 added 2 commits November 9, 2024 00:22

Minor bug fix

4f0e7ab

Apply style fix

48a50c4

mhaseeb123 mentioned this pull request Nov 9, 2024

[FEA] Use bloom filters in Parquet reader to filter row groups with equality predicates #17164

Open

mhaseeb123 and others added 19 commits November 14, 2024 14:54

Merge branch 'branch-24.12' into fea/extract-pq-bloom-filter-data

9a85d08

Merge branch 'branch-24.12' into fea/extract-pq-bloom-filter-data

b71cf9b

Some updates

68be24f

Move contents to a separate file

f848251

Revert erroneous changes

0b65233

Style and doc fix

cf7d762

Get equality predicate col indices

81efad2

Enable arrow_filter_policy and span types in bloom filter.

088377b

Merge branch 'branch-24.12' into fea/extract-pq-bloom-filter-data

0435bff

Successfully search bloom filter

3dff590

style fix

71e1d33

Code cleanup

aa65a2b

add tests

c52821b

Initial stuff for reading bloom filter from PQ files

3a20a98

Minor bug fix

d67e4b5

Apply style fix

10471d4

Some updates

1e12662

Move contents to a separate file

ee7217c

Revert erroneous changes

f8e6159

mhaseeb123 added 9 commits November 26, 2024 08:16

Minor improvements

dddee6c

Add gtest

0cfeb80

Improvements

9137585

Support int96 in bloom filter

77152b4

Cleanup

3984291

Minor improvements

9a39aa4

Fix minor bug

1def801

MInor bug fixing

6edc248

Add python tests

2925f1e

github-actions bot added the Python Affects Python cuDF API. label Nov 28, 2024

Correct parquet files

efc6ec0

mhaseeb123 changed the title ~~🚧 Support for reading bloom filters from Parquet files~~ Support reading and filtering row groups in Parquet reader using bloom filters Nov 28, 2024

mhaseeb123 requested review from karthikeyann, vuule, lamarrr and PointKernel November 28, 2024 02:31

mhaseeb123 commented Nov 28, 2024

View reviewed changes

mhaseeb123 marked this pull request as ready for review November 28, 2024 02:38

mhaseeb123 requested review from a team as code owners November 28, 2024 02:38

mhaseeb123 requested review from bdice and brandon-b-miller November 28, 2024 02:38

Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data

df84aca

mhaseeb123 changed the title ~~Support reading and filtering row groups in Parquet reader using bloom filters~~ Support reading bloom filters and filtering row groups in Parquet reader Nov 28, 2024

mhaseeb123 changed the title ~~Support reading bloom filters and filtering row groups in Parquet reader~~ Support reading bloom filters from Parquet files and filtering row groups with them. Nov 28, 2024

mhaseeb123 changed the title ~~Support reading bloom filters from Parquet files and filtering row groups with them.~~ Support reading bloom filters from Parquet files and use them for row groups filtering. Nov 28, 2024

mhaseeb123 changed the title ~~Support reading bloom filters from Parquet files and use them for row groups filtering.~~ Support reading bloom filters from Parquet files and filter row groups using them. Nov 28, 2024

mhaseeb123 changed the title ~~Support reading bloom filters from Parquet files and filter row groups using them.~~ Support reading bloom filters from Parquet files and filter row groups using them Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading bloom filters from Parquet files and filter row groups using them #17289

Support reading bloom filters from Parquet files and filter row groups using them #17289

mhaseeb123 commented Nov 9, 2024 •

edited

Loading

mhaseeb123 Nov 28, 2024

mhaseeb123 commented Nov 28, 2024

Support reading bloom filters from Parquet files and filter row groups using them #17289

Are you sure you want to change the base?

Support reading bloom filters from Parquet files and filter row groups using them #17289

Conversation

mhaseeb123 commented Nov 9, 2024 • edited Loading

Description

Could use some ideas:

Checklist

mhaseeb123 Nov 28, 2024

Choose a reason for hiding this comment

mhaseeb123 commented Nov 28, 2024

mhaseeb123 commented Nov 9, 2024 •

edited

Loading