-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parquet predicate filtering with column projection #15113
Fix parquet predicate filtering with column projection #15113
Conversation
This is great! Nice work @karthikeyann. Let's please add a step to drop any filter columns that weren't part of the requested column set before returning the table. |
Thanks, like @GregoryKimball it would be nice if the column selection were disjoint from the filtering selection logic in terms of determining what columns are finally returned. So that I can filter on (say) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions
get_columns() need not specify all columns in filter. columns in filter are read, and discard finally at the output after filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first pass - small suggestions
/ok to test |
@etseidl Does this change look like it will work for your application? |
Yes, I believe so. |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM
min.set_index(stats_idx, thrust::nullopt, {}); | ||
max.set_index(stats_idx, thrust::nullopt, {}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (non-blocking): Possibly worthwhile migrating set_index
to std::optional
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_value
and min
from Statistics
struct uses thrust::optional
, which is passed here.
https://github.com/rapidsai/cudf/blob/064dd7b02166cc67e882b708d66621bc3fafd70b/cpp/src/io/parquet/parquet.hpp uses thrust::optional
everywhere (except at 2 places). Not sure why.
@vuule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the thrift data structures use thrust::optional
over std::optional
because some are used on device. I assume these will migrate to cuda::std::optional
eventually. #15091 (comment)
auto read_opts = cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath}) | ||
.columns({"col_double", "col_uint32"}) | ||
.filter(read_ref_expr); | ||
EXPECT_THROW(cudf::io::read_parquet(read_opts), cudf::logic_error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (non-blocking): should these throw (per #12885) respectively cudf::data_type_error
and std::out_of_range
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion. These exceptions are thrown from AST. I will create another PR to fix it in AST code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are two reasons to throw, we need two separate tests
/ok to test |
@@ -127,6 +123,7 @@ class aggregate_reader_metadata { | |||
int64_t num_rows; | |||
size_type num_row_groups; | |||
|
|||
std::vector<data_type> _output_types; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep _output_types
cached here or can we clear this after the output has been built? Does extract output_dtypes
like inside reader::impl::preprocess_file
incur a lot of repetitive overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort. The changes look good. Not a blocker but please confirm if we need to cache output_dtypes
at this time or if just constructing it on the fly would be better.
Updated to construct on the fly. will work with |
/ok to test |
…nn/cudf into fix-pq_filter_col_projection
/ok to test |
/merge |
Description
Fixes #15051
The predicate filtering in parquet did not work while column projection is used. This PR fixes that limitation.
With this PR change, the user will be able to use both column name reference and column index reference in the filter.
This is achieved by extracting column names from filter and add to output buffers, after predicate filtering is done, these filter-only columns are removed and only requested columns are returned.
The change includes reading only output columns' statistics data instead of all root columns.
Summary of changes:
get_column_names_in_expression
extracts column names in filter.cpp/src/io/parquet/reader_impl_helpers.cpp
,cpp/src/io/parquet/reader_impl.cpp
cpp/src/io/parquet/predicate_pushdown.cpp
schema_idx
in row group metadata.named_to_reference_converter
constructor to cpp, and remove used constructor.Checklist