-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental parquet decoder with first-class selection pushdown support #6921
base: main
Are you sure you want to change the base?
Conversation
Implemented some more optimizations and tuning, here are ClickBench numbers on my machine. TLDR: about 15% total time reduction. We first compare no-pushdown vs our new push down implementation. Only Q27 has meaningful slow down, other queries are either similar or much faster. The fix for Q27 requires us to actually switch to a boolean mask-based selector implementation, like the one in #6624
Now we compare our new implementation with the old pushdown implementation -- only Q23 is a bit slower, others are either faster or similar. We do need some extra work to get the optimal performance of Q23. Nonetheless, we are faster than no-pushdown. I believe getting a fix for Q23 does not require foundamental changes to the existing decoding pipeline.
|
The implementation of course lacks tons of tests (I tried to mannually verify the clickbench results). If the high level stuff looks good, I'll try to send break down PRs that has tests and documentations. Like most performance related PRs, some of the code changes can be very non-intuitive, please let me know and I'll try my best to explain why some codes has to implement in that way |
Starting to check it out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @XiangpengHao -- TLDR I think this POC looks really nice and the overall structure makes sense to me. I am willing to help review this PR as it moves closer to reality
There are obvious ways to break this PR up into pieces, which is a nice bonus -- the core caching logic is fairly localized
cc @thinkharderdev @tustvold @Dandandan @etseidl for your comments / reviews as well
I also think the description on the PR is quite good and easy to follow -- thank you for that
(todo: cite myself)
😆 my favorite part of the description
if we can cache the decompressed pages, then we only need to decode arrow twice, which might be good enough.
We can also consider caching arrow as a follow on PR / project. If this initial PR effectively avoids decompressing each page twice (though it still decodes each page to arrow twice) that still seems better than the current main
branch which decompresses and decodes twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice @XiangpengHao. I think this makes a lot of sense.
b394ff9
to
ea8e68a
Compare
ea8e68a
to
be1435f
Compare
Now that we have most of the leaf changes merged, it's finally time for the big change here! I have renovated the PR description as well as many documents added. Please let me know anything I can help to clarify! |
Thanks @XiangpengHao -- I plan to review this more carefully tomorrow, but I first want to finish up / merge (aka #6668 is before this PR in my queue) |
I agree. We found |
@@ -683,7 +685,7 @@ impl ByteViewArrayDecoderDelta { | |||
|
|||
/// Check that `val` is a valid UTF-8 sequence | |||
pub fn check_valid_utf8(val: &[u8]) -> Result<()> { | |||
match std::str::from_utf8(val) { | |||
match simdutf8::basic::from_utf8(val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #7014
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is something wrong with how github is displaying this diff -- this callsite was changed to use simdutf8 in this PR
The version on main doesn't have any calls to from_utf8
:
https://github.com/apache/arrow-rs/blob/main/parquet/src/arrow/array_reader/byte_view_array.rs#L381
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked and this function goes away if main
is merged in.
Which issue does this PR close?
Many long lasting issues in DataFusion and Parquet. Note that this PR may or may not close these issues, but (imo) it will be the foundation to future more optimizations (e.g., more aggressive selection pushdown as described in this paper).
parquet::column::reader::GenericColumnReader::skip_records
still decompresses most data #6454Why selection pushdown?
Selection pushdown (or late materialization or row-filter or filter pushdown) is great in concept, but can be tricky to implement efficiently. For example, current straightforward implementation actually slow down many queries, which prevents query engine like DataFusion to enable filter pushdown by default. The goal of a super fast row-filter pushdown parquet reader is described by @alamb in #5523 (comment):
Previous discussions have listed many potential optimizations to current selection pushdown, like the ones in #5523 (comment).
However, it's not clear how we can incorporate those optimizations into the current implementation. After thinking more carefully about the design spaces and the implications, I believe the only way to reach that goal is to re-structure the parquet reading pipline, and also reuse as much existing implementation as possible.
Current implementation and the problems
We currently implement a two phase decoding:
Phase 1: Build selectors on each predicate
Phase 2: Decode parquet data using the selector
The problem is that we have to decode the predicate column twice, for example, if column 1 is being filtered, we need to first decode column 1 while evaluating the predicate, then decode it again to build the array.
Caching is the solution but not that simple
The high level intuition is that, if the problem is decoding twice, we simply cache the first decoding results and reuse later.
Here are the nuances:
Proposed solutions
The solution consists two parts:
The pipeline looks like this:
Once we have this pipeline, we can cache the
predicate columns for batch N
and reuse it whenload & emit batch N
, this avoids double decoding.Due to the difficulties mentioned above, this PR cache the decompressed pages, rather than decoded arrow arrays. As some research suggests decompressing pages costs up to twice as much as decoding arrow, if we can cache the decompressed pages, then we only need to decode arrow twice, which might be good enough. Caching decompressed pages is much simpler to implement, as we can reuse the current array_readers and just implement a new PageReader.
What changes are included in this PR?
This PR only implements a reader for async record batch stream. Sync version is left as future work, and should be straightforward based on the async version.
Are there any user-facing changes?
No. The same
ParquetRecordBatchStream
, will automatically benefit from the changes.