-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support hyper log log plus plus(HLL++) #17133
base: branch-25.02
Are you sure you want to change the base?
Conversation
Please link to #10652 as appropriate. |
bb2bfc2
to
dfb3bc3
Compare
171e417
to
5d96885
Compare
cpp/include/cudf/aggregation.hpp
Outdated
MERGE_TDIGEST, ///< create a tdigest by merging multiple tdigests together | ||
HISTOGRAM, ///< compute frequency of each element | ||
MERGE_HISTOGRAM, ///< merge partial values of HISTOGRAM aggregation | ||
HLLPP, ///< approximating the number of distinct items by using hyper log log plus plus (HLLPP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HLLPP, ///< approximating the number of distinct items by using hyper log log plus plus (HLLPP) | |
HLLPP, ///< approximating the number of distinct items using HyperLogLogPlusPlus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
using hash_value_type = uint64_t; | ||
|
||
template <typename Key> | ||
struct XXHash_64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be moved from the source file, which was added in #13612. However, I'm confused with the other XXHash_64
added in NVIDIA/spark-rapids-jni#1248. What is the difference between them? Is this XXHash_64
enough for addressing Spark's behavior in HLLPP?
cpp/src/groupby/sort/aggregate.cpp
Outdated
int const precision = | ||
dynamic_cast<cudf::detail::merge_hyper_log_log_aggregation const&>(agg).precision; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int const precision = | |
dynamic_cast<cudf::detail::merge_hyper_log_log_aggregation const&>(agg).precision; | |
int const precision = | |
dynamic_cast<cudf::detail::merge_hyper_log_log_aggregation const&>(agg).precision; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my eyes aren't able to pick up the difference suggested here. What's the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect removal of the blank line, but github has a bug that doesn't show it.
Signed-off-by: Chong Gao <[email protected]>
/ok to test |
Ready to review except test cases. |
closes #10652
HLL++ description
First get a 64 bits hash code, and generate a interger pair: register index -> register value. The register index is in [0, 512) if precision is 9.
Second, merge in the same group:
For each register index, find the max register value.
So HLL++ is actually like a process of doing max aggregation 512 times on 512 interger columns.
e.g., below is 2 sketches in a group.
reduce_by_key memory issue
thrust::reduce_by_key
will allocatesn
sketch values (n is the num_rows) which are large, typically each sketch is 512 integers(512 * 4 = 2K bytes). If there are 1G rows, it meansreduce_by_key
will use 2k * 1G = 2T intermidate memory.link
thrust::reduce_by_key
is not suitable for merge HLL sketches.This PR uses a new way to do aggregate.
aggregation steps this PR
Step1:
Partial merge:
Because group labels is sorted, so the group labels in each segment is also sorted.
Split rows into small segments with each segment contains 256 items.
One thread handles one segement.
In each segment, produces a cache recording the max values. It only uses N/256 caches which is relatively small memory. When scaning the items in a segment, when meets a new group, outputs the result for the previous group.
Step 2:
Merge the
cache
and theresult
to get the final result.For more details, please refer to the comments in the code.
Description
Checklist