Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hyper log log plus plus(HLL++) #17133

Draft
wants to merge 11 commits into
base: branch-25.02
Choose a base branch
from

Conversation

res-life
Copy link
Contributor

@res-life res-life commented Oct 21, 2024

closes #10652

  • Group HLL
  • Test: Group HLL
  • Compact plain sketch to Spark compatible
  • Test: Compact plain sketch to Spark compatible
  • Merge HLL
  • Test Merge HLL
  • Reduction

HLL++ description

First get a 64 bits hash code, and generate a interger pair: register index -> register value. The register index is in [0, 512) if precision is 9.
Second, merge in the same group:
For each register index, find the max register value.
So HLL++ is actually like a process of doing max aggregation 512 times on 512 interger columns.

e.g., below is 2 sketches in a group.

register index 0 register index 1 ... register index 511
1 22 ... 44
11 2 ... 4
Aggregation result is:
register index 0 register index 1 ... register index 511
11 22 ... 44

reduce_by_key memory issue

thrust::reduce_by_key will allocates n sketch values (n is the num_rows) which are large, typically each sketch is 512 integers(512 * 4 = 2K bytes). If there are 1G rows, it means reduce_by_key will use 2k * 1G = 2T intermidate memory.

link

    // scan the values by flag
    thrust::detail::temporary_array<ValueType,ExecutionPolicy> scanned_values(exec, n);

thrust::reduce_by_key is not suitable for merge HLL sketches.
This PR uses a new way to do aggregate.

aggregation steps this PR

Step1:
Partial merge:
Because group labels is sorted, so the group labels in each segment is also sorted.
Split rows into small segments with each segment contains 256 items.
One thread handles one segement.
In each segment, produces a cache recording the max values. It only uses N/256 caches which is relatively small memory. When scaning the items in a segment, when meets a new group, outputs the result for the previous group.

Step 2:
Merge the cache and the result to get the final result.

For more details, please refer to the comments in the code.

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Oct 21, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. labels Oct 21, 2024
@res-life res-life requested a review from ttnghia October 21, 2024 12:42
@vyasr
Copy link
Contributor

vyasr commented Oct 22, 2024

Please link to #10652 as appropriate.

MERGE_TDIGEST, ///< create a tdigest by merging multiple tdigests together
HISTOGRAM, ///< compute frequency of each element
MERGE_HISTOGRAM, ///< merge partial values of HISTOGRAM aggregation
HLLPP, ///< approximating the number of distinct items by using hyper log log plus plus (HLLPP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
HLLPP, ///< approximating the number of distinct items by using hyper log log plus plus (HLLPP)
HLLPP, ///< approximating the number of distinct items using HyperLogLogPlusPlus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

using hash_value_type = uint64_t;

template <typename Key>
struct XXHash_64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be moved from the source file, which was added in #13612. However, I'm confused with the other XXHash_64 added in NVIDIA/spark-rapids-jni#1248. What is the difference between them? Is this XXHash_64 enough for addressing Spark's behavior in HLLPP?

Comment on lines 816 to 817
int const precision =
dynamic_cast<cudf::detail::merge_hyper_log_log_aggregation const&>(agg).precision;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int const precision =
dynamic_cast<cudf::detail::merge_hyper_log_log_aggregation const&>(agg).precision;
int const precision =
dynamic_cast<cudf::detail::merge_hyper_log_log_aggregation const&>(agg).precision;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my eyes aren't able to pick up the difference suggested here. What's the change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect removal of the blank line, but github has a bug that doesn't show it.

@res-life res-life changed the title [Do not review] Support hyper log log plus plus(HLL++) Support hyper log log plus plus(HLL++) Nov 5, 2024
@res-life res-life changed the base branch from branch-24.12 to branch-25.02 November 26, 2024 07:52
@res-life res-life added non-breaking Non-breaking change feature request New feature or request labels Nov 26, 2024
@res-life
Copy link
Contributor Author

/ok to test

@res-life
Copy link
Contributor Author

Ready to review except test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support for approx_count_distinct
5 participants