feat: Add classification functions #11792

yuandagits · 2024-12-09T05:08:24Z

Summary:
Add the classification functions from presto into velox: https://prestodb.io/docs/current/functions/aggregate.html#classification-metrics-aggregate-functions

Classification functions all use FixedDoubleHistogram, which is a data structure to represent the bucket of weights. The index of the bucket for the histogram is evenly distributed between the min and value values.

For all of the classification functions, the only difference is the extraction phase. All other steps will be the same.

At a high level:

addRawInput will add a value into either the true or false weight bucket. The bucket to add the value to will depend on the prediction value. The prediction value is linearly mapped into a bucket based on (min, max and bucketCount) by normalizing the prediction between min and max.
The schema of the intermediate states is [version header][bucket count][min][max][weights]

Differential Revision: D66684198

facebook-github-bot · 2024-12-09T05:08:32Z

This pull request was exported from Phabricator. Differential Revision: D66684198

netlify · 2024-12-09T05:08:41Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`3c83e0e`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/675ddaad5bc99d00089cbf2f

Summary: Add the classification functions from presto into velox: https://prestodb.io/docs/current/functions/aggregate.html#classification-metrics-aggregate-functions Classification functions all use `FixedDoubleHistogram`, which is a data structure to represent the bucket of weights. The index of the bucket for the histogram is evenly distributed between the min and value values. For all of the classification functions, the only difference is the extraction phase. All other steps will be the same. At a high level: - addRawInput will add a value into either the true or false weight bucket. The bucket to add the value to will depend on the prediction value. The prediction value is linearly mapped into a bucket based on (min, max and bucketCount) by normalizing the prediction between min and max. - The schema of the intermediate states is [version header][bucket count][min][max][weights] Differential Revision: D66684198

facebook-github-bot · 2024-12-09T05:18:14Z

This pull request was exported from Phabricator. Differential Revision: D66684198

Yuhta · 2024-12-09T16:48:41Z

velox/functions/prestosql/aggregates/ClassificationAggregation.cpp

+    in.copyTo(&min, 1);
+    in.copyTo(&max, 1);
+
+    auto ret = FixedDoubleHistogram(bucketCount, min, max, allocator);


Can we avoid memory allocation for the buckets? Just merge with a view on the deserialized bytes (be careful about the alignment though).

Ohhhh good idea

Yuhta · 2024-12-09T18:01:59Z

velox/functions/prestosql/aggregates/ClassificationAggregation.cpp

+  /// std::vector<double>::max_size(), which may be less than 2^63 depending. To
+  /// account for this, we have two buckets which may be used to store the
+  /// weights with each bucket being at most kMaxBucketCount in size.
+  static constexpr int64_t kMaxBucketCount =


In practice max_size is at least 2^60 on 64 bits system, I don't think any system can give that large contiguous memory in one go. So you don't need to split the weights into 2 arrays for the discrepancy between 2^60 vs 2^64 (you can put a VELOX_CHECK_LT(bucketCount, weights_.max_size()) in validateParameters if you are really concerned about this).

Agreed, I added this because it was something the fuzzer caught

Summary: Add the classification functions from presto into velox: https://prestodb.io/docs/current/functions/aggregate.html#classification-metrics-aggregate-functions Classification functions all use `FixedDoubleHistogram`, which is a data structure to represent the bucket of weights. The index of the bucket for the histogram is evenly distributed between the min and value values. For all of the classification functions, the only difference is the extraction phase. All other steps will be the same. At a high level: - addRawInput will add a value into either the true or false weight bucket. The bucket to add the value to will depend on the prediction value. The prediction value is linearly mapped into a bucket based on (min, max and bucketCount) by normalizing the prediction between min and max. - The schema of the intermediate states is [version header][bucket count][min][max][weights] Differential Revision: D66684198

facebook-github-bot · 2024-12-11T15:40:33Z

This pull request was exported from Phabricator. Differential Revision: D66684198

Summary: Add the classification functions from presto into velox: https://prestodb.io/docs/current/functions/aggregate.html#classification-metrics-aggregate-functions Classification functions all use `FixedDoubleHistogram`, which is a data structure to represent the bucket of weights. The index of the bucket for the histogram is evenly distributed between the min and value values. For all of the classification functions, the only difference is the extraction phase. All other steps will be the same. At a high level: - addRawInput will add a value into either the true or false weight bucket. The bucket to add the value to will depend on the prediction value. The prediction value is linearly mapped into a bucket based on (min, max and bucketCount) by normalizing the prediction between min and max. - The schema of the intermediate states is [version header][bucket count][min][max][weights] Differential Revision: D66684198

facebook-github-bot · 2024-12-11T19:18:29Z

This pull request was exported from Phabricator. Differential Revision: D66684198

Summary: Add the classification functions from presto into velox: https://prestodb.io/docs/current/functions/aggregate.html#classification-metrics-aggregate-functions Classification functions all use `FixedDoubleHistogram`, which is a data structure to represent the bucket of weights. The index of the bucket for the histogram is evenly distributed between the min and value values. For all of the classification functions, the only difference is the extraction phase. All other steps will be the same. At a high level: - addRawInput will add a value into either the true or false weight bucket. The bucket to add the value to will depend on the prediction value. The prediction value is linearly mapped into a bucket based on (min, max and bucketCount) by normalizing the prediction between min and max. - The schema of the intermediate states is [version header][bucket count][min][max][weights] Differential Revision: D66684198

facebook-github-bot · 2024-12-13T00:15:56Z

This pull request was exported from Phabricator. Differential Revision: D66684198

Summary: Add the classification functions from presto into velox: https://prestodb.io/docs/current/functions/aggregate.html#classification-metrics-aggregate-functions Classification functions all use `FixedDoubleHistogram`, which is a data structure to represent the bucket of weights. The index of the bucket for the histogram is evenly distributed between the min and value values. For all of the classification functions, the only difference is the extraction phase. All other steps will be the same. At a high level: - addRawInput will add a value into either the true or false weight bucket. The bucket to add the value to will depend on the prediction value. The prediction value is linearly mapped into a bucket based on (min, max and bucketCount) by normalizing the prediction between min and max. - The schema of the intermediate states is [version header][bucket count][min][max][weights] Reviewed By: Yuhta Differential Revision: D66684198

facebook-github-bot · 2024-12-14T05:44:18Z

This pull request was exported from Phabricator. Differential Revision: D66684198

Summary: Add the classification functions from presto into velox: https://prestodb.io/docs/current/functions/aggregate.html#classification-metrics-aggregate-functions Classification functions all use `FixedDoubleHistogram`, which is a data structure to represent the bucket of weights. The index of the bucket for the histogram is evenly distributed between the min and value values. For all of the classification functions, the only difference is the extraction phase. All other steps will be the same. At a high level: - addRawInput will add a value into either the true or false weight bucket. The bucket to add the value to will depend on the prediction value. The prediction value is linearly mapped into a bucket based on (min, max and bucketCount) by normalizing the prediction between min and max. - The schema of the intermediate states is [version header][bucket count][min][max][weights] Reviewed By: Yuhta Differential Revision: D66684198

facebook-github-bot · 2024-12-14T19:21:44Z

This pull request was exported from Phabricator. Differential Revision: D66684198

yuandagits requested review from assignUser and majetideepak as code owners December 9, 2024 05:08

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 9, 2024

facebook-github-bot added the fb-exported label Dec 9, 2024

yuandagits force-pushed the export-D66684198 branch from e939248 to d2b00ed Compare December 9, 2024 05:17

yuandagits changed the title ~~feat: add classification functions~~ feat: Add classification functions Dec 9, 2024

yuandagits requested review from Yuhta and xiaoxmeng December 9, 2024 05:20

Yuhta reviewed Dec 9, 2024

View reviewed changes

yuandagits force-pushed the export-D66684198 branch from d2b00ed to 6bf72c8 Compare December 11, 2024 15:40

yuandagits force-pushed the export-D66684198 branch from 6bf72c8 to ef481a3 Compare December 11, 2024 19:17

yuandagits force-pushed the export-D66684198 branch from ef481a3 to 1de9154 Compare December 13, 2024 00:15

yuandagits requested a review from Yuhta December 13, 2024 16:11

Yuhta approved these changes Dec 13, 2024

View reviewed changes

yuandagits force-pushed the export-D66684198 branch from 1de9154 to 986bf72 Compare December 14, 2024 05:44

yuandagits force-pushed the export-D66684198 branch from 986bf72 to 3c83e0e Compare December 14, 2024 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add classification functions #11792

feat: Add classification functions #11792

yuandagits commented Dec 9, 2024

facebook-github-bot commented Dec 9, 2024

netlify bot commented Dec 9, 2024 •

edited

Loading

facebook-github-bot commented Dec 9, 2024

Yuhta Dec 9, 2024

yuandagits Dec 9, 2024

Yuhta Dec 9, 2024

yuandagits Dec 9, 2024

facebook-github-bot commented Dec 11, 2024

facebook-github-bot commented Dec 11, 2024

facebook-github-bot commented Dec 13, 2024

facebook-github-bot commented Dec 14, 2024

facebook-github-bot commented Dec 14, 2024

feat: Add classification functions #11792

Are you sure you want to change the base?

feat: Add classification functions #11792

Conversation

yuandagits commented Dec 9, 2024

facebook-github-bot commented Dec 9, 2024

netlify bot commented Dec 9, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

facebook-github-bot commented Dec 9, 2024

Yuhta Dec 9, 2024

Choose a reason for hiding this comment

yuandagits Dec 9, 2024

Choose a reason for hiding this comment

Yuhta Dec 9, 2024

Choose a reason for hiding this comment

yuandagits Dec 9, 2024

Choose a reason for hiding this comment

facebook-github-bot commented Dec 11, 2024

facebook-github-bot commented Dec 11, 2024

facebook-github-bot commented Dec 13, 2024

facebook-github-bot commented Dec 14, 2024

facebook-github-bot commented Dec 14, 2024

netlify bot commented Dec 9, 2024 •

edited

Loading