[exporter/awsemfexporter]Split EMF log with larger than 100 buckets. #36336

zzhlogin · 2024-11-12T19:42:43Z

Description:
In Application Signals, we utilize Base2 Exponential Bucket Histogram to aggregate and send latency data, with a default max number of buckets 160. In EMF exporter, these buckets are mapped to "Target members" in EMF log entries.
However, CloudWatch EMF logs impose a limit of 100 target members, beyond which EMF processors will mark the record as invalid, resulting in missing metrics and customer-facing errors reported via the EMFValidationErrors metric.

In this PR, we split histograms to two sub EMF logs with the following change:

Add an extra attribute metricIndex to groupedMetricMetadata : Current EMF exporter aggregate incoming metrics into groupedMetrics before converting to log events, where the groupKey is generated based on the groupedMetricMetadata including: metric namespace, timestamp, log group name, etc. After splitting, the two new metrics will share exactly the same key. Adding an extra metric metadata for key generation can prevent the second metric from dropping.
If the total buckets exceed 100, the exponential histogram metric are split into into multiple data points as needed,
each containing a maximum of 100 buckets, to comply with CloudWatch EMF log constraints.
For each split data point:

Min and Max values are recalculated based on the bucket boundary within that specific split.
Sum is only assigned to the first split to ensure the total sum of the datapoints after aggregation is correct.
Count is accumulated based on the bucket counts within each split.

Testing:
The change is tested by generating traffic with more than 100 buckets, and the emf log with larger than 100 values are eliminated after the change:

Compare the added Benchmark test before vs after the code change:
Benchmark test with 100 bucket length:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter
cpu: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
                                                  │ before_100.txt │           after_100.txt            │
                                                  │     sec/op     │   sec/op     vs base               │
GetAndCalculateDeltaDataPointsInclude100Buckets-16      19.68µ ± 5%   20.55µ ± 6%  +4.41% (p=0.015 n=10)

                                                  │ before_100.txt │            after_100.txt            │
                                                  │      B/op      │     B/op      vs base               │
GetAndCalculateDeltaDataPointsInclude100Buckets-16     11.50Ki ± 0%   12.02Ki ± 0%  +4.48% (p=0.000 n=10)

                                                  │ before_100.txt │           after_100.txt           │
                                                  │   allocs/op    │ allocs/op   vs base               │
GetAndCalculateDeltaDataPointsInclude100Buckets-16       126.0 ± 0%   132.0 ± 0%  +4.76% (p=0.000 n=10)

Benchmark test with 200 bucket length:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter
cpu: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
                                                  │ before_200.txt │           after_200.txt            │
                                                  │     sec/op     │   sec/op     vs base               │
GetAndCalculateDeltaDataPointsInclude200Buckets-16      26.36µ ± 6%   28.01µ ± 5%  +6.24% (p=0.011 n=10)

                                                  │ before_200.txt │            after_200.txt            │
                                                  │      B/op      │     B/op      vs base               │
GetAndCalculateDeltaDataPointsInclude200Buckets-16     15.50Ki ± 0%   16.59Ki ± 0%  +7.06% (p=0.000 n=10)

                                                  │ before_200.txt │           after_200.txt            │
                                                  │   allocs/op    │ allocs/op   vs base                │
GetAndCalculateDeltaDataPointsInclude200Buckets-16       128.0 ± 0%   152.0 ± 0%  +18.75% (p=0.000 n=10)

Benchmark test with 300 bucket length:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter
cpu: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
                                                  │ before_300.txt │           after_300.txt            │
                                                  │     sec/op     │   sec/op     vs base               │
GetAndCalculateDeltaDataPointsInclude300Buckets-16      37.04µ ± 6%   39.17µ ± 5%  +5.73% (p=0.029 n=10)

                                                  │ before_300.txt │            after_300.txt             │
                                                  │      B/op      │     B/op      vs base                │
GetAndCalculateDeltaDataPointsInclude300Buckets-16     23.50Ki ± 0%   20.98Ki ± 0%  -10.70% (p=0.000 n=10)

                                                  │ before_300.txt │           after_300.txt            │
                                                  │   allocs/op    │ allocs/op   vs base                │
GetAndCalculateDeltaDataPointsInclude300Buckets-16       130.0 ± 0%   171.0 ± 0%  +31.54% (p=0.000 n=10)

Benchmark test with 500 bucket length:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter
cpu: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
                                                  │ before_500.txt │            after_500.txt            │
                                                  │     sec/op     │   sec/op     vs base                │
GetAndCalculateDeltaDataPointsInclude500Buckets-16      52.51µ ± 3%   58.15µ ± 5%  +10.74% (p=0.000 n=10)

                                                  │ before_500.txt │            after_500.txt             │
                                                  │      B/op      │     B/op      vs base                │
GetAndCalculateDeltaDataPointsInclude500Buckets-16     23.50Ki ± 0%   30.14Ki ± 0%  +28.26% (p=0.000 n=10)

                                                  │ before_500.txt │           after_500.txt            │
                                                  │   allocs/op    │ allocs/op   vs base                │
GetAndCalculateDeltaDataPointsInclude500Buckets-16       130.0 ± 0%   210.0 ± 0%  +61.54% (p=0.000 n=10)

mxiamxia

Please fix the PR checks

…-collector-contrib-aws into emf-split

github-actions · 2024-11-27T05:20:50Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

zzhlogin · 2024-11-28T16:43:48Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Adding @Aneurysm9 for help on review.

exporter/awsemfexporter/datapoint.go

Aneurysm9

Where are the benchmarks? I don't see them in the PR. The benchstat output is also missing the after and comparison data.

exporter/awsemfexporter/datapoint.go

zzhlogin · 2024-12-04T17:43:02Z

Where are the benchmarks? I don't see them in the PR. The benchstat output is also missing the after and comparison data.

Sorry, failed to execute the comparison command before, added the after.txt output in the description now. The benchmark test is located in file "exporter/awsemfexporter/datapoint_test.go" at line 2075, 2076.

exporter/awsemfexporter/datapoint.go

…y-collector-contrib into emf-split

…pen-telemetry#36336)

songy23

This leads to test failures, see #36727

…uckets." (#36763) Reverts #36336 leads to test failures, see #36727

…pen-telemetry#36336)

…uckets." (open-telemetry#36763) Reverts open-telemetry#36336 leads to test failures, see open-telemetry#36727

… buckets." (#36771)  #### Description This PR fix the flaky unit test in previous PR: #36336, and add back the implementation of splitting the emf log logic.  #### Link to tracking issue #36727  #### Testing Unit test updated and passed with 10 count: ``` go test -run TestAddToGroupedMetric -count 10 -tags=always PASS ok github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter 0.016s ```

… buckets." (open-telemetry#36771)  #### Description This PR fix the flaky unit test in previous PR: open-telemetry#36336, and add back the implementation of splitting the emf log logic.  #### Link to tracking issue open-telemetry#36727  #### Testing Unit test updated and passed with 10 count: ``` go test -run TestAddToGroupedMetric -count 10 -tags=always PASS ok github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter 0.016s ```

…pen-telemetry#36336)

…uckets." (open-telemetry#36763) Reverts open-telemetry#36336 leads to test failures, see open-telemetry#36727

… buckets." (open-telemetry#36771)  #### Description This PR fix the flaky unit test in previous PR: open-telemetry#36336, and add back the implementation of splitting the emf log logic.  #### Link to tracking issue open-telemetry#36727  #### Testing Unit test updated and passed with 10 count: ``` go test -run TestAddToGroupedMetric -count 10 -tags=always PASS ok github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter 0.016s ```

Split EMF log with larger than 100 buckets.

5eef5bd

zzhlogin requested a review from a team as a code owner November 12, 2024 19:42

zzhlogin requested a review from ChrsMark November 12, 2024 19:42

github-actions bot assigned bogdandrutu Nov 12, 2024

github-actions bot added the exporter/awsemf awsemf exporter label Nov 12, 2024

Remove un-used code.

a4e0a1a

zzhlogin force-pushed the emf-split branch from 4e871ac to a4e0a1a Compare November 12, 2024 20:02

mxiamxia approved these changes Nov 12, 2024

View reviewed changes

mxiamxia reviewed Nov 12, 2024

View reviewed changes

zzhlogin changed the title ~~Split EMF log with larger than 100 buckets.~~ [exporter/awsemfexporter]Split EMF log with larger than 100 buckets. Nov 12, 2024

Merge branch 'main' into emf-split

3e4dcc6

github-actions bot requested review from Aneurysm9 and bryan-aguilar November 12, 2024 23:56

zzhlogin added 5 commits November 13, 2024 00:05

Add changelog.

ed126c7

Merge branch 'emf-split' of https://github.com/zzhlogin/opentelemetry…

83f627e

…-collector-contrib-aws into emf-split

Merge branch 'main' into emf-split

b674e36

Eliminate zero splits.

0eecd3f

Merge branch 'emf-split' of https://github.com/zzhlogin/opentelemetry…

1b6ab28

…-collector-contrib-aws into emf-split

github-actions bot added the Stale label Nov 27, 2024

Add exponential histrogram LongBuckets cases into benchmark tests.

9b942c8

github-actions bot removed the Stale label Nov 29, 2024

Aneurysm9 reviewed Dec 2, 2024

View reviewed changes

exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved

exporter/awsemfexporter/datapoint.go Show resolved Hide resolved

exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved

Address comments.

c4c84fa

Aneurysm9 reviewed Dec 3, 2024

View reviewed changes

exporter/awsemfexporter/datapoint.go Show resolved Hide resolved

zzhlogin added 3 commits December 4, 2024 09:55

refine comments.

186cc46

Update benchemark tests to include 100, 200, 300, 500 buckets.

298ac46

Refine benchmark tests.

baa5f3f

Aneurysm9 approved these changes Dec 4, 2024

View reviewed changes

atoulme reviewed Dec 5, 2024

View reviewed changes

exporter/awsemfexporter/datapoint.go Show resolved Hide resolved

zzhlogin added 2 commits December 5, 2024 19:20

Apply gofumpt.

3269d4a

Merge branch 'main' of https://github.com/open-telemetry/opentelemetr…

1065b47

…y-collector-contrib into emf-split

atoulme approved these changes Dec 6, 2024

View reviewed changes

atoulme added the ready to merge Code review completed; ready to merge by maintainers label Dec 6, 2024

evan-bradley merged commit 5eedf95 into open-telemetry:main Dec 6, 2024
168 checks passed

github-actions bot added this to the next release milestone Dec 6, 2024

ZenoCC-Peng pushed a commit to ZenoCC-Peng/opentelemetry-collector-contrib that referenced this pull request Dec 6, 2024

[exporter/awsemfexporter]Split EMF log with larger than 100 buckets. (o…

7b8ed1c

…pen-telemetry#36336)

songy23 reviewed Dec 10, 2024

View reviewed changes

songy23 mentioned this pull request Dec 10, 2024

Revert "[exporter/awsemfexporter]Split EMF log with larger than 100 buckets." #36763

Merged

mx-psi pushed a commit that referenced this pull request Dec 10, 2024

Revert "[exporter/awsemfexporter]Split EMF log with larger than 100 b…

ffd031a

…uckets." (#36763) Reverts #36336 leads to test failures, see #36727

zzhlogin mentioned this pull request Dec 10, 2024

Add back "[exporter/awsemfexporter]Split EMF log with larger than 100 buckets." #36771

Merged

sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this pull request Dec 17, 2024

[exporter/awsemfexporter]Split EMF log with larger than 100 buckets. (o…

79c5497

…pen-telemetry#36336)

AkhigbeEromo pushed a commit to sematext/opentelemetry-collector-contrib that referenced this pull request Jan 13, 2025

[exporter/awsemfexporter]Split EMF log with larger than 100 buckets. (o…

4444e9d

…pen-telemetry#36336)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/awsemfexporter]Split EMF log with larger than 100 buckets. #36336

[exporter/awsemfexporter]Split EMF log with larger than 100 buckets. #36336

zzhlogin commented Nov 12, 2024 •

edited

Loading

mxiamxia left a comment

github-actions bot commented Nov 27, 2024

zzhlogin commented Nov 28, 2024

Aneurysm9 left a comment

zzhlogin commented Dec 4, 2024

songy23 left a comment

[exporter/awsemfexporter]Split EMF log with larger than 100 buckets. #36336

[exporter/awsemfexporter]Split EMF log with larger than 100 buckets. #36336

Conversation

zzhlogin commented Nov 12, 2024 • edited Loading

mxiamxia left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 27, 2024

zzhlogin commented Nov 28, 2024

Aneurysm9 left a comment

Choose a reason for hiding this comment

zzhlogin commented Dec 4, 2024

songy23 left a comment

Choose a reason for hiding this comment

zzhlogin commented Nov 12, 2024 •

edited

Loading