Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/awsemfexporter]Split EMF log with larger than 100 buckets. #36336

Merged
merged 15 commits into from
Dec 6, 2024

Conversation

zzhlogin
Copy link
Contributor

@zzhlogin zzhlogin commented Nov 12, 2024

Description:
In Application Signals, we utilize Base2 Exponential Bucket Histogram to aggregate and send latency data, with a default max number of buckets 160. In EMF exporter, these buckets are mapped to "Target members" in EMF log entries.
However, CloudWatch EMF logs impose a limit of 100 target members, beyond which EMF processors will mark the record as invalid, resulting in missing metrics and customer-facing errors reported via the EMFValidationErrors metric.

In this PR, we split histograms to two sub EMF logs with the following change:

  1. Add an extra attribute metricIndex to groupedMetricMetadata : Current EMF exporter aggregate incoming metrics into groupedMetrics before converting to log events, where the groupKey is generated based on the groupedMetricMetadata including: metric namespace, timestamp, log group name, etc. After splitting, the two new metrics will share exactly the same key. Adding an extra metric metadata for key generation can prevent the second metric from dropping.
  2. If the total buckets exceed 100, the exponential histogram metric are split into into multiple data points as needed,
    each containing a maximum of 100 buckets, to comply with CloudWatch EMF log constraints.
    For each split data point:
  • Min and Max values are recalculated based on the bucket boundary within that specific split.
  • Sum is only assigned to the first split to ensure the total sum of the datapoints after aggregation is correct.
  • Count is accumulated based on the bucket counts within each split.

Testing:
The change is tested by generating traffic with more than 100 buckets, and the emf log with larger than 100 values are eliminated after the change:
Screenshot 2024-10-18 at 3 04 45 PM

Compare the added Benchmark test before vs after the code change:
Benchmark test with 100 bucket length:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter
cpu: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
                                                  │ before_100.txt │           after_100.txt            │
                                                  │     sec/op     │   sec/op     vs base               │
GetAndCalculateDeltaDataPointsInclude100Buckets-16      19.68µ ± 5%   20.55µ ± 6%  +4.41% (p=0.015 n=10)

                                                  │ before_100.txt │            after_100.txt            │
                                                  │      B/op      │     B/op      vs base               │
GetAndCalculateDeltaDataPointsInclude100Buckets-16     11.50Ki ± 0%   12.02Ki ± 0%  +4.48% (p=0.000 n=10)

                                                  │ before_100.txt │           after_100.txt           │
                                                  │   allocs/op    │ allocs/op   vs base               │
GetAndCalculateDeltaDataPointsInclude100Buckets-16       126.0 ± 0%   132.0 ± 0%  +4.76% (p=0.000 n=10)

Benchmark test with 200 bucket length:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter
cpu: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
                                                  │ before_200.txt │           after_200.txt            │
                                                  │     sec/op     │   sec/op     vs base               │
GetAndCalculateDeltaDataPointsInclude200Buckets-16      26.36µ ± 6%   28.01µ ± 5%  +6.24% (p=0.011 n=10)

                                                  │ before_200.txt │            after_200.txt            │
                                                  │      B/op      │     B/op      vs base               │
GetAndCalculateDeltaDataPointsInclude200Buckets-16     15.50Ki ± 0%   16.59Ki ± 0%  +7.06% (p=0.000 n=10)

                                                  │ before_200.txt │           after_200.txt            │
                                                  │   allocs/op    │ allocs/op   vs base                │
GetAndCalculateDeltaDataPointsInclude200Buckets-16       128.0 ± 0%   152.0 ± 0%  +18.75% (p=0.000 n=10)

Benchmark test with 300 bucket length:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter
cpu: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
                                                  │ before_300.txt │           after_300.txt            │
                                                  │     sec/op     │   sec/op     vs base               │
GetAndCalculateDeltaDataPointsInclude300Buckets-16      37.04µ ± 6%   39.17µ ± 5%  +5.73% (p=0.029 n=10)

                                                  │ before_300.txt │            after_300.txt             │
                                                  │      B/op      │     B/op      vs base                │
GetAndCalculateDeltaDataPointsInclude300Buckets-16     23.50Ki ± 0%   20.98Ki ± 0%  -10.70% (p=0.000 n=10)

                                                  │ before_300.txt │           after_300.txt            │
                                                  │   allocs/op    │ allocs/op   vs base                │
GetAndCalculateDeltaDataPointsInclude300Buckets-16       130.0 ± 0%   171.0 ± 0%  +31.54% (p=0.000 n=10)

Benchmark test with 500 bucket length:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter
cpu: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
                                                  │ before_500.txt │            after_500.txt            │
                                                  │     sec/op     │   sec/op     vs base                │
GetAndCalculateDeltaDataPointsInclude500Buckets-16      52.51µ ± 3%   58.15µ ± 5%  +10.74% (p=0.000 n=10)

                                                  │ before_500.txt │            after_500.txt             │
                                                  │      B/op      │     B/op      vs base                │
GetAndCalculateDeltaDataPointsInclude500Buckets-16     23.50Ki ± 0%   30.14Ki ± 0%  +28.26% (p=0.000 n=10)

                                                  │ before_500.txt │           after_500.txt            │
                                                  │   allocs/op    │ allocs/op   vs base                │
GetAndCalculateDeltaDataPointsInclude500Buckets-16       130.0 ± 0%   210.0 ± 0%  +61.54% (p=0.000 n=10)

@zzhlogin zzhlogin requested a review from a team as a code owner November 12, 2024 19:42
@zzhlogin zzhlogin requested a review from ChrsMark November 12, 2024 19:42
@github-actions github-actions bot added the exporter/awsemf awsemf exporter label Nov 12, 2024
Copy link
Member

@mxiamxia mxiamxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the PR checks

@zzhlogin zzhlogin changed the title Split EMF log with larger than 100 buckets. [exporter/awsemfexporter]Split EMF log with larger than 100 buckets. Nov 12, 2024
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 27, 2024
@zzhlogin
Copy link
Contributor Author

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Adding @Aneurysm9 for help on review.

@github-actions github-actions bot removed the Stale label Nov 29, 2024
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
Copy link
Member

@Aneurysm9 Aneurysm9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the benchmarks? I don't see them in the PR. The benchstat output is also missing the after and comparison data.

exporter/awsemfexporter/datapoint.go Show resolved Hide resolved
@zzhlogin
Copy link
Contributor Author

zzhlogin commented Dec 4, 2024

Where are the benchmarks? I don't see them in the PR. The benchstat output is also missing the after and comparison data.

Sorry, failed to execute the comparison command before, added the after.txt output in the description now. The benchmark test is located in file "exporter/awsemfexporter/datapoint_test.go" at line 2075, 2076.

@atoulme atoulme added the ready to merge Code review completed; ready to merge by maintainers label Dec 6, 2024
@evan-bradley evan-bradley merged commit 5eedf95 into open-telemetry:main Dec 6, 2024
168 checks passed
@github-actions github-actions bot added this to the next release milestone Dec 6, 2024
ZenoCC-Peng pushed a commit to ZenoCC-Peng/opentelemetry-collector-contrib that referenced this pull request Dec 6, 2024
Copy link
Member

@songy23 songy23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads to test failures, see #36727

mx-psi pushed a commit that referenced this pull request Dec 10, 2024
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this pull request Dec 17, 2024
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this pull request Dec 17, 2024
mx-psi pushed a commit that referenced this pull request Dec 17, 2024
… buckets." (#36771)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
This PR fix the flaky unit test in previous PR:
#36336,
and add back the implementation of splitting the emf log logic.

<!-- Issue number (e.g. #1234) or full URL to issue, if applicable. -->
#### Link to tracking issue

#36727

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Unit test updated and passed with 10 count:
```
go test -run TestAddToGroupedMetric -count 10 -tags=always

PASS
ok  	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter	0.016s

```
mterhar pushed a commit to mterhar/opentelemetry-collector-contrib that referenced this pull request Dec 19, 2024
… buckets." (open-telemetry#36771)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
This PR fix the flaky unit test in previous PR:
open-telemetry#36336,
and add back the implementation of splitting the emf log logic.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue

open-telemetry#36727

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Unit test updated and passed with 10 count:
```
go test -run TestAddToGroupedMetric -count 10 -tags=always

PASS
ok  	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter	0.016s

```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exporter/awsemf awsemf exporter ready to merge Code review completed; ready to merge by maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants