Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SDK span telemetry metrics #1631

Open
wants to merge 60 commits into
base: main
Choose a base branch
from

Conversation

JonasKunz
Copy link

@JonasKunz JonasKunz commented Nov 29, 2024

Changes

With this PR I'd like to start a discussion around adding SDK self-monitoring metrics to the semantic conventions.
The goal of these metrics is to give insights into how the SDK is performing, e.g. whether data is being dropped due to overload / misconfiguration or everything is healthy.
I'd like to add these to semconv to keep them language agnostic, so that for example a single dashboard can be used to visualize the health state of all SDKs used in a system.

We checked the SDK implementations, it seems like only the Java SDK currently has some health metrics implemented.
This PR took some inspiration from those and is intended to improve and therefore supersede them.

I'd like to start out with just span related metrics to keep the PR and discussions simpler here, but would follow up with similar PRs for logs and traces based on the discussion results on this PR.

Prior work

This PR can be seen as a follow up to the closed OTEP 259:

So we kind of have gone full circle: The discussion started with just SDK metrics (only for exporters), going to an approach to unify the metrics across SDK-exporters and collector, which then ended up with just collector metrics.
So this PR can be seen as the required revival of #184 (see also this comment).

In my opinion, it is a good thing to separate the collector and SDK self-metrics:

  • There have been concerns about both using the same metrics for both: How do you distinguish the metrics exposed by collector components from the self-monitoring metrics exposed by an Otel-SDK used in the collector for e.g. tracing the collector itself?
  • Though many concepts between the collector and SDK share the same name, they are not the same thing (to my knowledge, I'm not a collector expert): For example processors in the collector are designed to form pipelines potentially mutating the data as it passes through. In contrast, SDK span processor don't form pipelines (at least not visible to the SDK, those would be hidden custom implementations). Instead SDK span processors are merely observers with multiple callbacks for the span lifecycle. So it would feel like "shoehorning" things into the same metric, even though they are not the same concepts.
  • Separating collector and SDK metrics makes their evolution and reaching agreements a lot easier: When using separate metrics and namespaces, collector metrics can focus on the collector implementation and SDK metrics can be defined just using the SDK spec. If combine both in shared metrics, those will have to be always be aligned with both the SDK spec and the collector implementation. I think this would make maintenance much harder for little benefit.
  • I have a hard time finding benefits of sharing metrics for SDK and collector: The main benefit I find would of course be easier dashboarding / analysis. However, I do think having to look at two sets of metrics to do so is a fine tradeoff, considering the difficulties with the unification listed above and shown by the history of OTEP 259.

Existing Metrics in Java SDK

For reference, here is what the existing health metrics currently look like in the Java SDK:

Batch Span Processor metrics

  • Gauge queueSize, value is the current size of the queue
    • Attribute spanProcessorType=BatchSpanProcessor (there was a former ExecutorServiceSpanProcessor which has been removed)
    • This metric currently causes collisions if two BatchSpanProcessor instances are used
  • Counter processedSpans, value is the number of spans submitted to the Processor
    • Attribute spanProcessorType=BatchSpanProcessor
    • Attribute dropped (boolean), true for the number of spans which could not be processed due to a full queue

The SDK also implements pretty much the same metrics for the BatchLogRecordProcessor just span replaced everywhere with log

Exporter metrics

Exporter metrics are the same for spans, metrics and logs. They are distinguishable based on a type attribute.
Also the metric names are dependent on a "name" and "transport" defined by the exporter. For OTLP those are:

  • exporterName=otlp
  • transport is one of grpc, http (= protobuf) or http-json

The transport is used just for the instrumentation scope name: io.opentelemetry.exporters.<exporterName>-<transport>

Based on that, the following metrics are exposed:

Merge requirement checklist

@JonasKunz JonasKunz marked this pull request as ready for review November 29, 2024 10:40
@JonasKunz JonasKunz requested review from a team as code owners November 29, 2024 10:40
@lmolkova
Copy link
Contributor

lmolkova commented Dec 3, 2024

Related #1580

model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/registry.yaml Outdated Show resolved Hide resolved
model/otel/registry.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/registry.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
@dashpole
Copy link
Contributor

I completed the Go prototype of the proposed semantic conventions: open-telemetry/opentelemetry-go#6153

model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
instrument: updowncounter
unit: "{span}"
note: |
For spans with `recording=true` implementations MUST record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For spans with `recording=true` implementations MUST record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count`.
For spans with `recording=true`: Implementations MUST record both `otel.sdk.span.live.count` and `otel.sdk.span.ended.count`.

unit: "{span}"
note: |
For spans with `recording=true` implementations MUST record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count`.
For spans with `recording=false` implementations SHOULD record this metric, they MUST either record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count` or none.
Copy link
Member

@joaopgrassi joaopgrassi Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementations SHOULD record this metric

they MUST either record both

Is a bit confusing I think. Maybe something like this?

Suggested change
For spans with `recording=false` implementations SHOULD record this metric, they MUST either record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count` or none.
For spans with `recording=false`: If implementations decide to record this metric, they MUST also record `otel.sdk.span.ended.count` or not declare none.

instrument: counter
unit: "{span}"
note: |
For spans with `recording=true` implementations MUST record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For spans with `recording=true` implementations MUST record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count`.
For spans with `recording=true`: Implementations MUST record both `otel.sdk.span.live.count` and `otel.sdk.span.ended.count`.

unit: "{span}"
note: |
For spans with `recording=true` implementations MUST record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count`.
For spans with `recording=false` implementations SHOULD record this metric, they MUST either record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count` or none.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other comment above

stability: development
brief: "The number of spans for which the processing has finished, either successful or failed"
note: |
For successful processing, `error.type` must be empty. For failed processing, `error.type` must contain the failure cause.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For successful processing, `error.type` must be empty. For failed processing, `error.type` must contain the failure cause.
For successful processing, `error.type` SHOULD NOT be set. For failed processing, `error.type` must contain the failure cause.

I believe we want this to not be set, and not "empty" (empty != not set)

See more here:

If the request fails with an error before response status code was sent or received,

stability: development
brief: "The number of spans which were passed to the exporter, but that have not been exported yet (neither successful, nor failed)"
note: |
For successful exports, `error.type` must be empty. For failed exports, `error.type` must contain the failure cause.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, error.type should be left unset if processing was successful and not empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

7 participants