-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SDK span telemetry metrics #1631
base: main
Are you sure you want to change the base?
Conversation
04f924f
to
8bbea82
Compare
Related #1580 |
I completed the Go prototype of the proposed semantic conventions: open-telemetry/opentelemetry-go#6153 |
instrument: updowncounter | ||
unit: "{span}" | ||
note: | | ||
For spans with `recording=true` implementations MUST record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For spans with `recording=true` implementations MUST record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count`. | |
For spans with `recording=true`: Implementations MUST record both `otel.sdk.span.live.count` and `otel.sdk.span.ended.count`. |
unit: "{span}" | ||
note: | | ||
For spans with `recording=true` implementations MUST record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count`. | ||
For spans with `recording=false` implementations SHOULD record this metric, they MUST either record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count` or none. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implementations SHOULD record this metric
they MUST either record both
Is a bit confusing I think. Maybe something like this?
For spans with `recording=false` implementations SHOULD record this metric, they MUST either record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count` or none. | |
For spans with `recording=false`: If implementations decide to record this metric, they MUST also record `otel.sdk.span.ended.count` or not declare none. |
instrument: counter | ||
unit: "{span}" | ||
note: | | ||
For spans with `recording=true` implementations MUST record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For spans with `recording=true` implementations MUST record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count`. | |
For spans with `recording=true`: Implementations MUST record both `otel.sdk.span.live.count` and `otel.sdk.span.ended.count`. |
unit: "{span}" | ||
note: | | ||
For spans with `recording=true` implementations MUST record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count`. | ||
For spans with `recording=false` implementations SHOULD record this metric, they MUST either record both `metric.otel.sdk.span.live.count` and `otel.sdk.span.ended.count` or none. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the other comment above
stability: development | ||
brief: "The number of spans for which the processing has finished, either successful or failed" | ||
note: | | ||
For successful processing, `error.type` must be empty. For failed processing, `error.type` must contain the failure cause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For successful processing, `error.type` must be empty. For failed processing, `error.type` must contain the failure cause. | |
For successful processing, `error.type` SHOULD NOT be set. For failed processing, `error.type` must contain the failure cause. |
I believe we want this to not be set, and not "empty" (empty != not set)
See more here:
If the request fails with an error before response status code was sent or received, |
stability: development | ||
brief: "The number of spans which were passed to the exporter, but that have not been exported yet (neither successful, nor failed)" | ||
note: | | ||
For successful exports, `error.type` must be empty. For failed exports, `error.type` must contain the failure cause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, error.type
should be left unset if processing was successful and not empty.
Changes
With this PR I'd like to start a discussion around adding SDK self-monitoring metrics to the semantic conventions.
The goal of these metrics is to give insights into how the SDK is performing, e.g. whether data is being dropped due to overload / misconfiguration or everything is healthy.
I'd like to add these to semconv to keep them language agnostic, so that for example a single dashboard can be used to visualize the health state of all SDKs used in a system.
We checked the SDK implementations, it seems like only the Java SDK currently has some health metrics implemented.
This PR took some inspiration from those and is intended to improve and therefore supersede them.
I'd like to start out with just span related metrics to keep the PR and discussions simpler here, but would follow up with similar PRs for logs and traces based on the discussion results on this PR.
Prior work
This PR can be seen as a follow up to the closed OTEP 259:
So we kind of have gone full circle: The discussion started with just SDK metrics (only for exporters), going to an approach to unify the metrics across SDK-exporters and collector, which then ended up with just collector metrics.
So this PR can be seen as the required revival of #184 (see also this comment).
In my opinion, it is a good thing to separate the collector and SDK self-metrics:
Existing Metrics in Java SDK
For reference, here is what the existing health metrics currently look like in the Java SDK:
Batch Span Processor metrics
queueSize
, value is the current size of the queuespanProcessorType
=BatchSpanProcessor
(there was a formerExecutorServiceSpanProcessor
which has been removed)BatchSpanProcessor
instances are usedprocessedSpans
, value is the number of spans submitted to the ProcessorspanProcessorType
=BatchSpanProcessor
dropped
(boolean
),true
for the number of spans which could not be processed due to a full queueThe SDK also implements pretty much the same metrics for the
BatchLogRecordProcessor
justspan
replaced everywhere withlog
Exporter metrics
Exporter metrics are the same for spans, metrics and logs. They are distinguishable based on a
type
attribute.Also the metric names are dependent on a "name" and "transport" defined by the exporter. For OTLP those are:
exporterName
=otlp
transport
is one ofgrpc
,http
(= protobuf) orhttp-json
The transport is used just for the instrumentation scope name:
io.opentelemetry.exporters.<exporterName>-<transport>
Based on that, the following metrics are exposed:
Counter
<exporterName>.exporter.seen
: The number of records (spans, metrics or logs) submitted to the exportertype
: one ofspan
,metric
orlog
Counter
<exporterName>.exporter.exported
: The number of records (spans, metrics or logs) actually exported (or failed)type
: one ofspan
,metric
orlog
success
(boolean):false
for exporter failuresMerge requirement checklist
[chore]