-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[exporter/otlp] Report runtime status #11366
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #11366 +/- ##
==========================================
+ Coverage 91.61% 91.64% +0.03%
==========================================
Files 443 443
Lines 23770 23828 +58
==========================================
+ Hits 21776 21837 +61
+ Misses 1620 1618 -2
+ Partials 374 373 -1 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
646d3dc
to
84944b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before implementation, can we document somewhere (ideal in RFC) or in https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-status.md when an exporter is consider permanent error? I feel like we have different understandings of this and I don't want you to make lots of code changes until we agree on the basic definition.
I have these documented here: #9957. Is it ok to discuss on that issue, or would you prefer it to be an RFC? |
If you're suggesting that the errors identified as Permanent in this PR (and #9957) are in many cases recoverable, we can identify all errors as recoverable and skip permanent for now. |
I am not the biggest expert, but I feel at least some are recoverable, and some are contextual (example auth when passing auth tokens from the client).
I think would be safer correct? |
I think going with recoverable for all errors, at least for now, makes sense. We can discuss what errors should be permanent on #9957, but I suspect we'll be able to come up with counter arguments for most of the cases. |
ba773f0
to
66dbf98
Compare
I updated this PR to handle all errors as recoverable. If we are happy with this approach, we could consider making this an option for all exporters via exporter helper similar to: #8684 likely as a followup. |
exporter/otlpexporter/otlp.go
Outdated
componentstatus.ReportStatus(e.host, componentstatus.NewRecoverableErrorEvent(err)) | ||
return | ||
} | ||
componentstatus.ReportStatus(e.host, componentstatus.NewEvent(componentstatus.StatusOK)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not concerned for every request to report the status? Should we do it periodically, or only when changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The finite state machine handles this for us. It will report status only if there has been a change, otherwise it will no-op.
94fd12c
to
1a6e939
Compare
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Would anyone be interested in reviewing this, or at least removing the stale label? |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
componentstatus.ReportStatus(e.host, componentstatus.NewRecoverableErrorEvent(err)) | ||
return err | ||
} | ||
componentstatus.ReportStatus(e.host, componentstatus.NewEvent(componentstatus.StatusOK)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might create a lot of events reporting OK, is that really what you want? Is it there to recover from the error status?
Description
This PR adds runtime status reporting for the otlp and otlphttp exporters. It's an updated version of #8788 which was one of several approaches experimented with to add this functionality. This work was paused to allow the consumererror work to evolve, as it appeared as though there might be a way to uniformly apply the same logic to all exporters (via the exporterhelper), but that work has taken a different direction, and it looks like a uniform approach will not be possible.
This PR implements runtime status reporting as discussed in #9957. The choices for which statuses represent permanent errors are up for debate, but the key point is that a permanent error is an error discovered at runtime that will require user intervention to fix.
This implementation makes use of the finite state machine that underlies the status reporting system. The finite state machine will ensure that:
This means the exporter does not have to reason about current or previous statuses. It can report status based on its current view of the world and the status reporting system will handle the rest. Flapping between recoverable and ok is meant to be handled by watchers consuming status events. The healthcheckv2extension handles this by using a time based approach (e.g. recovery interval). Other watchers may choose to handle this situation differently.
For more information on status reporting, the state machine, etc see: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-status.md
While all components report status during start and shutdown (via automation), this is the first component to report runtime status. This will allow the healthcheckv2extension to replace the currently non-functioning
check_collector_pipeline
capability of the original healthcheckextension and should serve as an example for other components that wish to report runtime status moving forward.Link to tracking issue
Implements #9957 for the otlp and otlphttp exporters
Testing
units/manual
Documentation
code comments