-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[component] Runtime status reporting #9957
Labels
Comments
This was referenced Jun 13, 2024
codeboten
pushed a commit
that referenced
this issue
Aug 22, 2024
#### Description Adds an RFC for component status reporting. The main goal is to define what component status reporting is, our current, implementation, and how such a system interacts with a 1.0 component. When merged, the following issues will be unblocked: - #9823 - #10058 - #9957 - #9324 - #6506 --------- Co-authored-by: Matthew Wear <[email protected]> Co-authored-by: Pablo Baeyens <[email protected]>
With #10413 merged, I am removing this from the component 1.0 milestone and Collector V1 project. |
TylerHelmuth
removed this from the go.opentelemetry.io/collector/component 1.0 milestone
Aug 22, 2024
sfc-gh-sili
pushed a commit
to sfc-gh-sili/opentelemetry-collector
that referenced
this issue
Aug 23, 2024
Adds an RFC for component status reporting. The main goal is to define what component status reporting is, our current, implementation, and how such a system interacts with a 1.0 component. When merged, the following issues will be unblocked: - open-telemetry#9823 - open-telemetry#10058 - open-telemetry#9957 - open-telemetry#9324 - open-telemetry#6506 --------- Co-authored-by: Matthew Wear <[email protected]> Co-authored-by: Pablo Baeyens <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Runtime Status Reporting
As part of the component 1.0 milestone we should implement runtime status reporting for core components and come up with guidelines and best practices for incremental adoption by other components. This issue gives some background information on component status reporting, and outlines how it should work for different component types.
Component Status Events
Component status events can be broken down into two categories: lifecycle events (denoted by blue in the diagram) and runtime events (green and red). Lifecycle events e.g.
StatusStarting
,StatusStopping
are reported automatically by the collector and it is the responsibility of components to report their runtime status. During runtime it will be common for a component to transition betweenStatusOK
andStatusRecoverableError
. In some situations, a component may detect an unrecoverable state, and transition intoStatusPermanentError
. This is a final state that cannot be transitioned out of and indicates a human will have to intervene to fix it.The state transitions are governed by a finite state machine and the intention is that components should not have to keep track of their internal state when reporting status. Components can report
StatusOK
when an operation succeeds andStatusRecoverableError
when an operation fails (in a recoverable way). Status events will only be emitted when a component's state changes. So repeat reports of the same status will have no effect. Likewise, if a component has transitioned into a final state (e.g.StatusPermanentError
), subsequent attempts to report status will no-op.Consumers of Status Events
There is a PR for a new version of the health check extension that is based on component status reporting. It uses lifecycle events to determine if the collector is ready and running and allows users to opt-in to having recoverable or permanent errors factored in to collector health. The OpAMP extension will make use of these events for component health at some point in the future. Any extension that implements the optional
StatusWatcher
interface can be a consumer of component status events.Adoption and Best Practices
Components should be able to adopt runtime status reporting incrementally, but for the component 1.0 milestone we should establish guidelines for status reporting and implement them for some of the core components, at a minimum, the OTLP exporter, receiver, and the memory limiter processor. The guidelines should establish general rules for determining whether an error is permanent or recoverable for various component types. Below is a very rough and in-progress idea of how this could look. Many of these choices are likely to be controversial and completely open to debate and discussion. By implementing status reporting for core components we should be able to better establish and document best practices for future component adoption.
Receivers
Receivers should not report error statuses for bad data sent by clients. The errors should be explicitly related to the receiver itself. The following list identifies some scenarios and their statuses, but is likely very incomplete.
Processors
Processors are largely unique in the functionality they provide. Conventions for runtime status reporting will likely need to be considered on a case by case basis. We have an issue to link the memory limiter processor with the (new) health check extension, which will be a good use case for proof of concept.
Exporters
Exporter permanent errors fall into the following categories: bad or missing credentials, incorrectly configured or incompatible endpoint, requests or headers that are too large. All of these indicate misconfiguration at some level. The following list attempts to identify what response codes correspond to which component statuses for HTTP and GRPC.
There were two PRs that attempted to implement these guidelines for the OTLP exporters using different approaches (#8684 and #8788). There is likely a better option where we should be able to implement consistent handling for these codes via the exporter helper and this work to annotate consumer errors.
The text was updated successfully, but these errors were encountered: