-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confused about the difference between PermanentError vs FatalError #9823
Comments
cc @mwear |
The intent of having a permanent error state is to allow the collector to run in a degraded mode, and communicate issues to operators through the health check extension or something similar. In your example where the collector has a single non-functional receiver, this would not be very useful, but in a collector with multiple pipelines, it is possible for some pipelines to be running normally, and others to have issues. In a scenario where you are exporting to multiple backends, one exporter could be working fine, while another could be failing due to misconfiguration. The misconfiguration isn't apparent until trying to send data. I think we need to establish some guidelines, but I do think allowing a collector to run in a degraded fashion, rather than terminating altogether is a useful thing to support. |
I would suggest that we consider using the following (updated) definitions:
A A If you are interested, I attempted to enumerate permanent errors for exporters in #8684. I suspect some of the choices are controversial, and would like to find a more universal implementation, but the description captures my general thoughts. If we can reach consensus for exporters, my next step will be to start working on conventions for receivers. |
Based on these 2 definitions, I would say that a component should never use Fatal. Also we can remove FatalError from the status since we shutdown anyway. |
Are you saying that components should never trigger a fatal error, or that they should not use component status to report it? If a component can trigger a fatal error, I think it should go through the component status API so that it can be surfaced by a StatusWatcher prior to shutdown. |
An alternative would be to remove |
I've made proposals on #9324 and #9325 that we remove the error return value from |
If On that note, what would a |
Unfortunately, we cannot use the return value of start for all Any PermanentError discovered at runtime is a sign that something is not ok, and that a human needs to intervene to fix it. The severity of the error is situational. It can mean that a pipeline has failed completely, or that it is running in a degraded mode. It can also be the case that other pipelines are completely fine. The key point indicated by a PermanentError, is that restarting your collector is unlikely to fix it. The goal is to allow the collector to run, even if it's not 100% healthy, alert someone there is an issue, and allow them to mitigate it. |
I'd like to elaborate just a little on the PermanentError case as I think a couple of examples might help illustrate what the intent of this state is. I've defined it as a non-recoverable error that is detected at runtime (e.g. sometime after the collector has started). The idea behind this state, that there is an issue with a component that is unlikely to resolve if you wait it out, or restart your collector. It's something that a human should be notified about. That said, the impact to the collector might not necessarily be catastrophic. In a collector with multiple pipelines, some pipelines could be working fine, and it might be better to let the collector run in a degraded fashion until someone can troubleshoot and mitigate the issue. I did a pass over HTTP status codes where I attempted to catalog permanent and recoverable errors for exporters. They are listed on the runtime status reporting issue and some of the choices are likely to be controversial. Let's look at two. I've proposed that an exporter that receives a 401 (unauthorized) response code should report a PermanentError as this indicates a likely missing or misconfigured credential. It will not improve if you wait it out, or restart the collector. The impact is that the pipelines that use this exporter are not sending data. If this is your only exporter, or your primary, it's a big problem. If it's one of many pipelines, or a lower priority destination, it's less of a problem. I've also proposed that a 413 (request entity too large) should result in a PermanentError. My reasoning is that this is a misconfigured batch size. It will not go away if you wait, nor will it change if the collector is restarted. That said, some requests may be undersized and accepted by the destination, but ones that are rejected for being too large will continue to be rejected until someone has updated the batch size. In this case, pipelines using this exporter are degraded, but not entirely broken. The health check extension is a consumer of these events, and is the place for a user to define how they want to respond to them and get a deeper look into the collector to debug them. It offers per pipeline status, as well as overall collector status. Both of these are computed as "the sum" of the component statuses, which are available in the response body if the |
This might be getting a little off topic, but just to complete this thought, I wanted to show sample response bodies from the new health check extension. The first response body is for the collector as a whole. Note, the overall collector status is {
"start_time": "2024-01-18T17:27:12.570394-08:00",
"healthy": true,
"status": "StatusPermanentError",
"error": "rpc error: code = Unauthenticated desc = authentication failed",
"status_time": "2024-01-18T17:27:32.572301-08:00",
"components": {
"extensions": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.570428-08:00",
"components": {
"extension:healthcheckv2": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.570428-08:00"
}
}
},
"pipeline:metrics/grpc": {
"healthy": true,
"status": "StatusPermanentError",
"error": "rpc error: code = Unauthenticated desc = authentication failed",
"status_time": "2024-01-18T17:27:32.572301-08:00",
"components": {
"exporter:otlp/staging": {
"healthy": true,
"status": "StatusPermanentError",
"error": "rpc error: code = Unauthenticated desc = authentication failed",
"status_time": "2024-01-18T17:27:32.572301-08:00"
},
"processor:batch": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571132-08:00"
},
"receiver:otlp": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571576-08:00"
}
}
},
"pipeline:traces/http": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571625-08:00",
"components": {
"exporter:otlphttp/staging": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571615-08:00"
},
"processor:batch": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571621-08:00"
},
"receiver:otlp": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571625-08:00"
}
}
}
}
} The new extension allows you to check individual pipelines. If you were to check just {
"start_time": "2024-01-18T17:27:12.570394-08:00",
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571625-08:00",
"components": {
"exporter:otlphttp/staging": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571615-08:00"
},
"processor:batch": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571621-08:00"
},
"receiver:otlp": {
"healthy": true,
"status": "StatusOK",
"status_time": "2024-01-18T17:27:12.571625-08:00"
}
}
} The tl;dr is that the health check extension is a window into collector health, and that the different types of errors can help identify the urgency and possible steps for mitigation. If we allow a collector to run in a degraded mode, we can debug it at a distance with the extension. If we terminate the collector due to error, we don't have this option. There is a balance to strike between failing fast, and running in a degraded mode, but degraded doesn't mean silently broken. |
@mwear If we remove either of the two, would it be possible to add a new error type after 1.0 in a backwards-compatible way? If that's possible (I don't see why not) I am tempted to say we should remove fatal now and table the discussion until after both 1.0 has happened and we have gotten some feedback on the system |
It would be possible to add FatalError back in later if we removed it now and later decided that we wanted it. Right now it's mainly used as a mechanism to fail-fast from async work during component.Start. This functionality predates component status reporting, which is why it still exists. I think removing it greatly simplifies questions around component.Start and status reporting generally. With the new health check extension, we have better ways to get a view into the health of a running collector. I would support removing it for 1.0. |
I agree that removing it simplifies the problem and it can always be added back. My general thoughts on the Component Status feature is that we need to make sure that the collector can function with or without it. This makes sure that the existing collector operational expections stay intact for Collector admins who are not depending on status (or cannot, since the only aggregator is health check and OpAMP is still new), while giving advanced options for users of OpAMP or k8s (or some other orchestration tool that can take advantage of health check extension or some future Watcher). I plan to write up an RFC soon to help unblock many of the Status issues currently open. |
@TylerHelmuth, I'm currently writing up a document about component status reporting. It describes how the system works and suggests best practices. I'm not sure where it should live, but I think it needs to be written. I can let you know when I have a draft together and we can see what its future should be from there. |
#### Description Adds an RFC for component status reporting. The main goal is to define what component status reporting is, our current, implementation, and how such a system interacts with a 1.0 component. When merged, the following issues will be unblocked: - #9823 - #10058 - #9957 - #9324 - #6506 --------- Co-authored-by: Matthew Wear <[email protected]> Co-authored-by: Pablo Baeyens <[email protected]>
With #10413 merged, I am removing this from the component 1.0 milestone and Collector V1 project. |
Adds an RFC for component status reporting. The main goal is to define what component status reporting is, our current, implementation, and how such a system interacts with a 1.0 component. When merged, the following issues will be unblocked: - open-telemetry#9823 - open-telemetry#10058 - open-telemetry#9957 - open-telemetry#9324 - open-telemetry#6506 --------- Co-authored-by: Matthew Wear <[email protected]> Co-authored-by: Pablo Baeyens <[email protected]>
Per definitions from #9324
When do I use
PermanentError
vsFatalError
? If a component is permanently failing does that not mean fatal? Why? If my whole collector doesn't accept any data because that was my only receiver then I think is better to fail than transition in permanent error.The text was updated successfully, but these errors were encountered: