Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confused about the difference between PermanentError vs FatalError #9823

Open
bogdandrutu opened this issue Mar 22, 2024 · 16 comments
Open

Confused about the difference between PermanentError vs FatalError #9823

bogdandrutu opened this issue Mar 22, 2024 · 16 comments
Labels
area:componentstatus bug Something isn't working

Comments

@bogdandrutu
Copy link
Member

Per definitions from #9324

Status Meaning
PermanentError The component has detected a condition at runtime that will need human intervention to fix. The collector will continue to run in a degraded mode.
FatalError The collector has experienced a fatal runtime error and will shutdown.

When do I use PermanentError vs FatalError? If a component is permanently failing does that not mean fatal? Why? If my whole collector doesn't accept any data because that was my only receiver then I think is better to fail than transition in permanent error.

@mx-psi
Copy link
Member

mx-psi commented Mar 25, 2024

cc @mwear

@mwear
Copy link
Member

mwear commented Mar 27, 2024

The intent of having a permanent error state is to allow the collector to run in a degraded mode, and communicate issues to operators through the health check extension or something similar. In your example where the collector has a single non-functional receiver, this would not be very useful, but in a collector with multiple pipelines, it is possible for some pipelines to be running normally, and others to have issues. In a scenario where you are exporting to multiple backends, one exporter could be working fine, while another could be failing due to misconfiguration. The misconfiguration isn't apparent until trying to send data.

I think we need to establish some guidelines, but I do think allowing a collector to run in a degraded fashion, rather than terminating altogether is a useful thing to support.

@mwear
Copy link
Member

mwear commented Mar 27, 2024

I would suggest that we consider using the following (updated) definitions:

Status Meaning
PermanentError The component has detected a condition after start that will need human intervention to fix. The collector will continue to run in a degraded mode.
FatalError The collector has experienced an unrecoverable error during startup and will shutdown.

A FatalError can be used to fail fast during startup when detecting an unrecoverable error condition. An example would be a receiver that is unable to bind to its configured port.

A PermanentError would be an unrecoverable error that happens after start. For example, an exporter attempts to send data and receives a http.StatusUnauthorized (401) or codes.Unauthenticated (16).

If you are interested, I attempted to enumerate permanent errors for exporters in #8684. I suspect some of the choices are controversial, and would like to find a more universal implementation, but the description captures my general thoughts. If we can reach consensus for exporters, my next step will be to start working on conventions for receivers.

@bogdandrutu
Copy link
Member Author

A FatalError can be used to fail fast during startup when detecting an unrecoverable error condition. An example would be a receiver that is unable to bind to its configured port.
The collector has experienced an unrecoverable error during startup and will shutdown.

Based on these 2 definitions, I would say that a component should never use Fatal. Also we can remove FatalError from the status since we shutdown anyway.

@mwear
Copy link
Member

mwear commented Apr 2, 2024

Are you saying that components should never trigger a fatal error, or that they should not use component status to report it? If a component can trigger a fatal error, I think it should go through the component status API so that it can be surfaced by a StatusWatcher prior to shutdown.

@mwear
Copy link
Member

mwear commented Apr 4, 2024

An alternative would be to remove FatalErrors altogether, as suggested here.

@mwear
Copy link
Member

mwear commented May 2, 2024

I've made proposals on #9324 and #9325 that we remove the error return value from component.Start and Shutdown and rely on status reporting for errors instead. In light of the conversation on #9324, I think we should keep FatalError as a fail-fast mechanism during component.Start or async work that begins in Start. During Start, a component can choose to report a PermanentError instead of a FatalError if the author wishes to allow the collector to run in a degraded mode. After component.Start only PermanentErrors should be used for non-recoverable errors.

@TylerHelmuth
Copy link
Member

If FatalError is intended only on Start I think we could use error to mark when a collector would shut down, and then we'd be able to remove FatalError. I am working under the assumption that a running component should never be able to shutdown the entire collector unless it causes a panic (which would be bad).

On that note, what would a PermanentError for a processor imply for the collector as a whole? Is the pipeline that the processor is in dead? Should the collector be panicking and stopping instead?

@mwear
Copy link
Member

mwear commented May 2, 2024

Unfortunately, we cannot use the return value of start for all FatalErrors. This is because many components will start work in a go routine, like spawning an HTTP server, which can fail as a result of Start, but not until after Start returns. For this reason, we need a separate API for FatalErrors. We would also like the health check extension to be aware of this, so it makes sense for this API to be the status reporting API.

Any PermanentError discovered at runtime is a sign that something is not ok, and that a human needs to intervene to fix it. The severity of the error is situational. It can mean that a pipeline has failed completely, or that it is running in a degraded mode. It can also be the case that other pipelines are completely fine. The key point indicated by a PermanentError, is that restarting your collector is unlikely to fix it. The goal is to allow the collector to run, even if it's not 100% healthy, alert someone there is an issue, and allow them to mitigate it.

@mwear
Copy link
Member

mwear commented May 3, 2024

I'd like to elaborate just a little on the PermanentError case as I think a couple of examples might help illustrate what the intent of this state is. I've defined it as a non-recoverable error that is detected at runtime (e.g. sometime after the collector has started). The idea behind this state, that there is an issue with a component that is unlikely to resolve if you wait it out, or restart your collector. It's something that a human should be notified about. That said, the impact to the collector might not necessarily be catastrophic. In a collector with multiple pipelines, some pipelines could be working fine, and it might be better to let the collector run in a degraded fashion until someone can troubleshoot and mitigate the issue.

I did a pass over HTTP status codes where I attempted to catalog permanent and recoverable errors for exporters. They are listed on the runtime status reporting issue and some of the choices are likely to be controversial. Let's look at two. I've proposed that an exporter that receives a 401 (unauthorized) response code should report a PermanentError as this indicates a likely missing or misconfigured credential. It will not improve if you wait it out, or restart the collector. The impact is that the pipelines that use this exporter are not sending data. If this is your only exporter, or your primary, it's a big problem. If it's one of many pipelines, or a lower priority destination, it's less of a problem. I've also proposed that a 413 (request entity too large) should result in a PermanentError. My reasoning is that this is a misconfigured batch size. It will not go away if you wait, nor will it change if the collector is restarted. That said, some requests may be undersized and accepted by the destination, but ones that are rejected for being too large will continue to be rejected until someone has updated the batch size. In this case, pipelines using this exporter are degraded, but not entirely broken.

The health check extension is a consumer of these events, and is the place for a user to define how they want to respond to them and get a deeper look into the collector to debug them. It offers per pipeline status, as well as overall collector status. Both of these are computed as "the sum" of the component statuses, which are available in the response body if the verbose flag has been set on a request to the health service.

@mwear
Copy link
Member

mwear commented May 3, 2024

This might be getting a little off topic, but just to complete this thought, I wanted to show sample response bodies from the new health check extension. The first response body is for the collector as a whole. Note, the overall collector status is StatusPermanentError as is pipeline:metrics/grpc, but pipeline:traces/http is StatusOK.

{
    "start_time": "2024-01-18T17:27:12.570394-08:00",
    "healthy": true,
    "status": "StatusPermanentError",
    "error": "rpc error: code = Unauthenticated desc = authentication failed",
    "status_time": "2024-01-18T17:27:32.572301-08:00",
    "components": {
        "extensions": {
            "healthy": true,
            "status": "StatusOK",
            "status_time": "2024-01-18T17:27:12.570428-08:00",
            "components": {
                "extension:healthcheckv2": {
                    "healthy": true,
                    "status": "StatusOK",
                    "status_time": "2024-01-18T17:27:12.570428-08:00"
                }
            }
        },
        "pipeline:metrics/grpc": {
            "healthy": true,
            "status": "StatusPermanentError",
            "error": "rpc error: code = Unauthenticated desc = authentication failed",
            "status_time": "2024-01-18T17:27:32.572301-08:00",
            "components": {
                "exporter:otlp/staging": {
                    "healthy": true,
                    "status": "StatusPermanentError",
                    "error": "rpc error: code = Unauthenticated desc = authentication failed",
                    "status_time": "2024-01-18T17:27:32.572301-08:00"
                },
                "processor:batch": {
                    "healthy": true,
                    "status": "StatusOK",
                    "status_time": "2024-01-18T17:27:12.571132-08:00"
                },
                "receiver:otlp": {
                    "healthy": true,
                    "status": "StatusOK",
                    "status_time": "2024-01-18T17:27:12.571576-08:00"
                }
            }
        },
        "pipeline:traces/http": {
            "healthy": true,
            "status": "StatusOK",
            "status_time": "2024-01-18T17:27:12.571625-08:00",
            "components": {
                "exporter:otlphttp/staging": {
                    "healthy": true,
                    "status": "StatusOK",
                    "status_time": "2024-01-18T17:27:12.571615-08:00"
                },
                "processor:batch": {
                    "healthy": true,
                    "status": "StatusOK",
                    "status_time": "2024-01-18T17:27:12.571621-08:00"
                },
                "receiver:otlp": {
                    "healthy": true,
                    "status": "StatusOK",
                    "status_time": "2024-01-18T17:27:12.571625-08:00"
                }
            }
        }
    }
}

The new extension allows you to check individual pipelines. If you were to check just traces/http you'd see:

{
    "start_time": "2024-01-18T17:27:12.570394-08:00",
    "healthy": true,
    "status": "StatusOK",
    "status_time": "2024-01-18T17:27:12.571625-08:00",
    "components": {
        "exporter:otlphttp/staging": {
            "healthy": true,
            "status": "StatusOK",
            "status_time": "2024-01-18T17:27:12.571615-08:00"
        },
        "processor:batch": {
            "healthy": true,
            "status": "StatusOK",
            "status_time": "2024-01-18T17:27:12.571621-08:00"
        },
        "receiver:otlp": {
            "healthy": true,
            "status": "StatusOK",
            "status_time": "2024-01-18T17:27:12.571625-08:00"
        }
    }
}

The tl;dr is that the health check extension is a window into collector health, and that the different types of errors can help identify the urgency and possible steps for mitigation. If we allow a collector to run in a degraded mode, we can debug it at a distance with the extension. If we terminate the collector due to error, we don't have this option. There is a balance to strike between failing fast, and running in a degraded mode, but degraded doesn't mean silently broken.

@mx-psi
Copy link
Member

mx-psi commented May 20, 2024

@mwear If we remove either of the two, would it be possible to add a new error type after 1.0 in a backwards-compatible way?

If that's possible (I don't see why not) I am tempted to say we should remove fatal now and table the discussion until after both 1.0 has happened and we have gotten some feedback on the system

@mwear
Copy link
Member

mwear commented May 22, 2024

It would be possible to add FatalError back in later if we removed it now and later decided that we wanted it. Right now it's mainly used as a mechanism to fail-fast from async work during component.Start. This functionality predates component status reporting, which is why it still exists. I think removing it greatly simplifies questions around component.Start and status reporting generally. With the new health check extension, we have better ways to get a view into the health of a running collector. I would support removing it for 1.0.

@TylerHelmuth
Copy link
Member

I agree that removing it simplifies the problem and it can always be added back.

My general thoughts on the Component Status feature is that we need to make sure that the collector can function with or without it. This makes sure that the existing collector operational expections stay intact for Collector admins who are not depending on status (or cannot, since the only aggregator is health check and OpAMP is still new), while giving advanced options for users of OpAMP or k8s (or some other orchestration tool that can take advantage of health check extension or some future Watcher).

I plan to write up an RFC soon to help unblock many of the Status issues currently open.

@mwear
Copy link
Member

mwear commented May 22, 2024

I plan to write up an RFC soon to help unblock many of the Status issues currently open.

@TylerHelmuth, I'm currently writing up a document about component status reporting. It describes how the system works and suggests best practices. I'm not sure where it should live, but I think it needs to be written. I can let you know when I have a draft together and we can see what its future should be from there.

codeboten pushed a commit that referenced this issue Aug 22, 2024
#### Description
Adds an RFC for component status reporting. The main goal is to define
what component status reporting is, our current, implementation, and how
such a system interacts with a 1.0 component. When merged, the following
issues will be unblocked:
- #9823
- #10058
- #9957
- #9324
- #6506

---------

Co-authored-by: Matthew Wear <[email protected]>
Co-authored-by: Pablo Baeyens <[email protected]>
@TylerHelmuth
Copy link
Member

With #10413 merged, I am removing this from the component 1.0 milestone and Collector V1 project.

sfc-gh-sili pushed a commit to sfc-gh-sili/opentelemetry-collector that referenced this issue Aug 23, 2024
Adds an RFC for component status reporting. The main goal is to define
what component status reporting is, our current, implementation, and how
such a system interacts with a 1.0 component. When merged, the following
issues will be unblocked:
- open-telemetry#9823
- open-telemetry#10058
- open-telemetry#9957
- open-telemetry#9324
- open-telemetry#6506

---------

Co-authored-by: Matthew Wear <[email protected]>
Co-authored-by: Pablo Baeyens <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:componentstatus bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants