-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape based receiver startup behavior #8816
Comments
What errors would be included? I can only think of actually bad configuration (one that totally prevents the receiver to be started, ever). |
Offhand configuration errors are really the only thing I can think of as well. |
I think connections should be attempted at |
@jpkrohling I wanted to clarify a couple of things about your suggested approach before diving in. I can introduce handling for permanent errors in the func (rcvs Receivers) StartAll(ctx context.Context, host component.Host) error {
for _, rcv := range rcvs {
rcv.logger.Info("Receiver is starting...")
if err := rcv.Start(ctx, host); err != nil {
return err
}
rcv.logger.Info("Receiver started.")
}
return nil
}
Is this approach what you had in mind? Are there any alternatives or improvements to consider? Also, I wanted to clarify, what if any, explicit code needs to be written for the health check to be aware of any issues with receiver start. I'm not very familiar with the extension. If there's an example of what I need to do to interact with it, feel free to point me towards it. |
That does sound good! It's a change in the current behavior, so, we would probably need a feature flag to hide this for a couple of releases. Or we can introduce another error, that unambiguously states that the component will continue trying after it got started. The health check changes are missing from your plan though. I'm now a bit outdated regarding this component, but I assume it should be possible for a component to report its state to it? |
I suppose introducing a recoverable error (instead of using permanent) as you hinted at earlier might be a little more straightforward and less disruptive to add. I might go that route first if there aren't any objections. |
We had a discussion about it today during the SIG call. The error we have is in the consumer package, which isn't suitable for this. The proposal is then to change the component.Host to allow components to report its state (and continue reporting it), adding another function allowing components to register for callbacks, to be called whenever there's a report. This way, components such as the load balancing exporter and the health check extension can register for callbacks, whereas scrapers and components like the OTLP exporter can report their connection failures in real time. The call to Start would then not return an error in those cases, so that components would only return errors on permanent, unrecoverable errors. I can't think of any examples right now, which might be a sign that there shouldn't be errors being returned from the Start function? |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
Although open-telemetry/opentelemetry-collector#5304 is closed, due to inactivity, but I think this enhancement is still needed, right? |
Yes, I believe so. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
@mwear, do you remember what's the current state about this? I have a feeling that (parts?) of it would be possible with the new health check extension, but not sure if all aspects of this discussion are captured there. |
I don't think this has been standardized yet. Component status and the new health check extension provide necessary mechanisms to surface recoverable errors during startup without shutting the collector down. Ideally we would be able to standardize behavior and abstract it into scraperhelper. This is somewhat related to the work here: open-telemetry/opentelemetry-collector#9957 and likely here: open-telemetry/opentelemetry-collector#9041. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
Background
I started looking into this issue when I noticed that the kafkametricsreceiver crashes the collector if kafka is not available during startup. I originally started a discussion on #8349, but have found that this issue is more widespread, and that many receivers exhibit the same behavior if the service they monitor is not up at start time.
As a brief recap of the discussion on #8349, I found that the kafkametricsreceiver and roughly a dozen other receivers make use of shared code via the scraperhelper. The list of receivers using the scraperhelper includes: apachereceiver, couchdbreceiver, dockerstatsreceiver, elasticsearchreceiver, googlecloudspannereceiver, hostmetricsreceiver, kafkametricsreceiver, kubeletstatsreceiver, memcachedreceiver, mogodbatlasreceiver, mongodbreceiver, mysqlreceiver, nginxreceiver, podmanreceiver, postgresqlreceiver, rabbitmqreceiver, redisreceiver, windowsperfcountersreceiver, zookeeperreceiver.
The
Start
methods of many of these receivers try to establish a connection to the services they monitor, and return an error if the service is not up, which in turn crashes the collector during startup.Ideally, a receiver would not crash at startup if the monitored service is down, and instead would periodically try to reconnect, and start scraping when the connection is successful.
Fix in scraperhelper
I began looking at fixing this in the individual receivers and also looked at fixing this in the shared
scraperhelper
package. I protyped fixes for both approaches. The fix in the scraperhelper started to turn into a can of worms. Without going into all the details, the attempted fix made Start asynchronous, which was a pretty big change to collector startup, and made it impossible to return an error fromStart
which might still be desirable in certain scenarios.Fix in receiver
My attention turned to fixing this in the receiver itself. Moving client creation to the
Scrape
method was enough to fix the issue and introduce retries for the kafkametricsreceiver. I'd like to propose that receivers only return fatal errors fromStart
. If the monitored service is unreachable, it can return an error fromScrape
without causing a failure.PR
I opened a PR (#8817) for this work as a starting point for discussion. Ideally we'd be able to reach consensus on an approach and apply similar fixes to other receivers with similar problems. Once we are happy with solution we can establish best practices for receivers going forward.
The text was updated successfully, but these errors were encountered: