Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape based receiver startup behavior #8816

Closed
mwear opened this issue Mar 23, 2022 · 22 comments
Closed

Scrape based receiver startup behavior #8816

mwear opened this issue Mar 23, 2022 · 22 comments

Comments

@mwear
Copy link
Member

mwear commented Mar 23, 2022

Background
I started looking into this issue when I noticed that the kafkametricsreceiver crashes the collector if kafka is not available during startup. I originally started a discussion on #8349, but have found that this issue is more widespread, and that many receivers exhibit the same behavior if the service they monitor is not up at start time.

As a brief recap of the discussion on #8349, I found that the kafkametricsreceiver and roughly a dozen other receivers make use of shared code via the scraperhelper. The list of receivers using the scraperhelper includes: apachereceiver, couchdbreceiver, dockerstatsreceiver, elasticsearchreceiver, googlecloudspannereceiver, hostmetricsreceiver, kafkametricsreceiver, kubeletstatsreceiver, memcachedreceiver, mogodbatlasreceiver, mongodbreceiver, mysqlreceiver, nginxreceiver, podmanreceiver, postgresqlreceiver, rabbitmqreceiver, redisreceiver, windowsperfcountersreceiver, zookeeperreceiver.

The Start methods of many of these receivers try to establish a connection to the services they monitor, and return an error if the service is not up, which in turn crashes the collector during startup.

Ideally, a receiver would not crash at startup if the monitored service is down, and instead would periodically try to reconnect, and start scraping when the connection is successful.

Fix in scraperhelper
I began looking at fixing this in the individual receivers and also looked at fixing this in the shared scraperhelper package. I protyped fixes for both approaches. The fix in the scraperhelper started to turn into a can of worms. Without going into all the details, the attempted fix made Start asynchronous, which was a pretty big change to collector startup, and made it impossible to return an error from Start which might still be desirable in certain scenarios.

Fix in receiver
My attention turned to fixing this in the receiver itself. Moving client creation to the Scrape method was enough to fix the issue and introduce retries for the kafkametricsreceiver. I'd like to propose that receivers only return fatal errors from Start. If the monitored service is unreachable, it can return an error from Scrape without causing a failure.

PR
I opened a PR (#8817) for this work as a starting point for discussion. Ideally we'd be able to reach consensus on an approach and apply similar fixes to other receivers with similar problems. Once we are happy with solution we can establish best practices for receivers going forward.

@carlosalberto
Copy link
Contributor

I'd like to propose that receivers only return fatal errors from Start.

What errors would be included? I can only think of actually bad configuration (one that totally prevents the receiver to be started, ever).

@mwear
Copy link
Member Author

mwear commented Mar 24, 2022

Offhand configuration errors are really the only thing I can think of as well.

@jpkrohling
Copy link
Member

I think connections should be attempted at Start. If all components can be started without errors, we can mark the collector as healthy (in the health check extension). If anything fails, we can either wrap in a permanent error (we have such error in the code base), or in recoverable (not sure we have this). We can then either fail (permanent errors), or assume that the component will attempt to try again periodically. This can then lead the health status to be "degraded".

@mwear
Copy link
Member Author

mwear commented Mar 28, 2022

@jpkrohling I wanted to clarify a couple of things about your suggested approach before diving in.

I can introduce handling for permanent errors in the StartAll method in receivers_builder:

func (rcvs Receivers) StartAll(ctx context.Context, host component.Host) error {
	for _, rcv := range rcvs {
		rcv.logger.Info("Receiver is starting...")


		if err := rcv.Start(ctx, host); err != nil {
			return err
		}
		rcv.logger.Info("Receiver started.")
	}
	return nil
}
  • I can update StartAll to check the type of error returned from Start.
    • If it's a permanent error, return the error and the collector will fail to start.
    • For any other type of error log and continue.
  • In order to preserve current behavior, I would have to take a pass through all of the current receivers, and where an error is returned from Start I would have to return a permanent error.
  • We could then incrementally go back through each receiver, refactor them to periodically reconnect, and hopefully reduce or remove the possibilities of returning a permanent error from start.

Is this approach what you had in mind? Are there any alternatives or improvements to consider?

Also, I wanted to clarify, what if any, explicit code needs to be written for the health check to be aware of any issues with receiver start. I'm not very familiar with the extension. If there's an example of what I need to do to interact with it, feel free to point me towards it.

@jpkrohling
Copy link
Member

jpkrohling commented Mar 30, 2022

That does sound good! It's a change in the current behavior, so, we would probably need a feature flag to hide this for a couple of releases. Or we can introduce another error, that unambiguously states that the component will continue trying after it got started.

The health check changes are missing from your plan though. I'm now a bit outdated regarding this component, but I assume it should be possible for a component to report its state to it?

https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension

@mwear
Copy link
Member Author

mwear commented Mar 30, 2022

I suppose introducing a recoverable error (instead of using permanent) as you hinted at earlier might be a little more straightforward and less disruptive to add. I might go that route first if there aren't any objections.

@jpkrohling
Copy link
Member

We had a discussion about it today during the SIG call. The error we have is in the consumer package, which isn't suitable for this. The proposal is then to change the component.Host to allow components to report its state (and continue reporting it), adding another function allowing components to register for callbacks, to be called whenever there's a report. This way, components such as the load balancing exporter and the health check extension can register for callbacks, whereas scrapers and components like the OTLP exporter can report their connection failures in real time.

The call to Start would then not return an error in those cases, so that components would only return errors on permanent, unrecoverable errors. I can't think of any examples right now, which might be a sign that there shouldn't be errors being returned from the Start function?

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Nov 16, 2022
@fatsheep9146 fatsheep9146 added comp: receiver Receiver enhancement New feature or request and removed Stale needs triage New item requiring triage labels Nov 16, 2022
@fatsheep9146
Copy link
Contributor

fatsheep9146 commented Nov 16, 2022

Although open-telemetry/opentelemetry-collector#5304 is closed, due to inactivity, but I think this enhancement is still needed, right?
@jpkrohling

@jpkrohling
Copy link
Member

Yes, I believe so.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Aug 8, 2023
@fatsheep9146 fatsheep9146 removed the Stale label Aug 8, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Apr 17, 2024
@jpkrohling
Copy link
Member

@mwear, do you remember what's the current state about this? I have a feeling that (parts?) of it would be possible with the new health check extension, but not sure if all aspects of this discussion are captured there.

@jpkrohling jpkrohling removed the Stale label Apr 30, 2024
@mwear
Copy link
Member Author

mwear commented Jun 13, 2024

I don't think this has been standardized yet. Component status and the new health check extension provide necessary mechanisms to surface recoverable errors during startup without shutting the collector down. Ideally we would be able to standardize behavior and abstract it into scraperhelper. This is somewhat related to the work here: open-telemetry/opentelemetry-collector#9957 and likely here: open-telemetry/opentelemetry-collector#9041.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Aug 13, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants