Scrape based receiver startup behavior #8816

mwear · 2022-03-23T18:40:11Z

Background
I started looking into this issue when I noticed that the kafkametricsreceiver crashes the collector if kafka is not available during startup. I originally started a discussion on #8349, but have found that this issue is more widespread, and that many receivers exhibit the same behavior if the service they monitor is not up at start time.

As a brief recap of the discussion on #8349, I found that the kafkametricsreceiver and roughly a dozen other receivers make use of shared code via the scraperhelper. The list of receivers using the scraperhelper includes: apachereceiver, couchdbreceiver, dockerstatsreceiver, elasticsearchreceiver, googlecloudspannereceiver, hostmetricsreceiver, kafkametricsreceiver, kubeletstatsreceiver, memcachedreceiver, mogodbatlasreceiver, mongodbreceiver, mysqlreceiver, nginxreceiver, podmanreceiver, postgresqlreceiver, rabbitmqreceiver, redisreceiver, windowsperfcountersreceiver, zookeeperreceiver.

The Start methods of many of these receivers try to establish a connection to the services they monitor, and return an error if the service is not up, which in turn crashes the collector during startup.

Ideally, a receiver would not crash at startup if the monitored service is down, and instead would periodically try to reconnect, and start scraping when the connection is successful.

Fix in scraperhelper
I began looking at fixing this in the individual receivers and also looked at fixing this in the shared scraperhelper package. I protyped fixes for both approaches. The fix in the scraperhelper started to turn into a can of worms. Without going into all the details, the attempted fix made Start asynchronous, which was a pretty big change to collector startup, and made it impossible to return an error from Start which might still be desirable in certain scenarios.

Fix in receiver
My attention turned to fixing this in the receiver itself. Moving client creation to the Scrape method was enough to fix the issue and introduce retries for the kafkametricsreceiver. I'd like to propose that receivers only return fatal errors from Start. If the monitored service is unreachable, it can return an error from Scrape without causing a failure.

PR
I opened a PR (#8817) for this work as a starting point for discussion. Ideally we'd be able to reach consensus on an approach and apply similar fixes to other receivers with similar problems. Once we are happy with solution we can establish best practices for receivers going forward.

The text was updated successfully, but these errors were encountered:

carlosalberto · 2022-03-24T15:30:10Z

I'd like to propose that receivers only return fatal errors from Start.

What errors would be included? I can only think of actually bad configuration (one that totally prevents the receiver to be started, ever).

mwear · 2022-03-24T16:21:48Z

Offhand configuration errors are really the only thing I can think of as well.

jpkrohling · 2022-03-24T17:41:37Z

I think connections should be attempted at Start. If all components can be started without errors, we can mark the collector as healthy (in the health check extension). If anything fails, we can either wrap in a permanent error (we have such error in the code base), or in recoverable (not sure we have this). We can then either fail (permanent errors), or assume that the component will attempt to try again periodically. This can then lead the health status to be "degraded".

mwear · 2022-03-28T21:13:24Z

@jpkrohling I wanted to clarify a couple of things about your suggested approach before diving in.

I can introduce handling for permanent errors in the StartAll method in receivers_builder:

func (rcvs Receivers) StartAll(ctx context.Context, host component.Host) error {
	for _, rcv := range rcvs {
		rcv.logger.Info("Receiver is starting...")


		if err := rcv.Start(ctx, host); err != nil {
			return err
		}
		rcv.logger.Info("Receiver started.")
	}
	return nil
}

I can update StartAll to check the type of error returned from Start.
- If it's a permanent error, return the error and the collector will fail to start.
- For any other type of error log and continue.
In order to preserve current behavior, I would have to take a pass through all of the current receivers, and where an error is returned from Start I would have to return a permanent error.
We could then incrementally go back through each receiver, refactor them to periodically reconnect, and hopefully reduce or remove the possibilities of returning a permanent error from start.

Is this approach what you had in mind? Are there any alternatives or improvements to consider?

Also, I wanted to clarify, what if any, explicit code needs to be written for the health check to be aware of any issues with receiver start. I'm not very familiar with the extension. If there's an example of what I need to do to interact with it, feel free to point me towards it.

jpkrohling · 2022-03-30T15:06:12Z

That does sound good! It's a change in the current behavior, so, we would probably need a feature flag to hide this for a couple of releases. Or we can introduce another error, that unambiguously states that the component will continue trying after it got started.

The health check changes are missing from your plan though. I'm now a bit outdated regarding this component, but I assume it should be possible for a component to report its state to it?

https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension

mwear · 2022-03-30T15:36:09Z

I suppose introducing a recoverable error (instead of using permanent) as you hinted at earlier might be a little more straightforward and less disruptive to add. I might go that route first if there aren't any objections.

jpkrohling · 2022-03-30T16:49:07Z

We had a discussion about it today during the SIG call. The error we have is in the consumer package, which isn't suitable for this. The proposal is then to change the component.Host to allow components to report its state (and continue reporting it), adding another function allowing components to register for callbacks, to be called whenever there's a report. This way, components such as the load balancing exporter and the health check extension can register for callbacks, whereas scrapers and components like the OTLP exporter can report their connection failures in real time.

The call to Start would then not return an error in those cases, so that components would only return errors on permanent, unrecoverable errors. I can't think of any examples right now, which might be a sign that there shouldn't be errors being returned from the Start function?

github-actions · 2022-11-16T03:44:24Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

fatsheep9146 · 2022-11-16T11:08:09Z

Although open-telemetry/opentelemetry-collector#5304 is closed, due to inactivity, but I think this enhancement is still needed, right?
@jpkrohling

jpkrohling · 2022-11-23T17:43:41Z

Yes, I believe so.

github-actions · 2023-01-23T03:32:06Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions · 2023-03-27T03:29:17Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions · 2023-06-08T03:29:57Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions · 2023-08-08T03:29:10Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions · 2023-10-09T03:30:09Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions · 2023-12-11T03:29:14Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver: @djaglowski @BinaryFissionGames

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-02-12T03:29:22Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions · 2024-04-17T03:29:07Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

jpkrohling · 2024-04-30T10:54:20Z

@mwear, do you remember what's the current state about this? I have a feeling that (parts?) of it would be possible with the new health check extension, but not sure if all aspects of this discussion are captured there.

mwear · 2024-06-13T22:39:51Z

I don't think this has been standardized yet. Component status and the new health check extension provide necessary mechanisms to surface recoverable errors during startup without shutting the collector down. Ideally we would be able to standardize behavior and abstract it into scraperhelper. This is somewhat related to the work here: open-telemetry/opentelemetry-collector#9957 and likely here: open-telemetry/opentelemetry-collector#9041.

github-actions · 2024-08-13T03:31:34Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions · 2024-10-12T05:19:45Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

mwear mentioned this issue Mar 23, 2022

[receiver/kafkametricsreceiver] do not crash collector on startup when kafka is unavailable #8817

Merged

mwear mentioned this issue Apr 6, 2022

[poc] status reporting open-telemetry/opentelemetry-collector#5158

Closed

6 tasks

mwear mentioned this issue May 3, 2022

status reporting open-telemetry/opentelemetry-collector#5304

Closed

codeboten added the needs triage New item requiring triage label Sep 16, 2022

github-actions bot added the Stale label Nov 16, 2022

fatsheep9146 added comp: receiver Receiver enhancement New feature or request and removed Stale needs triage New item requiring triage labels Nov 16, 2022

github-actions bot added the Stale label Jan 23, 2023

fatsheep9146 removed the Stale label Jan 23, 2023

github-actions bot added the Stale label Mar 27, 2023

fatsheep9146 removed the Stale label Apr 8, 2023

github-actions bot added the Stale label Jun 8, 2023

fatsheep9146 removed the Stale label Jun 8, 2023

github-actions bot added the Stale label Aug 8, 2023

fatsheep9146 removed the Stale label Aug 8, 2023

github-actions bot added the Stale label Oct 9, 2023

fatsheep9146 removed the Stale label Oct 9, 2023

github-actions bot added the Stale label Dec 11, 2023

fatsheep9146 removed the Stale label Dec 11, 2023

github-actions bot added the Stale label Feb 12, 2024

fatsheep9146 removed the Stale label Feb 16, 2024

github-actions bot added the Stale label Apr 17, 2024

jpkrohling removed the Stale label Apr 30, 2024

github-actions bot added the Stale label Aug 13, 2024

github-actions bot added the closed as inactive label Oct 12, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape based receiver startup behavior #8816

Scrape based receiver startup behavior #8816

mwear commented Mar 23, 2022 •

edited

Loading

carlosalberto commented Mar 24, 2022

mwear commented Mar 24, 2022

jpkrohling commented Mar 24, 2022

mwear commented Mar 28, 2022

jpkrohling commented Mar 30, 2022 •

edited

Loading

mwear commented Mar 30, 2022 •

edited

Loading

jpkrohling commented Mar 30, 2022

github-actions bot commented Nov 16, 2022

fatsheep9146 commented Nov 16, 2022 •

edited

Loading

jpkrohling commented Nov 23, 2022

github-actions bot commented Jan 23, 2023

github-actions bot commented Mar 27, 2023

github-actions bot commented Jun 8, 2023

github-actions bot commented Aug 8, 2023

github-actions bot commented Oct 9, 2023

github-actions bot commented Dec 11, 2023

github-actions bot commented Feb 12, 2024

github-actions bot commented Apr 17, 2024

jpkrohling commented Apr 30, 2024

mwear commented Jun 13, 2024

github-actions bot commented Aug 13, 2024

github-actions bot commented Oct 12, 2024

Scrape based receiver startup behavior #8816

Scrape based receiver startup behavior #8816

Comments

mwear commented Mar 23, 2022 • edited Loading

carlosalberto commented Mar 24, 2022

mwear commented Mar 24, 2022

jpkrohling commented Mar 24, 2022

mwear commented Mar 28, 2022

jpkrohling commented Mar 30, 2022 • edited Loading

mwear commented Mar 30, 2022 • edited Loading

jpkrohling commented Mar 30, 2022

github-actions bot commented Nov 16, 2022

fatsheep9146 commented Nov 16, 2022 • edited Loading

jpkrohling commented Nov 23, 2022

github-actions bot commented Jan 23, 2023

github-actions bot commented Mar 27, 2023

github-actions bot commented Jun 8, 2023

github-actions bot commented Aug 8, 2023

github-actions bot commented Oct 9, 2023

github-actions bot commented Dec 11, 2023

github-actions bot commented Feb 12, 2024

github-actions bot commented Apr 17, 2024

jpkrohling commented Apr 30, 2024

mwear commented Jun 13, 2024

github-actions bot commented Aug 13, 2024

github-actions bot commented Oct 12, 2024

mwear commented Mar 23, 2022 •

edited

Loading

jpkrohling commented Mar 30, 2022 •

edited

Loading

mwear commented Mar 30, 2022 •

edited

Loading

fatsheep9146 commented Nov 16, 2022 •

edited

Loading