-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requests failing with 503 errors in OTel Load Balancer, no traces in logs #35512
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
It works when we remove the LB. It also works with the OTel LB layer in place. The mystery is why a small percentage (around 3-5%) of requests are lost at the OTel LB, even though the overall traffic flow is generally successful. It doesn't give any clue in logs. |
This issue with routing key |
I'm trying to wrap my head around this issue: are you saying that the connection between the LB and the Collector is failing with 503, or that only the Demo to LB is failing? Can you provide me with metrics from the LB (localhost:8888/metrics)? In any case, I'm adding a few debug statements to the load balancer to help diagnose this kind of issue. |
We've had almost the same error message appear:
in this case we've replaced the API gateway with nginx, and I can see the intermittent 503s appear in nginx but not bubble up to loki, so haven't been able to track the cause. Just thought I'd throw that in there given you mentioned using nginx |
Can you please provide me with the metrics? This message on its own does not indicate a problem. As it states there, the request will be retried. |
I don't have any metrics to share (customer has been told to turn them on - understandably very hard to debug without them). Will ask again and report here if it's still an issue |
…ort operation (#36575) This adds some debug logging to the load balancing exporter, to help identify causes of 503, reported as part of issues like #35512. The statements should only be logged when the logging mode is set to debug, meaning that there should not be a difference to the current behavior of production setups. Signed-off-by: Juraci Paixão Kröhling <[email protected]> Signed-off-by: Juraci Paixão Kröhling <[email protected]>
…ort operation (open-telemetry#36575) This adds some debug logging to the load balancing exporter, to help identify causes of 503, reported as part of issues like open-telemetry#35512. The statements should only be logged when the logging mode is set to debug, meaning that there should not be a difference to the current behavior of production setups. Signed-off-by: Juraci Paixão Kröhling <[email protected]> Signed-off-by: Juraci Paixão Kröhling <[email protected]>
…ort operation (open-telemetry#36575) This adds some debug logging to the load balancing exporter, to help identify causes of 503, reported as part of issues like open-telemetry#35512. The statements should only be logged when the logging mode is set to debug, meaning that there should not be a difference to the current behavior of production setups. Signed-off-by: Juraci Paixão Kröhling <[email protected]> Signed-off-by: Juraci Paixão Kröhling <[email protected]>
…ort operation (open-telemetry#36575) This adds some debug logging to the load balancing exporter, to help identify causes of 503, reported as part of issues like open-telemetry#35512. The statements should only be logged when the logging mode is set to debug, meaning that there should not be a difference to the current behavior of production setups. Signed-off-by: Juraci Paixão Kröhling <[email protected]> Signed-off-by: Juraci Paixão Kröhling <[email protected]>
Component(s)
exporter/loadbalancing
What happened?
Description
Problem:
We are observing intermittent 503 errors at the OTel Load Balancer pod. No logs are generated for these failures, but internal telemetry (http_server_request_size) shows metrics with HTTP status code 503 from the OTel Load Balancer. The error is visible in the OTel Demo app's collector.
Error Message (from OTel Demo app's otelcol pod):
Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/withauth", "error": "Throttle (0s), error: rpc error: code = Unavailable desc = error exporting items, request to http://otel-gateway.<IP>.nip.io:80/v1/metrics responded with HTTP Status Code 503, Message=unable to get service name, Details=[]", "interval": "27.51009548s"}
Steps to Reproduce
OTel Demo > API Gateway (Internal) > OTel Load Balancer > OTel Collector
We are using a single replica of both the OTel Load Balancer and OTel Collector for testing purposes.
Note: Nginx Ingress can be used in place of internal API Gateway
OTel Load Balancer Configuration:
OTel Collector Configuration:
Configuration Details:
OTel LoadBalancer and OTel Collector configurations are attached in configuration section.
Note: We can see it's working when we hit request using Postman manually.
Expected Result
All requests should be processed without an error as downstream OTel Collector is available, up and running. OTel Demo app should be reporting properly.
Actual Result
Few requests fails with 503 without any error logged by loadbalancing exporter
Collector version
v0.109.0
Environment information
Environment
OS: CentOS v8
GoLang: 1.23
OTel Collector: v0.109.0
OpenTelemetry Collector configuration
Log output
Additional context
No response
The text was updated successfully, but these errors were encountered: