-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Routing by origin - service is not good enough #35320
Labels
Comments
dmedinag
added
enhancement
New feature or request
needs triage
New item requiring triage
labels
Sep 20, 2024
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This was referenced Sep 24, 2024
This was referenced Oct 8, 2024
This was referenced Oct 29, 2024
jpkrohling
removed
needs triage
New item requiring triage
waiting-for-code-owners
labels
Nov 27, 2024
Theirs is a similar problem, although the solution proposed here is a bit different it could as well, yes. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Component(s)
exporter/loadbalancing
Is your feature request related to a problem? Please describe.
There exist two keys for routing spans today:
service
andtraceID
.traceID
is great for tail sampling, whileservice
is the only one producing reliable spanmetrics without insane cardinality on big volumes.Routing spans by
service
, however, brings a big problem to the table if the differentservice
s generating spans produce a very uneven volume. In an environment where a small percentage of services produce >99% of the spans, instances of the otelcol running the spanmetrics connector (or any other trace pipeline receiving spans routed by service) sustain a very asymmetric load, overloading a few instances while others stay idle.Describe the solution you'd like
service
does a good job for spanmetrics because it produces metric series issued by a single collector of the layer behind the load balancing collector, easing up aggregation and range functions requiring multiple datapoints. But for this reason so wouldpod_name
in a k8s environment, or any other proxy for task instance (where task = application that generates spans).In the context of kubernetes, the IP of the incoming connection uniquely identifies a task instance. My proposal is to add
source_ip
as a routing key for signals, which would considerably dilute the problem we have withservice
as routing key by seggregating the load of these big services across as many workers as instances that service has. Since big services are also typically the ones with the most instances in our ecosystem (and others in the industry), while it's not as good as traceID in terms of randomness this is a much more efficient routing key and an effective solution for our problem.Note 1: this solution would also solve this issue to and any other environment where the IP uniquely identifies a task instance.
Note 2: this routing key would be applicable for
spans
, but also formetrics
andlogs
.Describe alternatives you've considered
Ad-hoc pipeline for offenders
Again within a kubernetes environment, we've considered omitting the span traffic of these big services on our central spanmetrics pipeline and deploying otelcols as sidecars with such spanmetrics pipeline.
But this really just hacks the problem, increases our architecture complexity and does not solve the real issue of
service
being an inherently bad key for symmetric load.Hack the routing key
Since we have the ability to modify
service
before it's used as routing key, we introduced a transformer that changes the value ofservice
to{serviceName}%{ip}
, to force the loadbalancer exporter to route by this composed key. Then, on the second layer of collectors, we recover the original value ofservice
by removing the artificial suffix and proceed with the processing. This has mitigated the problem for us for now, we see much more symmetric load in our spanmetrics collectors, but again it's just hacking around a limitation on the exporter.This solution was preferred to us over having a custom adaptation of the
loadbalancingexporter
on our side so that we simplify the process of maintaining our collectors.Additional context
No response
The text was updated successfully, but these errors were encountered: