-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
W3C trace_id backwards compatibility may break multiple components. #33814
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This is an interesting problem: basically, a trace might have two trace IDs, one with the whole 128 bits, and one with 64 zero bits + 64 value bits, both representing the same trace ID. Are you facing this problem right now, or is this a hypothetical problem at the moment? |
Yes, indeed, exactly. We faced it with our SDKs and backend, since both were built for 64-bits alone originally, and thus there would naturally be a migration period to 128-bit support w/ a need for "hybrid" deployments to work in the interim for all of our users. We also got a report recently about this being an issue in agents/collectors when doing consistent probabilistic sampling (where we have analogous logic). We fixed it in our Agent (which happened to be what the user was using) but I wanted to open this here to have the fix generally available, and also have the fix apply for other components or be generally provided by the core collector in some fashion. TL;DR: yes there were reports from users but so far we've been able to handle it on our end since they were using our implementation of the code path -- which is just luck and won't work for all cases. |
I wonder if a config option common to all three component would make sense, to use only the lower/higher 64 bits when trace IDs are needed for a decision. Users would still only figure this out too late in the process (when they are facing something "weird" already), but I think it's the safest route for current and future users. |
This occurred to me as well, but there may be other components making identity-based decisions w/ the trace id. Some of the components may be custom/provided by the end user. It seems likely that they will stumble into the same issue (I only tagged the three components you mentioned since I was aware of them/they are very popular). We could handle this at the core collector level, perhaps? Provide a new config option for "64-bit compatibility mode" that automatically truncates all of the higher order bits? I'm a bit unhappy with all of these solutions, though, even the core collector proposal above, since it requires the user to be aware that something wrong is happening to begin with, as you mentioned. At that point they may as well fix the issue as close to the source as possible rather than scattering the fix out to downstream processors/exporters/etc. |
e.g. one scenario we couldn't handle in the collector is if this happens at SDK level: customer wishes to sample independently but consistently with tracing systems A, B, C, each configured with different sampling rates. This is fine if there is no truncation, but in a call stack of tracing systems like A > B > C > B > A, if C truncates the trace id then A, B won't have consistent traces even unto themselves. |
In that sense it seems like the best solution would be clear documentation on this as a general ecosystem-level concern (it applies to any W3C-based tracing system after all), and perhaps a design pattern or pre-rolled processor that users can use to downgrade trace ids to 64-bits. Then a collector user can protect their pipelines like so:
|
I agree. We could provide a connector that could produce some metric that would allow people to detect that this is happening. Perhaps by tracking the high and low bits in different caches, and reporting their sizes? They should match all the time, and if they don't, there's this scenario happening. |
Even tracking TIDs with 64 leading zero bits and ignoring everything else seems sufficient? There's a (1/2)^64 chance per trace of such an id being generated if ids are not being truncated, and a 100% chance per trace if the ids are being truncated, so it seems like a reasonable signal (assuming a static call stack here). Would it then just be If it makes sense I will discuss this with my team and consider where we may be able to schedule in a docs contribution to that effect. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I think I'd keep this issue here as a record of this problem, but I'm not sure we should handle this on our side: this seems too much of an edge case for us to handle here. Perhaps we/you could have a processor that could be used to normalize the trace IDs during a migration period? |
Component(s)
exporter/loadbalancing, processor/probabilisticsampler, processor/tailsampling
Describe the issue you're reporting
W3C trace context spec requires support for legacy 64-bit tracers, which themselves may participate in a trace and truncate the leading 64-bits of a W3C-based trace to 64 0 bits.
When spans from such traces are processed by an OpenTelemetry collector that makes trace identity decisions based on the entire 128-bit trace id, a logically incorrect difference in identification will occur due to the discrepancy on the higher 64-bits:
This issue may also affect SDKs that are making decisions based on trace id-based identity.
We should fix this somehow.
The text was updated successfully, but these errors were encountered: