Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W3C trace_id backwards compatibility may break multiple components. #33814

Open
ancostas opened this issue Jun 28, 2024 · 11 comments
Open

W3C trace_id backwards compatibility may break multiple components. #33814

ancostas opened this issue Jun 28, 2024 · 11 comments
Labels
exporter/loadbalancing processor/probabilisticsampler Probabilistic Sampler processor processor/tailsampling Tail sampling processor

Comments

@ancostas
Copy link

Component(s)

exporter/loadbalancing, processor/probabilisticsampler, processor/tailsampling

Describe the issue you're reporting

W3C trace context spec requires support for legacy 64-bit tracers, which themselves may participate in a trace and truncate the leading 64-bits of a W3C-based trace to 64 0 bits.

When spans from such traces are processed by an OpenTelemetry collector that makes trace identity decisions based on the entire 128-bit trace id, a logically incorrect difference in identification will occur due to the discrepancy on the higher 64-bits:

  • Probabilistic sampler may sample one span and not the other.
  • Loadbalancing exporter configured to load balance by trace id may route one span to a different collector than the other.
  • Tail sampling processor may not group two spans with a discrepancy in the upper 64 bits into the same trace for tail sampling, which further compounds the issue mentioned for load balancing exporter (these are typically used in conjunction with one another -- gateway deployments).

This issue may also affect SDKs that are making decisions based on trace id-based identity.

We should fix this somehow.

@ancostas ancostas added the needs triage New item requiring triage label Jun 28, 2024
@github-actions github-actions bot added exporter/loadbalancing processor/probabilisticsampler Probabilistic Sampler processor processor/tailsampling Tail sampling processor labels Jun 28, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

This is an interesting problem: basically, a trace might have two trace IDs, one with the whole 128 bits, and one with 64 zero bits + 64 value bits, both representing the same trace ID.

Are you facing this problem right now, or is this a hypothetical problem at the moment?

@jpkrohling jpkrohling removed the needs triage New item requiring triage label Jul 8, 2024
@ancostas
Copy link
Author

ancostas commented Jul 8, 2024

Yes, indeed, exactly. We faced it with our SDKs and backend, since both were built for 64-bits alone originally, and thus there would naturally be a migration period to 128-bit support w/ a need for "hybrid" deployments to work in the interim for all of our users.

We also got a report recently about this being an issue in agents/collectors when doing consistent probabilistic sampling (where we have analogous logic). We fixed it in our Agent (which happened to be what the user was using) but I wanted to open this here to have the fix generally available, and also have the fix apply for other components or be generally provided by the core collector in some fashion.

TL;DR: yes there were reports from users but so far we've been able to handle it on our end since they were using our implementation of the code path -- which is just luck and won't work for all cases.

@jpkrohling
Copy link
Member

I wonder if a config option common to all three component would make sense, to use only the lower/higher 64 bits when trace IDs are needed for a decision. Users would still only figure this out too late in the process (when they are facing something "weird" already), but I think it's the safest route for current and future users.

@ancostas
Copy link
Author

ancostas commented Jul 8, 2024

This occurred to me as well, but there may be other components making identity-based decisions w/ the trace id. Some of the components may be custom/provided by the end user. It seems likely that they will stumble into the same issue (I only tagged the three components you mentioned since I was aware of them/they are very popular).

We could handle this at the core collector level, perhaps? Provide a new config option for "64-bit compatibility mode" that automatically truncates all of the higher order bits?

I'm a bit unhappy with all of these solutions, though, even the core collector proposal above, since it requires the user to be aware that something wrong is happening to begin with, as you mentioned. At that point they may as well fix the issue as close to the source as possible rather than scattering the fix out to downstream processors/exporters/etc.

@ancostas
Copy link
Author

ancostas commented Jul 8, 2024

e.g. one scenario we couldn't handle in the collector is if this happens at SDK level: customer wishes to sample independently but consistently with tracing systems A, B, C, each configured with different sampling rates.

This is fine if there is no truncation, but in a call stack of tracing systems like A > B > C > B > A, if C truncates the trace id then A, B won't have consistent traces even unto themselves.

@ancostas
Copy link
Author

ancostas commented Jul 8, 2024

In that sense it seems like the best solution would be clear documentation on this as a general ecosystem-level concern (it applies to any W3C-based tracing system after all), and perhaps a design pattern or pre-rolled processor that users can use to downgrade trace ids to 64-bits.

Then a collector user can protect their pipelines like so:

<rest of config>
service:
    pipelines:
        traces:
            processors: 
                - downgradeto64 // first stage in every pipeline
                <other processors>
             <rest of trace pipeline>

@jpkrohling
Copy link
Member

In that sense it seems like the best solution would be clear documentation on this as a general ecosystem-level concern

I agree. We could provide a connector that could produce some metric that would allow people to detect that this is happening. Perhaps by tracking the high and low bits in different caches, and reporting their sizes? They should match all the time, and if they don't, there's this scenario happening.

@ancostas
Copy link
Author

ancostas commented Jul 30, 2024

Even tracking TIDs with 64 leading zero bits and ignoring everything else seems sufficient? There's a (1/2)^64 chance per trace of such an id being generated if ids are not being truncated, and a 100% chance per trace if the ids are being truncated, so it seems like a reasonable signal (assuming a static call stack here).

Would it then just be countconnector with the appropriate config? And transformprocessor to downgrade to 64-bits as a mitigation, I suppose.

If it makes sense I will discuss this with my team and consider where we may be able to schedule in a docs contribution to that effect.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 30, 2024
@jpkrohling
Copy link
Member

I think I'd keep this issue here as a record of this problem, but I'm not sure we should handle this on our side: this seems too much of an edge case for us to handle here. Perhaps we/you could have a processor that could be used to normalize the trace IDs during a migration period?

@github-actions github-actions bot removed the Stale label Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exporter/loadbalancing processor/probabilisticsampler Probabilistic Sampler processor processor/tailsampling Tail sampling processor
Projects
None yet
Development

No branches or pull requests

2 participants