Implement tail sampling on the client #4160

philipp-spiess · 2024-05-13T16:05:18Z

We currently implement open telemetry tail-sampling only on the OTel Collector. This causes a bunch of issues, though:

We noticed "congestion" since we log a lot of spans and only sample a few of them. In the past, OTel collector was discarding events because of this for larger periods of time.
We can only send the traces to our dedicated OTel collector since otherwise it will be too noisy. This prevented us from enabling traces on enterprise instances in any useful way.

This PR attempts to implement tail-sampling like we do it on the OTel collector on the client. Since there is no Tail Sampler, this is done by extending the HTTP SpanExporter. For every span that is exported, we will now look at the root span and only record if if it was manually marked as sampled.

Since spans can be flushed out of order (sub-spans can be flushed before the parent span is sent), theres a bit of bookkeeping that queues spans until their parent was received (or up to 1min in total).

Test plan

To test this, I made some changes in the source code, mainly:
- Comment out the check in CodyTraceExport line 17 for this.isTracingEnabled
- Add a console.log to see which traces are flushed in CodyTraceExport on line 58
Then, trigger things that would collect traces like autocomplete or chat and see if there's something logged:

bobheadxi · 2024-05-13T17:43:18Z

For every span that is exported, we will now look at the root span and only record if if it was manually marked as sampled.

@philipp-spiess at least in the Go SDK, it's possible to set the trace is sampled flag https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#sampling that automatically applies to all spans in a trace and controls whether the underlying processor receives the span or not, which sounds like the behaviour you describe here (i.e. there is a native "should sample" flag that does not require looking at the root span). This also works over the wire if both ends are configured correctly, which I commented on here: sourcegraph/zoekt#761 (comment)

This is also how we implement should trace "trace policy" in-Sourcegraph as of https://github.com/sourcegraph/sourcegraph/pull/58068, PR also describes how this works in Go a bit

bobheadxi · 2024-05-13T17:46:30Z

The native sampling flag behaviour sounds like pretty much what you want here, but I may be missing some nuance - feel free to ping me on Slack to discuss if that might be more effective!

philipp-spiess · 2024-05-14T08:38:20Z

@bobheadxi I'll look into it but one problem I remember is that for autocomplete we only know towards the end of the trace if we will sample it so all sub-spans that are already flushed need to wait for that, hence the tail sampling approach.

If we can do this with TraceFlags too than that should help a ton yep

philipp-spiess · 2024-05-14T08:53:13Z

@bobheadxi I tried (via span.spanContext().traceFlags = TraceFlags.SAMPLED) but I couldn't get this to work the way I want to. Any ideas what else I can try? 🤔

When I just make this change, everything else is sampled by default too
When I use it in combination with TraceIdRatioBasedSampler set to 0 (i.e nothing is sampled) it won't sample any trace
I guess I can use it with a custom Sampler but this has the issue that some spans may be flushed before we make the sampling decision (I tried this early with the manual sampled field and it was not working reliably).

bobheadxi · 2024-05-14T20:16:33Z

I tried (via span.spanContext().traceFlags = TraceFlags.SAMPLED) but I couldn't get this to work the way I want to. Any ideas what else I can try? 🤔

It should be a read-only attribute, so the should-trace flag must be set by a custom Sampler implementation, Go version: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/tracer/policy_sampler.go?L23:1-44:2 i.e. wrap TraceIdRatioBasedSampler with your own sampling decider

Node version: https://open-telemetry.github.io/opentelemetry-js/interfaces/_opentelemetry_sdk_trace_base.Sampler.html

we only know towards the end of the trace

Curious how this works, as your custom sampler in this PR interrogates the root span 🤔 do we set the attribute only at the end of the root span?

bobheadxi

Left some suggestions on native mechanism: #4160 (comment) - if this is not critical, I recommend investigating if we can use some semblance of the native mechanisms instead, as the bookkeeping done in this exporter seems a bit risky 😁

Approving to unblock regardless as this is just telemetry and if we need this to improve the state of trace export traffic, then feel free to ship this as-is :)

philipp-spiess · 2024-05-17T12:19:21Z

Curious how this works, as your custom sampler in this PR interrogates the root span 🤔 do we set the attribute only at the end of the root span?

@bobheadxi Yeah. I only wanted to sample autocomplete requests that also end up being shown to users. We do a similar thing for telemetry, because we don't really care about those which aren't shown (these won't be noticed by a user anyhow and we only care about their cost which we can capture from the gateway traces).

So what we do here is we start a trace and then do all the fancy shmancy to build the autocomplete result and only if the request was not interrupted in the mean time (i.e by another keystroke or some other things) will it be marked as "suggested" and also sampled. That's the inherent issue I had with custom Sampler, the decision on wether something samples need to be made ahead of time.

I'll try this out for now, my main motivation behind this PR is that we reduce the load of traces being sent from the Cody Clients to the OTel collectors so if we run into issues we can continue to move the sampling decision into the otel collector but it would be nice if the client doesn't log every single I/O operation going on 😬

bobheadxi · 2024-05-17T16:13:44Z

Sounds good!

philipp-spiess added 2 commits May 13, 2024 17:54

Implement tail sampling on the client

8a22b6c

Remove unused class

6153251

philipp-spiess requested review from valerybugakov, RXminuS, michaellzc, bobheadxi and a team May 13, 2024 16:05

Merge remote-tracking branch 'origin/main' into ps/tail-sampling-otel

318ab1a

Fix stuff

f0c60e0

bobheadxi approved these changes May 14, 2024

View reviewed changes

philipp-spiess merged commit 7b7f0ff into main May 17, 2024
21 checks passed

philipp-spiess deleted the ps/tail-sampling-otel branch May 17, 2024 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement tail sampling on the client #4160

Implement tail sampling on the client #4160

philipp-spiess commented May 13, 2024 •

edited

Loading

bobheadxi commented May 13, 2024 •

edited

Loading

bobheadxi commented May 13, 2024

philipp-spiess commented May 14, 2024

philipp-spiess commented May 14, 2024

bobheadxi commented May 14, 2024 •

edited

Loading

bobheadxi left a comment

philipp-spiess commented May 17, 2024

bobheadxi commented May 17, 2024

Implement tail sampling on the client #4160

Implement tail sampling on the client #4160

Conversation

philipp-spiess commented May 13, 2024 • edited Loading

Test plan

bobheadxi commented May 13, 2024 • edited Loading

bobheadxi commented May 13, 2024

philipp-spiess commented May 14, 2024

philipp-spiess commented May 14, 2024

bobheadxi commented May 14, 2024 • edited Loading

bobheadxi left a comment

Choose a reason for hiding this comment

philipp-spiess commented May 17, 2024

bobheadxi commented May 17, 2024

philipp-spiess commented May 13, 2024 •

edited

Loading

bobheadxi commented May 13, 2024 •

edited

Loading

bobheadxi commented May 14, 2024 •

edited

Loading