Support distributed trace sampling #142

reachfh · 2022-08-20T02:06:30Z

This PR makes changes to priority to support distributed tracing and sampling.

In distributed tracing, an upstream process may set an HTTP header indicating that the current request should be sampled. We need to determine if that header was present, and optionally make a decision about whether to sample the request if it is not already set. The application code may also need to increase priority for the trace later, e.g. if there is an error in the request.

This PR sets the default priority as nil instead of 1, allowing code to identify if it has been set already.
The application can make a sampling decision if it is the first application in the trace, otherwise the current value will be maintained. This PR adds functions to get and set the current trace priority.

Most of the details are specific to the back end library, and are handled by spandex_datadog in this PR:
spandex-project/spandex_datadog#52

Features:

Add functions to get and set trace priority
Allow Spandex.Tracer functions to be overrided
Add GitHub Action to run tests
Add :file and :line to test logs

Breaking Changes:

Don't set priority by default

Potentially Breaking Changes:

Use new Config instead of Mix.Config in config files
Update libraries
Require Elixir 1.11 or higher (was 1.7)
Use Elixir 1.11 image to run tests in CircleCI

Bug Fixes:

Apply fix for interpolated string span and trace names from PR Fix interpolated string span and trace names #136

GregMefford

Thanks so much for taking the time to put together such a thorough PR! ❤️
I took a quick pass through it and had some questions, so hopefully you have some time to discuss. If not, I will do some more research on my own as well and confirm one way or the other so we can get some/most of this merged. 🚀

GregMefford · 2022-09-07T14:57:00Z

CHANGELOG.md

+### Breaking Changes:
+
+* Don't set default priority to support distributed traces and sampling
+* Use new `Config` instead of `Mix.Config` in `config` files


I don't think this is a breaking change at all in terms of people using Spandex, right? I just want to be sure I'm not missing something there.

The effect of not defaulting sampling to 1 is that the sampling metric (_sampling_priority_v1) won't be set when sending traces to Datadog. If it's not set, then Datadog will do whatever its default is, instead of considering the trace to be selected for sampling. The x-datagog-sampling-priority header won't be set for downstream requests, and downstream processes won't consider the request selected for sampling.

Failure to set sampling by default is unlikely to cause problems. On the other hand, setting it by default makes it difficult to actually use distributed sampling, resulting in higher Datadog bills.

GregMefford · 2022-09-07T15:02:11Z

CHANGELOG.md

@@ -5,6 +5,18 @@ See [Conventional Commits](Https://conventionalcommits.org) for commit guideline

 <!-- changelog -->

+### Breaking Changes:
+
+* Don't set default priority to support distributed traces and sampling


I'm curious if you have any documentation references from Datadog about not deciding a value / sending out a sampling decision downstream unless one is received upstream or decided in the application. Shouldn't we always default to choose something, as opposed to leaving it nil? I like the idea of adding a probabilistic sampler instead of having it always hard-coded, but I'm curious if there's some reason that it makes sense for it to ever be nil, when there are already provisions for auto/manual keep/reject.

The Datdog docs are here: https://docs.datadoghq.com/tracing/trace_pipeline/ingestion_controls/
That is not super clear, however. I gleaned most of my knowledge by reading the Datadog client library for Ruby and doing things the same way.

The theory is this:

When we are doing "head sampling", the first process in the distributed trace makes a decision about whether this specific trace should be sampled. So it could e.g. say that only 10% of the traces should be sampled, and 10% of the time it would set priority to 1. The rest of the time it would set it to 0. That means that it has made a decision. It passes that decision to Datadog by setting _sampling_priority_v1 when uploading the trace, and it communicates that decision to downstream processes by setting the x-datagog-sampling-priority HTTP header. Downstream processes will then consider the sampling decision already made and preserve it.

If an error occurs in a trace that was otherwise not sampled (priority = 0), then the app can select it for tracing after the fact, by setting priority = 2. The Datadog model is that it receives all of the traces, then makes a decision about which ones to index or retain long term. So it can still include the complete trace across all apps, avoiding broken traces.

This model is different from throwing away traces when sampling, i.e. deciding to not do tracing at all or not send the trace to Datadog. That reduces the overhead of tracing, but results in broken traces in a distributed environment and metrics based on counting spans are incorrect.

To answer your question, if we don't set a priority, then the trace is still sent to Datadog, and Datadog will ingest it. Setting it to 1 by default means that we are saying that a decision has been made to sample this trace, when no decision has actually been made. This can hinder a downstream process from making the decision to trace or not. For example, it's pretty common for the trace to be started in e.g. a load balancer, but it doesn't make the decision about sampling. By leaving the decision unmade, something else can handle it.

Thanks for all the details here! I did read through those Datadog docs, but as you said, it's not really written for implementers, but just for general APM users to understand how it works and not all the details of how to implement it.

I can definitely see your point about not wanting to have a load-balancer make the sampling decision, but in the context of Spandex, wouldn't an Elixir application always want to make a decision one way or the other? If I understand correctly, there are a few possible scenarios:

(Current situation) Spandex defaults to sampling priority 1 for all traces. It would be pretty easy for someone to make a mechanism to randomly choose priority 0 or 1 based on a configurable probability, but that's not currently built into Spandex. Since Spandex has made a decision, that decision will be propagated to downstream services and also to the Datadog Agent. Since the decision is passed to the Datadog Agent, it's no longer possible to configure the sample rate in Datadog's Agent for all services because the service is configuring it

(Proposed change in this PR) Spandex doesn't make any decision by default, leaving it up to the Datadog Agent configuration. It doesn't propagate a decision to downstream services or to the Datadog Agent unless a decision is made somehow in application code, or a decision is passed in from upstream. It's still possible to insert a mechanism to make the decision, as described in scenario 1 above, e.g. via a Plug. The down-side of this scenario is that if someone upgrades their Spandex deps without adding a decision-making mechanism anywhere, I believe they would lose access to the trace analytics feature in Datadog altogether, and they would probably not know why. I haven't confirmed this, but I believe that's the reason we started to include the _sampling_priority_v1 and related metadata in the first place. I'll try to do some more experimentation there to see how it behaves if you never send any spans with this metadata, but I currently don't have access to a Datadog testing environment because I'm waiting for them to renew the trial account that I use for development.

(Alternate solution) Spandex always makes a sampling decision, but defaults to sampling priority 1 for all traces, for backward-compatibility, but offers a new RateSampler like you've already proposed to add to spandex_datadog. It might make sense to add this directly to spandex instead... I'd have to think about that. This would allow for configurable probabilistic sampling per service (probably at the level of a Spandex.Tracer instance). The benefit here is that it's an additive feature to what exists today, and it could be as simple as always using the RateSampler but having it default to 100%, allowing the user to configure it to whatever smaller rate they want.

The background is that we are spending crazy amounts on Datadog and need to implement sampling.
We need all decisions about sampling and span analytics to be explicit. The current defaults force everything on.

Using nil by default for priority allows us to distinguish when priority has been set by an upstream header.
A plug can then make a decision about whether to set the value, i.e. if it has been set upstream by a header, then use it, otherwise make a sampling decision. If it is 1 by default, then we have no way of knowing what happened.

I believe Datadog will behave reasonably if no priority is set when sending traces. The Ruby Datadog client library does not set _sampling_priority_v1 if priority is not defined, and I think we should do the same.
The same thing applies to other philosophical questions, just do what the official Datadog clients do :-)

zachdaniel · 2023-02-02T07:00:42Z

Its been a while, but @reachfh have you tried this in production? Does it have the intended effect?

reachfh added 30 commits August 8, 2022 20:26

Make priority optional

b741d10

Use Config instead of Mix.Config

8989810

Upgrade dialyxir

230254c

Upgrade dialyxir

e625d6e

mix format

49bbdfd

Upgrade credo and inch

0269408

Upgrade libraries

6ae8458

PR136: Fix interpolated string span and trace names

130ea6e

Add functions to get and set trace priority

71eddca

Update CHANGELOG

fb3db25

Add docs

2c7253f

Use new names for priority constants

30a7d11

Allow Spandex.Tracer functions to be overrided

8b36101

Use bind_quoted for opts

004fa12

Fix typo

a07a0fa

Document overriding span_error

2a0b286

Add GitHub Action

605d3a0

mix format

99b24f0

Disable --warnings-as-errors

6557fd0

Doc

0d1dd2c

Fix name

9b8bd5a

Add :file and :line to test logs

f17ae74

Use Elixir 1.11 test image; Run mix format check after deps.get

fa41b1c

Require Elixir 1.11

25b08d0

Update docs

ef04a4e

Don't send coveralls report to server

734eb1f

Test older versions of Elixir

a2fd94e

Use setup-elixir instead of setup-beam

ab8b663

Use later version of setup-elixir

a40b1cb

Specify detailed versions of Elxir and OTP

80a7ce2

reachfh added 4 commits August 24, 2022 16:02

Fix variable

7174fcb

Try setup-beam again

fb29312

Use correct name

06be739

Only test on OTP 24

f998617

GregMefford reviewed Sep 7, 2022

View reviewed changes

zachdaniel mentioned this pull request Feb 2, 2023

improvement: set more logger metadata, and reset it after span closes #146

Closed

kamilkowalski mentioned this pull request May 29, 2023

Support sampling #150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support distributed trace sampling #142

Support distributed trace sampling #142

reachfh commented Aug 20, 2022 •

edited

Loading

GregMefford left a comment

GregMefford Sep 7, 2022

reachfh Sep 7, 2022

GregMefford Sep 7, 2022

reachfh Sep 7, 2022

GregMefford Sep 8, 2022

reachfh Sep 8, 2022

zachdaniel commented Feb 2, 2023

Support distributed trace sampling #142

Are you sure you want to change the base?

Support distributed trace sampling #142

Conversation

reachfh commented Aug 20, 2022 • edited Loading

Features:

Breaking Changes:

Potentially Breaking Changes:

Bug Fixes:

GregMefford left a comment

Choose a reason for hiding this comment

GregMefford Sep 7, 2022

Choose a reason for hiding this comment

reachfh Sep 7, 2022

Choose a reason for hiding this comment

GregMefford Sep 7, 2022

Choose a reason for hiding this comment

reachfh Sep 7, 2022

Choose a reason for hiding this comment

GregMefford Sep 8, 2022

Choose a reason for hiding this comment

reachfh Sep 8, 2022

Choose a reason for hiding this comment

zachdaniel commented Feb 2, 2023

reachfh commented Aug 20, 2022 •

edited

Loading