Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support distributed trace sampling #142

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

reachfh
Copy link

@reachfh reachfh commented Aug 20, 2022

This PR makes changes to priority to support distributed tracing and sampling.

In distributed tracing, an upstream process may set an HTTP header indicating that the current request should be sampled. We need to determine if that header was present, and optionally make a decision about whether to sample the request if it is not already set. The application code may also need to increase priority for the trace later, e.g. if there is an error in the request.

This PR sets the default priority as nil instead of 1, allowing code to identify if it has been set already.
The application can make a sampling decision if it is the first application in the trace, otherwise the current value will be maintained. This PR adds functions to get and set the current trace priority.

Most of the details are specific to the back end library, and are handled by spandex_datadog in this PR:
spandex-project/spandex_datadog#52

Features:

  • Add functions to get and set trace priority
  • Allow Spandex.Tracer functions to be overrided
  • Add GitHub Action to run tests
  • Add :file and :line to test logs

Breaking Changes:

  • Don't set priority by default

Potentially Breaking Changes:

  • Use new Config instead of Mix.Config in config files
  • Update libraries
  • Require Elixir 1.11 or higher (was 1.7)
  • Use Elixir 1.11 image to run tests in CircleCI

Bug Fixes:

Copy link
Member

@GregMefford GregMefford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for taking the time to put together such a thorough PR! ❤️
I took a quick pass through it and had some questions, so hopefully you have some time to discuss. If not, I will do some more research on my own as well and confirm one way or the other so we can get some/most of this merged. 🚀

### Breaking Changes:

* Don't set default priority to support distributed traces and sampling
* Use new `Config` instead of `Mix.Config` in `config` files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a breaking change at all in terms of people using Spandex, right? I just want to be sure I'm not missing something there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect of not defaulting sampling to 1 is that the sampling metric (_sampling_priority_v1) won't be set when sending traces to Datadog. If it's not set, then Datadog will do whatever its default is, instead of considering the trace to be selected for sampling. The x-datagog-sampling-priority header won't be set for downstream requests, and downstream processes won't consider the request selected for sampling.

Failure to set sampling by default is unlikely to cause problems. On the other hand, setting it by default makes it difficult to actually use distributed sampling, resulting in higher Datadog bills.

@@ -5,6 +5,18 @@ See [Conventional Commits](Https://conventionalcommits.org) for commit guideline

<!-- changelog -->

### Breaking Changes:

* Don't set default priority to support distributed traces and sampling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if you have any documentation references from Datadog about not deciding a value / sending out a sampling decision downstream unless one is received upstream or decided in the application. Shouldn't we always default to choose something, as opposed to leaving it nil? I like the idea of adding a probabilistic sampler instead of having it always hard-coded, but I'm curious if there's some reason that it makes sense for it to ever be nil, when there are already provisions for auto/manual keep/reject.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Datdog docs are here: https://docs.datadoghq.com/tracing/trace_pipeline/ingestion_controls/
That is not super clear, however. I gleaned most of my knowledge by reading the Datadog client library for Ruby and doing things the same way.

The theory is this:

  • When we are doing "head sampling", the first process in the distributed trace makes a decision about whether this specific trace should be sampled. So it could e.g. say that only 10% of the traces should be sampled, and 10% of the time it would set priority to 1. The rest of the time it would set it to 0. That means that it has made a decision. It passes that decision to Datadog by setting _sampling_priority_v1 when uploading the trace, and it communicates that decision to downstream processes by setting the x-datagog-sampling-priority HTTP header. Downstream processes will then consider the sampling decision already made and preserve it.
  • If an error occurs in a trace that was otherwise not sampled (priority = 0), then the app can select it for tracing after the fact, by setting priority = 2. The Datadog model is that it receives all of the traces, then makes a decision about which ones to index or retain long term. So it can still include the complete trace across all apps, avoiding broken traces.
  • This model is different from throwing away traces when sampling, i.e. deciding to not do tracing at all or not send the trace to Datadog. That reduces the overhead of tracing, but results in broken traces in a distributed environment and metrics based on counting spans are incorrect.

To answer your question, if we don't set a priority, then the trace is still sent to Datadog, and Datadog will ingest it. Setting it to 1 by default means that we are saying that a decision has been made to sample this trace, when no decision has actually been made. This can hinder a downstream process from making the decision to trace or not. For example, it's pretty common for the trace to be started in e.g. a load balancer, but it doesn't make the decision about sampling. By leaving the decision unmade, something else can handle it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the details here! I did read through those Datadog docs, but as you said, it's not really written for implementers, but just for general APM users to understand how it works and not all the details of how to implement it.

I can definitely see your point about not wanting to have a load-balancer make the sampling decision, but in the context of Spandex, wouldn't an Elixir application always want to make a decision one way or the other? If I understand correctly, there are a few possible scenarios:

  1. (Current situation) Spandex defaults to sampling priority 1 for all traces. It would be pretty easy for someone to make a mechanism to randomly choose priority 0 or 1 based on a configurable probability, but that's not currently built into Spandex. Since Spandex has made a decision, that decision will be propagated to downstream services and also to the Datadog Agent. Since the decision is passed to the Datadog Agent, it's no longer possible to configure the sample rate in Datadog's Agent for all services because the service is configuring it
  2. (Proposed change in this PR) Spandex doesn't make any decision by default, leaving it up to the Datadog Agent configuration. It doesn't propagate a decision to downstream services or to the Datadog Agent unless a decision is made somehow in application code, or a decision is passed in from upstream. It's still possible to insert a mechanism to make the decision, as described in scenario 1 above, e.g. via a Plug. The down-side of this scenario is that if someone upgrades their Spandex deps without adding a decision-making mechanism anywhere, I believe they would lose access to the trace analytics feature in Datadog altogether, and they would probably not know why. I haven't confirmed this, but I believe that's the reason we started to include the _sampling_priority_v1 and related metadata in the first place. I'll try to do some more experimentation there to see how it behaves if you never send any spans with this metadata, but I currently don't have access to a Datadog testing environment because I'm waiting for them to renew the trial account that I use for development.
  3. (Alternate solution) Spandex always makes a sampling decision, but defaults to sampling priority 1 for all traces, for backward-compatibility, but offers a new RateSampler like you've already proposed to add to spandex_datadog. It might make sense to add this directly to spandex instead... I'd have to think about that. This would allow for configurable probabilistic sampling per service (probably at the level of a Spandex.Tracer instance). The benefit here is that it's an additive feature to what exists today, and it could be as simple as always using the RateSampler but having it default to 100%, allowing the user to configure it to whatever smaller rate they want.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background is that we are spending crazy amounts on Datadog and need to implement sampling.
We need all decisions about sampling and span analytics to be explicit. The current defaults force everything on.

Using nil by default for priority allows us to distinguish when priority has been set by an upstream header.
A plug can then make a decision about whether to set the value, i.e. if it has been set upstream by a header, then use it, otherwise make a sampling decision. If it is 1 by default, then we have no way of knowing what happened.

I believe Datadog will behave reasonably if no priority is set when sending traces. The Ruby Datadog client library does not set _sampling_priority_v1 if priority is not defined, and I think we should do the same.
The same thing applies to other philosophical questions, just do what the official Datadog clients do :-)

@zachdaniel
Copy link
Member

Its been a while, but @reachfh have you tried this in production? Does it have the intended effect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants