Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NO-TICKET] Fix breaking application boot due to concurrency issue in tracing #4303

Merged
merged 1 commit into from
Jan 22, 2025

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Jan 20, 2025

What does this PR do?

This PR fixes #2851 . As explained in the original bug report, there's a concurrency issue that can be triggered by the environment logger running concurrently with activating an instrumentation.

Motivation:

Fix a long-standing concurrency issue.

Change log entry

Yes. Fix breaking application boot due to concurrency issue in tracing

Additional Notes:

To fix this issue, I've made two changes:

  1. Introduced a mutex to protect the @instrumented_integrations, thus making sure no two threads can be touching it at the same time

  2. Took advantage of the fact that #instrumented_integrations was marked as private, and only being used by the environment logger and telemetry (e.g. read-only usage) to return a copy of the data.

This way, we can safely iterate on the data while reconfiguration is happening concurrently.

I believe the overhead of this lock is negligible, since we don't need to read this information very often.

How to test the change?

I wrote a reproducer to be able to see this issue easily.

The reproducer is in two parts -- concurrent_bug_repo.rb:

require 'datadog'

Datadog.configure do |c|
  c.tracing.instrument(:http, split_by_domain: true)
end

$sync_with_logger = Queue.new

Thread.new do
  Datadog::Tracing::Diagnostics::EnvironmentLogger.collect_and_log!
  puts "Background thread finished!"
end

$sync_with_logger.pop

Datadog.configure do |c|
  c.tracing.instrument(:rake, split_by_domain: true)
end

and the following change to the environment logger:

diff --git a/lib/datadog/tracing/diagnostics/environment_logger.rb b/lib/datadog/tracing/diagnostics/environment_logger.rb
index d96cfa14c6..427861bff7 100644
--- a/lib/datadog/tracing/diagnostics/environment_logger.rb
+++ b/lib/datadog/tracing/diagnostics/environment_logger.rb
@@ -128,7 +128,15 @@ module Datadog
           end

           def collect_integrations_settings!
+            sync_once = $sync_with_logger
+
             instrumented_integrations.each_with_object({}) do |(name, integration), result|
+              if sync_once
+                sync_once << true
+                sync_once = nil
+                sleep 5
+              end
+
               integration.configuration.to_h.each do |setting, value|
                 next if setting == :tracer # Skip internal objects

That is, I specifically injected a sleep in the environment logger, and made it run on a background thread, triggering the exact same issue that had been plaguing us for years.

Running the above change without my PR will trigger the issue, and with my PR it no longer does -- since the environment logger now gets its own copy of the instrumented integrations.

…cing

**What does this PR do?**

This PR fixes #2851 . As explained in the original bug report, there's
a concurrency issue that can be triggered by the environment logger
running concurrently with activating an instrumentation.

**Motivation:**

Fix a long-standing concurrency issue.

**Additional Notes:**

To fix this issue, I've made two changes:

1. Introduced a mutex to protect the `@instrumented_integrations`,
   thus making sure no two threads can be touching it at the same
   time

2. Took advantage of the fact that `#instrumented_integrations`
   was marked as private, and only being used by the environment
   logger and telemetry (e.g. read-only usage) to return a copy
   of the data.

This way, we can safely iterate on the data while reconfiguration
is happening concurrently.

I believe the overhead of this lock is negligible, since we
don't need to read this information very often.

**How to test the change?**

I wrote a reproducer to be able to see this issue easily.

The reproducer is in two parts -- `concurrent_bug_repo.rb`:

```ruby
require 'datadog'

Datadog.configure do |c|
  c.tracing.instrument(:http, split_by_domain: true)
end

$sync_with_logger = Queue.new

Thread.new do
  Datadog::Tracing::Diagnostics::EnvironmentLogger.collect_and_log!
  puts "Background thread finished!"
end

$sync_with_logger.pop

Datadog.configure do |c|
  c.tracing.instrument(:rake, split_by_domain: true)
end
```

and the following change to the environment logger:

```diff
diff --git a/lib/datadog/tracing/diagnostics/environment_logger.rb b/lib/datadog/tracing/diagnostics/environment_logger.rb
index d96cfa14c6..427861bff7 100644
--- a/lib/datadog/tracing/diagnostics/environment_logger.rb
+++ b/lib/datadog/tracing/diagnostics/environment_logger.rb
@@ -128,7 +128,15 @@ module Datadog
           end

           def collect_integrations_settings!
+            sync_once = $sync_with_logger
+
             instrumented_integrations.each_with_object({}) do |(name, integration), result|
+              if sync_once
+                sync_once << true
+                sync_once = nil
+                sleep 5
+              end
+
               integration.configuration.to_h.each do |setting, value|
                 next if setting == :tracer # Skip internal objects
```

That is, I specifically injected a `sleep` in the environment logger,
and made it run on a background thread, triggering the exact same
issue that had been plaguing us for years.

Running the above change without my PR will trigger the issue, and
with my PR it no longer does -- since the environment logger now
gets its own copy of the instrumented integrations.
@ivoanjo ivoanjo requested a review from a team as a code owner January 20, 2025 16:04
@github-actions github-actions bot added integrations Involves tracing integrations tracing labels Jan 20, 2025
Copy link
Member

@Strech Strech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I have a question, if we dup the return of instrumented_integrations, do we freeze it to show an intention that this set should never be modified (even tho you can't modify the dup to affect original)

@ivoanjo
Copy link
Member Author

ivoanjo commented Jan 21, 2025

do we freeze it to show an intention that this set should never be modified (even tho you can't modify the dup to affect original)

Yes, that's exactly the intention -- this method is documented as private and as far as I could find it it's only used in our codebase for logging.

But... if somehow I missed a spot or anyone out there has a weird patch set on top of the original method, this will flag the issue when they attempt to mutate. (Without the freeze, they'd change the copy, which wouldn't affect the original and perhaps introduce very subtle bugs)

@ivoanjo ivoanjo merged commit 97d8583 into master Jan 22, 2025
383 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/fix-concurrent-instrumented-integrations branch January 22, 2025 08:43
@github-actions github-actions bot added this to the 2.10.0 milestone Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrations Involves tracing integrations tracing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ddtrace crashes application on load
4 participants