-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NO-TICKET] Fix breaking application boot due to concurrency issue in tracing #4303
[NO-TICKET] Fix breaking application boot due to concurrency issue in tracing #4303
Conversation
…cing **What does this PR do?** This PR fixes #2851 . As explained in the original bug report, there's a concurrency issue that can be triggered by the environment logger running concurrently with activating an instrumentation. **Motivation:** Fix a long-standing concurrency issue. **Additional Notes:** To fix this issue, I've made two changes: 1. Introduced a mutex to protect the `@instrumented_integrations`, thus making sure no two threads can be touching it at the same time 2. Took advantage of the fact that `#instrumented_integrations` was marked as private, and only being used by the environment logger and telemetry (e.g. read-only usage) to return a copy of the data. This way, we can safely iterate on the data while reconfiguration is happening concurrently. I believe the overhead of this lock is negligible, since we don't need to read this information very often. **How to test the change?** I wrote a reproducer to be able to see this issue easily. The reproducer is in two parts -- `concurrent_bug_repo.rb`: ```ruby require 'datadog' Datadog.configure do |c| c.tracing.instrument(:http, split_by_domain: true) end $sync_with_logger = Queue.new Thread.new do Datadog::Tracing::Diagnostics::EnvironmentLogger.collect_and_log! puts "Background thread finished!" end $sync_with_logger.pop Datadog.configure do |c| c.tracing.instrument(:rake, split_by_domain: true) end ``` and the following change to the environment logger: ```diff diff --git a/lib/datadog/tracing/diagnostics/environment_logger.rb b/lib/datadog/tracing/diagnostics/environment_logger.rb index d96cfa14c6..427861bff7 100644 --- a/lib/datadog/tracing/diagnostics/environment_logger.rb +++ b/lib/datadog/tracing/diagnostics/environment_logger.rb @@ -128,7 +128,15 @@ module Datadog end def collect_integrations_settings! + sync_once = $sync_with_logger + instrumented_integrations.each_with_object({}) do |(name, integration), result| + if sync_once + sync_once << true + sync_once = nil + sleep 5 + end + integration.configuration.to_h.each do |setting, value| next if setting == :tracer # Skip internal objects ``` That is, I specifically injected a `sleep` in the environment logger, and made it run on a background thread, triggering the exact same issue that had been plaguing us for years. Running the above change without my PR will trigger the issue, and with my PR it no longer does -- since the environment logger now gets its own copy of the instrumented integrations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I have a question, if we dup the return of instrumented_integrations
, do we freeze it to show an intention that this set should never be modified (even tho you can't modify the dup to affect original)
Yes, that's exactly the intention -- this method is documented as private and as far as I could find it it's only used in our codebase for logging. But... if somehow I missed a spot or anyone out there has a weird patch set on top of the original method, this will flag the issue when they attempt to mutate. (Without the freeze, they'd change the copy, which wouldn't affect the original and perhaps introduce very subtle bugs) |
What does this PR do?
This PR fixes #2851 . As explained in the original bug report, there's a concurrency issue that can be triggered by the environment logger running concurrently with activating an instrumentation.
Motivation:
Fix a long-standing concurrency issue.
Change log entry
Yes. Fix breaking application boot due to concurrency issue in tracing
Additional Notes:
To fix this issue, I've made two changes:
Introduced a mutex to protect the
@instrumented_integrations
, thus making sure no two threads can be touching it at the same timeTook advantage of the fact that
#instrumented_integrations
was marked as private, and only being used by the environment logger and telemetry (e.g. read-only usage) to return a copy of the data.This way, we can safely iterate on the data while reconfiguration is happening concurrently.
I believe the overhead of this lock is negligible, since we don't need to read this information very often.
How to test the change?
I wrote a reproducer to be able to see this issue easily.
The reproducer is in two parts --
concurrent_bug_repo.rb
:and the following change to the environment logger:
That is, I specifically injected a
sleep
in the environment logger, and made it run on a background thread, triggering the exact same issue that had been plaguing us for years.Running the above change without my PR will trigger the issue, and with my PR it no longer does -- since the environment logger now gets its own copy of the instrumented integrations.