Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we emit internal telemetry that works with existing Datadog Agent operational tooling? #118

Closed
tobz opened this issue Jul 17, 2024 · 4 comments
Labels
area/observability Internal observability of ADP and Saluki. effort/intermediate Involves changes that can be worked on by non-experts but might require guidance. type/meta Things that can't be neatly categorized and/or aren't yet fully-formed ideas/thoughts.

Comments

@tobz
Copy link
Member

tobz commented Jul 17, 2024

At a high-level, both Datadog Agent and Agent Data Plane/Saluki emit internal telemetry used for debugging performance issues and understanding their operational state. However, the naming differs between the two by a large amount, even for metrics that are functionally identical. This makes it challenging to use ADP, as it currently exists, as a drop-in replacement for DSD support in the core Agent.

The metric prefix we use when emitting internal metrics is configurable at the tippity top when initializing the metrics subsystem via saluki_app::metrics::initialize_metrics, so that's fine... but how do we line up individual metrics with their spiritual equivalent in the Datadog Agent?

This is a problem we need to solve if we hope to have ADP replace DSD in the core Agent.

@tobz
Copy link
Member Author

tobz commented Jul 17, 2024

One idea: metric remapping.

Conceptually, specific components in Saluki map to specific components in the core Agent. For example, the DogStatsD source in ADP is the dogstatsd component in the Datadog Agent, and the Datadog Metrics destination in ADP is the defaultforwarder component in the Datadog Agent. If we included the component type in internal metrics (e.g., metrics from the Datadog Metrics destination have a component_type tag with a value of datadog_metrics), we could conceivably use that to remap metrics to their Datadog Agent equivalent.

For example, datadog.agent.transactions.errors in the Datadog Agent is used to track "transaction errors", which occur when the default forwarder fails to send a request to the Datadog intake. The error_type tag indicates the specific type of error. Similarly, on the Saluki side, the Datadog Metrics destination emits a component_errors_total metric, with an error_type tag that has a value of http_send, when we fail to send a request.

Since we should expect to only have one Datadog Metrics destination running in ADP, we could conceivably map all instances of component_errors_total, where component_type was equal to datadog_metrics, to agent.transactions.errors.. and potentially map the error_type tag as well.

We could likely do this pretty simply with a dedicated transform that remaps metric names, perhaps one even designed solely for remapping to Datadog Agent-equivalent metric names. Biggest downside, I think, is just the general aspect of us having to maintain this mapping in the first place rather than doing it by default.

@tobz
Copy link
Member Author

tobz commented Jul 17, 2024

Another idea: change all points where we register metrics to also register Datadog Agent-specific versions.

Essentially, we would emit duplicate metrics -- a generically-named one for "pure" Saluki usage, and a Datadog Agent-specific one -- and that way anything using Saluki that wasn't ADP could have the more generic/flexible metric names, and ADP could still emit the Datadog Agent-specific metric names to meet our goal of being drop-in compatible.

This, obviously, means emitting more telemetry than absolutely necessary. If we really didn't want to do that, we could also have a transform for filtering out the generically-named metrics, leaving only the Datadog Agent-specific ones. We could also, perhaps, try and do something where we have a toggle for emitting the Saluki or Datadog Agent version... but threading that state all through Saluki would be very ugly.

@tobz tobz changed the title How do we emit internal telemetry that works with existing Datadog Agent operational tooling? meta: how do we emit internal telemetry that works with existing Datadog Agent operational tooling? Jul 17, 2024
@tobz tobz changed the title meta: how do we emit internal telemetry that works with existing Datadog Agent operational tooling? How do we emit internal telemetry that works with existing Datadog Agent operational tooling? Jul 17, 2024
@tobz tobz added type/meta Things that can't be neatly categorized and/or aren't yet fully-formed ideas/thoughts. area/observability Internal observability of ADP and Saluki. effort/intermediate Involves changes that can be worked on by non-experts but might require guidance. labels Jul 17, 2024
@tobz tobz added this to the ADP v0.2: Staging Ahoy milestone Aug 13, 2024
@tobz
Copy link
Member Author

tobz commented Sep 26, 2024

We've ended up taking an approach (in #240) which blends the two ideas previously mentioned: a dedicated component that "remaps" specific configured metrics, but does so by emitting additional metrics.

This was chosen in order to allow us to keep the more intentionally abstract/generic Saluki telemetry, such that we could potentially design a more complete "Agent Data Plane Health" dashboard, while also meeting our goals around operational continuity and emitting equivalent telemetry to ensure existing Agent dashboards, focused on DSD, still work for ADP-based Agent deployments.

We're keeping this PR open for now, however, to signify that we don't yet remap all of the internal telemetry that is emitted by the DogStatsD or aggregator components in the Agent.

@tobz
Copy link
Member Author

tobz commented Dec 13, 2024

At this point, we remap the majority of DogStatsD-specific telemetry that is relevant to day-to-day debug and operations, so we'll close this issue for now and open additional issues for follow-ups around remapping additional bits of telemetry.

@tobz tobz closed this as completed Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/observability Internal observability of ADP and Saluki. effort/intermediate Involves changes that can be worked on by non-experts but might require guidance. type/meta Things that can't be neatly categorized and/or aren't yet fully-formed ideas/thoughts.
Projects
None yet
Development

No branches or pull requests

1 participant