How do we emit internal telemetry that works with existing Datadog Agent operational tooling? #118

tobz · 2024-07-17T18:32:23Z

At a high-level, both Datadog Agent and Agent Data Plane/Saluki emit internal telemetry used for debugging performance issues and understanding their operational state. However, the naming differs between the two by a large amount, even for metrics that are functionally identical. This makes it challenging to use ADP, as it currently exists, as a drop-in replacement for DSD support in the core Agent.

The metric prefix we use when emitting internal metrics is configurable at the tippity top when initializing the metrics subsystem via saluki_app::metrics::initialize_metrics, so that's fine... but how do we line up individual metrics with their spiritual equivalent in the Datadog Agent?

This is a problem we need to solve if we hope to have ADP replace DSD in the core Agent.

The text was updated successfully, but these errors were encountered:

tobz · 2024-07-17T18:41:56Z

One idea: metric remapping.

Conceptually, specific components in Saluki map to specific components in the core Agent. For example, the DogStatsD source in ADP is the dogstatsd component in the Datadog Agent, and the Datadog Metrics destination in ADP is the defaultforwarder component in the Datadog Agent. If we included the component type in internal metrics (e.g., metrics from the Datadog Metrics destination have a component_type tag with a value of datadog_metrics), we could conceivably use that to remap metrics to their Datadog Agent equivalent.

For example, datadog.agent.transactions.errors in the Datadog Agent is used to track "transaction errors", which occur when the default forwarder fails to send a request to the Datadog intake. The error_type tag indicates the specific type of error. Similarly, on the Saluki side, the Datadog Metrics destination emits a component_errors_total metric, with an error_type tag that has a value of http_send, when we fail to send a request.

Since we should expect to only have one Datadog Metrics destination running in ADP, we could conceivably map all instances of component_errors_total, where component_type was equal to datadog_metrics, to agent.transactions.errors.. and potentially map the error_type tag as well.

We could likely do this pretty simply with a dedicated transform that remaps metric names, perhaps one even designed solely for remapping to Datadog Agent-equivalent metric names. Biggest downside, I think, is just the general aspect of us having to maintain this mapping in the first place rather than doing it by default.

tobz · 2024-07-17T18:46:07Z

Another idea: change all points where we register metrics to also register Datadog Agent-specific versions.

Essentially, we would emit duplicate metrics -- a generically-named one for "pure" Saluki usage, and a Datadog Agent-specific one -- and that way anything using Saluki that wasn't ADP could have the more generic/flexible metric names, and ADP could still emit the Datadog Agent-specific metric names to meet our goal of being drop-in compatible.

This, obviously, means emitting more telemetry than absolutely necessary. If we really didn't want to do that, we could also have a transform for filtering out the generically-named metrics, leaving only the Datadog Agent-specific ones. We could also, perhaps, try and do something where we have a toggle for emitting the Saluki or Datadog Agent version... but threading that state all through Saluki would be very ugly.

tobz · 2024-09-26T16:28:35Z

We've ended up taking an approach (in #240) which blends the two ideas previously mentioned: a dedicated component that "remaps" specific configured metrics, but does so by emitting additional metrics.

This was chosen in order to allow us to keep the more intentionally abstract/generic Saluki telemetry, such that we could potentially design a more complete "Agent Data Plane Health" dashboard, while also meeting our goals around operational continuity and emitting equivalent telemetry to ensure existing Agent dashboards, focused on DSD, still work for ADP-based Agent deployments.

We're keeping this PR open for now, however, to signify that we don't yet remap all of the internal telemetry that is emitted by the DogStatsD or aggregator components in the Agent.

tobz · 2024-12-13T16:50:29Z

At this point, we remap the majority of DogStatsD-specific telemetry that is relevant to day-to-day debug and operations, so we'll close this issue for now and open additional issues for follow-ups around remapping additional bits of telemetry.

tobz changed the title ~~How do we emit internal telemetry that works with existing Datadog Agent operational tooling?~~ meta: how do we emit internal telemetry that works with existing Datadog Agent operational tooling? Jul 17, 2024

tobz changed the title ~~meta: how do we emit internal telemetry that works with existing Datadog Agent operational tooling?~~ How do we emit internal telemetry that works with existing Datadog Agent operational tooling? Jul 17, 2024

tobz added type/meta Things that can't be neatly categorized and/or aren't yet fully-formed ideas/thoughts. area/observability Internal observability of ADP and Saluki. effort/intermediate Involves changes that can be worked on by non-experts but might require guidance. labels Jul 17, 2024

tobz added this to the ADP v0.2: Staging Ahoy milestone Aug 13, 2024

tobz mentioned this issue Sep 11, 2024

[APR-207] chore: experimental approach to doing internal telemetry remapping #240

Merged

tobz closed this as completed Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we emit internal telemetry that works with existing Datadog Agent operational tooling? #118

How do we emit internal telemetry that works with existing Datadog Agent operational tooling? #118

tobz commented Jul 17, 2024

tobz commented Jul 17, 2024 •

edited

Loading

tobz commented Jul 17, 2024

tobz commented Sep 26, 2024

tobz commented Dec 13, 2024

How do we emit internal telemetry that works with existing Datadog Agent operational tooling? #118

How do we emit internal telemetry that works with existing Datadog Agent operational tooling? #118

Comments

tobz commented Jul 17, 2024

tobz commented Jul 17, 2024 • edited Loading

tobz commented Jul 17, 2024

tobz commented Sep 26, 2024

tobz commented Dec 13, 2024

tobz commented Jul 17, 2024 •

edited

Loading