Node graph displays incorrect values #4319

joli-sys · 2024-11-13T08:51:43Z

Description

The node graph in Grafana Tempo plugin is showing incorrect values

Response time values are incorrect

Steps to Reproduce

Open Tempo node graph
Observe response time values
Compare with actual values

Expected Behavior

Node graph should display accurate values matching actual traffic
Average response time should match actual values

Current Behavior

Node graph shows dramatically higher ms/req values for response time

System Information

Grafana version: 10.1.10
- Helm deployment ( Helm chart version 8.6.0)
Tempo version (Tempo distributed): 2.6.0
- Helm deployment ( Helm chart version 1.21.1)
Browser: Arc Browser/Chrome

Additional Context

Screenshot of node graph showing incorrect values
Traces metrics are correct in Tempo

Possible Related Issues

The rps of the service graph node are incorrect #2445

The text was updated successfully, but these errors were encountered:

joe-elliott · 2024-11-18T19:43:11Z

Can you check the underlying histograms to see if they agree with the service graph or not?

traces_service_graph_request_client_seconds

traces_service_graph_request_server_seconds

joli-sys · 2024-11-19T12:16:49Z

Can you check the underlying histograms to see if they agree with the service graph or not?

traces_service_graph_request_client_seconds

traces_service_graph_request_server_seconds

Hey @joe-elliott , thanks for replying.
It seems like these metrics truly correlates with service_graph values.
Here is graph with average latency per request for last 5 minutes

It seems for me, like traces_service_graph_request_client_seconds and traces_service_graph_request_client_seconds are in ms instead of seconds in reality. We have other Prometheus metrics, where we have our endpoints latency, and it never goes that high. I will try investigate further our setup, but if you have any clue, I would really appreciate any tip.
Thanks a lot!

joe-elliott · 2024-11-22T15:05:44Z

It seems for me, like traces_service_graph_request_client_seconds and traces_service_graph_request_client_seconds are in ms instead of seconds in reality.

I hope it's seconds. Your promql query is multiplying by 1000 is that causing the discrepancy? It would be interesting to see the p50 as well.

It seems like these metrics truly correlates with service_graph values.

So this means if there's an issue it's related to the service graph processor/tempo and not the Grafana visualization.

When we measure server and client "latency" we use the span duration. Here is where we record the information:

https://github.com/grafana/tempo/blob/main/modules/generator/processor/servicegraphs/servicegraphs.go#L377-L378

and here is where set it:

https://github.com/grafana/tempo/blob/main/modules/generator/processor/servicegraphs/servicegraphs.go#L199
https://github.com/grafana/tempo/blob/main/modules/generator/processor/servicegraphs/servicegraphs.go#L218

Perhaps this definition of "latency" is unexpected or is causing the discrepancy?

I'd also rate these two counters:

tempo_metrics_generator_processor_service_graphs_expired_edges
tempo_metrics_generator_processor_service_graphs_edges

This will give you a sense of how many of your discovered edges are being expired w/o finding a suitable pair. Perhaps the issue is that only a small percentage is getting paired which is skewing the results?

The service graphs config block has an option called wait that controls how long tempo will wait for the edges pair before giving up. Perhaps increase this value?

https://grafana.com/docs/tempo/latest/configuration/#metrics-generator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node graph displays incorrect values #4319

Node graph displays incorrect values #4319

joli-sys commented Nov 13, 2024 •

edited

Loading

joe-elliott commented Nov 18, 2024

joli-sys commented Nov 19, 2024

joe-elliott commented Nov 22, 2024

Node graph displays incorrect values #4319

Node graph displays incorrect values #4319

Comments

joli-sys commented Nov 13, 2024 • edited Loading

Description

Steps to Reproduce

Expected Behavior

Current Behavior

System Information

Additional Context

Possible Related Issues

joe-elliott commented Nov 18, 2024

joli-sys commented Nov 19, 2024

joe-elliott commented Nov 22, 2024

joli-sys commented Nov 13, 2024 •

edited

Loading