Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node graph displays incorrect values #4319

Open
joli-sys opened this issue Nov 13, 2024 · 3 comments
Open

Node graph displays incorrect values #4319

joli-sys opened this issue Nov 13, 2024 · 3 comments

Comments

@joli-sys
Copy link

joli-sys commented Nov 13, 2024

Description

The node graph in Grafana Tempo plugin is showing incorrect values

  • Response time values are incorrect

Steps to Reproduce

  1. Open Tempo node graph
  2. Observe response time values
  3. Compare with actual values

Expected Behavior

  • Node graph should display accurate values matching actual traffic
  • Average response time should match actual values

Current Behavior

  • Node graph shows dramatically higher ms/req values for response time

System Information

  • Grafana version: 10.1.10
    • Helm deployment ( Helm chart version 8.6.0)
  • Tempo version (Tempo distributed): 2.6.0
    • Helm deployment ( Helm chart version 1.21.1)
  • Browser: Arc Browser/Chrome

Additional Context

  • Screenshot of node graph showing incorrect values
    Image

  • Traces metrics are correct in Tempo
    Image

Possible Related Issues

@joe-elliott
Copy link
Member

Can you check the underlying histograms to see if they agree with the service graph or not?

traces_service_graph_request_client_seconds

traces_service_graph_request_server_seconds

@joli-sys
Copy link
Author

Can you check the underlying histograms to see if they agree with the service graph or not?

traces_service_graph_request_client_seconds

traces_service_graph_request_server_seconds

Hey @joe-elliott , thanks for replying.
It seems like these metrics truly correlates with service_graph values.
Here is graph with average latency per request for last 5 minutes
Image

It seems for me, like traces_service_graph_request_client_seconds and traces_service_graph_request_client_seconds are in ms instead of seconds in reality. We have other Prometheus metrics, where we have our endpoints latency, and it never goes that high. I will try investigate further our setup, but if you have any clue, I would really appreciate any tip.
Thanks a lot!

@joe-elliott
Copy link
Member

It seems for me, like traces_service_graph_request_client_seconds and traces_service_graph_request_client_seconds are in ms instead of seconds in reality.

I hope it's seconds. Your promql query is multiplying by 1000 is that causing the discrepancy? It would be interesting to see the p50 as well.

It seems like these metrics truly correlates with service_graph values.

So this means if there's an issue it's related to the service graph processor/tempo and not the Grafana visualization.

When we measure server and client "latency" we use the span duration. Here is where we record the information:

https://github.com/grafana/tempo/blob/main/modules/generator/processor/servicegraphs/servicegraphs.go#L377-L378

and here is where set it:

https://github.com/grafana/tempo/blob/main/modules/generator/processor/servicegraphs/servicegraphs.go#L199
https://github.com/grafana/tempo/blob/main/modules/generator/processor/servicegraphs/servicegraphs.go#L218

Perhaps this definition of "latency" is unexpected or is causing the discrepancy?

I'd also rate these two counters:

tempo_metrics_generator_processor_service_graphs_expired_edges
tempo_metrics_generator_processor_service_graphs_edges

This will give you a sense of how many of your discovered edges are being expired w/o finding a suitable pair. Perhaps the issue is that only a small percentage is getting paired which is skewing the results?

The service graphs config block has an option called wait that controls how long tempo will wait for the edges pair before giving up. Perhaps increase this value?

https://grafana.com/docs/tempo/latest/configuration/#metrics-generator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants