Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max query expr #3

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/dependabot_serverless_gomod.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ jobs:
USER: x-access-token
TOKEN: ${{ steps.get-github-app-token.outputs.token }}
run: |
git config --global user.email "[email protected]"
git config --global user.name "tempo-gh-bot[bot]"
git config --global url."https://${USER}:${TOKEN}@github.com/grafana/tempo".insteadOf "https://github.com/grafana/tempo"
git add cmd/tempo-serverless/lambda/go.mod
git add cmd/tempo-serverless/lambda/go.sum
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/drone-signature-check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ on:

jobs:
drone-signature-check:
# only run in grafana/tempo.
if: github.repository == 'grafana/tempo'
uses: grafana/shared-workflows/.github/workflows/check-drone-signature.yaml@main
with:
drone_config_path: .drone/drone.yml
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
## main / unreleased
* [FEATURE] tempo-cli: support dropping multiple traces in a single operation [#4266](https://github.com/grafana/tempo/pull/4266) (@ndk)
* [CHANGE] **BREAKING CHANGE** Add maximum spans per span set. Users can set `max_spans_per_span_set` to 0 to obtain the old behavior. [#4275](https://github.com/grafana/tempo/pull/4383) (@carles-grafana)
* [CHANGE] slo: include request cancellations within SLO [#4355] (https://github.com/grafana/tempo/pull/4355) (@electron0zero)
request cancellations are exposed under `result` label in `tempo_query_frontend_queries_total` and `tempo_query_frontend_queries_within_slo_total` with `completed` or `canceled` values to differentiate between completed and canceled requests.
* [CHANGE] update default config values to better align with production workloads [#4340](https://github.com/grafana/tempo/pull/4340) (@electron0zero)
Expand Down Expand Up @@ -68,6 +69,7 @@
* [ENHANCEMENT] Reuse generator code to better refuse "too large" traces. [#4365](https://github.com/grafana/tempo/pull/4365) (@joe-elliott)
This will cause the ingester to more aggressively and correctly refuse traces. Also added two metrics to better track bytes consumed per tenant in the ingester.
`tempo_metrics_generator_live_trace_bytes` and `tempo_ingester_live_trace_bytes`.
* [BUGFIX] Handle invalid TraceQL query filter in tag values v2 disk cache [#4392](https://github.com/grafana/tempo/pull/4392) (@electron0zero)
* [BUGFIX] Replace hedged requests roundtrips total with a counter. [#4063](https://github.com/grafana/tempo/pull/4063) [#4078](https://github.com/grafana/tempo/pull/4078) (@galalen)
* [BUGFIX] Metrics generators: Correctly drop from the ring before stopping ingestion to reduce drops during a rollout. [#4101](https://github.com/grafana/tempo/pull/4101) (@joe-elliott)
* [BUGFIX] Correctly handle 400 Bad Request and 404 Not Found in gRPC streaming [#4144](https://github.com/grafana/tempo/pull/4144) (@mapno)
Expand Down
9 changes: 8 additions & 1 deletion docs/sources/tempo/configuration/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -593,6 +593,10 @@ query_frontend:
# A list of regular expressions for refusing matching requests, these will apply for every request regardless of the endpoint.
[url_deny_list: <list of strings> | default = <empty list>]]

# Max allowed TraceQL expression size, in bytes. queries bigger then this size will be rejected.
# (default: 128 KiB)
[max_query_expression_size_bytes: <int> | default = 131072]]

search:

# The number of concurrent jobs to execute when searching the backend.
Expand Down Expand Up @@ -640,7 +644,10 @@ query_frontend:

# The number of shards to break ingester queries into.
[ingester_shards]: <int> | default = 3]


# The maximum allowed value of spans per span set. 0 disables this limit.
[max_spans_per_span_set]: <int> | default = 100]

# SLO configuration for Metadata (tags and tag values) endpoints.
metadata_slo:
# If set to a non-zero value, it's value will be used to decide if metadata query is within SLO or not.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,17 @@ keywords:
title: Identify bottlenecks and establish SLOs
menuTitle: Identify bottlenecks and establish SLOs
weight: 320
refs:
metrics-generator:
- pattern: /docs/tempo/
destination: https://grafana.com/docs/tempo/<TEMPO_VERSION>/metrics-generator/
- pattern: /docs/enterprise-traces/
destination: https://grafana.com/docs/enterprise-traces/<ENTERPRISE_TRACES_VERSION>/metrics-generator/
span-metrics:
- pattern: /docs/tempo/
destination: https://grafana.com/docs/tempo/<TEMPO_VERSION>/metrics-generator/span_metrics/
- pattern: /docs/enterprise-traces/
destination: https://grafana.com/docs/enterprise-traces/<ENTERPRISE_TRACES_VERSION>/metrics-generator/span_metrics/
---

# Identify bottlenecks and establish SLOs
Expand All @@ -19,34 +30,34 @@ Handy Site Corp, a fake website company, runs an ecommerce application that incl

### Define realistic SLOs

Handy Site’s engineers start by establishing service level objectives (SLOs) around latency ensure that customers have a good experience when trying to complete the checkout process.
Handy Site’s engineers start by establishing service level objectives (SLOs) around latency ensure that customers have a good experience when trying to complete the checkout process.
To do this, they use metrics generated from their span data.

Their service level objective should be a realistic target based on previous history during times of normal operation.
Once they've agreed upon their service level objective, they will set up alerts to warn them when they are at risk of failing to meet that objective.
Once they've agreed upon their service level objective, they will set up alerts to warn them when they are at risk of failing to meet that objective.

### Utilize span metrics to define your SLO and SLI

After evaluating options, they decide to use [span metrics](https://grafana.com/docs/tempo/latest/metrics-generator/span_metrics/) as a service level indicator (SLI) to measure SLO compliance.
After evaluating options, they decide to use [span metrics](ref:span-metrics) as a service level indicator (SLI) to measure SLO compliance.

![Metrics generator and exemplars](/media/docs/tempo/intro/traces-metrics-gen-exemplars.png)

Tempo can generate metrics using the [metrics-generator component](https://grafana.com/docs/tempo/latest/metrics-generator/).
Tempo can generate metrics using the [metrics-generator component](ref:metrics-generator).
These metrics are created based on spans from incoming traces and demonstrate immediate usefulness with respect to application flow and overview.
This includes rate, error, and duration (RED) metrics.


Span metrics also make it easy to use exemplars.
An [exemplar](https://grafana.com/docs/grafana/latest/basics/exemplars/) serves as a detailed example of one of the observations aggregated into a metric. An exemplar contains the observed value together with an optional timestamp and arbitrary trace IDs, which are typically used to reference a trace.
Since traces and metrics co-exist in the metrics-generator, exemplars can be automatically added to those metrics, allowing you to quickly jump from a metric showing aggregate latency over time into an individual trace that represents a low, medium, or high latency request. Similarly, you can quickly jump from a metric showing error rate over time into an individual erroring trace.
An [exemplar](https://grafana.com/docs/grafana/<GRAFANA_VERSION>/basics/exemplars/) serves as a detailed example of one of the observations aggregated into a metric. An exemplar contains the observed value together with an optional timestamp and arbitrary trace IDs, which are typically used to reference a trace.
Since traces and metrics co-exist in the metrics-generator, exemplars can be automatically added to those metrics, allowing you to quickly jump from a metric showing aggregate latency over time into an individual trace that represents a low, medium, or high latency request. Similarly, you can quickly jump from a metric showing error rate over time into an individual erroring trace.

### Monitor latency

Handy Site decides they're most interested in monitoring the latency of requests processed by their checkout service and want to set an objective that 99.5% of requests in a given month should complete within 2 seconds.
To define a service level indicator (SLI) that they can use to track their progress against their objective, they use the `traces_spanmetrics_latency` metric with the proper label selectors, such as `service name = checkoutservice`.
The metrics-generator adds a default set of labels to the metrics it generates, including `span_kind` and `status_code`. However, if they were interested in calculating checkout service latency per endpoint or per version of the software, they could change the configuration of the Tempo metrics-generator to add these custom dimensions as labels to their spanmetrics.
The metrics-generator adds a default set of labels to the metrics it generates, including `span_kind` and `status_code`. However, if they were interested in calculating checkout service latency per endpoint or per version of the software, they could change the configuration of the Tempo metrics-generator to add these custom dimensions as labels to their spanmetrics.

With all of this in place, Handy Site now opens the [Grafana SLO](https://grafana.com/docs/grafana-cloud/alerting-and-irm/slo/) application and follows the setup flow to establish an [SLI](https://grafana.com/docs/grafana-cloud/alerting-and-irm/slo/create/) for their checkout service around the `traces_spanmetrics_latency` metric..
With all of this in place, Handy Site now opens the [Grafana SLO](https://grafana.com/docs/grafana-cloud/alerting-and-irm/slo/) application and follows the setup flow to establish an [SLI](https://grafana.com/docs/grafana-cloud/alerting-and-irm/slo/create/) for their checkout service around the `traces_spanmetrics_latency` metric.
They can now be alerted to degradations in service quality that directly impact their end user experience. SLO-based alerting also ensures that they don't suffer from noisy alerts. Alerts are only triggered when the value of the SLI is such that the team is in danger of missing their SLO.

![Latency SLO dashboard](/media/docs/tempo/intro/traces-metrics-gen-SLO.png)
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ keywords:
title: Diagnose errors with traces
menuTitle: Diagnose errors with traces
weight: 400
refs:
traceql:
- pattern: /docs/tempo/
destination: https://grafana.com/docs/tempo/<TEMPO_VERSION>/traceql/
- pattern: /docs/enterprise-traces/
destination: https://grafana.com/docs/enterprise-traces/<ENTERPRISE_TRACES_VERSION>/traceql/
---

# Diagnose errors with traces
Expand All @@ -27,7 +33,7 @@ It’s imperative for the operations team at Handy Site to quickly troubleshoot

## Use TraceQL to query data

Tempo has a traces-first query language, [TraceQL](https://grafana.com/docs/tempo/latest/traceql/), that provides a unique toolset for selecting and searching tracing data. TraceQL can match traces based on span and resource attributes, duration, and ancestor<>descendant relationships. It also can compute aggregate statistics (e.g., `rate`) over a set of spans.
Tempo has a traces-first query language, [TraceQL](ref:traceql), that provides a unique toolset for selecting and searching tracing data. TraceQL can match traces based on span and resource attributes, duration, and ancestor<>descendant relationships. It also can compute aggregate statistics (e.g., `rate`) over a set of spans.

Handy Site’s services and applications are instrumented for tracing, so they can use TraceQL as a debugging tool. Using three TraceQL queries, the team identifies and validates the root cause of the issue.

Expand All @@ -50,7 +56,7 @@ Looking at the set of returned spans, the most concerning ones are those with th

The team decides to use structural operators to follow an error chain from the top-level `mythical-requester` service to any descendant spans that also have an error status.
Descendant spans can be any span that's descended from the parent span, such as a child or a further child at any depth.
Using this query, the team can pinpoint the downstream service that might be causing the issue. The query below says "Find me spans where `status = error` that that are descendants of spans from the `mythical-requester` service that have status code `500`."
Using this query, the team can pinpoint the downstream service that might be causing the issue. The query below says "Find me spans where `status = error` that that are descendants of spans from the `mythical-requester` service that have status code `500`."

```traceql
{ resource.service.name = "mythical-requester" && span.http.status_code = 500 } >> { status = error }
Expand All @@ -68,14 +74,14 @@ Specifically, the service is passing a `null` value for a column in a database t
After identifying the specific cause of this internal server error,
the team wants to know if there are errors in any database operations other than the `null` `INSERT` error found above.
Their updated query uses a negated regular expression to find any spans where the database statement either doesn’t exist, or doesn’t start with an `INSERT` clause.
This should expose any other issues causing an internal server error and filter out the class of issues that they already diagnosed.
This should expose any other issues causing an internal server error and filter out the class of issues that they already diagnosed.

```traceql
{ resource.service.name = "mythical-requester" && span.http.status_code = 500 } >> { status = error && span.db.statement !~ "INSERT.*" }
```

This query yields no results, suggesting that the root cause of the issues the operations team are seeing is exclusively due to the failing database `INSERT` statement.
At this point, they can roll back to a known working version of the service, or deploy a fix to ensure that `null` data being passed to the service is rejected appropriately.
Once that is complete, the issue can be marked resolved and the Handy team's error rate SLI should return back to acceptable levels.
Once that is complete, the issue can be marked resolved and the Handy team's error rate SLI should return back to acceptable levels.

![Empty query results](/media/docs/tempo/intro/traceql-no-results-handy-site.png)
4 changes: 2 additions & 2 deletions docs/sources/tempo/metrics-generator/active-series.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ These capabilities rely on a set of generated span metrics and service metrics.

Any spans that are ingested by Tempo can create many metric series. However, this doesn't mean that every time a span is ingested that a new active series is created.

The number of active series generated depends on the label pairs generated from span data that are associated with the metrics, similar to other Prometheus-formated data.
The number of active series generated depends on the label pairs generated from span data that are associated with the metrics, similar to other Prometheus-formatted data.

For additional information, refer to the [Active series and DPM documentation](/docs/grafana-cloud/billing-and-usage/active-series-and-dpm/#active-series).
For additional information, refer to the [Active series and DPM documentation](https://grafana.com/docs/grafana-cloud/billing-and-usage/active-series-and-dpm/).

## Active series calculation

Expand Down
21 changes: 16 additions & 5 deletions docs/sources/tempo/metrics-generator/service-graph-view.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,17 @@ description: Grafana's service graph view utilizes metrics generated by the metr
aliases:
- ./app-performance-mgmt # /docs/tempo/<TEMPO_VERSION>/metrics-generator/app-performance-mgmt
weight: 400
refs:
enable-service-graphs:
- pattern: /docs/tempo/
destination: https://grafana.com/docs/tempo/<TEMPO_VERSION>/metrics-generator/service_graphs/enable-service-graphs/
- pattern: /docs/enterprise-traces/
destination: https://grafana.com/docs/enterprise-traces/<ENTERPRISE_TRACES_VERSION>/metrics-generator/service_graphs/enable-service-graphs/
span-metrics:
- pattern: /docs/tempo/
destination: https://grafana.com/docs/tempo/<TEMPO_VERSION>/metrics-generator/span_metrics/
- pattern: /docs/enterprise-traces/
destination: https://grafana.com/docs/enterprise-traces/<ENTERPRISE_TRACES_VERSION>/metrics-generator/span_metrics/
---

# Service graph view
Expand All @@ -27,13 +38,13 @@ You have to enable span metrics and service graph generation on the Grafana back

To use the service graph view, you need:

* Tempo or Grafana Cloud Traces with either 1) the metrics generator enabled and configured or 2) Grafana Agent or Grafana Alloy enabled and configured to send data to a Prometheus-compatible metrics store
* [Services graphs]({{< relref "../metrics-generator/service_graphs/enable-service-graphs" >}}), which are enabled by default in Grafana
* [Span metrics]({{< relref "../metrics-generator/span_metrics#how-to-run" >}}) enabled in your Tempo data source configuration
* Tempo or Grafana Cloud Traces with either the metrics generator enabled and configured or Grafana Agent or Grafana Alloy enabled and configured to send data to a Prometheus-compatible metrics store
* [Services graphs](ref:enable-service-graphs), which are enabled by default in Grafana
* [Span metrics](ref:span-metrics) enabled in your Tempo data source configuration

The service graph view can be derived from metrics generated by either the metrics-generator or by Grafana Agent or Grafana Alloy.

For information on how to configure these features, refer to the [Grafana Tempo data sources documentation](/docs/grafana/latest/datasources/tempo/).
For information on how to configure these features, refer to the [Tempo data sources documentation](/docs/grafana/<GRAFANA_VERSION>/datasources/tempo/).

## What does the service graph view show?

Expand All @@ -46,7 +57,7 @@ The service graph view provides a span metrics visualization (table) and service
You can select any information in the table that has an underline to show more detailed information.
You can also select any node in the service graph to display additional information.

![Service graph with extended informaiton](/media/docs/grafana/data-sources/tempo/query-editor/tempo-ds-query-service-graph-prom.png)
![Service graph with extended information](/media/docs/grafana/data-sources/tempo/query-editor/tempo-ds-query-service-graph-prom.png)

### Error rate example

Expand Down
Loading
Loading