From ff8584c0187246087c2086a03be30e242c458bb1 Mon Sep 17 00:00:00 2001 From: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com> Date: Wed, 30 Oct 2024 10:30:47 -0400 Subject: [PATCH] [DOC] Update metrics query docs with examples, more details (#4248) * Update metrics query doc with more examples * Update content from doc session * Apply suggestions from code review Co-authored-by: Jennifer Villa Co-authored-by: Joe Elliott * Restructure and update metrics queries * Updates to meet review comments * fix paragraph * Updates for conflict --------- Co-authored-by: Jennifer Villa Co-authored-by: Joe Elliott (cherry picked from commit 532f83f9338542c4f63052b104e6c1205e8b6f36) --- docs/sources/tempo/api_docs/_index.md | 23 +- .../sources/tempo/metrics-generator/_index.md | 2 +- .../tempo/operations/traceql-metrics.md | 40 ++++ docs/sources/tempo/traceql/metrics-queries.md | 160 ------------- .../tempo/traceql/metrics-queries/_index.md | 87 +++++++ .../traceql/metrics-queries/functions.md | 219 ++++++++++++++++++ .../solve-problems-metrics-queries.md | 67 ++++++ 7 files changed, 428 insertions(+), 170 deletions(-) delete mode 100644 docs/sources/tempo/traceql/metrics-queries.md create mode 100644 docs/sources/tempo/traceql/metrics-queries/_index.md create mode 100644 docs/sources/tempo/traceql/metrics-queries/functions.md create mode 100644 docs/sources/tempo/traceql/metrics-queries/solve-problems-metrics-queries.md diff --git a/docs/sources/tempo/api_docs/_index.md b/docs/sources/tempo/api_docs/_index.md index af70906bbce..4913ca2f718 100644 --- a/docs/sources/tempo/api_docs/_index.md +++ b/docs/sources/tempo/api_docs/_index.md @@ -31,7 +31,7 @@ For externally supported GRPC API, [see below](#tempo-grpc-api). | [Search tag names V2](#search-tags-v2) | Query-frontend | HTTP | `GET /api/v2/search/tags` | | [Search tag values](#search-tag-values) | Query-frontend | HTTP | `GET /api/search/tag//values` | | [Search tag values V2](#search-tag-values-v2) | Query-frontend | HTTP | `GET /api/v2/search/tag//values` | -| [TraceQL Metrics](#traceql-metrics) | Query-frontend | HTTP | `GET /api/metrics/query_range` | +| [TraceQL Metrics](#traceql-metrics) | Query-frontend | HTTP | `GET /api/metrics/query_range` | | [TraceQL Metrics (instant)](#instant) | Query-frontend | HTTP | `GET /api/metrics/query` | | [Query Echo Endpoint](#query-echo-endpoint) | Query-frontend | HTTP | `GET /api/echo` | | [Overrides API](#overrides-api) | Query-frontend | HTTP | `GET,POST,PATCH,DELETE /api/overrides` | @@ -311,8 +311,9 @@ $ curl -G -s http://localhost:3200/api/search --data-urlencode 'tags=service.nam Ingester configuration `complete_block_timeout` affects how long tags are available for search. -This endpoint retrieves all discovered tag names that can be used in search. The endpoint is available in the query frontend service in -a microservices deployment, or the Tempo endpoint in a monolithic mode deployment. The tags endpoint takes a scope that controls the kinds +This endpoint retrieves all discovered tag names that can be used in search. +The endpoint is available in the query frontend service in a microservices deployment, or the Tempo endpoint in a monolithic mode deployment. +The tags endpoint takes a scope that controls the kinds of tags or attributes returned. If nothing is provided, the endpoint returns all resource and span tags. ``` @@ -518,7 +519,9 @@ If a particular service name (for example, `shopping-cart`) is only present on s ### TraceQL Metrics -The TraceQL Metrics API returns Prometheus-like time-series for a given metrics query. Metrics queries are those using metrics functions like `rate()` and `quantile_over_time()`. See the [documentation]({{< relref "../traceql/metrics-queries" >}}) for the complete list. +The TraceQL Metrics API returns Prometheus-like time-series for a given metrics query. +Metrics queries are those using metrics functions like `rate()` and `quantile_over_time()`. +Refer to the [TraceQL metrics documentation](https://grafana.com/docs/tempo//traceql/metrics-queries/) for more information list. Parameters: @@ -529,18 +532,20 @@ Parameters: - `end = (unix epoch seconds | unix epoch nanoseconds | RFC3339 string)` Optional. Along with `start` define the time range. Providing both `start` and `end` includes blocks for the specified time range only. - `since = (duration string)` - Optional. Can be used instead of `start` and `end` to define the time range in relative values. For example `since=15m` will query the last 15 minutes. Default is last 1 hour. + Optional. Can be used instead of `start` and `end` to define the time range in relative values. For example, `since=15m` queries the last 15 minutes. Default is the last 1 hour. - `step = (duration string)` - Optional. Defines the granularity of the returned time-series. For example `step=15s` will return a data point every 15s within the time range. If not specified then the default behavior will choose a dynamic step based on the time range. + Optional. Defines the granularity of the returned time-series. For example, `step=15s` returns a data point every 15s within the time range. If not specified, then the default behavior chooses a dynamic step based on the time range. +- `exemplars = (integer)` + Optional. Defines the maximun number of exemplars for the query. It will be trimmed to max_exemplars if exceed it. The API is available in the query frontend service in a microservices deployment, or the Tempo endpoint in a monolithic mode deployment. -For example the following request computes the rate of spans received for `myservice` over the last three hours, at 1 minute intervals. +For example, the following request computes the rate of spans received for `myservice` over the last three hours, at 1 minute intervals. {{< admonition type="note" >}} Actual API parameters must be url-encoded. This example is left unencoded for readability. -{{% /admonition %}} +{{< /admonition >}} ``` GET /api/metrics/query_range?q={resource.service.name="myservice"}|rate()&since=3h&step=1m @@ -763,6 +768,6 @@ service StreamingQuerier { rpc SearchTagsV2(SearchTagsRequest) returns (stream SearchTagsV2Response) {} rpc SearchTagValues(SearchTagValuesRequest) returns (stream SearchTagValuesResponse) {} rpc SearchTagValuesV2(SearchTagValuesRequest) returns (stream SearchTagValuesV2Response) {} - rpc MetricsQueryRange(QueryRangeRequest) returns (stream QueryRangeResponse) {} + rpc MetricsQueryRange(QueryRangeRequest) returns (stream QueryRangeResponse) {} } ``` diff --git a/docs/sources/tempo/metrics-generator/_index.md b/docs/sources/tempo/metrics-generator/_index.md index 919eeab0c99..9c8a7436cc2 100644 --- a/docs/sources/tempo/metrics-generator/_index.md +++ b/docs/sources/tempo/metrics-generator/_index.md @@ -10,7 +10,7 @@ weight: 500 # Metrics-generator Metrics-generator is an optional Tempo component that derives metrics from ingested traces. -If present, the distributors write received spans to both the ingester and the metrics-generator. +If present, the distributor writes received spans to both the ingester and the metrics-generator. The metrics-generator processes spans and writes metrics to a Prometheus data source using the Prometheus remote write protocol. ## Architecture diff --git a/docs/sources/tempo/operations/traceql-metrics.md b/docs/sources/tempo/operations/traceql-metrics.md index d8e156cbefd..c0eb42ea557 100644 --- a/docs/sources/tempo/operations/traceql-metrics.md +++ b/docs/sources/tempo/operations/traceql-metrics.md @@ -84,6 +84,46 @@ Setting `flush_to_storage` to `true` ensures that metrics blocks are flushed to For more information about overrides, refer to [Standard overrides](https://grafana.com/docs/tempo//configuration/#standard-overrides). + ```yaml + overrides: + 'tenantID': + metrics_generator_processors: + - local-blocks + ``` + +By default, for all tenants in the main configuration: + + ```yaml + overrides: + defaults: + metrics_generator: + processors: [local-blocks] + ``` + +Add this configuration to run TraceQL metrics queries against all spans (and not just server spans): + +```yaml +metrics_generator: + processor: + local_blocks: + filter_server_spans: false +``` + +If you configured Tempo using the `tempo-distributed` Helm chart, you can also set `traces_storage` using your `values.yaml` file. +Refer to the [Helm chart for an example](https://github.com/grafana/helm-charts/blob/559ecf4a9c9eefac4521454e7a8066778e4eeff7/charts/tempo-distributed/values.yaml#L362). + +```yaml +metrics_generator: + processor: + local_blocks: + flush_to_storage: true +``` + +Setting `flush_to_storage` to `true` ensures that metrics blocks are flushed to storage so TraceQL metrics queries against historical data. + +For more information about overrides, refer to [Standard overrides](https://grafana.com/docs/tempo//configuration/#standard-overrides). + + ## Evaluate query timeouts Because of their expensive nature, these queries can take a long time to run. diff --git a/docs/sources/tempo/traceql/metrics-queries.md b/docs/sources/tempo/traceql/metrics-queries.md deleted file mode 100644 index 65f1d7fabf8..00000000000 --- a/docs/sources/tempo/traceql/metrics-queries.md +++ /dev/null @@ -1,160 +0,0 @@ ---- -title: TraceQL metrics queries -menuTitle: TraceQL metrics queries -description: Learn about TraceQL metrics queries -weight: 600 -keywords: - - metrics query - - TraceQL metrics ---- - -# TraceQL metrics queries - -{{< docs/experimental product="TraceQL metrics" >}} - -TraceQL metrics is an experimental feature in Grafana Tempo that creates metrics from traces. - -Metric queries extend trace queries by applying a function to trace query results. -This powerful feature allows for ad hoc aggregation of any existing TraceQL query by any dimension available in your traces, much in the same way that LogQL metric queries create metrics from logs. - -Traces are a unique observability signal that contain causal relationships between the components in your system. -Do you want to know how many database calls across all systems are downstream of your application? -What services beneath a given endpoint are currently failing? -What services beneath an endpoint are currently slow? TraceQL metrics can answer all these questions by parsing your traces in aggregate. - -![Metrics visualization in Grafana](/media/docs/tempo/metrics-explore-sample-2.4.png) - -## Enable and use TraceQL metrics - -You can use the TraceQL metrics in Grafana with any existing or new Tempo data source. -This capability is available in Grafana Cloud and Grafana (10.4 and newer). - -To enable TraceQL metrics, refer to [Configure TraceQL metrics](https://grafana.com/docs/tempo/latest/operations/traceql-metrics/) for more information. - -## Exemplars - -Exemplars are a powerful feature of TraceQL metrics. -They allow you to see an exact trace that contributed to a given metric value. -This is particularly useful when you want to understand why a given metric is high or low. - -Exemplars are available in TraceQL metrics for all functions. -To get exemplars, you need to configure it in the query-frontend with the parameter `query_frontend.metrics.exemplars`, -or pass a query hint in your query. - -``` -{ name = "GET /:endpoint" } | quantile_over_time(duration, .99) by (span.http.target) with (exemplars=true) -``` - -## Functions - -TraceQL supports include `rate`, `count_over_time`, `quantile_over_time`, and `histogram_over_time` functions. -These functions can be added as an operator at the end of any TraceQL query. - -`rate` -: Calculates the number of matching spans per second - -`count_over_time` -: Counts the number of matching spans per time interval (see the `step` API parameter) - -`quantile_over_time` -: The quantile of the values in the specified interval - -`histogram_over_time` -: Evaluate frequency distribution over time. Example: `histogram_over_time(duration) by (span.foo)` - -`compare` -: Used to split the stream of spans into two groups: a selection and a baseline. The function returns time-series for all attributes found on the spans to highlight the differences between the two groups. - -### The `rate` function - -The following query shows the rate of errors by service and span name. - -``` -{ status = error } | rate() by (resource.service.name, name) -``` - -This example calculates the rate of the erroring spans coming from the service `foo`. -Rate is a `spans/sec` quantity. - -``` -{ resource.service.name = "foo" && status = error } | rate() -``` - -Combined with the `by()` operator, this can be even more powerful. - -``` -{ resource.service.name = "foo" && status = error } | rate() by (span.http.route) -``` - -This example still rates the erroring spans in the service `foo` but the metrics have been broken -down by HTTP route. -This might let you determine that `/api/sad` had a higher rate of erroring -spans than `/api/happy`, for example. - -### The `quantile_over_time` and `histogram_over_time` functions - -The `quantile_over_time()` and `histogram_over_time()` functions let you aggregate numerical values, such as the all important span duration. -You can specify multiple quantiles in the same query. - -``` -{ name = "GET /:endpoint" } | quantile_over_time(duration, .99, .9, .5) -``` - -You can group by any span or resource attribute. - -``` -{ name = "GET /:endpoint" } | quantile_over_time(duration, .99) by (span.http.target) -``` - -Quantiles aren't limited to span duration. -Any numerical attribute on the span is fair game. -To demonstrate this flexibility, consider this nonsensical quantile on `span.http.status_code`: - -``` -{ name = "GET /:endpoint" } | quantile_over_time(span.http.status_code, .99, .9, .5) -``` - -### The `compare` function - -This adds a new metrics function `compare` which is used to split the stream of spans into two groups: a selection and a baseline. -It returns time-series for all attributes found on the spans to highlight the differences between the two groups. -This is a powerful function that is best understood by looking at example outputs below: - -The function is used like other metrics functions: when it's placed after any search query, and converts it into a metrics query: -`...any spanset pipeline... | compare({subset filters}, , , )` - -Example: -``` -{ resource.service.name="a" && span.http.path="/myapi" } | compare({status=error}) -``` -This function is generally run as an instant query. It may return may exceed gRPC payloads when run as a query range. -#### Parameters - -The `compare` function has four parameters: - -1. Required. The first parameter is a spanset filter for choosing the subset of spans. This filter is executed against the incoming spans. If it matches, then the span is considered to be part of the selection. Otherwise, it is part of the baseline. Common filters are expected to be things like `{status=error}` (what is different about errors?) or `{duration>1s}` (what is different about slow spans?) - -1. Optional. The second parameter is the top `N` values to return per attribute. If an attribute exceeds this limit in either the selection group or baseline group, then only the top `N` values (based on frequency) are returned, and an error indicator for the attribute is included output (see below). Defaults to `10`. - -1. Optional. Start and End timestamps in Unix nanoseconds, which can be used to constrain the selection window by time, in addition to the filter. For example, the overall query could cover the past hour, and the selection window only a 5 minute time period in which there was an anomaly. These timestamps must both be given, or neither. - -#### Output - -The outputs are flat time-series for each attribute/value found in the spans. - -Each series has a label `__meta_type` which denotes which group it is in, either `selection` or `baseline`. - -Example output series: -``` -{ __meta_type="baseline", resource.cluster="prod" } 123 -{ __meta_type="baseline", resource.cluster="qa" } 124 -{ __meta_type="selection", resource.cluster="prod" } 456 <--- significant difference detected -{ __meta_type="selection", resource.cluster="qa" } 125 -{ __meta_type="selection", resource.cluster="dev"} 126 <--- cluster=dev was found in the highlighted spans but not in the baseline -``` - -When an attribute reaches the topN limit, there will also be present an error indicator. -This example means the attribute `resource.cluster` had too many values. -``` -{ __meta_error="__too_many_values__", resource.cluster= } -``` diff --git a/docs/sources/tempo/traceql/metrics-queries/_index.md b/docs/sources/tempo/traceql/metrics-queries/_index.md new file mode 100644 index 00000000000..9f4c2a2fe77 --- /dev/null +++ b/docs/sources/tempo/traceql/metrics-queries/_index.md @@ -0,0 +1,87 @@ +--- +title: TraceQL metrics queries +menuTitle: TraceQL metrics queries +description: Learn about TraceQL metrics queries +weight: 600 +keywords: + - metrics query + - TraceQL metrics +--- + +# TraceQL metrics queries + +{{< docs/experimental product="TraceQL metrics" >}} + +TraceQL metrics is an experimental feature in Grafana Tempo that creates metrics from traces. + +Metric queries extend trace queries by applying a function to trace query results. +This powerful feature allows for ad hoc aggregation of any existing TraceQL query by any dimension available in your traces, much in the same way that LogQL metric queries create metrics from logs. + +Traces are a unique observability signal that contain causal relationships between the components in your system. + +TraceQL metrics can help answer questions like this: + +* How many database calls across all systems are downstream of your application? +* What services beneath a given endpoint are currently failing? +* What services beneath an endpoint are currently slow? + +TraceQL metrics can help you answer these questions by parsing your traces in aggregate. + +TraceQL metrics are powered by the [TraceQL metrics API](https://grafana.com/docs/tempo//api_docs/#traceql-metrics). + +![Metrics visualization in Grafana](/media/docs/tempo/metrics-explore-sample-2.4.png) + +## RED metrics, TraceQL, and PromQL + +RED is an acronym for three types of metrics: + +- Rate, the number of requests per second +- Errors, the number of those requests that are failing +- Duration, the amount of time those requests take + +For more information about the RED method, refer to [The RED Method: how to instrument your services](/blog/2018/08/02/the-red-method-how-to-instrument-your-services/). + +You can write TraceQL metrics queries to compute rate, errors, and durations over different groups of spans. + +For more information on how to use TraceQL metrics to investigate issues, refer to [Solve problems with metrics queries](./solve-problems-metrics-queries). + +## Enable and use TraceQL metrics + +To use TraceQL metrics, you need to enable them on your Tempo database. +Refer to [Configure TraceQL metrics](https://grafana.com/docs/tempo//operations/traceql-metrics/) for more information. + +From there, you can either query the TraceQL metrics API directly (for example, with `curl`) or using Grafana +(recommended). +To run TraceQL metrics queries in Grafana, you need Grafana Cloud or Grafana 10.4 or later. +No extra configuration is needed. +Use a Tempo data source that points to a Tempo database with TraceQL metrics enabled. + +Refer to [Solve problems using metrics queries](./solve-problems-metrics-queries/) for some real-world examples. + +### Functions + +TraceQL metrics queries currently include the following functions for aggregating over groups of spans: `rate`, `count_over_time`, `quantile_over_time`, `histogram_over_time`, and `compare`. +These functions can be added as an operator at the end of any TraceQL query. + +For detailed information and example queries for each function, refer to [TraceQL metrics functions](./functions). + +### Exemplars + +Exemplars are a powerful feature of TraceQL metrics. +They allow you to see an exact trace that contributed to a given metric value. +This is particularly useful when you want to understand why a given metric is high or low. + +Exemplars are available in TraceQL metrics for all range queries. +To get exemplars, you need to configure it in the query-frontend with the parameter `query_frontend.metrics.max_exemplars`, +or pass a query hint in your query. + +Example: + +``` +{ span:name = "GET /:endpoint" } | quantile_over_time(duration, .99) by (span.http.target) with (exemplars=true) +``` + +{{< admonition type="note" >}} +TraceQL metric queries with exemplars aren't fully supported in Grafana Explore. +They will be supported in a future Grafana release. +{{< /admonition >}} diff --git a/docs/sources/tempo/traceql/metrics-queries/functions.md b/docs/sources/tempo/traceql/metrics-queries/functions.md new file mode 100644 index 00000000000..24e3e070f1d --- /dev/null +++ b/docs/sources/tempo/traceql/metrics-queries/functions.md @@ -0,0 +1,219 @@ +--- +title: TraceQL metrics functions +menuTitle: TraceQL metrics functions +description: Learn about functions used in TraceQL metrics queries +weight: 600 +keywords: + - metrics query + - TraceQL metrics +--- + +# TraceQL metrics functions + + + +TraceQL supports `rate`, `count_over_time`, `quantile_over_time`, `histogram_over_time`, and `compare` functions. + +## Available functions + +These functions can be added as an operator at the end of any TraceQL query. + +`rate` +: Calculates the number of matching spans per second + +`count_over_time` +: Counts the number of matching spans per time interval (refer to the [`step` API parameter](https://grafana.com/docs/tempo//api_docs/#traceql-metrics)). + +`min_over_time` +: Returns the minimum value of matching spans values per time interval (see the `step` API parameter) + +`max_over_time` +: Returns the minimum value for the specified attribute across all matching spans per time interval (refer to the [`step` API parameter](https://grafana.com/docs/tempo//api_docs/#traceql-metrics)). + +`quantile_over_time` +: The quantile of the values in the specified interval + +`histogram_over_time` +: Evaluate frequency distribution over time. Example: `histogram_over_time(duration) by (span.foo)` + +`compare` +: Used to split the stream of spans into two groups: a selection and a baseline. The function returns time-series for all attributes found on the spans to highlight the differences between the two groups. + +## The `rate` function + +The `rate` function calculates the number of matching spans per second that match the given span selectors. + +### Parameters + +None. + +## Examples + +The following query shows the rate of errors by service and span name. +This is a TraceQL specific way of gathering rate metrics that would otherwise be generated by the span metrics processor. + +For example, this query: + +``` +{ status = error } | rate() by (resource.service.name, name) +``` + +Is an equivalent to using span-generated metrics and running the query. + +This example calculates the rate of the erroring spans coming from the service `foo`. +Rate is a `spans/sec` quantity. + +``` +{ resource.service.name = "foo" && status = error } | rate() +``` + +Combined with the `by()` operator, this can be even more powerful. + +``` +{ resource.service.name = "foo" && status = error } | rate() by (span.http.route) +``` + +This example still rates the erroring spans in the service `foo` but the metrics are broken +down by HTTP route. +This might let you determine that `/api/sad` had a higher rate of erroring +spans than `/api/happy`, for example. + +## The `count_over_time` function + +The `count_over_time()` function counts the number of matching spans per time interval. +The time interval that the count will be computed over is set by the `step` parameter. +For more information, refer to the [`step` API parameter](https://grafana.com/docs/tempo//api_docs/#traceql-metrics). + + +### Example + +This example counts the number of spans with name `"GET /:endpoint"` broken down by status code. You might see that there are 10 `"GET /:endpoint"` spans with status code 200 and 15 `"GET /:endpoint"` spans with status code 400. + +``` +{ name = "GET /:endpoint" } | count_over_time() by (span.http.status_code) + +``` + +## The `min_over_time` and `max_over_time` functions + +The `min_over_time()` function lets you aggregate numerical attributes by calculating their minimum value. +For example, you could choose to calculate the minimum duration of a group of spans, or you could choose to calculate the minimum value of a custom attribute you've attached to your spans, like `span.shopping.cart.entries`. +The time interval that the minimum is computed over is set by the `step` parameter. + +The `max_over_time()` let you aggregate numerical values by computing the maximum value of them, such as the all important span duration. +The time interval that the maximum is computer over is set by the `step` parameter. + +For more information, refer to the [`step` API parameter](https://grafana.com/docs/tempo//api_docs/#traceql-metrics). + +### Parameters + +Numerical field that you want to calculate the minimum or maximum of. + +### Examples + +This example computes the minimum duration for each `http.target` of all spans named `"GET /:endpoint"`. +Any numerical attribute on the span is fair game. + +``` +{ name = "GET /:endpoint" } | min_over_time(duration) by (span.http.target) +``` + +This example computes the minimum status code value of all spans named `"GET /:endpoint"`. + +``` +{ name = "GET /:endpoint" } | min_over_time(span.http.status_code) +``` + +This example computes the maximum duration for each `http.target` of all spans named `"GET /:endpoint"`. + +``` +{ name = "GET /:endpoint" } | max_over_time(duration) by (span.http.target) +``` + +``` +{ name = "GET /:endpoint" } | max_over_time(span.http.response.size) +``` + +## The `quantile_over_time` and `histogram_over_time` functions + +The `quantile_over_time()` and `histogram_over_time()` functions let you aggregate numerical values, such as the all important span duration. +You can specify multiple quantiles in the same query. + +The example below computes the 99th, 90th, and 50th percentile of the duration attribute on all spans with name `GET /:endpoint`. + +``` +{ name = "GET /:endpoint" } | quantile_over_time(duration, .99, .9, .5) +``` + +You can group by any span or resource attribute. + +``` +{ name = "GET /:endpoint" } | quantile_over_time(duration, .99) by (span.http.target) +``` + +Quantiles aren't limited to span duration. +Any numerical attribute on the span is fair game. +To demonstrate this flexibility, consider this nonsensical quantile on `span.http.status_code`: + +``` +{ name = "GET /:endpoint" } | quantile_over_time(span.http.status_code, .99, .9, .5) +``` + +This computes the 99th, 90th, and 50th percentile of the values of the `status_code` attribute for all spans named `GET /:endpoint`. +This is unlikely to tell you anything useful (what does a median status code of `347` mean?), but it works. + +As a further example, imagine a custom attribute like `span.temperature`. +You could use a similar query to know what the 50th percentile and 95th percentile temperatures were across all your spans. + +## The `compare` function + +The `compare` function is used to split a set of spans into two groups: a selection and a baseline. +It returns time-series for all attributes found on the spans to highlight the differences between the two groups. + +This is a powerful function that's best understood by using the [**Comparison** tab in Explore Traces](https://grafana.com/docs/grafana//explore/simplified-exploration/traces/investigate/#comparison). +You can also under this function by looking at example outputs below. + +The function is used like other metrics functions: when it's placed after any trace query, it converts the query into a metrics query: +`...any spanset pipeline... | compare({subset filters}, , , )` + +Example: + +``` +{ resource.service.name="a" && span.http.path="/myapi" } | compare({status=error}) +``` + +This function is generally run as an instant query. +An instant query gives a single value at the end of the selected time range. +[Instant queries](https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries) are quicker to execute and it often easier to understand their results +The returns may exceed gRPC payloads when run as a range query. + +### Parameters + +The `compare` function has four parameters: + +1. Required. The first parameter is a spanset filter for choosing the subset of spans. This filter is executed against the incoming spans. If it matches, then the span is considered to be part of the selection. Otherwise, it is part of the baseline. Common filters are expected to be things like `{status=error}` (what is different about errors?) or `{duration>1s}` (what is different about slow spans?) + +1. Optional. The second parameter is the top `N` values to return per attribute. If an attribute exceeds this limit in either the selection group or baseline group, then only the top `N` values (based on frequency) are returned, and an error indicator for the attribute is included output (see below). Defaults to `10`. + +1. Optional. Start and End timestamps in Unix nanoseconds, which can be used to constrain the selection window by time, in addition to the filter. For example, the overall query could cover the past hour, and the selection window only a 5 minute time period in which there was an anomaly. These timestamps must both be given, or neither. + +### Output + +The outputs are flat time-series for each attribute/value found in the spans. + +Each series has a label `__meta_type` which denotes which group it is in, either `selection` or `baseline`. + +Example output series: +``` +{ __meta_type="baseline", resource.cluster="prod" } 123 +{ __meta_type="baseline", resource.cluster="qa" } 124 +{ __meta_type="selection", resource.cluster="prod" } 456 <--- significant difference detected +{ __meta_type="selection", resource.cluster="qa" } 125 +{ __meta_type="selection", resource.cluster="dev"} 126 <--- cluster=dev was found in the highlighted spans but not in the baseline +``` + +When an attribute reaches the topN limit, there will also be present an error indicator. +This example means the attribute `resource.cluster` had too many values. +``` +{ __meta_error="__too_many_values__", resource.cluster= } +``` diff --git a/docs/sources/tempo/traceql/metrics-queries/solve-problems-metrics-queries.md b/docs/sources/tempo/traceql/metrics-queries/solve-problems-metrics-queries.md new file mode 100644 index 00000000000..7f2c65730c8 --- /dev/null +++ b/docs/sources/tempo/traceql/metrics-queries/solve-problems-metrics-queries.md @@ -0,0 +1,67 @@ +--- +title: Solve problems with trace metrics queries +menuTitle: Use cases +description: Solve problems with trace metrics queries +weight: 600 +keywords: + - metrics query + - TraceQL metrics +--- + +# Solve problems with trace metrics queries + +You can query data generated by TraceQL metrics in a similar way that you would query results stored in Prometheus, Grafana Mimir, or other Prometheus-compatible Time-Series-Database (TSDB). +TraceQL metrics queries allows you to calculate metrics on trace span data on-the-fly with Tempo (your tracing database), without requiring a time-series-database like Prometheus. + +This page provides an example of how you can investigate the rate of incoming requests using both PromQL and TraceQL. + +## RED metrics and queries + +The Tempo metrics-generator emits metrics with pre-configured labels for Rate, Error, and Duration (RED) metrics and service graph edges. +Generated metric labels vary, but always include the service name (in service graph metrics, as a client and/or a server type). +For more information, refer to the [metrics-generator documentation](../../metrics-generator/). + +You can use these metrics to get an overview of application performance. +The metrics can be directly correlated to the trace spans that are available for querying. + +TraceQL metrics allow a user to query metrics from traces directly from Tempo instead of requiring the metrics-generator component and an accompanying TSDB. + +{{< admonition type="note" >}} +TraceQL metrics are constrained to a 24-hour range window, and aren't available as a Grafana Managed Alerts source. For any metrics that you want to query over longer time ranges, use for alerting, or retain for more than 30 days, use the metrics-generator to store these metrics in Prometheus, Mimir, or other Prometheus-compatible TSDB and continue to use PromQL for querying. +{{< /admonition >}} + +## Investigate the rate of incoming requests + +Let's say that you want to know how many requests are being serviced both by your application, but also by each service that comprises your application. +This allows you to ensure that your application scales appropriately, can help with capacity planning, and can show you which services may be having problems and are taking up load in fail-over scenarios. +In PromQL, these values are calculated over counters that increase each time a service is called. These metrics provide the Rate (R) in RED. + +If you are familiar with PromQL, then you're used to constructing queries. +You can create an equivalent queries in TraceQL. +Here's the two queries for the different data sources (PromQL for Mimir and TraceQL for Tempo), shown side by side over a 6 hour time-range. + +![Equivalent PromQL and TraceQL queries](/media/docs/tempo/traceql/TraceQL-metrics-query-example-1.png) + +### How the query looks in PromQL + +The Tempo metrics-generator outputs a metric, `traces_spanmetrics_calls_total`, a counter that increases each time a named span in a service is called. +RED data generated by the metrics-generator includes the service name and span kind. +You can use this to only show call counts when a service was called externally by filtering via the `SERVER` span kind, thus showing the total number of times the service has been called. + +You can use the PromQL `rate()` and `sum()` functions to examine the counter and determine the per-second rate of calls occurring, summing them by each service. +In addition to only looking at spans of `kind=server`, you can also focus on spans coming from a particular Kubernetes namespace (`ditl-demo-prod`). + +``` +sum by (service_name)(rate(traces_spanmetrics_calls_total{service_namespace="ditl-demo-prod", span_kind="SPAN_KIND_SERVER"}[2m])) +``` + +### How the query looks in TraceQL + +TraceQL metrics queries let you similarly examine a particular subset of your spans. +As in the example above, you can start by filtering down to spans that occur in a particular Kubernetes namespace (`ditl-demo-prod`), and are of kind `SERVER`. +That resulting set of spans is piped to the TraceQL `rate` function, which then calculates the rate (in spans/sec) at which spans matches your filters are received. +By adding the `by (resource.service.name)` term, the query returns spans per second rates per service, rather than an aggregate across all services. + +``` +{ resource.service.namespace="ditl-demo-prod" && kind=server } | rate() by (resource.service.name) +```