Skip to content

Commit

Permalink
Merge branch 'main' into initialize-hist-buckets
Browse files Browse the repository at this point in the history
  • Loading branch information
javiermolinar authored Nov 22, 2024
2 parents 08a7d58 + 1a21818 commit 64a3532
Show file tree
Hide file tree
Showing 185 changed files with 27,465 additions and 5,536 deletions.
18 changes: 15 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
## main / unreleased
* [FEATURE] tempo-cli: support dropping multiple traces in a single operation [#4266](https://github.com/grafana/tempo/pull/4266) (@ndk)
* [CHANGE] update default config values to better align with production workloads [#4340](https://github.com/grafana/tempo/pull/4340) (@electron0zero)
* [CHANGE] fix deprecation warning by switching to DoBatchWithOptions [#4343](https://github.com/grafana/tempo/pull/4343) (@dastrobu)
* [CHANGE] **BREAKING CHANGE** The Tempo serverless is now deprecated and will be removed in an upcoming release [#4017](https://github.com/grafana/tempo/pull/4017/) @electron0zero
* [CHANGE] **BREAKING CHANGE** Change the AWS Lambda serverless build tooling output from "main" to "bootstrap". Refer to https://aws.amazon.com/blogs/compute/migrating-aws-lambda-functions-from-the-go1-x-runtime-to-the-custom-runtime-on-amazon-linux-2/ for migration steps [#3852](https://github.com/grafana/tempo/pull/3852) (@zatlodan)
* [CHANGE] Add throughput and SLO metrics in the tags and tag values endpoints [#4148](https://github.com/grafana/tempo/pull/4148) (@electron0zero)
* [CHANGE] tempo-cli: add support for /api/v2/traces endpoint [#4127](https://github.com/grafana/tempo/pull/4127) (@electron0zero)
Expand All @@ -18,6 +22,7 @@
* [FEATURE] TraceQL support for instrumentation scope [#3967](https://github.com/grafana/tempo/pull/3967) (@ie-pham)
* [FEATURE] Export cost attribution usage metrics from distributor [#4162](https://github.com/grafana/tempo/pull/4162) (@mdisibio)
* [FEATURE] TraceQL metrics: avg_over_time [#4073](https://github.com/grafana/tempo/pull/4073) (@javiermolinar)
* [ENHANCEMENT] Update to the latest dskit [#4341](https://github.com/grafana/tempo/pull/4341) (@dastrobu)
* [ENHANCEMENT] Changed log level from INFO to DEBUG for the TempoDB Find operation using traceId to reduce excessive/unwanted logs in log search. [#4179](https://github.com/grafana/tempo/pull/4179) (@Aki0x137)
* [ENHANCEMENT] Pushdown collection of results from generators in the querier [#4119](https://github.com/grafana/tempo/pull/4119) (@electron0zero)
* [ENHANCEMENT] The span multiplier now also sources its value from the resource attributes. [#4210](https://github.com/grafana/tempo/pull/4210)
Expand Down Expand Up @@ -49,19 +54,26 @@
* [ENHANCEMENT] Add disk caching in ingester SearchTagValuesV2 for completed blocks [#4069](https://github.com/grafana/tempo/pull/4069) (@electron0zero)
* [ENHANCEMENT] chore: remove gofakeit dependency [#4274](https://github.com/grafana/tempo/pull/4274) (@javiermolinar)
* [ENHANCEMENT] Add a max flush attempts and metric to the metrics generator [#4254](https://github.com/grafana/tempo/pull/4254) (@joe-elliott)
* [ENHANCEMENT] Collection of query-frontend changes to reduce allocs. [#4242]https://github.com/grafana/tempo/pull/4242 (@joe-elliott)
* [ENHANCEMENT] Added `insecure-skip-verify` option in tempo-cli to skip SSL certificate validation when connecting to the S3 backend. [#44236](https://github.com/grafana/tempo/pull/4259) (@faridtmammadov)
* [ENHANCEMENT] Collection of query-frontend changes to reduce allocs. [#4242](https://github.com/grafana/tempo/pull/4242) (@joe-elliott)
* [ENHANCEMENT] Added `insecure-skip-verify` option in tempo-cli to skip SSL certificate validation when connecting to the S3 backend. [#4259](https://github.com/grafana/tempo/pull/4259) (@faridtmammadov)
* [ENHANCEMENT] Chore: delete spanlogger. [4312](https://github.com/grafana/tempo/pull/4312) (@javiermolinar)
* [ENHANCEMENT] Add `invalid_utf8` to reasons spanmetrics will discard spans. [#4293](https://github.com/grafana/tempo/pull/4293) (@zalegrala)
* [ENHANCEMENT] Reduce frontend and querier allocations by dropping HTTP headers early in the pipeline. [#4298](https://github.com/grafana/tempo/pull/4298) (@joe-elliott)
* [ENHANCEMENT] Reduce ingester working set by improving prelloc behavior. [#4344](https://github.com/grafana/tempo/pull/4344) (@joe-elliott)
* [ENHANCEMENT] Use Promtheus fast regexp for TraceQL regular expression matchers. [#4329](https://github.com/grafana/tempo/pull/4329) (@joe-elliott)
**BREAKING CHANGE** All regular expression matchers will now be fully anchored. `span.foo =~ "bar"` will now be evaluated as `span.foo =~ "^bar$"`
* [BUGFIX] Replace hedged requests roundtrips total with a counter. [#4063](https://github.com/grafana/tempo/pull/4063) [#4078](https://github.com/grafana/tempo/pull/4078) (@galalen)
* [BUGFIX] Metrics generators: Correctly drop from the ring before stopping ingestion to reduce drops during a rollout. [#4101](https://github.com/grafana/tempo/pull/4101) (@joe-elliott)
* [BUGFIX] Correctly handle 400 Bad Request and 404 Not Found in gRPC streaming [#4144](https://github.com/grafana/tempo/pull/4144) (@mapno)
* [BUGFIX] Pushes a 0 to classic histogram's counter when the series is new to allow Prometheus to start from a non-null value. [#4140](https://github.com/grafana/tempo/pull/4140) (@mapno)
* [BUGFIX] Fix counter samples being downsampled by backdate to the previous minute the initial sample when the series is new [#44236](https://github.com/grafana/tempo/pull/4236) (@javiermolinar)
* [BUGFIX] Fix counter samples being downsampled by backdate to the previous minute the initial sample when the series is new [#4236](https://github.com/grafana/tempo/pull/4236) (@javiermolinar)
* [BUGFIX] Fix traceql metrics time range handling at the cutoff between recent and backend data [#4257](https://github.com/grafana/tempo/issues/4257) (@mdisibio)
* [BUGFIX] Skip computing exemplars for instant queries. [#4204](https://github.com/grafana/tempo/pull/4204) (@javiermolinar)
* [BUGFIX] Gave context to orphaned spans related to various maintenance processes. [#4260](https://github.com/grafana/tempo/pull/4260) (@joe-elliott)
* [BUGFIX] Utilize S3Pass and S3User parameters in tempo-cli options, which were previously unused in the code. [#44236](https://github.com/grafana/tempo/pull/4259) (@faridtmammadov)
* [BUGFIX] Initialize histogram buckets to 0 to avoid downsampling. [#4366](https://github.com/grafana/tempo/pull/4366) (@javiermolinar)


# v2.6.1

* [CHANGE] **BREAKING CHANGE** tempo-query is no longer a Jaeger instance with grpcPlugin. It's now a standalone server. Serving a gRPC API for Jaeger on `0.0.0.0:7777` by default. [#3840](https://github.com/grafana/tempo/issues/3840) (@frzifus)
Expand Down
38 changes: 33 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,43 @@
<a href="https://goreportcard.com/report/github.com/grafana/tempo"><img src="https://goreportcard.com/badge/github.com/grafana/tempo" alt="Go Report Card" /></a>
</p>

Grafana Tempo is an open source, easy-to-use and high-scale distributed tracing backend. Tempo is cost-efficient, requiring only object storage to operate, and is deeply integrated with Grafana, Prometheus, and Loki.
Grafana Tempo is an open source, easy-to-use, and high-scale distributed tracing backend. Tempo is cost-efficient, requiring only object storage to operate, and is deeply integrated with Grafana, Prometheus, and Loki.

Tempo is Jaeger, Zipkin, Kafka, OpenCensus and OpenTelemetry compatible. It ingests batches in any of the mentioned formats, buffers them and then writes them to Azure, GCS, S3 or local disk. As such, it is robust, cheap and easy to operate!

Tempo implements [TraceQL](https://grafana.com/docs/tempo/latest/traceql/), a traces-first query language inspired by LogQL and PromQL. This query language allows users to very precisely and easily select spans and jump directly to the spans fulfilling the specified conditions:
## Business value of distributed tracing

![Tempo data source query editor](https://grafana.com/media/docs/grafana/data-sources/tempo/query-editor/tempo-ds-query-ed-example-v11-a.png)
Distributed tracing helps teams quickly pinpoint performance issues and understand the flow of requests across services. The Explore Traces UI simplifies this process by offering a user-friendly interface to view and analyze trace data, making it easier to identify and resolve issues without needing to write complex queries.

## Getting started
Refer to [Use traces to find solutions](https://grafana.com/docs/tempo/latest/introduction/solutions-with-traces/)t o learn more about how you can use distributed tracing to investigate and solve issues.

## Explore Traces UI: A better way to get value from your tracing data
We are excited to introduce the [Explore Traces app](https://github.com/grafana/explore-traces) as part of the Grafana Explore suite. This app provides a queryless and intuitive experience for analyzing tracing data, allowing teams to quickly identify performance issues, latency bottlenecks, and errors without needing to write complex queries or use TraceQL.

Key Features:
- **Intuitive Trace Analysis**: Spot slow or error-prone traces with easy, point-and-click interactions.
- **RED Metrics Overview**: Use Rate, Errors, and Duration metrics to highlight performance issues.
- **Automated Comparison**: Identify problematic attributes with automatic trace comparison.
- **Simplified Visualizations**: Access rich visual data without needing to construct TraceQL queries.

![image](https://github.com/user-attachments/assets/991205df-1b27-489f-8ef0-1a05ee158996)

To learn more see the following links:
- [Explore Traces repo](https://github.com/grafana/explore-traces)
- [Explore Traces documentation](https://grafana.com/docs/grafana/latest/explore/simplified-exploration/traces/)
- [Demo video](https://github.com/user-attachments/assets/8103e173-6dcf-4659-b938-7614c8a5b52d
)

## TraceQL

Tempo implements [TraceQL](https://grafana.com/docs/tempo/latest/traceql/), a traces-first query language inspired by LogQL and PromQL, which enables targeted queries or rich UI-driven analyses.

### TraceQL metrics

[TraceQL metrics](https://grafana.com/docs/tempo/latest/traceql/metrics-queries/) is an experimental feature in Grafana Tempo that creates metrics from traces. Metric queries extend trace queries by applying a function to trace query results. This powerful feature allows for ad hoc aggregation of any existing TraceQL query by any dimension available in your traces, much in the same way that LogQL metric queries create metrics from logs.

Tempo is Jaeger, Zipkin, Kafka, OpenCensus, and OpenTelemetry compatible. It ingests batches in any of the mentioned formats, buffers them, and then writes them to Azure, GCS, S3, or local disk. As such, it is robust, cheap, and easy to operate!

## Getting started with Tempo

- [Get started documentation](https://grafana.com/docs/tempo/latest/getting-started/)
- [Deployment Examples](./example)
Expand Down
118 changes: 76 additions & 42 deletions cmd/tempo-cli/cmd-rewrite-blocks.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import (
"fmt"
"os"
"strconv"
"strings"

"github.com/go-kit/log"
"github.com/google/uuid"
Expand All @@ -19,16 +20,16 @@ import (
"github.com/grafana/tempo/tempodb/encoding/common"
)

type dropTraceCmd struct {
type dropTracesCmd struct {
backendOptions

TraceID string `arg:"" help:"trace ID to retrieve"`
TenantID string `arg:"" help:"tenant ID to search"`
TraceIDs string `arg:"" help:"Trace IDs to drop"`
DropTrace bool `name:"drop-trace" help:"actually attempt to drop the trace" default:"false"`
}

func (cmd *dropTraceCmd) Run(ctx *globalOptions) error {
fmt.Printf("beginning process to drop trace %v from tenant %v\n", cmd.TraceID, cmd.TenantID)
func (cmd *dropTracesCmd) Run(opts *globalOptions) error {
fmt.Printf("beginning process to drop traces %v from tenant %v\n", cmd.TraceIDs, cmd.TenantID)
fmt.Println("**warning**: compaction must be disabled or a compactor may duplicate a block as this process is rewriting it")
fmt.Println("")
if cmd.DropTrace {
Expand All @@ -38,62 +39,86 @@ func (cmd *dropTraceCmd) Run(ctx *globalOptions) error {
fmt.Println("")
}

r, w, c, err := loadBackend(&cmd.backendOptions, ctx)
if err != nil {
return err
}
ctx := context.Background()

id, err := util.HexStringToTraceID(cmd.TraceID)
r, w, c, err := loadBackend(&cmd.backendOptions, opts)
if err != nil {
return err
}

blocks, err := blocksWithTraceID(context.Background(), r, cmd.TenantID, id)
if err != nil {
return err
type pair struct {
traceIDs []common.ID
blockMeta *backend.BlockMeta
}
tracesByBlock := map[backend.UUID]pair{}

if len(blocks) == 0 {
fmt.Println("\ntrace not found in any block. aborting")
return nil
}
// Group trace IDs by blocks
ids := strings.Split(cmd.TraceIDs, ",")
for _, id := range ids {
traceID, err := util.HexStringToTraceID(id)
if err != nil {
return err
}

// print out blocks that have the trace id
fmt.Println("\n\ntrace found in:")
for _, block := range blocks {
fmt.Printf(" %v sz: %d traces: %d\n", block.BlockID, block.Size_, block.TotalObjects)
}
// It might be significantly improved if common.BackendBlock supported bulk searches.
blocks, err := blocksWithTraceID(ctx, r, cmd.TenantID, traceID)
if err != nil {
return err
}

if !cmd.DropTrace {
fmt.Println("**not dropping trace, use --drop-trace to actually drop**")
return nil
if len(blocks) == 0 {
fmt.Printf("\ntrace %s not found in any block. skipping\n", util.TraceIDToHexString(traceID))
}
for _, block := range blocks {
p, ok := tracesByBlock[block.BlockID]
if !ok {
p = pair{blockMeta: block}
}
p.traceIDs = append(p.traceIDs, traceID)
tracesByBlock[block.BlockID] = p
}
}

fmt.Println("rewriting blocks:")
for _, block := range blocks {
fmt.Printf(" rewriting %v\n", block.BlockID)
newBlock, err := rewriteBlock(context.Background(), r, w, block, id)
// Remove traces from blocks
for _, p := range tracesByBlock {
// print out trace IDs to be removed in the block
strTraceIDs := make([]string, len(p.traceIDs))
for i, tid := range p.traceIDs {
strTraceIDs[i] = util.TraceIDToHexString(tid)
}
fmt.Printf("\nFound %d traces: %v in block: %v\n", len(strTraceIDs), strTraceIDs, p.blockMeta.BlockID)
fmt.Printf("blockInfo: ID: %v, Size: %d Total Traces: %d\n", p.blockMeta.BlockID, p.blockMeta.Size_, p.blockMeta.TotalObjects)

if !cmd.DropTrace {
fmt.Println("**not dropping trace, use --drop-trace to actually drop**")
continue
}

fmt.Printf(" rewriting %v\n", p.blockMeta.BlockID)
newMeta, err := rewriteBlock(ctx, r, w, p.blockMeta, p.traceIDs)
if err != nil {
return err
}
fmt.Printf(" rewrote to new block: %v\n", newBlock.BlockID)
}
if newMeta == nil {
fmt.Printf(" block %v was removed\n", p.blockMeta.BlockID)
} else {
fmt.Printf(" rewrote to new block: %v\n", newMeta.BlockID)
}

fmt.Println("marking old blocks compacted")
for _, block := range blocks {
fmt.Printf(" marking %v\n", block.BlockID)
err = c.MarkBlockCompacted((uuid.UUID)(block.BlockID), block.TenantID)
fmt.Printf(" marking %v compacted\n", p.blockMeta.BlockID)
err = c.MarkBlockCompacted((uuid.UUID)(p.blockMeta.BlockID), p.blockMeta.TenantID)
if err != nil {
return err
}
}

fmt.Println("successfully rewrote blocks dropping requested trace")
if cmd.DropTrace {
fmt.Printf("successfully rewrote blocks dropping requested traces: %v from tenant: %v\n", cmd.TraceIDs, cmd.TenantID)
}

return nil
}

func rewriteBlock(ctx context.Context, r backend.Reader, w backend.Writer, meta *backend.BlockMeta, traceID common.ID) (*backend.BlockMeta, error) {
func rewriteBlock(ctx context.Context, r backend.Reader, w backend.Writer, meta *backend.BlockMeta, traceIDs []common.ID) (*backend.BlockMeta, error) {
enc, err := encoding.FromVersion(meta.Version)
if err != nil {
return nil, fmt.Errorf("error getting encoder: %w", err)
Expand Down Expand Up @@ -131,7 +156,12 @@ func rewriteBlock(ctx context.Context, r backend.Reader, w backend.Writer, meta

// hook to drop the trace
DropObject: func(id common.ID) bool {
return bytes.Equal(id, traceID)
for _, tid := range traceIDs {
if bytes.Equal(id, tid) {
return true
}
}
return false
},

// setting to prevent panics. should we track and report these?
Expand All @@ -153,20 +183,24 @@ func rewriteBlock(ctx context.Context, r backend.Reader, w backend.Writer, meta
}

if len(out) != 1 {
if meta.TotalObjects == int64(len(traceIDs)) {
// we removed all traces from the block
return nil, nil
}
return nil, fmt.Errorf("expected 1 block, got %d", len(out))
}

newMeta := out[0]

if newMeta.TotalObjects != meta.TotalObjects-1 {
if newMeta.TotalObjects != meta.TotalObjects-int64(len(traceIDs)) {
return nil, fmt.Errorf("expected output to have one less object then in. out: %d in: %d", newMeta.TotalObjects, meta.TotalObjects)
}

return newMeta, nil
}

func blocksWithTraceID(ctx context.Context, r backend.Reader, tenantID string, traceID common.ID) ([]*backend.BlockMeta, error) {
blockIDs, _, err := r.Blocks(context.Background(), tenantID)
blockIDs, _, err := r.Blocks(ctx, tenantID)
if err != nil {
return nil, err
}
Expand All @@ -184,7 +218,7 @@ func blocksWithTraceID(ctx context.Context, r backend.Reader, tenantID string, t
// search here
meta, err := isInBlock(ctx, r, blockNum2, id2, tenantID, traceID)
if err != nil {
fmt.Println("Error querying block:", err)
fmt.Println("\nError querying block:", err)
return
}

Expand All @@ -211,7 +245,7 @@ func isInBlock(ctx context.Context, r backend.Reader, blockNum int, id uuid.UUID
fmt.Print(strconv.Itoa(blockNum))
}

meta, err := r.BlockMeta(context.Background(), id, tenantID)
meta, err := r.BlockMeta(ctx, id, tenantID)
if err != nil && !errors.Is(err, backend.ErrDoesNotExist) {
return nil, err
}
Expand Down
Loading

0 comments on commit 64a3532

Please sign in to comment.