Imbalance resource usage of compactor instances #4067

chenlujjj · 2024-09-11T02:57:38Z

Describe the bug

We setup 8 replicas of compactor, and observed that the resource usage of them is imbalance, some instances' cpu and memory usage decreased suddenly to a very low level:

Seems this instance didn't send request to backend from a time point:

The normal instance:

To Reproduce
Steps to reproduce the behavior:

Start Tempo (SHA or version): 26.0
Perform Operations (Read/Write/Others): Produce traces in a steady rate and send to Tempo

Expected behavior

All compactor instances should have balanced resource usage

Environment:

Infrastructure: Kubernetes
Deployment too: helm

Additional Context

The TempoCompactorsTooManyOutstandingBlocks is triggered, and tempodb_compaction_outstanding_blocks is increasing.

The text was updated successfully, but these errors were encountered:

joe-elliott · 2024-09-12T14:45:48Z

As the rate of block creation increases I would recommend lowering the following setting to allow more compactors to participate in reducing the length of the blocklist:

compactor:
  compaction:
    compaction_window: 1h // default

Lowering this value too much will prevent compactors from finding blocks to attempt to compact so perhaps try 30m and see the impact?

github-actions · 2024-11-12T00:04:31Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

markustoivonen · 2024-12-03T12:17:41Z

@joe-elliott are there any other measures one could try? Changing the compaction_window to 30min did not change the situation

For us the situation is the same, most of the time there is an idle pod even though the active pods might go over the resources they have and get oomkilled

joe-elliott · 2024-12-04T21:39:12Z

What's the value of tempodb_compaction_outstanding_blocks and how long is your blocklist? it's possible you can just scale down compactors. It's not necessary to perfectly compact the blocklist.

etiennep · 2024-12-15T19:52:15Z

We've also experienced the same issue and I've noticed some log activity:

2024-12-15 06:00:26.459	
caller=compactor.go:160 err="does not exist" host=ip-10-51-14-209.us-west-2.compute.internal level=warn msg="unable to find meta during compaction.  trying again on this block list" pod=tempo-compactor-6958d9dcd6-wzt4c time=2024-12-15T14:00:26.459937865Z ts=2024-12-15T14:00:26.459864184Z
2024-12-15 06:00:26.445	
caller=compactor.go:160 err="does not exist" host=ip-10-51-14-209.us-west-2.compute.internal level=warn msg="unable to find meta during compaction.  trying again on this block list" pod=tempo-compactor-6958d9dcd6-wzt4c time=2024-12-15T14:00:26.445155043Z ts=2024-12-15T14:00:26.445079252Z
2024-12-15 06:00:26.430	
caller=compactor.go:160 err="does not exist" host=ip-10-51-14-209.us-west-2.compute.internal level=warn msg="unable to find meta during compaction.  trying again on this block list" pod=tempo-compactor-6958d9dcd6-wzt4c time=2024-12-15T14:00:26.430962995Z ts=2024-12-15T14:00:26.430871084Z
2024-12-15 06:00:26.421	
caller=compactor.go:162 err="unable to mark 2 blocks compacted" host=ip-10-51-14-209.us-west-2.compute.internal level=error msg="error during compaction cycle" pod=tempo-compactor-6958d9dcd6-wzt4c time=2024-12-15T14:00:26.421975681Z ts=2024-12-15T14:00:26.421888458Z
2024-12-15 06:00:26.413	
blockID=fe4c1bb7-5b08-4915-88ed-2d5ea5a9dbfc caller=compactor.go:286 err="error copying obj meta to compacted obj meta: The specified key does not exist." host=ip-10-51-14-209.us-west-2.compute.internal level=error msg="unable to mark block compacted" pod=tempo-compactor-6958d9dcd6-wzt4c tenantID=default time=2024-12-15T14:00:26.413388724Z ts=2024-12-15T14:00:26.413262341Z
2024-12-15 06:00:26.400	
blockID=906429df-a136-4382-be1e-26b0bfe05b4f caller=compactor.go:286 err="error copying obj meta to compacted obj meta: The specified key does not exist." host=ip-10-51-14-209.us-west-2.compute.internal level=error msg="unable to mark block compacted" pod=tempo-compactor-6958d9dcd6-wzt4c tenantID=default time=2024-12-15T14:00:26.400824270Z ts=2024-12-15T14:00:26.400695568Z

And then suddenly this instance's activity goes down to zero and never recovers.

Our blocklist length is ~75K and outstanding blocks are steady around ~60K across 23 pods.

joe-elliott · 2024-12-16T13:33:52Z

@mdisibio just put up a PR that we believe may cause this issue:

#4446

The internal blocklist is not being updated correctly and the same block is constantly being rediscovered.

github-actions bot added the stale Used for stale issues / PRs label Nov 12, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imbalance resource usage of compactor instances #4067

Imbalance resource usage of compactor instances #4067

chenlujjj commented Sep 11, 2024 •

edited

Loading

joe-elliott commented Sep 12, 2024

github-actions bot commented Nov 12, 2024

markustoivonen commented Dec 3, 2024

joe-elliott commented Dec 4, 2024

etiennep commented Dec 15, 2024 •

edited

Loading

joe-elliott commented Dec 16, 2024

Imbalance resource usage of compactor instances #4067

Imbalance resource usage of compactor instances #4067

Comments

chenlujjj commented Sep 11, 2024 • edited Loading

joe-elliott commented Sep 12, 2024

github-actions bot commented Nov 12, 2024

markustoivonen commented Dec 3, 2024

joe-elliott commented Dec 4, 2024

etiennep commented Dec 15, 2024 • edited Loading

joe-elliott commented Dec 16, 2024

chenlujjj commented Sep 11, 2024 •

edited

Loading

etiennep commented Dec 15, 2024 •

edited

Loading