Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imbalance resource usage of compactor instances #4067

Closed
chenlujjj opened this issue Sep 11, 2024 · 6 comments
Closed

Imbalance resource usage of compactor instances #4067

chenlujjj opened this issue Sep 11, 2024 · 6 comments
Labels
stale Used for stale issues / PRs

Comments

@chenlujjj
Copy link

chenlujjj commented Sep 11, 2024

Describe the bug

We setup 8 replicas of compactor, and observed that the resource usage of them is imbalance, some instances' cpu and memory usage decreased suddenly to a very low level:
Image
Seems this instance didn't send request to backend from a time point:
Image

The normal instance:
Image
Image

To Reproduce
Steps to reproduce the behavior:

  1. Start Tempo (SHA or version): 26.0
  2. Perform Operations (Read/Write/Others): Produce traces in a steady rate and send to Tempo

Expected behavior

All compactor instances should have balanced resource usage

Environment:

  • Infrastructure: Kubernetes
  • Deployment too: helm

Additional Context

The TempoCompactorsTooManyOutstandingBlocks is triggered, and tempodb_compaction_outstanding_blocks is increasing.
Image

@joe-elliott
Copy link
Member

As the rate of block creation increases I would recommend lowering the following setting to allow more compactors to participate in reducing the length of the blocklist:

compactor:
  compaction:
    compaction_window: 1h // default

Lowering this value too much will prevent compactors from finding blocks to attempt to compact so perhaps try 30m and see the impact?

Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Nov 12, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 27, 2024
@markustoivonen
Copy link
Contributor

@joe-elliott are there any other measures one could try? Changing the compaction_window to 30min did not change the situation

For us the situation is the same, most of the time there is an idle pod even though the active pods might go over the resources they have and get oomkilled

Image
Image

@joe-elliott
Copy link
Member

What's the value of tempodb_compaction_outstanding_blocks and how long is your blocklist? it's possible you can just scale down compactors. It's not necessary to perfectly compact the blocklist.

@etiennep
Copy link

etiennep commented Dec 15, 2024

We've also experienced the same issue and I've noticed some log activity:

2024-12-15 06:00:26.459	
caller=compactor.go:160 err="does not exist" host=ip-10-51-14-209.us-west-2.compute.internal level=warn msg="unable to find meta during compaction.  trying again on this block list" pod=tempo-compactor-6958d9dcd6-wzt4c time=2024-12-15T14:00:26.459937865Z ts=2024-12-15T14:00:26.459864184Z
2024-12-15 06:00:26.445	
caller=compactor.go:160 err="does not exist" host=ip-10-51-14-209.us-west-2.compute.internal level=warn msg="unable to find meta during compaction.  trying again on this block list" pod=tempo-compactor-6958d9dcd6-wzt4c time=2024-12-15T14:00:26.445155043Z ts=2024-12-15T14:00:26.445079252Z
2024-12-15 06:00:26.430	
caller=compactor.go:160 err="does not exist" host=ip-10-51-14-209.us-west-2.compute.internal level=warn msg="unable to find meta during compaction.  trying again on this block list" pod=tempo-compactor-6958d9dcd6-wzt4c time=2024-12-15T14:00:26.430962995Z ts=2024-12-15T14:00:26.430871084Z
2024-12-15 06:00:26.421	
caller=compactor.go:162 err="unable to mark 2 blocks compacted" host=ip-10-51-14-209.us-west-2.compute.internal level=error msg="error during compaction cycle" pod=tempo-compactor-6958d9dcd6-wzt4c time=2024-12-15T14:00:26.421975681Z ts=2024-12-15T14:00:26.421888458Z
2024-12-15 06:00:26.413	
blockID=fe4c1bb7-5b08-4915-88ed-2d5ea5a9dbfc caller=compactor.go:286 err="error copying obj meta to compacted obj meta: The specified key does not exist." host=ip-10-51-14-209.us-west-2.compute.internal level=error msg="unable to mark block compacted" pod=tempo-compactor-6958d9dcd6-wzt4c tenantID=default time=2024-12-15T14:00:26.413388724Z ts=2024-12-15T14:00:26.413262341Z
2024-12-15 06:00:26.400	
blockID=906429df-a136-4382-be1e-26b0bfe05b4f caller=compactor.go:286 err="error copying obj meta to compacted obj meta: The specified key does not exist." host=ip-10-51-14-209.us-west-2.compute.internal level=error msg="unable to mark block compacted" pod=tempo-compactor-6958d9dcd6-wzt4c tenantID=default time=2024-12-15T14:00:26.400824270Z ts=2024-12-15T14:00:26.400695568Z

And then suddenly this instance's activity goes down to zero and never recovers.

Our blocklist length is ~75K and outstanding blocks are steady around ~60K across 23 pods.

@joe-elliott
Copy link
Member

@mdisibio just put up a PR that we believe may cause this issue:

#4446

The internal blocklist is not being updated correctly and the same block is constantly being rediscovered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Used for stale issues / PRs
Projects
None yet
Development

No branches or pull requests

4 participants