feat(v2): background compaction cleanup #3694

kolesnikovae · 2024-11-15T08:57:05Z

The change moves storage cleanup to the compaction worker service and alters the compaction orchestration: now we explicitly replicate state updates. The change is still being tested; however, the PR is ready for review.

Please refer to the README for details.

I'm going to improve the way compaction strategy is configured, and add compaction metrics (in compaction planner/scheduler/worker) before merging the PR.

# Conflicts: # api/gen/proto/go/metastore/v1/compactor.pb.go # api/gen/proto/go/metastore/v1/index.pb.go # api/gen/proto/go/metastore/v1/index_vtproto.pb.go # api/gen/proto/go/metastore/v1/metastorev1connect/index.connect.go # api/gen/proto/go/metastore/v1/types.pb.go # api/gen/proto/go/metastore/v1/types_vtproto.pb.go # api/metastore/v1/index.proto # api/metastore/v1/types.proto # pkg/experiment/metastore/cleaner_raft_handler.go # pkg/experiment/metastore/cleaner_service.go # pkg/experiment/metastore/client/methods.go # pkg/experiment/metastore/compaction_planner.go # pkg/experiment/metastore/compaction_raft_handler.go # pkg/experiment/metastore/compaction_service.go # pkg/experiment/metastore/fsm/fsm.go # pkg/experiment/metastore/index/index.go # pkg/experiment/metastore/index/store.go # pkg/experiment/metastore/index_service.go # pkg/experiment/metastore/markers/deletion_markers.go # pkg/experiment/metastore/metastore.go # pkg/experiment/metastore/raftnode/node.go

kolesnikovae · 2024-11-20T11:39:10Z

NB: in the latest optimization I broke block cleanup – fixing it now

aleks-p

I like the idea of replicating the state of the compaction plan explicitly, it should dramatically reduce the likelihood of the state being inconsistent between replicas. It also solves the state issues when rolling out code changes to replicas.

I am not entirely sold on the implementation itself. I think we can proceed with merging this, but I would try to see if we can simplify a few things in a future iteration. My main concern is that this will be harder to maintain. We introduce many new concepts (a scheduler, planner, etc.), multiple layers of queues and many types representing different representations of compaction jobs. I fear that some parts (e.g., the compaction queue) will be hard to reason about when the knowledge about the internal workings is not as fresh as now.

aleks-p · 2024-11-19T13:40:58Z

api/metastore/v1/raft_log/raft_log.proto

all of these types seem to be used internally, can this be moved out of the api folder?

aleks-p · 2024-11-19T20:45:39Z

pkg/experiment/compactor/compaction_worker.go

+	workers int
+	free    atomic.Int32


The type is called Worker and it is not immediately obvious what are these 2 fields for. Are there other names that can communicate that more clearly?

Renamed to threads and capactity correspondingly.

aleks-p · 2024-11-19T20:49:02Z

pkg/experiment/compactor/compaction_worker.go

+	ctx    context.Context
+	cancel context.CancelFunc
+	*metastorev1.CompactionJob
+	source []*metastorev1.BlockMeta


we have sourceBlocks already, maybe we can call this resolvedBlocks or sourceMetas or similar?

It's an interesting read on the topic. I agree with the author and strive to follow the principles outlined in the article.

The version I decided to go with:

compacted, err := block.Compact(ctx, job.blocks)

Not sure how this is relevant, I was referring to the "source" field in the struct which is now called blocks.

My point is that source or blocks are named too similarly to sourceBlocks in *metastorev1.CompactionJob. If someone is working with a compactionJob object they have "sourceBlocks" and "blocks" to decide between. They both represent source blocks, one of them (blocks) is initialized later than the other (resolved via a metastore client call), hence why I suggested we use a name with a qualifier.

I strongly recommend this semi-official style guide and this piece of wisdom for further reading, in addition to the link I already shared.

Yes, I understand your point. I cannot agree.

You are right that source might not be the best name, but for a slightly different reason. It's not possible to confuse or misuse job.SourceBlocks with job.source (now job.blocks) since they have different types and visibility levels. I also believe the purpose of the members is very clear and specific in the context.

I just think blocks is the clearest and most concise option here – perfect qualities for a variable name.

block.Compact(ctx, job.blocks) // I settled on this version. block.Compact(ctx, job.source) // Works well but might be ambiguous for a reader who has no idea what a compaction job is and what the job source is. block.Compact(ctx, job.sourceBlocks) // Quite good, but matches the input job.SourceBlocks and can be shortened without any clarity loss: any block we give at input is the source. block.Compact(ctx, job.resolvedBlocks) // Irrelevant details. Besides, this is the only place in the whole codebase where we mention "resolved block". block.Compact(ctx, job.sourceMetas) // Looks like we're compacting metas. There's no reason to include the type name in the variable name.

aleks-p · 2024-11-19T20:51:33Z

pkg/experiment/compactor/compaction_worker.go

 		go func() {
 			defer w.wg.Done()
-			w.jobsLoop(ctx)
+			level.Info(w.logger).Log("msg", "compaction worker thead started")


nit, typo in "thread"

aleks-p · 2024-11-20T14:18:06Z

pkg/experiment/metastore/compaction_raft_handler.go

+	for created := 0; created < capacity; created++ {
+		plan, err := planner.CreateJob()


Not sure I understand why we try to create this number (capacity) of jobs. Is it just a relatively low number to avoid having a large response here?

There are a number of reasons:

We do want to have small raft messages (where the planned job will end up).

We don't want to create jobs ahead of time. This allows altering the planner, scheduler, and workers' configs at any time, with almost instant effect.

The rate I want our system to maintain is at least 1GB/s, which is roughly 256 segments (500ms) per second, or around 1M metadata entries and 50K compaction jobs per hour.

Consider the case when no workers are available (e.g., due to infrastructure issues, misconfiguration, or bugs), or when workers do not have enough capacity to handle all the jobs.

Consider also the case when the metastore is unavailable for a period (e.g., due to infrastructure issues, misconfiguration, or bugs), and our DLQ accumulates a substantial number of entries to process (e.g., 1M, which is a reasonable number). While we will pace the processing, many new blocks (and thus jobs) are to be added over a short period of time.

If there is no limit on the job queue size (which is TODO, BTW), and no limit on how many jobs we can create at once, ingestion could be blocked for a long period of time when the metastore or workers come back online (or added). However, if we produce no more jobs than we can actually handle, this will not happen: the service will adapt to the capacity of the worker fleet, and all jobs will eventually be created and scheduled.

In addition, the underlying implementation of the block queue can handle millions of entries without major issues. While the job queue is simpler, it may cause performance issues if not handled carefully – something we may want to address in the future.

I was referring to the choice of capacity as the limit, not saying that we should not have a limit. We already assign jobs above so the capacity might be "spent" already.

The answer is here:

However, if we produce no more jobs than we can actually handle, this will not happen: the service will adapt to the capacity of the worker fleet, and all jobs will eventually be created and scheduled.

I think I need to expand on this and add it to the documentation. The challenge is that we don't know the capacity of our worker fleet in advance, and we have no control over them – they can appear and disappear at any time. Therefore, we need an adaptive approach if we want to keep our queue short and workers busy.

I considered a few options:

We produce the assigned number of jobs: no workers are assigned any jobs at all. There might be jobs ready for scheduling, but we do not add them to the scheduler because workers are left without assignments due to no jobs being in the scheduler queue. Loop.

We produce the capacity-assigned number of jobs: we never utilize capacity fully. I'd call this strategy "greedy worker" – the intuition here is that everyone only cares about itself, but in the end, everyone starves.

We produce the capacity number of jobs: we create jobs for another compaction worker instance. Essentially, we have an evidence that this number of jobs will eventually be handled.

We ensure that our queue has at least max_worker_capacity number of assignable jobs in the queue. That works but it is trickier to implement and requires a parameter, which can be dynamic.

There's an unaddressed risk however: if all the workers just abandon all the jobs they take, it will inflate the queue. This is the reason why we need a hard cap and a displacement policy, regardless of the scheduling principles.

kolesnikovae · 2024-11-21T08:56:50Z

Thank you for the review Aleks!

I am not entirely sold on the implementation itself. I think we can proceed with merging this, but I would try to see if we can simplify a few things in a future iteration.

I'd like to request more specific, actionable feedback. If you have tangible suggestions for simplification or areas where you see unnecessary complexity, I'm happy to discuss and explore adjustments. I suspect that one of us may be missing some nuances of the system's operation and failure modes. Let's discuss it next time we meet.

My main concern is that this will be harder to maintain. We introduce many new concepts (a scheduler, planner, etc.),

The current design introduces the following components:

The compactor that accepts blocks for compaction and owns the compaction queue.
The planner is responsible for creating job plans (defining the compaction jobs).
The scheduler oversees job priorities, assignments, and status transitions.

Each of the components has a well-defined set of responsibilities. This separation of concerns is intentional to ensure the system remains maintainable as it evolves. If you believe the earlier version (and a couple of pieces here and there) was simpler and preferable, I'm open to a discussion.

I fear that some parts (e.g., the compaction queue) will be hard to reason about when the knowledge about the internal workings is not as fresh as now.

The data structures involved – priority queues and linked lists – are standard and should be familiar to most developers. However, if specific parts of the codebase seem unclear, I'd be happy to add documentation or comments.

and many types representing different representations of compaction jobs

Apparently, you're referring to:

message CompactionPlanUpdate {
  repeated NewCompactionJob new_jobs = 1;
  repeated AssignedCompactionJob assigned_jobs = 2;
  repeated UpdatedCompactionJob updated_jobs = 3;
  repeated CompletedCompactionJob completed_jobs = 4;
}

A job has different attributes depending on its status (e.g., UpdatedCompactionJob never includes the job plan unlike AssignedCompactionJob, while CompletedCompactionJob includes the job results). Strict typing will protect against mistakes. If you have concerns about this approach, let's discuss alternatives that still maintain clarity and safety.

aleks-p · 2024-11-21T20:21:57Z

I'd like to request more specific, actionable feedback. If you have tangible suggestions for simplification or areas where you see unnecessary complexity, I'm happy to discuss and explore adjustments. I suspect that one of us may be missing some nuances of the system's operation and failure modes. Let's discuss it next time we meet.

Sorry I can't provide more specific feedback right away, I was sharing my initial feeling about the changes. Things are a bit more clear after a second look. We are making a large shift from creating immutable jobs when we consume blocks to creating jobs on demand. We are also opening up the possibility of partial completion of "batches" which then requires extra care. As I said in my previous comment and as discussed offline, lets move forward with this and act later if needed.

chore(v2): refactor compaction

623bdc1

kolesnikovae mentioned this pull request Nov 15, 2024

chore(v2): refactor metastore #3652

Closed

kolesnikovae added 6 commits November 15, 2024 19:26

consistent state initialization and restore

b7e2de1

pull raft node refactoring

c4ac73a

state reader clarifications

dc62cbd

implement GetBlockMetadata

0b6df05

update compaction worker

787c63e

make generate

000525c

kolesnikovae force-pushed the feat/compaction-background-cleanup branch from 49ff50d to 000525c Compare November 18, 2024 08:56

kolesnikovae added 10 commits November 18, 2024 18:23

background cleanup in workers

0262f85

fixes

3fdf51e

fix registration

1e0f6a7

persist block swap

673fa29

persist block swap

b1feb65

applied index optimization

58cedaa

handle empty ring gracefully in distributors

55599fd

snapshot restore optimization

9f48c0c

go mod tidy

1bfe907

kolesnikovae marked this pull request as ready for review November 19, 2024 11:13

kolesnikovae requested a review from a team as a code owner November 19, 2024 11:13

kolesnikovae added 3 commits November 19, 2024 20:51

copmaction level limit

a91842c

read index refinements

da48a3b

refine service<->handler contract

0919891

kolesnikovae requested a review from a team as a code owner November 20, 2024 10:28

kolesnikovae added 3 commits November 20, 2024 19:00

fixes and improvements

c43f4e8

go mod tidy

9c6372e

go mod tidy

ae3d4f1

disable applied index optimization

2628645

kolesnikovae requested a review from korniltsev as a code owner November 20, 2024 13:28

kolesnikovae added 2 commits November 20, 2024 23:46

enable applied index optimization

6472024

comment why we need to specify term

931ecfd

aleks-p approved these changes Nov 20, 2024

View reviewed changes

kolesnikovae added 3 commits November 21, 2024 14:50

better naming

30252fc

remove unused code

1859bd1

fix mistakes in comments

c37dd55

fix metrics registration

3afaff1

kolesnikovae added 5 commits November 22, 2024 18:13

document some implementation details

4a1decc

add compactor metrics

3273752

add scheduler metrics

3c67122

improve "object not found" error handling

39c5c1d

refine compaction worker metrics

65562d1

kolesnikovae force-pushed the feat/compaction-background-cleanup branch from 08d9387 to 65562d1 Compare November 24, 2024 13:49

kolesnikovae added 3 commits November 25, 2024 11:08

improve compaction-worker concurrency config

ce5fe97

metric reset

8c4ac5f

re-generate docs

45758fc

kolesnikovae merged commit 00f81b9 into main Nov 25, 2024
29 checks passed

kolesnikovae deleted the feat/compaction-background-cleanup branch November 25, 2024 13:46

alsoba13 mentioned this pull request Nov 26, 2024

POC: export profile metrics at compaction time #3718

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(v2): background compaction cleanup #3694

feat(v2): background compaction cleanup #3694

kolesnikovae commented Nov 15, 2024 •

edited

Loading

kolesnikovae commented Nov 20, 2024

aleks-p left a comment •

edited

Loading

aleks-p Nov 19, 2024

aleks-p Nov 19, 2024

kolesnikovae Nov 21, 2024

aleks-p Nov 19, 2024

kolesnikovae Nov 21, 2024

aleks-p Nov 21, 2024

kolesnikovae Nov 22, 2024 •

edited

Loading

aleks-p Nov 19, 2024

aleks-p Nov 20, 2024

kolesnikovae Nov 21, 2024

aleks-p Nov 21, 2024

kolesnikovae Nov 22, 2024 •

edited

Loading

kolesnikovae commented Nov 21, 2024

aleks-p commented Nov 21, 2024

		for created := 0; created < capacity; created++ {
		plan, err := planner.CreateJob()

feat(v2): background compaction cleanup #3694

feat(v2): background compaction cleanup #3694

Conversation

kolesnikovae commented Nov 15, 2024 • edited Loading

kolesnikovae commented Nov 20, 2024

aleks-p left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolesnikovae Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolesnikovae Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

kolesnikovae commented Nov 21, 2024

aleks-p commented Nov 21, 2024

kolesnikovae commented Nov 15, 2024 •

edited

Loading

aleks-p left a comment •

edited

Loading

kolesnikovae Nov 22, 2024 •

edited

Loading

kolesnikovae Nov 22, 2024 •

edited

Loading