verifier: tolerate data loss #52

nvartolomei · 2024-03-18T16:04:01Z

Introduce a -continuous flag which instructs kgo-verifier to continue polling for new data rather than exiting or looping.
Introduce a -tolerate-data-loss flag which facilitates testing write caching feature where data loss is tolerable in specific circumstances.

This fails `go test` which is introduced in subsequent commit.

andrwng

LGTM! Just nits for me.

Also, just want to clarify the behavior when we see data loss:

validator is running with -continuous and -tolerate-data-loss
validator consumes up to 100, and is continuing to poll for 101+
data is "rolled back" to 90
PollFetches is doing some a periodic poll of highest offset, and under the hood will detect it moving backwards...? Is that right?
validator detects this, adjusts the ranges, and continues with respective workloads

pkg/worker/verifier/group_read_worker.go

cmd/kgo-verifier/main.go

pkg/util/util.go

cmd/kgo-verifier/main.go

nvartolomei · 2024-03-28T19:03:49Z

LGTM! Just nits for me.

Also, just want to clarify the behavior when we see data loss:

validator is running with -continuous and -tolerate-data-loss

validator consumes up to 100, and is continuing to poll for 101+

data is "rolled back" to 90

PollFetches is doing some a periodic poll of highest offset, and under the hood will detect it moving backwards...? Is that right?

validator detects this, adjusts the ranges, and continues with respective workloads

Correct. When offsets are "rolled back" we necessarily increment the raft term / leader epoch. When Redpanda receives a fetch request with an "outdated" epoch, it responds with fenced_leader_epoch (https://github.com/redpanda-data/redpanda/blob/c2e22debf59dbc4ad91dfa8d1a9520e352b382fe/src/v/kafka/server/handlers/details/leader_epoch.h#L34). Note: This happens pre-write caching too.

With this we (franz-go) fetches the last offset for that epoch using a offset_for_leader_epoch request (https://github.com/redpanda-data/redpanda/blob/40f4ad32381fec89043281805f2877bf4d84b3fc/src/v/kafka/server/handlers/offset_for_leader_epoch.cc#L122) / kerr.DataLoss on kgo side which tells us where to rollback

This allows to restart the consumer and have it resume from the last committed offset.

Will make state management easier in a subsequent PR.

Will be used for refactoring in subsequent PR.

Never hang /last_pass, /shutdown requests in case of retries. Quality of life change.

In this mode verifier waits for new data to be produced instead of exiting on EOF. Stops after /last_pass request.

This mode allows to verify redpanda when write caching is enabled. In addition to tolerating data loss we also record and export to the /status endpoint the number of offsets/records that are considered lost from the point of view of the verifier.

nvartolomei force-pushed the nv/write-caching branch 5 times, most recently from 38a5c12 to 1cb796e Compare March 21, 2024 17:03

nvartolomei added 2 commits March 26, 2024 14:42

.gitignore: vscode

947173b

fix formatting directives typos

fd6715f

This fails `go test` which is introduced in subsequent commit.

nvartolomei force-pushed the nv/write-caching branch 2 times, most recently from fdb8113 to 45958fd Compare March 26, 2024 15:38

nvartolomei requested review from mmaslankaprv and bharathv March 27, 2024 10:32

graphcareful self-requested a review March 27, 2024 14:42

andrwng approved these changes Mar 28, 2024

View reviewed changes

nvartolomei added 8 commits March 28, 2024 19:05

verifier: allow custom consumer group names

6ef7c7b

This allows to restart the consumer and have it resume from the last committed offset.

verifier: allow launching only a single type of consumer per instance

ac694c8

Will make state management easier in a subsequent PR.

pkg: introduce loop state object for managing consumer loops

0f50f88

Will be used for refactoring in subsequent PR.

verifier: use loop state object for managing consumer loops

02a6543

verifier: do not block on channels

1fc4c3d

Never hang /last_pass, /shutdown requests in case of retries. Quality of life change.

verifier: continuous mode

e35dc1d

In this mode verifier waits for new data to be produced instead of exiting on EOF. Stops after /last_pass request.

verifier: allow to tolerate data loss

99de2ad

This mode allows to verify redpanda when write caching is enabled. In addition to tolerating data loss we also record and export to the /status endpoint the number of offsets/records that are considered lost from the point of view of the verifier.

.github: test in ci

1bfaab4

nvartolomei force-pushed the nv/write-caching branch from 9000bde to 1bfaab4 Compare March 28, 2024 19:28

dotnwat requested a review from andrwng March 28, 2024 21:42

andrwng approved these changes Mar 28, 2024

View reviewed changes

nvartolomei merged commit 8f4fdb7 into main Mar 29, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

verifier: tolerate data loss #52

verifier: tolerate data loss #52

nvartolomei commented Mar 18, 2024 •

edited

Loading

andrwng left a comment

nvartolomei commented Mar 28, 2024

verifier: tolerate data loss #52

verifier: tolerate data loss #52

Conversation

nvartolomei commented Mar 18, 2024 • edited Loading

andrwng left a comment

Choose a reason for hiding this comment

nvartolomei commented Mar 28, 2024

nvartolomei commented Mar 18, 2024 •

edited

Loading