Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verifier: tolerate data loss #52

Merged
merged 10 commits into from
Mar 29, 2024
Merged

verifier: tolerate data loss #52

merged 10 commits into from
Mar 29, 2024

Conversation

nvartolomei
Copy link
Contributor

@nvartolomei nvartolomei commented Mar 18, 2024

  • Introduce a -continuous flag which instructs kgo-verifier to continue polling for new data rather than exiting or looping.
  • Introduce a -tolerate-data-loss flag which facilitates testing write caching feature where data loss is tolerable in specific circumstances.

@nvartolomei nvartolomei force-pushed the nv/write-caching branch 5 times, most recently from 38a5c12 to 1cb796e Compare March 21, 2024 17:03
This fails `go test` which is introduced in subsequent commit.
@nvartolomei nvartolomei force-pushed the nv/write-caching branch 2 times, most recently from fdb8113 to 45958fd Compare March 26, 2024 15:38
@graphcareful graphcareful self-requested a review March 27, 2024 14:42
Copy link

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just nits for me.

Also, just want to clarify the behavior when we see data loss:

  • validator is running with -continuous and -tolerate-data-loss
  • validator consumes up to 100, and is continuing to poll for 101+
  • data is "rolled back" to 90
  • PollFetches is doing some a periodic poll of highest offset, and under the hood will detect it moving backwards...? Is that right?
  • validator detects this, adjusts the ranges, and continues with respective workloads

@nvartolomei
Copy link
Contributor Author

LGTM! Just nits for me.

Also, just want to clarify the behavior when we see data loss:

  • validator is running with -continuous and -tolerate-data-loss
  • validator consumes up to 100, and is continuing to poll for 101+
  • data is "rolled back" to 90
  • PollFetches is doing some a periodic poll of highest offset, and under the hood will detect it moving backwards...? Is that right?
  • validator detects this, adjusts the ranges, and continues with respective workloads

Correct. When offsets are "rolled back" we necessarily increment the raft term / leader epoch. When Redpanda receives a fetch request with an "outdated" epoch, it responds with fenced_leader_epoch (https://github.com/redpanda-data/redpanda/blob/c2e22debf59dbc4ad91dfa8d1a9520e352b382fe/src/v/kafka/server/handlers/details/leader_epoch.h#L34). Note: This happens pre-write caching too.

With this we (franz-go) fetches the last offset for that epoch using a offset_for_leader_epoch request (https://github.com/redpanda-data/redpanda/blob/40f4ad32381fec89043281805f2877bf4d84b3fc/src/v/kafka/server/handlers/offset_for_leader_epoch.cc#L122) / kerr.DataLoss on kgo side which tells us where to rollback

This allows to restart the consumer and have it resume from the last
committed offset.
Will make state management easier in a subsequent PR.
Will be used for refactoring in subsequent PR.
Never hang /last_pass, /shutdown requests in case of retries.

Quality of life change.
In this mode verifier waits for new data to be produced instead of
exiting on EOF.

Stops after /last_pass request.
This mode allows to verify redpanda when write caching is enabled. In
addition to tolerating data loss we also record and export to the
/status endpoint the number of offsets/records that are considered lost
from the point of view of the verifier.
@dotnwat dotnwat requested a review from andrwng March 28, 2024 21:42
@nvartolomei nvartolomei merged commit 8f4fdb7 into main Mar 29, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants