-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for journal hibernation #411
Conversation
59e72fc
to
01ddcb3
Compare
01ddcb3
to
eb7c3a7
Compare
41c3489
to
557ca28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really nice. I left a few inline comments, though IDK whether any of them should block merging. I think the docs in overview-build-and-test.rst
could use updated before merging, though.
@@ -15,7 +15,8 @@ import ( | |||
|
|||
// DiagnosticsConfig configures pull-based application metrics, debugging and diagnostics. | |||
type DiagnosticsConfig struct { | |||
// Nothing to see here (yet). | |||
Port string `long:"port" env:"PORT" description:"Port for diagnostics"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if you pass --private
but no --port
argument? Would that effectively disable the diagnostics endpoints? Should we validate that port is present if private is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would effectively disable diagnostics. I'll add a logged warning.
// | ||
// The Suspend field is managed by Gazette, and is updated when an Append RPC | ||
// suspends a journal or if auto-suspension is enabled. Operators should | ||
// not set it directly. However when utilizing suspension, operators MUST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you must take care not to accidentally clobber the current value of suspend
when running, for example, gazctl journals apply
. What would happen if you changed the value of suspend anyway? Would the broker actually change the suspension state of the journal, or does it just reject the request?
I'm not seeing where it'd be rejecting the request, so I'm guessing it'd actually update the value. This seems a little surprising to me. I don't really see a use case for allowing clients to directly change these fields, so it seems like maybe we ought to reject Apply requests that attempt to change it. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if you changed the value of suspend anyway?
-
If the journal is partially suspended (one read-only replica), then the overwrite un-suspends it and there's no harm done. The prior replica already had the correct resume offset which is synchronizes with new topology peers.
-
If the journal is partially suspended AND the single scaled-down replica crashes at the same time, you may need to run
reset head
in order to resume appends. The precondition was that all data is already in cloud storage, but it cannot now use the suspend offset to automatically resume the next append offset. -
If the journal is fully suspended (no data exists in cloud storage) and its spec is clobbered, then it will become active and begin writing from offset zero.
I'm not seeing where it'd be rejecting the request, so I'm guessing it'd actually update the value. This seems a little surprising to me.
That's correct, it will overwrite it. From a technical angle, the control flow for applying a spec update verifies the applied spec is valid, but doesn't even consider the prior spec state (only the client-supplied ModRevision).
We could do more validation here, but i didn't feel it was worth it and i can also imagine scenarios where it's desirable to manually change these fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the prior value was unsuspended and you set it to either partially or fully suspended, would the brokers handle that by either partially or fully suspending the journal?
I'm also wondering how this would work in the scenario where an older client version is running against a newer broker. The old client would presumably leave the suspend field blank, which seems like it'd effectively clobber whatever is there. We might just say not to use journal suspension unless all your clients are updated, though that does make me a little nervous because it'd be easy for someone to accidentally use an outdated gazctl
.
I'm not really sure how important those concerns are, but I'm mainly just trying to understand what we'll need to do as part of the upgrade and what the risks are with the Apply RPC.
When `Port` is set, serve diagnostics (on http.DefaultServeMux) over the indicated port. When `Private` is set, don't serve diagnostics over the main service port.
Truncating slices but holding their memory means that the garbage collector is unable to free the backing files of persisted spools which happen to be resident in the queue.
0b9a9d8
to
d1ff715
Compare
Made some lightweight updates to I backed out upgrades of gRPC, as changes there have altered memory semantics compared to its prior behavior (it now assumes it owns buffers of messages sent into SendMsg, where it hadn't before and we relied on that behavior). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Introduce a journal suspension mechanism which is incorporated into the Append RPC state machine, and stored as a `suspend` field within each JournalSpec. Suspended journals consumer fewer or no assignment slots, reducing broker and Etcd resources required for journals which are not actively serving appends. When running with the `--auto-suspend` flag, brokers will automatically suspend journals which are fully flushed to the fragment index.
Previously, a `stopCh` signal could hang the "pump" goroutine, sending into a channel which is never read. Instead, ensure we fully consume the pump channel and use a context cancellation to ensure it exits on `stopCh`.
Junk the docker-based builder strategy and instead use the GitHub Actions CI environment for builds. Introduce an ARM64 broker variant using cross-compilation and the Github docker build action with QEMU. Switch to using the librocksdb8.9 library provided by Ubuntu 24.04 within CI and within the "examples" image, rather than pinning RocksDB and building it ourselves. Switch to Github container registry for release builds.
Remove --disable-stores behavior, which causes fragments to never persist to their configured remote store. This flag has historically been used in testing contexts. Replace it with --file-only, which achieves a similar testing outcome by re-interpreting a s3://bucket/foobar store to instead use file:///foobar This preserves the desired behavior of keeping changes local and not writing to remote fragment stores, while allowing for test brokers to be restarted without data loss, as well as allowing for testing of features that require persistence (like journal suspension).
afe1503
to
95692a2
Compare
Implement journal suspension and auto-suspension, under which journals use fewer or no runtime resources if they're not being actively appended to or if they have no content.
Overhaul the CI build system and introduce ARM64 builds of the Gazette broker.
Various other small updates; see commits.
This change is