-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add readyness API to net.server #5770
Conversation
@grafana/grafana-agent-signals-maintainers what do you think of this? Even though the procedure is manual, this will allow one to effectively drain the WAL from agent pods deployed in a statefulset, instead of just shutting them down without knowing if there's still data in the WAL. |
@@ -58,6 +63,28 @@ func NewTargetServer(logger log.Logger, metricsNamespace string, reg prometheus. | |||
return ts, nil | |||
} | |||
|
|||
func (ts *TargetServer) registerManagementAPI(router *mux.Router) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have some form of config management this may need a different name to be easily differentiated.
This feels like global solution to a very specific fix. My intuition is this state change should be pushed to components so they can take individual action additionally. For instance any component that is getting logs by polling or metrics scrapers will continue while the readiness probe is false. This should also hook into clustering since ideally this removes the pod from the cluster. @rfratto and @tpaschalis for second opinions. |
I also have some reservations about this, since I don't fully understand the use case yet and whether we need a new concept to solve that use case. Why isn't the normal lifecycle of the pod sufficient, where a pod is terminated and it finishes any work that it needs to do on shutdown before fully exiting? |
@rfratto so the use case is the following. Let's say you host a bunch of agents in a
At some point we notice the agents are receiving a bunch of traffic, so we scale up to N replicas. Later, when the surge finished and traffic is back to normal, one would like to scale down the statefulset. When scaling down the statefulset, that means one of the agent inside will be evicted, and the PVC lost. Since the PVC contains the WAL, we need some mechanism to performa graceful shutdown, which looks as follows:
This PR implements an endpoint so step 1) can be done manually, marking as un-ready the agent, and the k8s cluster avoiding routing more traffic. I think ideally, we'd want this procedure to be automattic, with the configured timeouts, upon a SIG that stops the agent. Any ideas/recommendations? |
If I understand your use case correctly, that behavior can already be achieved today without introducing new probes: have the Run method of your component flush data before returning after the context passed to Run is canceled. When the agent shuts down today, the Flow controller will terminate all running components. The Flow controller will wait for all components to finish shutting down before finally exiting the process. Kubernetes has a grace period where it waits for a pod to exit gracefully before force killing it; you can tune that setting on the pod if you need more time to flush data. |
Mmm interesting, gotta explore that option. The only thing, is say in the example above I have the |
Currently, the controller will terminate all the components at the same time: agent/pkg/flow/internal/controller/scheduler.go Lines 106 to 112 in 461a4b2
|
Closing in favour of #5804 |
PR Description
This PR adds an opt-in readyness API for the
common/net/server.go
. The idea is that if one hosts a set of agents in something like kubernetes, where readyness probes can be defined, this will allow the toggling of the readyness state. This comes useful for manually draining a pod, for example, if usingloki.source
components that expose a network server. The draining procude would be, assumming agents are hosted in a statefulset.PUT /server/toggle_ready
endpointloki.source
andloki.write
)Which issue(s) this PR fixes
Part of https://github.com/grafana/cloud-onboarding/issues/5407
Notes to the Reviewer
PR Checklist