Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add readyness API to net.server #5770

Closed
wants to merge 1 commit into from
Closed

Add readyness API to net.server #5770

wants to merge 1 commit into from

Conversation

thepalbi
Copy link
Contributor

PR Description

This PR adds an opt-in readyness API for the common/net/server.go. The idea is that if one hosts a set of agents in something like kubernetes, where readyness probes can be defined, this will allow the toggling of the readyness state. This comes useful for manually draining a pod, for example, if using loki.source components that expose a network server. The draining procude would be, assumming agents are hosted in a statefulset.

  1. find the highest numbered pod
  2. toggle the readyness state with the PUT /server/toggle_ready endpoint
  3. wait, using metrics, for the desired draining state (for example, using WAL metrics for loki.source and loki.write)
  4. Downscale statefulset so that this pod can be evicted

Which issue(s) this PR fixes

Part of https://github.com/grafana/cloud-onboarding/issues/5407

Notes to the Reviewer

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@thepalbi
Copy link
Contributor Author

@grafana/grafana-agent-signals-maintainers what do you think of this? Even though the procedure is manual, this will allow one to effectively drain the WAL from agent pods deployed in a statefulset, instead of just shutting them down without knowing if there's still data in the WAL.

@thepalbi thepalbi requested a review from mattdurham November 14, 2023 19:01
@@ -58,6 +63,28 @@ func NewTargetServer(logger log.Logger, metricsNamespace string, reg prometheus.
return ts, nil
}

func (ts *TargetServer) registerManagementAPI(router *mux.Router) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have some form of config management this may need a different name to be easily differentiated.

@mattdurham
Copy link
Collaborator

This feels like global solution to a very specific fix. My intuition is this state change should be pushed to components so they can take individual action additionally. For instance any component that is getting logs by polling or metrics scrapers will continue while the readiness probe is false. This should also hook into clustering since ideally this removes the pod from the cluster. @rfratto and @tpaschalis for second opinions.

@rfratto
Copy link
Member

rfratto commented Nov 15, 2023

I also have some reservations about this, since I don't fully understand the use case yet and whether we need a new concept to solve that use case.

Why isn't the normal lifecycle of the pod sufficient, where a pod is terminated and it finishes any work that it needs to do on shutdown before fully exiting?

@thepalbi
Copy link
Contributor Author

@rfratto so the use case is the following. Let's say you host a bunch of agents in a StatefulSet with a PVC for WAL. All of them have a config like the following:

loki.source.awsfirehose "receiver" {
  http {
    listen_address = "0.0.0.0"
    listen_port = 8080
  }
  forward_to = [loki.write.cell.receiver]
  use_incoming_timestamp = false
}
loki.write "cell" {
  endpoint {
    max_backoff_period = "7s"
    url = "loki.com"
  }
  wal {
    enabled = true
  }
}

At some point we notice the agents are receiving a bunch of traffic, so we scale up to N replicas. Later, when the surge finished and traffic is back to normal, one would like to scale down the statefulset. When scaling down the statefulset, that means one of the agent inside will be evicted, and the PVC lost. Since the PVC contains the WAL, we need some mechanism to performa graceful shutdown, which looks as follows:

  1. user decides to scale down the sfset
  2. user decreases the number of replicas in the statefulset
  3. "shutdown with drain signal is sent to the agent being evicted" (in a statefulset that is the replica with the highest number)
  4. agent shuts down first the loki.source..+ components, so no more data is received
  5. agent tells WAL watcher to burn through WAL
  6. when 2) is done with some timeout, agent exits
  7. statefulset now has one less replica, and data in the WAL has been evicted.

This PR implements an endpoint so step 1) can be done manually, marking as un-ready the agent, and the k8s cluster avoiding routing more traffic. I think ideally, we'd want this procedure to be automattic, with the configured timeouts, upon a SIG that stops the agent. Any ideas/recommendations?

@rfratto
Copy link
Member

rfratto commented Nov 15, 2023

If I understand your use case correctly, that behavior can already be achieved today without introducing new probes: have the Run method of your component flush data before returning after the context passed to Run is canceled.

When the agent shuts down today, the Flow controller will terminate all running components. The Flow controller will wait for all components to finish shutting down before finally exiting the process. Kubernetes has a grace period where it waits for a pod to exit gracefully before force killing it; you can tune that setting on the pod if you need more time to flush data.

@thepalbi
Copy link
Contributor Author

If I understand your use case correctly, that behavior can already be achieved today without introducing new probes: have the Run method of your component flush data before returning after the context passed to Run is canceled.

When the agent shuts down today, the Flow controller will terminate all running components. The Flow controller will wait for all components to finish shutting down before finally exiting the process. Kubernetes has a grace period where it waits for a pod to exit gracefully before force killing it; you can tune that setting on the pod if you need more time to flush data.

Mmm interesting, gotta explore that option. The only thing, is say in the example above I have the loki.source.awsfirehose -> loki.write.cell. How does the controller determine the components shutdown order? Is it all at once, or it follows the DAG?

@rfratto
Copy link
Member

rfratto commented Nov 16, 2023

Mmm interesting, gotta explore that option. The only thing, is say in the example above I have the loki.source.awsfirehose -> loki.write.cell. How does the controller determine the components shutdown order? Is it all at once, or it follows the DAG?

Currently, the controller will terminate all the components at the same time:

// Close stops the Scheduler and returns after all running goroutines have
// exited.
func (s *Scheduler) Close() error {
s.cancel()
s.running.Wait()
return nil
}

@thepalbi
Copy link
Contributor Author

Closing in favour of #5804

@thepalbi thepalbi closed this Nov 17, 2023
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants