-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add workers autoscaling through KEDA #33
Comments
This sounds like a feature request to SDK? |
Just casting some light on this - we use KEDA with Postgres queries right now to scale our workers when there is more than a predefined number of tasks that our base node pool can handle. At the moment it's 2 workers base and when there are more than 10 concurrent Workflows we spin up more servers, 5 concurrent Workflows per node. It's working well, but we're always a little worried that the schema will change and break everything. It would be great to abstract this out to a native integration with KEDA. |
I have this working within KEDA using the temporalClient's ListOpenWorkflow method and would love to chat to someone from the temporal maintainers about the best way to get this into the various repos and whether things make sense |
Co-authored-by: Mike <[email protected]>
PR for the same https://github.com/kedacore/keda/pull/4863/files |
+1 would be a great feature to have. It's common to have specific workers (services) listening on specific task queues. Exposing the size of a given task queue would be a very precise autoscaling metric for those workers. |
My organization would like to see this, too. |
+1 This would be great to have |
My studio has just started testing Temporal and it would be great to have this feature. |
There were two previous attempts to implement a Temporal scaler for Keda, but both got closed. Ref. kedacore/keda#4721 and kedacore/keda#4863. @cretz, since you were directly involved in kedacore/keda#4863, do you think the new Task Queue Statistics added in v1.25 would be the right way to implement a Keda scaler for Temporal? If so, any thoughts on whether the Temporal team might consider to implement this or whether support would have to come from the community? |
To provide some more context: In particular, we are interested in using Keda's ability to scale down Temporal workers to zero if there are no pending tasks on the worker's task queue(s) for some period of time. This is not possible using the (previously?) recommended way of scaling Temporal workers based on the |
Absolutely, and this is on our roadmap to build. We demo'd this at our Replay conference. Stay tuned for more info. |
@cretz Any further information you can share on this, i.e. possibly a rough timeline? This would help us decide whether we can wait for an official version or whether we need to build something in-house for our own use first. |
you could autoscale using https://keda.sh/docs/2.15/scalers/prometheus/ @jhecking |
Thanks, @febinct. That's what we currently do for 1->n scaling. What we are looking for is a solution that can handle 0->1 / 1->0 scaling as well. |
@jhecking Scaling from zero to one or vice versa isn’t currently feasible with Temporal because it relies on workers continuously polling task queues. If at least one worker isn’t running, we’re unable to submit jobs and execute it as metrics are exported from sdk. This setup doesn’t support Lambda or keda scaled job like use-cases at the moment. I discussed this with Maxim (CTO of Temporal) during the last Temporal meetup, and he mentioned that it’s on the roadmap didnt get any ETA thou :) Our team raised this PR, and we’re actively exploring Task Queue Statistics-based autoscaling. We plan to raise a PR in KEDA within the next 2-4 weeks. as a hack to avoid running larger machines and to save cost what we did was run a very small pod from that temporal workflow we triggered an sqs event and using sqs we triggered https://keda.sh/docs/1.4/concepts/scaling-jobs/ and then waited for signal to come back from that sqs processor which is executing the expensive job |
@febinct I don't think this is correct with regards to the new task queue statistics. I was able to spin up a v1.25 dev server using the Temporal CLI. Then I used the hello-world examples to start several workflows on tasks queues that had no active workers, as well as some additional workflows that ran activities on task queues that had no active workers. When I use the Temporal CLI to query the DescribeTaskQueue API, I get the expected stats, i.e.
So I think it should be possible to implement a Keda scaler that queries the DescribeTaskQueue API and uses the ApproximateBacklogCount metric to make 0->1 and 1->0 scaling decisions. @cretz please correct me if I got any of this wrong. |
I am afraid there is no specific timeline at this time.
Yes, unlike schedule to start latency (which is worker side so requires a worker), backlog count can be used for scale-to-zero use cases and was one of the primary motivators behind this API. Feel free to come discuss scaling in our community slack or our community forums. |
Great! Thanks for the confirmation. |
Hi all 👋🏽 I'm Nikitha, a PM here at Temporal and I wanted to acknowledge all the great feedback and discussion in this thread. I'm excited to share that we do have imminent plans to build and contribute a KEDA scaler upstream (yes scale to zero will work as @cretz confirmed). I don't have an ETA for you just yet, but it's actively in the works and we will share more soon! |
https://github.com/kedacore/keda/pull/6191/files pr for the same please review @cretz @jhecking |
Thank you! Will take a look. |
@febinct - from @atihkin above, "we do have imminent plans to build and contribute a KEDA scaler upstream", but this seems to preempt us from being able to build this by adding an externally created one. Our algorithm may differ slightly from the one in the PR (for instance, combined task queue stats probably not the way to go unless opted in, build-id-specific stats may be better). I will get with the engineers on the scaling project and review the submission. We should hold off on merging this PR until Temporal takes a look and/or submits a similar alternative. |
If it makes sense please give review comments happy to collaborate as an extended team and collaborate on the growth of the temporal community. |
I, for one, am very grateful to @febinct and team for having put their own implementation out there. 🙏 I have reviewed the PR and I think it will meet our needs. We are planning to go ahead and run some tests with it to get a feel for how well the 0->1 / 1->0 scaling works for our workloads. Though we would probably wait for the official implementation of the Temporal team before using it in prod. |
To update folks on this thread - the Temporal team has taken a look and we've decide to go ahead with @febinct's proposal (thank you for your contribution and also @jhecking for your review!). @robholland has left a few comments in https://github.com/kedacore/keda/pull/6191/files but we do hope to be able to merge this PR soon. |
All the credit goes to https://github.com/Prajithp from our team. We are actively working on closing the comments and then get merged. Will close soon. Thanks @atihkin |
Thank you @Prajithp and @febinct for pushing this forward! 🙏 But I do want to point out that from our perspective the new Keda Temporal scaler is not yet production ready, as we are still faced with the issue of Keda using up 100% of the allocated CPU as soon as we enable the new scaler. I'm continuing to debug the issue but have yet to find a solution. |
we are also checking same @jhecking as of now suspecting creating new gRPC connections too frequently. The MinConnectTimeout of 5 seconds might be causing rapid reconnections if the connection does not succeed within that time frame could be another potential reason. can you also try bypassing Consul temporarily to see if the CPU load decreases as well? as we dont have Consul setup |
In our case, the Temporal workers are often running in a different cluster from the Temporal server and Consul is required for the workers and Keda to connect to the Temporal server. So far, none of our Temporal workers (using the Typescript, Java and Python SDKs) have shown any similar issues. But I'll try to replicate this in a different cluster where Consul is not required. |
any update |
@raanand-dig please follow progress in https://github.com/kedacore/keda/pull/6191/files |
https://github.com/kedacore/keda
The text was updated successfully, but these errors were encountered: