-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to integrate AKS's Carbon-Aware-Scaler in to KEDA #4463
Comments
Worth nothing that I work for Microsoft which could come across as biased but wearing KEDA hat here For reference, here is an overview from AKS' operator and how it works: I'm very excited to see this new operator that productizes the POC we did in collaboration with AKS & TAG Environmental Sustainability along with @husky-parul, @rootfs, @yelghali & @zroubalik! One of the main take-aways was that fetching the data is the main problem and we need to find a unified way of gathering them. A second learning was that this actually cannot be a new scaler, but needs to be something that influences the min/max replicas that it is allowed to scale to while the scalers still define when scaling actions are required. So I think the next step for KEDA as a runtime is to see how we can formalize the second learning where cluster operators or platform teams can use a new CRD that basically defines how far a ScaledObject/ScaledJob can scale out. This is not only valuable for this use-case, but maybe an app dev wants to scale to 1000 replicas while the cluster operator says "no!" and wants to overrule that. With this new model, however, we need to have a flexible way of influencing this. While the AKS operator has this as a fixed CRD that fits the needs, we might want to be more generic and open for future new "providers" or scenarios. For example, there was another reported scenario where people want to override it. With that I don't mean that we need to shove everything in 1 CRD which is generic; because end-user experience is essential here so we might need to align with a model similar to ingress/gateway API where providers can register itself with KEDA and that every provider has separate CRDs to define the criteria for them. (it depends how we want to approach things. As an example, this is tailored to the needs of carbon scaling and SME's know exactly what to provide: apiVersion: carbonaware.kubernetes.azure.com/v1alpha1
kind: CarbonAwareKedaScaler
metadata:
name: carbon-aware-word-processor-scaler
spec:
kedaTarget: scaledobjects.keda.sh # can be used for ScaledObjects & ScaledJobs
kedaTargetRef:
name: word-processor-scaler
namespace: default
carbonIntensityForecastDataSource: # carbon intensity forecast data source
mockCarbonForecast: false # [OPTIONAL] use mock carbon forecast data
localConfigMap: # [OPTIONAL] use configmap for carbon forecast data
name: carbon-intensity
namespace: kube-system
key: data
maxReplicasByCarbonIntensity: # array of carbon intensity values in ascending order; each threshold value represents the upper limit and previous entry represents lower limit
- carbonIntensityThreshold: 437 # when carbon intensity is 437 or below
maxReplicas: 110 # do more
- carbonIntensityThreshold: 504 # when carbon intensity is >437 and <=504
maxReplicas: 60
- carbonIntensityThreshold: 571 # when carbon intensity is >504 and <=571 (and beyond)
maxReplicas: 10 # do less
ecoModeOff: # [OPTIONAL] settings to override carbon awareness; can override based on high intensity duration or schedules
maxReplicas: 100 # when carbon awareness is disabled, use this value
carbonIntensityDuration: # [OPTIONAL] disable carbon awareness when carbon intensity is high for this length of time
carbonIntensityThreshold: 555 # when carbon intensity is equal to or above this value, consider it high
overrideEcoAfterDurationInMins: 45 # if carbon intensity is high for this many hours disable ecomode
customSchedule: # [OPTIONAL] disable carbon awareness during specified time periods
- startTime: "2023-04-28T16:45:00Z" # start time in UTC
endTime: "2023-04-28T17:00:59Z" # end time in UTC
recurringSchedule: # [OPTIONAL] disable carbon awareness during specified recurring time periods
- "* 23 * * 1-5" # disable every weekday from 11pm to 12am UTC If we have to move this to a generic CRD we might lose that focus or that CRD might become gigantic. Hence why a provider approach might be ideal where top-level CRD has a pointer to a more specific CRD to define the details as above. Another aspect to keep in mind is scoping. Let's say we introduce a new CRD for this, what would the scope of it be? A single ScaledObject, based on label filtering, based on a namespace, ...? |
Be warning of the quality of service, like mentioned here: #3467 (comment).
|
thanks @clemlesne indeed reducing actual replicas would have more impact, indeed, however we found that having the operator updates replicas would be a bit dangerous, as it would conflict with KEDA/HPA, so the updating maxReplicas is less intrusive, and in practice, would prevent bursting or using more compute during high carbon intensity times. also it's important to note that this operator is meant to be used with low priority and time flexible workloads, that support interruptions, https://github.com/Azure/carbon-aware-keda-operator/blob/main/README.md |
WDYT @kedacore/keda-maintainers ? |
Hard-limiting the number of replicas will add a burden to the SRE teams, and goes against serverless principles.
I didn't understand everything, as you mentioned multiple topics. Here are my thoughts: ---
title: KEDA with carbon limiter and relative-caping
---
flowchart LR
trigger1["Trigger #1"]
trigger2["Trigger #2"]
carbonLimiter["Carbon limiter"]
scalingRule["Scaling rule"]
kubeHpa["k8s HPA"]
trigger1 --> scalingRule
trigger2 --> scalingRule
carbonLimiter --> scalingRule
scalingRule --> kubeHpa
In that case,
I understand. It could solve the hard-caping problem. This complexity, I think, is not necessary with the relative-caping. |
Kindly reminder @kedacore/keda-maintainers |
I'm not sure I like the approach mentioned here though:
I still believe that adding it to scaledobject/scaledjob is not the right place, assuming that is what "Scaling rule" represents. Otherwise, I think this should be a 1st class feature in KEDA :) |
A few things that I like from this discussion:
I also agree that this should not be baked into the scaling rule logic itself, but rather raise or lower the ceiling on how far a workload can scale (with min/max replicas)
Totally understand the position here and the scenarios mentioned. It is important to re-affirm that carbon-aware scaling should not be applied for time and/or demand sensitive workloads. So if the workload needs to accommodate high demand and within a reasonable amount of time, IMO it should not be a "carbon-aware" workload.
I like the idea of relative reduction in replica count using percentages, but that may require additional lookup on the actual ScaledObject/ScaledJob to see what that max value is to know how high a workload would scale out to or scale down to. Do you think supporting both options (% and actual replica counts) is warranted here?
Splitting into multiple CRDs sounds good to me, but what is your vision of a provider here? Could this be a provider of carbon-intensity data?
All of the above :-) Just kidding, my thought it to keep it narrowly scoped to a single ScaledObject in a single namespace to reduce the chances that it reduces the scalability a workload that was not intended to be "carbon aware" |
Exactly. I think we can support both % wise and number-wise, up to the end-user to choose which model is best for them.
Yes, this is what I am thinking indeed.
I think this is another case of providing option where you can make the argument that you want to scope to single workload, or use labels for large-scale scenarios. |
I am very sorry for the delay on this. I would like to restart the conversation. I am 100% for integrating this into KEDA project, it shouldn't be part of the core though, it's a nice extension. The only thing that we need to solve from my pov is maintainership. |
+1 |
Awesome! I've been meaning to jump back into this myself. As for maintainership... with a little help, I'd be happy to do that! |
@tomkerkhove do you want to drive this from KEDA perspective? |
Will check with Paul in a couple of weeks |
I have been discussing this a bit with @zroubalik and @JorTurFer and summarizing things here:
I'll circle back with @qpetraroia and discuss next steps |
Introduction
Today, Microsoft announced an open-source way to scale your workloads based on carbon-intensity with KEDA and the green software foundations SDK. This was built on top of earlier learnings and POCs with the KEDA team and other open-source contributors. Below you can find the open-source repository:
The above repository provides a Kubernetes operator that aims to reduce carbon emissions by helping KEDA scale Kubernetes workloads based on carbon intensity. Carbon intensity is a measure of how much carbon dioxide is emitted per unit of energy consumed. By scaling workloads according to the carbon intensity of the region or grid where they run, we can optimize the carbon efficiency and environmental impact of our applications.
This operator can use carbon intensity data from third party sources such as WattTime, Electricity Map or any other provider, to dynamically adjust the scaling behavior of KEDA. The operator does not require any application or workload code change, and it works with any KEDA scaler.
With the sustainability conversation started in the Kubernetes space, we are now looking forward to partner with KEDA to officially bring our code into KEDA and work with the KEDA team to build out the official carbon-aware scaler.
Proposal
Our proposal is to work with KEDA team to build out an official KEDA carbon-aware scaler and bring it into the open-source KEDA project. This could either be built on top of our existing repository by donating it to KEDA or by starting a new scaler.
Use Cases
Use cases for the operator include low priority and time flexible workloads that support interruptions in dev/test environments. Some examples of these are non-critical data backups, batch processing jobs, data analytics processing, and ML training jobs.
Scaler Source
Carbon intensity data via the GSF SDK or a cloud provider.
Scaling Mechanics
Scale based on carbon intensity via the GSF SDK or a cloud provider providing carbon intensity data. Microsoft has provided an open-source example of this here.
Authentication Source
Through the GSF or a cloud provider.
And special thanks to @yelghali, @pauldotyu, @tomkerkhove, @helayoty and @Fei-Guo! Appreciate all your hard work :)
The text was updated successfully, but these errors were encountered: