-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing rule backup for rules API HA #5782
Allowing rule backup for rules API HA #5782
Conversation
e9eebc6
to
3624924
Compare
8331652
to
014235e
Compare
0375b45
to
2f8f631
Compare
7aa272b
to
5a1bc69
Compare
769796e
to
a37e449
Compare
pkg/ruler/manager.go
Outdated
@@ -85,6 +88,7 @@ func NewDefaultMultiTenantManager(cfg Config, managerFactory ManagerFactory, eva | |||
mapper: newMapper(cfg.RulePath, logger), | |||
userManagers: map[string]RulesManager{}, | |||
userManagerMetrics: userManagerMetrics, | |||
rulesBackupManager: newRulesBackupManager(cfg, logger, reg), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we skip initializing it if it is not enabled? This can skip initializing the metrics
Namespace: "cortex", | ||
Name: "ruler_backup_rule_group_rules", | ||
Help: "The number of backed up rules", | ||
}, []string{"user", "rule_group"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to have rule_group
as a label? What's the usecase to know the count per rule group?
I understand the backup is either success or failure for all rules of a tenant. Isn't a boolean value good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I planned for this metric to be the same as cortex_prometheus_rule_group_rules
. It would give us an understanding on how much data each ruler have as backup. It can also be used to detect discrepancies between what was evaluated and whats on backup which can happen since sync rules happen at different times for each ruler. I don't know when these information could be super useful but I see some potential. if you think the cost of using int over bool is not worth it I will change it :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also be used to detect discrepancies between what was evaluated and whats on backup which can happen since sync rules happen at different times for each ruler. I don't know when these information could be super useful but I see some potential.
Yeah, I understand we have such discrepancy but I have no idea how we really gonna use this count. Does it matter if we know the count is different? The sync delay is by design.
I prefer to keep it simple to use boolean or count total number of rules for a user, by removing rule_group
label
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the conclusion? seems that the rule_group
is still here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstood it. I kept the rule_group
label but used boolean. I will update to remove the rule_group
label and use total count of rules instead.
pkg/ruler/ruler.go
Outdated
@@ -476,6 +491,29 @@ func instanceOwnsRuleGroup(r ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRu | |||
return ownsRuleGroup, nil | |||
} | |||
|
|||
func instanceBacksUpRuleGroup(r ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, instanceAddr string) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems does the same as instanceOwnsRuleGroup
except using a different condition.
One is ownsRuleGroup := rlrs.Instances[0].Addr == instanceAddr
and this one is
var backupRuleGroup bool
// Only the second up to the last replica is used a backup
for i := 1; i < len(rlrs.Instances); i++ {
if rlrs.Instances[i].Addr == instanceAddr {
backupRuleGroup = true
break
}
}
Can we just make this condition logic a parameter and remove duplicate code?
// | ||
// Reason why this function is not a method on Ruler is to make sure we don't accidentally use r.ring, | ||
// but only ring passed as parameter. | ||
func filterBackupRuleGroups(userID string, ruleGroups []*rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, ring ring.ReadRing, instanceAddr string, log log.Logger, ringCheckErrors prometheus.Counter) []*rulespb.RuleGroupDesc { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems duplicated, too. Can we generalize the two functions?
Is log the main concern of not reusing the previous function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning for this was that when we implement the Rule evaluation HA that would also result the Rules API HA so we won't be needing filterBackupRuleGroups
and we can delete it. Deleting an entire unused method is simpler then adjusting existing methods and remove the unused blocks. But thinking now, API HA code and Evaluation HA code can co-exists if we allow cortex to be operated with API HA enabled and Evaluation HA disabled when Evaluation HA is implemented and available. I will make the adjustment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. It is not a big issue. I am fine to keep it as it is since generalizing the log is probably not easy. But I hope we can generalize instanceBacksUpRuleGroup
.
pkg/ruler/ruler.go
Outdated
merged []*GroupStateDesc | ||
mtx sync.Mutex | ||
merged []*GroupStateDesc | ||
errs []error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need errs
to be a slice of error? It looks like we only use len(errs)
but not using the error itself.
In this case, is keeping track the error count good enough?
pkg/ruler/ruler.go
Outdated
) | ||
failedZones := make(map[string]interface{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend using make(map[string]struct{})
which is used more common in Go to represent a set.
IIUC this is what we need.
pkg/ruler/ruler.go
Outdated
Name: "cortex_ruler_get_rules_failure_total", | ||
Help: "The total number of failed rules request sent to rulers.", | ||
}, []string{"ruler"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention it is for failed requests of getShardedRules
only?
# with default state (state before any evaluation) and send this copy in list | ||
# API requests as backup in case the ruler who owns the rule fails to send its | ||
# rules. This allows the rules API to handle ruler outage by returning rules | ||
# with default state. Ring replication-factor needs to be set to 3 or more for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it needs to be 3? I understand for Ruler API HA, 2 should be good enough. You don't have to require quorum to have 2 > 1 since you are merging the response anyway.
pkg/ruler/rule_backup_manager.go
Outdated
func (r *rulesBackupManager) setRuleGroups(_ context.Context, ruleGroups map[string]rulespb.RuleGroupList) { | ||
backupRuleGroups := make(map[string][]*promRules.Group) | ||
for user, groups := range ruleGroups { | ||
promGroups, err := r.ruleGroupListToPromGroups(user, groups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to understand why we need to convert rulespb.RuleGroupList
to prometheus rules type and store them in memory. Why we cannot store rulespb.RuleGroupList
directly?
At the end of the day, when we call getShardedRules
and getLocalRules
, the final response is rulespb
type defined by us.
If we store Prometheus type here we are basically converting twice, both at read and write time.
And I think promManager.LoadGroups
is very expensive since it needs to run m.opts.RuleDependencyController.AnalyseRules(rules)
to do the topology sort every time.
https://github.com/prometheus/prometheus/blob/main/rules/manager.go#L329
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we have to convert the rulespb.RuleGroupList
to promRules.Group
because
- I am hoping to reuse the prometheus code that sets the default value of fields like group
s
Interval, field. We could make the code convert
rulespb.RuleGroupListto
GroupStateDescwhich is what
getLocalRules` return and just make sure to copy behavior from https://github.com/prometheus/prometheus/blob/main/rules/manager.go#L280-L343 - By converting it to groups and appending it to the list returned by the manager, we can reuse the existing code for supporting filters https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L778-L798 . We could write a different code path that filters
GroupStateDesc
if we opt to storeGroupStateDesc
orrulespb.RuleGroupList
I did not realize that m.opts.RuleDependencyController.AnalyseRules(rules)
can be expensive. Since this is only relevant for evaluation we could get around it by passing in a noop RuleDependencyController
.
I think converting to prometheus rules type is the simpler approach. But I can change it if you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By converting it to groups and appending it to the list returned by the manager, we can reuse the existing code for supporting filters https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L778-L798
I don't see how this code is for prometheus rules only. It looks very general, no?
I am hoping to reuse the prometheus code that sets the default value of fields like groups Interval, field. We could make the code convert rulespb.RuleGroupListtoGroupStateDescwhich is whatgetLocalRules` return and just make sure to copy behavior from https://github.com/prometheus/prometheus/blob/main/rules/manager.go#L280-L343
The reason why I don't want to reuse this code is because I don't want to initialize a new rules manager everytime when this code path is called. It is expensive since it initializes rules metrics everytime but not using them at all.
Since what we need is to just use the rules we cached in memory, can we just use the loader directly without initializing the manager?
The way I can think of is to copy paste the code out and remove manager dependency. However, if we do this, why we don't just store rulespb.RuleGroupList? This way we can even get rid of the loader
I did not realize that m.opts.RuleDependencyController.AnalyseRules(rules) can be expensive. Since this is only relevant for evaluation we could get around it by passing in a noop RuleDependencyController.
We can pass a noop controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I don't want to reuse this code is because I don't want to initialize a new rules manager everytime when this code path is called. It is expensive since it initializes rules metrics everytime but not using them at all.
I am ok if we pre-initialize the manager metrics and pass it every time. But tbh it is still not ideal. NewGroup
and LoadGroups
do more things we don't need at all. We just need to convert the rules type, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I agree. It is more optimal to store rulespb.RuleGroupList
and do the conversion to GroupStateDesc
upon request. I will make the change. Thank you 🙇
4796a04
to
cf3b288
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my feedback, @emanlodovice. Overall I think it is much better and close to get merged.
I added some more comments. PTAL.
pkg/ring/ring.go
Outdated
// does not require quorum so only 1 replica is needed to complete the operation. For MaxUnavailableZones, it is | ||
// not automatically reduced when there are unhealthy instances in a zone because healthy instances in the zone | ||
// are still returned, but the information about zones with unhealthy instances is returned. | ||
GetReplicationSetForOperationWithNoQuorum(op Operation) (ReplicationSet, map[string]struct{}, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit about the method name. Should we call it WithoutQuorum
or WithNoQuorum
?
For MaxUnavailableZones, it is not automatically reduced when there are unhealthy instances in a zone because healthy instances in the zone are still returned, but the information about zones with unhealthy instances is returned.
It feels like this method differs from GetReplicationSetForOperation
more than not requiring quorum but also relaxes the constraint where we require all instances in the zone are healthy.
Why do we make this change? I am afraid that in the future we have a use case where we don't need quorum but we want to exclude zones with unhealthy instances. Then we have to change the interface again.
I want to make sure that this is not only for Ruler API backup and can be used in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, I think it makes sense for the Ruler API backup usecase to return all instances even in an unhealthy zone. However, I don't see other usecase for it now.
Do we think we can remove this method from the interface and move it to Ruler Ring code only? Since it might only make sense for Ruler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree, my problem was that there is no ring operation that allows me to get healthy instances as well as unhealthy instances so I can write code in ruler to decide how much error we can tolerate. There is GetInstanceDescsForOperation
but this only returns healthy instances. Even if we set the op
to consider all statuses as healthy it still doesn't guarantee that we get all instances in the ring because storageLastUpdate := r.KVClient.LastUpdateTime(r.key)
is also considered when checking if the instance is healthy. Do you think we can add a new method in the ring that could look like GetAllInstanceDescs(op Operation) ([]InstanceDesc, []InstanceDesc, err)
that returns the healthy and unhealthy instances in the ring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can add a new method in the ring that could look like GetAllInstanceDescs(op Operation) ([]InstanceDesc, []InstanceDesc, err) that returns the healthy and unhealthy instances in the ring
I think we can do that and I prefer this solution. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change GetInstanceDescsForOperation
as well. WDYT @alanprot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like GetInstanceDescsForOperation
is only used right now in store-gateway sync. This doesn't run very often so i guess an addition of a new map in the method would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
humm...
I think i prefer a new method as well (GetAllInstanceDescs(op Operation) ([]InstanceDesc, []InstanceDesc, err)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated @yeya24 , PTAL, thank you
} | ||
|
||
backupGroups := r.manager.GetBackupRules(userID) | ||
for _, group := range backupGroups { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move some code to a separate function? getLocalRules
is like 200+ lines of code.
b916f50
to
81eef26
Compare
Signed-off-by: Emmanuel Lodovice <[email protected]>
Signed-off-by: Emmanuel Lodovice <[email protected]>
Signed-off-by: Emmanuel Lodovice <[email protected]>
Signed-off-by: Emmanuel Lodovice <[email protected]>
Signed-off-by: Emmanuel Lodovice <[email protected]>
Signed-off-by: Emmanuel Lodovice <[email protected]>
- Remove duplicate code and use better data structures - Make backup rule_group label match the prometheus rule_group label - Skip initialization when feature is not enabled Signed-off-by: Emmanuel Lodovice <[email protected]>
Signed-off-by: Emmanuel Lodovice <[email protected]>
…in getShardedRules Signed-off-by: Emmanuel Lodovice <[email protected]>
Signed-off-by: Emmanuel Lodovice <[email protected]>
in ruler to get Replicaset without requiring quorum Signed-off-by: Emmanuel Lodovice <[email protected]>
0f5fd54
to
d8955ab
Compare
Signed-off-by: Emmanuel Lodovice <[email protected]>
d8955ab
to
2681f23
Compare
I understood the general ideal and LGTM. I did not reviewed every single line in detail but as its already have 3 approves im good with it. |
What this PR does:
This PR introduces support for rule group replication in ruler for API HA. The idea is that ruler can be configured to have a replication factor greater than 1 but only 1 of the replica set will be assigned to evaluate the rule group. The rest of the rulers returned by the ring operation will only hold a copy of the rule group and send it when listing rule groups. In the event when a ruler goes down, its rule groups are still loaded by another rulers so the
PrometheusRules
API will return a 2xx but some of the rule groups will have the default state (state before evaluation).Because a rule group can now loaded by multiple rulers (though only evaluated by one) the PrometheusRules API was updated to deduplicate the rule groups by keeping the
GroupStateDesc
with the latest evaluation timestamp. This guarantees that in the happy path where all rulers are healthy, the GroupStateDesc coming from the ruler evaluating the rule group will always be kept and send in the response.Which issue(s) this PR fixes:
Fixes #5773
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]