Allowing rule backup for rules API HA #5782

emanlodovice · 2024-02-22T01:55:07Z

What this PR does:
This PR introduces support for rule group replication in ruler for API HA. The idea is that ruler can be configured to have a replication factor greater than 1 but only 1 of the replica set will be assigned to evaluate the rule group. The rest of the rulers returned by the ring operation will only hold a copy of the rule group and send it when listing rule groups. In the event when a ruler goes down, its rule groups are still loaded by another rulers so the PrometheusRules API will return a 2xx but some of the rule groups will have the default state (state before evaluation).

Because a rule group can now loaded by multiple rulers (though only evaluated by one) the PrometheusRules API was updated to deduplicate the rule groups by keeping the GroupStateDesc with the latest evaluation timestamp. This guarantees that in the happy path where all rulers are healthy, the GroupStateDesc coming from the ruler evaluating the rule group will always be kept and send in the response.

Which issue(s) this PR fixes:
Fixes #5773

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

yeya24 · 2024-03-28T19:53:29Z

pkg/ruler/manager.go

@@ -85,6 +88,7 @@ func NewDefaultMultiTenantManager(cfg Config, managerFactory ManagerFactory, eva
 		mapper:                    newMapper(cfg.RulePath, logger),
 		userManagers:              map[string]RulesManager{},
 		userManagerMetrics:        userManagerMetrics,
+		rulesBackupManager:        newRulesBackupManager(cfg, logger, reg),


Can we skip initializing it if it is not enabled? This can skip initializing the metrics

yeya24 · 2024-03-28T19:59:19Z

pkg/ruler/rule_backup_manager.go

+			Namespace: "cortex",
+			Name:      "ruler_backup_rule_group_rules",
+			Help:      "The number of backed up rules",
+		}, []string{"user", "rule_group"}),


Why do we need to have rule_group as a label? What's the usecase to know the count per rule group?

I understand the backup is either success or failure for all rules of a tenant. Isn't a boolean value good enough

I planned for this metric to be the same as cortex_prometheus_rule_group_rules. It would give us an understanding on how much data each ruler have as backup. It can also be used to detect discrepancies between what was evaluated and whats on backup which can happen since sync rules happen at different times for each ruler. I don't know when these information could be super useful but I see some potential. if you think the cost of using int over bool is not worth it I will change it :D

It can also be used to detect discrepancies between what was evaluated and whats on backup which can happen since sync rules happen at different times for each ruler. I don't know when these information could be super useful but I see some potential.

Yeah, I understand we have such discrepancy but I have no idea how we really gonna use this count. Does it matter if we know the count is different? The sync delay is by design.

I prefer to keep it simple to use boolean or count total number of rules for a user, by removing rule_group label

What is the conclusion? seems that the rule_group is still here.

I think I misunderstood it. I kept the rule_group label but used boolean. I will update to remove the rule_group label and use total count of rules instead.

yeya24 · 2024-03-28T20:05:31Z

pkg/ruler/ruler.go

@@ -476,6 +491,29 @@ func instanceOwnsRuleGroup(r ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRu
 	return ownsRuleGroup, nil
 }

+func instanceBacksUpRuleGroup(r ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, instanceAddr string) (bool, error) {


This function seems does the same as instanceOwnsRuleGroup except using a different condition.
One is ownsRuleGroup := rlrs.Instances[0].Addr == instanceAddr and this one is

var backupRuleGroup bool // Only the second up to the last replica is used a backup for i := 1; i < len(rlrs.Instances); i++ { if rlrs.Instances[i].Addr == instanceAddr { backupRuleGroup = true break } }

Can we just make this condition logic a parameter and remove duplicate code?

yeya24 · 2024-03-28T20:07:44Z

pkg/ruler/ruler.go

+//
+// Reason why this function is not a method on Ruler is to make sure we don't accidentally use r.ring,
+// but only ring passed as parameter.
+func filterBackupRuleGroups(userID string, ruleGroups []*rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, ring ring.ReadRing, instanceAddr string, log log.Logger, ringCheckErrors prometheus.Counter) []*rulespb.RuleGroupDesc {


This function seems duplicated, too. Can we generalize the two functions?
Is log the main concern of not reusing the previous function

My reasoning for this was that when we implement the Rule evaluation HA that would also result the Rules API HA so we won't be needing filterBackupRuleGroups and we can delete it. Deleting an entire unused method is simpler then adjusting existing methods and remove the unused blocks. But thinking now, API HA code and Evaluation HA code can co-exists if we allow cortex to be operated with API HA enabled and Evaluation HA disabled when Evaluation HA is implemented and available. I will make the adjustment.

Thanks. It is not a big issue. I am fine to keep it as it is since generalizing the log is probably not easy. But I hope we can generalize instanceBacksUpRuleGroup.

yeya24 · 2024-03-28T20:10:24Z

pkg/ruler/ruler.go

-		merged   []*GroupStateDesc
+		mtx    sync.Mutex
+		merged []*GroupStateDesc
+		errs   []error


Do we need errs to be a slice of error? It looks like we only use len(errs) but not using the error itself.
In this case, is keeping track the error count good enough?

yeya24 · 2024-03-28T20:12:15Z

pkg/ruler/ruler.go

 	)
+	failedZones := make(map[string]interface{})


I would recommend using make(map[string]struct{}) which is used more common in Go to represent a set.
IIUC this is what we need.

yeya24 · 2024-03-28T20:20:14Z

pkg/ruler/ruler.go

+			Name: "cortex_ruler_get_rules_failure_total",
+			Help: "The total number of failed rules request sent to rulers.",
+		}, []string{"ruler"}),


Should we mention it is for failed requests of getShardedRules only?

yeya24 · 2024-03-28T20:32:28Z

docs/configuration/config-file-reference.md

+# with default state (state before any evaluation) and send this copy in list
+# API requests as backup in case the ruler who owns the rule fails to send its
+# rules. This allows the rules API to handle ruler outage by returning rules
+# with default state. Ring replication-factor needs to be set to 3 or more for


Why it needs to be 3? I understand for Ruler API HA, 2 should be good enough. You don't have to require quorum to have 2 > 1 since you are merging the response anyway.

yeya24 · 2024-03-28T20:57:05Z

pkg/ruler/rule_backup_manager.go

+func (r *rulesBackupManager) setRuleGroups(_ context.Context, ruleGroups map[string]rulespb.RuleGroupList) {
+	backupRuleGroups := make(map[string][]*promRules.Group)
+	for user, groups := range ruleGroups {
+		promGroups, err := r.ruleGroupListToPromGroups(user, groups)


I am trying to understand why we need to convert rulespb.RuleGroupList to prometheus rules type and store them in memory. Why we cannot store rulespb.RuleGroupList directly?

At the end of the day, when we call getShardedRules and getLocalRules, the final response is rulespb type defined by us.
If we store Prometheus type here we are basically converting twice, both at read and write time.

And I think promManager.LoadGroups is very expensive since it needs to run m.opts.RuleDependencyController.AnalyseRules(rules) to do the topology sort every time.

https://github.com/prometheus/prometheus/blob/main/rules/manager.go#L329

I thought we have to convert the rulespb.RuleGroupList to promRules.Group because

I am hoping to reuse the prometheus code that sets the default value of fields like groups Interval, field. We could make the code convert rulespb.RuleGroupListtoGroupStateDescwhich is whatgetLocalRules` return and just make sure to copy behavior from https://github.com/prometheus/prometheus/blob/main/rules/manager.go#L280-L343

By converting it to groups and appending it to the list returned by the manager, we can reuse the existing code for supporting filters https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L778-L798 . We could write a different code path that filters GroupStateDesc if we opt to store GroupStateDesc or rulespb.RuleGroupList

I did not realize that m.opts.RuleDependencyController.AnalyseRules(rules) can be expensive. Since this is only relevant for evaluation we could get around it by passing in a noop RuleDependencyController.

I think converting to prometheus rules type is the simpler approach. But I can change it if you think otherwise.

By converting it to groups and appending it to the list returned by the manager, we can reuse the existing code for supporting filters https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L778-L798

I don't see how this code is for prometheus rules only. It looks very general, no?

I am hoping to reuse the prometheus code that sets the default value of fields like groups Interval, field. We could make the code convert rulespb.RuleGroupListtoGroupStateDescwhich is whatgetLocalRules` return and just make sure to copy behavior from https://github.com/prometheus/prometheus/blob/main/rules/manager.go#L280-L343

The reason why I don't want to reuse this code is because I don't want to initialize a new rules manager everytime when this code path is called. It is expensive since it initializes rules metrics everytime but not using them at all.
Since what we need is to just use the rules we cached in memory, can we just use the loader directly without initializing the manager?

The way I can think of is to copy paste the code out and remove manager dependency. However, if we do this, why we don't just store rulespb.RuleGroupList? This way we can even get rid of the loader

I did not realize that m.opts.RuleDependencyController.AnalyseRules(rules) can be expensive. Since this is only relevant for evaluation we could get around it by passing in a noop RuleDependencyController.

We can pass a noop controller.

The reason why I don't want to reuse this code is because I don't want to initialize a new rules manager everytime when this code path is called. It is expensive since it initializes rules metrics everytime but not using them at all.

I am ok if we pre-initialize the manager metrics and pass it every time. But tbh it is still not ideal. NewGroup and LoadGroups do more things we don't need at all. We just need to convert the rules type, right?

I see, I agree. It is more optimal to store rulespb.RuleGroupList and do the conversion to GroupStateDesc upon request. I will make the change. Thank you 🙇

yeya24

Thanks for addressing my feedback, @emanlodovice. Overall I think it is much better and close to get merged.

I added some more comments. PTAL.

pkg/ruler/ruler.go

pkg/ring/ring.go

yeya24 · 2024-04-10T01:09:15Z

pkg/ring/ring.go

+	// does not require quorum so only 1 replica is needed to complete the operation. For MaxUnavailableZones, it is
+	// not automatically reduced when there are unhealthy instances in a zone because healthy instances in the zone
+	// are still returned, but the information about zones with unhealthy instances is returned.
+	GetReplicationSetForOperationWithNoQuorum(op Operation) (ReplicationSet, map[string]struct{}, error)


Nit about the method name. Should we call it WithoutQuorum or WithNoQuorum?

For MaxUnavailableZones, it is not automatically reduced when there are unhealthy instances in a zone because healthy instances in the zone are still returned, but the information about zones with unhealthy instances is returned.

It feels like this method differs from GetReplicationSetForOperation more than not requiring quorum but also relaxes the constraint where we require all instances in the zone are healthy.

Why do we make this change? I am afraid that in the future we have a use case where we don't need quorum but we want to exclude zones with unhealthy instances. Then we have to change the interface again.

I want to make sure that this is not only for Ruler API backup and can be used in the future.

To me, I think it makes sense for the Ruler API backup usecase to return all instances even in an unhealthy zone. However, I don't see other usecase for it now.
Do we think we can remove this method from the interface and move it to Ruler Ring code only? Since it might only make sense for Ruler.

Yes I agree, my problem was that there is no ring operation that allows me to get healthy instances as well as unhealthy instances so I can write code in ruler to decide how much error we can tolerate. There is GetInstanceDescsForOperation but this only returns healthy instances. Even if we set the op to consider all statuses as healthy it still doesn't guarantee that we get all instances in the ring because storageLastUpdate := r.KVClient.LastUpdateTime(r.key) is also considered when checking if the instance is healthy. Do you think we can add a new method in the ring that could look like GetAllInstanceDescs(op Operation) ([]InstanceDesc, []InstanceDesc, err) that returns the healthy and unhealthy instances in the ring?

Do you think we can add a new method in the ring that could look like GetAllInstanceDescs(op Operation) ([]InstanceDesc, []InstanceDesc, err) that returns the healthy and unhealthy instances in the ring

I think we can do that and I prefer this solution. WDYT?

We can change GetInstanceDescsForOperation as well. WDYT @alanprot

Seems like GetInstanceDescsForOperation is only used right now in store-gateway sync. This doesn't run very often so i guess an addition of a new map in the method would be fine.

humm...

I think i prefer a new method as well (GetAllInstanceDescs(op Operation) ([]InstanceDesc, []InstanceDesc, err))

Updated @yeya24 , PTAL, thank you

yeya24 · 2024-04-10T01:10:56Z

pkg/ruler/ruler.go

+	}
+
+	backupGroups := r.manager.GetBackupRules(userID)
+	for _, group := range backupGroups {


Can we move some code to a separate function? getLocalRules is like 200+ lines of code.

Signed-off-by: Emmanuel Lodovice <[email protected]>

- Remove duplicate code and use better data structures - Make backup rule_group label match the prometheus rule_group label - Skip initialization when feature is not enabled Signed-off-by: Emmanuel Lodovice <[email protected]>

Signed-off-by: Emmanuel Lodovice <[email protected]>

…in getShardedRules Signed-off-by: Emmanuel Lodovice <[email protected]>

Signed-off-by: Emmanuel Lodovice <[email protected]>

in ruler to get Replicaset without requiring quorum Signed-off-by: Emmanuel Lodovice <[email protected]>

Signed-off-by: Emmanuel Lodovice <[email protected]>

alanprot · 2024-04-19T17:11:51Z

I understood the general ideal and LGTM.

I did not reviewed every single line in detail but as its already have 3 approves im good with it.

pull-request-size bot added the size/S label Feb 22, 2024

emanlodovice force-pushed the ruler-az-replication-config branch from e9eebc6 to 3624924 Compare February 22, 2024 02:13

pull-request-size bot added size/M and removed size/S labels Feb 22, 2024

emanlodovice changed the title ~~Allowing ruler replication to be configurable~~ Allowing ruler replication to be configurable for rules API HA Feb 22, 2024

pull-request-size bot added size/L and removed size/M labels Feb 27, 2024

emanlodovice force-pushed the ruler-az-replication-config branch 4 times, most recently from 8331652 to 014235e Compare February 29, 2024 09:02

pull-request-size bot added size/XL and removed size/L labels Feb 29, 2024

emanlodovice force-pushed the ruler-az-replication-config branch 3 times, most recently from 0375b45 to 2f8f631 Compare March 5, 2024 01:29

emanlodovice changed the title ~~Allowing ruler replication to be configurable for rules API HA~~ Allowing rule backup for rules API HA Mar 5, 2024

emanlodovice force-pushed the ruler-az-replication-config branch 2 times, most recently from 7aa272b to 5a1bc69 Compare March 5, 2024 08:07

pull-request-size bot added size/XXL and removed size/XL labels Mar 5, 2024

emanlodovice force-pushed the ruler-az-replication-config branch 9 times, most recently from 769796e to a37e449 Compare March 11, 2024 03:40

yeya24 reviewed Mar 28, 2024

View reviewed changes

emanlodovice force-pushed the ruler-az-replication-config branch 5 times, most recently from 4796a04 to cf3b288 Compare April 8, 2024 08:10

yeya24 reviewed Apr 10, 2024

View reviewed changes

emanlodovice force-pushed the ruler-az-replication-config branch 4 times, most recently from b916f50 to 81eef26 Compare April 12, 2024 06:02

emanlodovice added 11 commits April 11, 2024 23:23

Allowing ruler replication to be configurable

8b8013a

Signed-off-by: Emmanuel Lodovice <[email protected]>

Allow rules to be loaded to rulers as backup for List rules API HA

bcc7af4

Signed-off-by: Emmanuel Lodovice <[email protected]>

Add integration test for rulers API with backup enabled

0e2c260

Signed-off-by: Emmanuel Lodovice <[email protected]>

Mark the entire feature as experimental and improve variable names

9764486

Signed-off-by: Emmanuel Lodovice <[email protected]>

Rename backUpRuleGroups to setRuleGroups to make it code less confusing

a572559

Signed-off-by: Emmanuel Lodovice <[email protected]>

Remove backup manager lock because its not needed

2d32a1e

Signed-off-by: Emmanuel Lodovice <[email protected]>

Improve code quality

f273e9c

- Remove duplicate code and use better data structures - Make backup rule_group label match the prometheus rule_group label - Skip initialization when feature is not enabled Signed-off-by: Emmanuel Lodovice <[email protected]>

Store rulepb.RuleGroupList in rules backup instead of promRules.Group

1bac62e

Signed-off-by: Emmanuel Lodovice <[email protected]>

Add GetReplicationSetForOperationWithNoQuorum ring method and use it …

4aef787

…in getShardedRules Signed-off-by: Emmanuel Lodovice <[email protected]>

Refactor getLocalRules to make the method shorter

00c188e

Signed-off-by: Emmanuel Lodovice <[email protected]>

Add new ring method to get all instances and created a new method

daf1f83

in ruler to get Replicaset without requiring quorum Signed-off-by: Emmanuel Lodovice <[email protected]>

emanlodovice force-pushed the ruler-az-replication-config branch 2 times, most recently from 0f5fd54 to d8955ab Compare April 12, 2024 06:40

Fix flaky test due to sorting issue

2681f23

Signed-off-by: Emmanuel Lodovice <[email protected]>

emanlodovice force-pushed the ruler-az-replication-config branch from d8955ab to 2681f23 Compare April 12, 2024 07:17

yeya24 approved these changes Apr 12, 2024

View reviewed changes

alanprot approved these changes Apr 19, 2024

View reviewed changes

yeya24 merged commit 8f6da89 into cortexproject:master Apr 19, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowing rule backup for rules API HA #5782

Allowing rule backup for rules API HA #5782

emanlodovice commented Feb 22, 2024 •

edited

Loading

yeya24 Mar 28, 2024

yeya24 Mar 28, 2024

emanlodovice Mar 28, 2024

yeya24 Mar 28, 2024 •

edited

Loading

alanprot Apr 19, 2024

emanlodovice Apr 19, 2024

yeya24 Mar 28, 2024

yeya24 Mar 28, 2024

emanlodovice Mar 28, 2024 •

edited

Loading

yeya24 Mar 28, 2024

yeya24 Mar 28, 2024 •

edited

Loading

yeya24 Mar 28, 2024 •

edited

Loading

yeya24 Mar 28, 2024

yeya24 Mar 28, 2024

yeya24 Mar 28, 2024 •

edited

Loading

emanlodovice Apr 1, 2024 •

edited

Loading

yeya24 Apr 1, 2024 •

edited

Loading

yeya24 Apr 1, 2024 •

edited

Loading

yeya24 Apr 1, 2024 •

edited

Loading

emanlodovice Apr 1, 2024

yeya24 left a comment

yeya24 Apr 10, 2024 •

edited

Loading

yeya24 Apr 10, 2024 •

edited

Loading

emanlodovice Apr 10, 2024

yeya24 Apr 10, 2024

yeya24 Apr 10, 2024

emanlodovice Apr 10, 2024 •

edited

Loading

alanprot Apr 10, 2024

emanlodovice Apr 11, 2024

yeya24 Apr 10, 2024

alanprot commented Apr 19, 2024

Allowing rule backup for rules API HA #5782

Allowing rule backup for rules API HA #5782

Conversation

emanlodovice commented Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeya24 Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emanlodovice Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeya24 Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

yeya24 Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeya24 Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

emanlodovice Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

yeya24 Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

yeya24 Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

yeya24 Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeya24 left a comment

Choose a reason for hiding this comment

yeya24 Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

yeya24 Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emanlodovice Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanprot commented Apr 19, 2024

emanlodovice commented Feb 22, 2024 •

edited

Loading

yeya24 Mar 28, 2024 •

edited

Loading

emanlodovice Mar 28, 2024 •

edited

Loading

yeya24 Mar 28, 2024 •

edited

Loading

yeya24 Mar 28, 2024 •

edited

Loading

yeya24 Mar 28, 2024 •

edited

Loading

emanlodovice Apr 1, 2024 •

edited

Loading

yeya24 Apr 1, 2024 •

edited

Loading

yeya24 Apr 1, 2024 •

edited

Loading

yeya24 Apr 1, 2024 •

edited

Loading

yeya24 Apr 10, 2024 •

edited

Loading

yeya24 Apr 10, 2024 •

edited

Loading

emanlodovice Apr 10, 2024 •

edited

Loading