Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing rule backup for rules API HA #5782

Merged
merged 12 commits into from
Apr 19, 2024

Conversation

emanlodovice
Copy link
Contributor

@emanlodovice emanlodovice commented Feb 22, 2024

What this PR does:
This PR introduces support for rule group replication in ruler for API HA. The idea is that ruler can be configured to have a replication factor greater than 1 but only 1 of the replica set will be assigned to evaluate the rule group. The rest of the rulers returned by the ring operation will only hold a copy of the rule group and send it when listing rule groups. In the event when a ruler goes down, its rule groups are still loaded by another rulers so the PrometheusRules API will return a 2xx but some of the rule groups will have the default state (state before evaluation).

Because a rule group can now loaded by multiple rulers (though only evaluated by one) the PrometheusRules API was updated to deduplicate the rule groups by keeping the GroupStateDesc with the latest evaluation timestamp. This guarantees that in the happy path where all rulers are healthy, the GroupStateDesc coming from the ruler evaluating the rule group will always be kept and send in the response.

Which issue(s) this PR fixes:
Fixes #5773

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@emanlodovice emanlodovice force-pushed the ruler-az-replication-config branch from e9eebc6 to 3624924 Compare February 22, 2024 02:13
@pull-request-size pull-request-size bot added size/M and removed size/S labels Feb 22, 2024
@emanlodovice emanlodovice changed the title Allowing ruler replication to be configurable Allowing ruler replication to be configurable for rules API HA Feb 22, 2024
@pull-request-size pull-request-size bot added size/L and removed size/M labels Feb 27, 2024
@emanlodovice emanlodovice force-pushed the ruler-az-replication-config branch 4 times, most recently from 8331652 to 014235e Compare February 29, 2024 09:02
@emanlodovice emanlodovice force-pushed the ruler-az-replication-config branch 3 times, most recently from 0375b45 to 2f8f631 Compare March 5, 2024 01:29
@emanlodovice emanlodovice changed the title Allowing ruler replication to be configurable for rules API HA Allowing rule backup for rules API HA Mar 5, 2024
@emanlodovice emanlodovice force-pushed the ruler-az-replication-config branch 2 times, most recently from 7aa272b to 5a1bc69 Compare March 5, 2024 08:07
@emanlodovice emanlodovice force-pushed the ruler-az-replication-config branch 9 times, most recently from 769796e to a37e449 Compare March 11, 2024 03:40
@@ -85,6 +88,7 @@ func NewDefaultMultiTenantManager(cfg Config, managerFactory ManagerFactory, eva
mapper: newMapper(cfg.RulePath, logger),
userManagers: map[string]RulesManager{},
userManagerMetrics: userManagerMetrics,
rulesBackupManager: newRulesBackupManager(cfg, logger, reg),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we skip initializing it if it is not enabled? This can skip initializing the metrics

Namespace: "cortex",
Name: "ruler_backup_rule_group_rules",
Help: "The number of backed up rules",
}, []string{"user", "rule_group"}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to have rule_group as a label? What's the usecase to know the count per rule group?

I understand the backup is either success or failure for all rules of a tenant. Isn't a boolean value good enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I planned for this metric to be the same as cortex_prometheus_rule_group_rules. It would give us an understanding on how much data each ruler have as backup. It can also be used to detect discrepancies between what was evaluated and whats on backup which can happen since sync rules happen at different times for each ruler. I don't know when these information could be super useful but I see some potential. if you think the cost of using int over bool is not worth it I will change it :D

Copy link
Contributor

@yeya24 yeya24 Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also be used to detect discrepancies between what was evaluated and whats on backup which can happen since sync rules happen at different times for each ruler. I don't know when these information could be super useful but I see some potential.

Yeah, I understand we have such discrepancy but I have no idea how we really gonna use this count. Does it matter if we know the count is different? The sync delay is by design.

I prefer to keep it simple to use boolean or count total number of rules for a user, by removing rule_group label

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the conclusion? seems that the rule_group is still here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood it. I kept the rule_group label but used boolean. I will update to remove the rule_group label and use total count of rules instead.

@@ -476,6 +491,29 @@ func instanceOwnsRuleGroup(r ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRu
return ownsRuleGroup, nil
}

func instanceBacksUpRuleGroup(r ring.ReadRing, g *rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, instanceAddr string) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems does the same as instanceOwnsRuleGroup except using a different condition.
One is ownsRuleGroup := rlrs.Instances[0].Addr == instanceAddr and this one is

	var backupRuleGroup bool
	// Only the second up to the last replica is used a backup
	for i := 1; i < len(rlrs.Instances); i++ {
		if rlrs.Instances[i].Addr == instanceAddr {
			backupRuleGroup = true
			break
		}
	}

Can we just make this condition logic a parameter and remove duplicate code?

//
// Reason why this function is not a method on Ruler is to make sure we don't accidentally use r.ring,
// but only ring passed as parameter.
func filterBackupRuleGroups(userID string, ruleGroups []*rulespb.RuleGroupDesc, disabledRuleGroups validation.DisabledRuleGroups, ring ring.ReadRing, instanceAddr string, log log.Logger, ringCheckErrors prometheus.Counter) []*rulespb.RuleGroupDesc {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems duplicated, too. Can we generalize the two functions?
Is log the main concern of not reusing the previous function

Copy link
Contributor Author

@emanlodovice emanlodovice Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning for this was that when we implement the Rule evaluation HA that would also result the Rules API HA so we won't be needing filterBackupRuleGroups and we can delete it. Deleting an entire unused method is simpler then adjusting existing methods and remove the unused blocks. But thinking now, API HA code and Evaluation HA code can co-exists if we allow cortex to be operated with API HA enabled and Evaluation HA disabled when Evaluation HA is implemented and available. I will make the adjustment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It is not a big issue. I am fine to keep it as it is since generalizing the log is probably not easy. But I hope we can generalize instanceBacksUpRuleGroup.

merged []*GroupStateDesc
mtx sync.Mutex
merged []*GroupStateDesc
errs []error
Copy link
Contributor

@yeya24 yeya24 Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need errs to be a slice of error? It looks like we only use len(errs) but not using the error itself.
In this case, is keeping track the error count good enough?

)
failedZones := make(map[string]interface{})
Copy link
Contributor

@yeya24 yeya24 Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using make(map[string]struct{}) which is used more common in Go to represent a set.
IIUC this is what we need.

Comment on lines 327 to 329
Name: "cortex_ruler_get_rules_failure_total",
Help: "The total number of failed rules request sent to rulers.",
}, []string{"ruler"}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention it is for failed requests of getShardedRules only?

# with default state (state before any evaluation) and send this copy in list
# API requests as backup in case the ruler who owns the rule fails to send its
# rules. This allows the rules API to handle ruler outage by returning rules
# with default state. Ring replication-factor needs to be set to 3 or more for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it needs to be 3? I understand for Ruler API HA, 2 should be good enough. You don't have to require quorum to have 2 > 1 since you are merging the response anyway.

func (r *rulesBackupManager) setRuleGroups(_ context.Context, ruleGroups map[string]rulespb.RuleGroupList) {
backupRuleGroups := make(map[string][]*promRules.Group)
for user, groups := range ruleGroups {
promGroups, err := r.ruleGroupListToPromGroups(user, groups)
Copy link
Contributor

@yeya24 yeya24 Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand why we need to convert rulespb.RuleGroupList to prometheus rules type and store them in memory. Why we cannot store rulespb.RuleGroupList directly?

At the end of the day, when we call getShardedRules and getLocalRules, the final response is rulespb type defined by us.
If we store Prometheus type here we are basically converting twice, both at read and write time.

And I think promManager.LoadGroups is very expensive since it needs to run m.opts.RuleDependencyController.AnalyseRules(rules) to do the topology sort every time.

https://github.com/prometheus/prometheus/blob/main/rules/manager.go#L329

Copy link
Contributor Author

@emanlodovice emanlodovice Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we have to convert the rulespb.RuleGroupList to promRules.Group because

  1. I am hoping to reuse the prometheus code that sets the default value of fields like groups Interval, field. We could make the code convert rulespb.RuleGroupListtoGroupStateDescwhich is whatgetLocalRules` return and just make sure to copy behavior from https://github.com/prometheus/prometheus/blob/main/rules/manager.go#L280-L343
  2. By converting it to groups and appending it to the list returned by the manager, we can reuse the existing code for supporting filters https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L778-L798 . We could write a different code path that filters GroupStateDesc if we opt to store GroupStateDesc or rulespb.RuleGroupList

I did not realize that m.opts.RuleDependencyController.AnalyseRules(rules) can be expensive. Since this is only relevant for evaluation we could get around it by passing in a noop RuleDependencyController.

I think converting to prometheus rules type is the simpler approach. But I can change it if you think otherwise.

Copy link
Contributor

@yeya24 yeya24 Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copy link
Contributor

@yeya24 yeya24 Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By converting it to groups and appending it to the list returned by the manager, we can reuse the existing code for supporting filters https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L778-L798

I don't see how this code is for prometheus rules only. It looks very general, no?

I am hoping to reuse the prometheus code that sets the default value of fields like groups Interval, field. We could make the code convert rulespb.RuleGroupListtoGroupStateDescwhich is whatgetLocalRules` return and just make sure to copy behavior from https://github.com/prometheus/prometheus/blob/main/rules/manager.go#L280-L343

The reason why I don't want to reuse this code is because I don't want to initialize a new rules manager everytime when this code path is called. It is expensive since it initializes rules metrics everytime but not using them at all.
Since what we need is to just use the rules we cached in memory, can we just use the loader directly without initializing the manager?

The way I can think of is to copy paste the code out and remove manager dependency. However, if we do this, why we don't just store rulespb.RuleGroupList? This way we can even get rid of the loader

I did not realize that m.opts.RuleDependencyController.AnalyseRules(rules) can be expensive. Since this is only relevant for evaluation we could get around it by passing in a noop RuleDependencyController.

We can pass a noop controller.

Copy link
Contributor

@yeya24 yeya24 Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I don't want to reuse this code is because I don't want to initialize a new rules manager everytime when this code path is called. It is expensive since it initializes rules metrics everytime but not using them at all.

I am ok if we pre-initialize the manager metrics and pass it every time. But tbh it is still not ideal. NewGroup and LoadGroups do more things we don't need at all. We just need to convert the rules type, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I agree. It is more optimal to store rulespb.RuleGroupList and do the conversion to GroupStateDesc upon request. I will make the change. Thank you 🙇

@emanlodovice emanlodovice force-pushed the ruler-az-replication-config branch 5 times, most recently from 4796a04 to cf3b288 Compare April 8, 2024 08:10
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my feedback, @emanlodovice. Overall I think it is much better and close to get merged.

I added some more comments. PTAL.

pkg/ruler/ruler.go Outdated Show resolved Hide resolved
pkg/ruler/ruler.go Show resolved Hide resolved
pkg/ruler/ruler.go Outdated Show resolved Hide resolved
pkg/ring/ring.go Outdated Show resolved Hide resolved
pkg/ring/ring.go Outdated
// does not require quorum so only 1 replica is needed to complete the operation. For MaxUnavailableZones, it is
// not automatically reduced when there are unhealthy instances in a zone because healthy instances in the zone
// are still returned, but the information about zones with unhealthy instances is returned.
GetReplicationSetForOperationWithNoQuorum(op Operation) (ReplicationSet, map[string]struct{}, error)
Copy link
Contributor

@yeya24 yeya24 Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit about the method name. Should we call it WithoutQuorum or WithNoQuorum?

For MaxUnavailableZones, it is not automatically reduced when there are unhealthy instances in a zone because healthy instances in the zone are still returned, but the information about zones with unhealthy instances is returned.

It feels like this method differs from GetReplicationSetForOperation more than not requiring quorum but also relaxes the constraint where we require all instances in the zone are healthy.

Why do we make this change? I am afraid that in the future we have a use case where we don't need quorum but we want to exclude zones with unhealthy instances. Then we have to change the interface again.

I want to make sure that this is not only for Ruler API backup and can be used in the future.

Copy link
Contributor

@yeya24 yeya24 Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, I think it makes sense for the Ruler API backup usecase to return all instances even in an unhealthy zone. However, I don't see other usecase for it now.
Do we think we can remove this method from the interface and move it to Ruler Ring code only? Since it might only make sense for Ruler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, my problem was that there is no ring operation that allows me to get healthy instances as well as unhealthy instances so I can write code in ruler to decide how much error we can tolerate. There is GetInstanceDescsForOperation but this only returns healthy instances. Even if we set the op to consider all statuses as healthy it still doesn't guarantee that we get all instances in the ring because storageLastUpdate := r.KVClient.LastUpdateTime(r.key) is also considered when checking if the instance is healthy. Do you think we can add a new method in the ring that could look like GetAllInstanceDescs(op Operation) ([]InstanceDesc, []InstanceDesc, err) that returns the healthy and unhealthy instances in the ring?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can add a new method in the ring that could look like GetAllInstanceDescs(op Operation) ([]InstanceDesc, []InstanceDesc, err) that returns the healthy and unhealthy instances in the ring

I think we can do that and I prefer this solution. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change GetInstanceDescsForOperation as well. WDYT @alanprot

Copy link
Contributor Author

@emanlodovice emanlodovice Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like GetInstanceDescsForOperation is only used right now in store-gateway sync. This doesn't run very often so i guess an addition of a new map in the method would be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humm...

I think i prefer a new method as well (GetAllInstanceDescs(op Operation) ([]InstanceDesc, []InstanceDesc, err))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated @yeya24 , PTAL, thank you

}

backupGroups := r.manager.GetBackupRules(userID)
for _, group := range backupGroups {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move some code to a separate function? getLocalRules is like 200+ lines of code.

@emanlodovice emanlodovice force-pushed the ruler-az-replication-config branch 4 times, most recently from b916f50 to 81eef26 Compare April 12, 2024 06:02
- Remove duplicate code and use better data structures
- Make backup rule_group label match the prometheus rule_group label
- Skip initialization when feature is not enabled

Signed-off-by: Emmanuel Lodovice <[email protected]>
in ruler to get Replicaset without requiring quorum

Signed-off-by: Emmanuel Lodovice <[email protected]>
@emanlodovice emanlodovice force-pushed the ruler-az-replication-config branch 2 times, most recently from 0f5fd54 to d8955ab Compare April 12, 2024 06:40
Signed-off-by: Emmanuel Lodovice <[email protected]>
@emanlodovice emanlodovice force-pushed the ruler-az-replication-config branch from d8955ab to 2681f23 Compare April 12, 2024 07:17
@alanprot
Copy link
Member

I understood the general ideal and LGTM.

I did not reviewed every single line in detail but as its already have 3 approves im good with it.

@yeya24 yeya24 merged commit 8f6da89 into cortexproject:master Apr 19, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ruler API HA
7 participants