feat: Allow to restrict the CRs watched according to their labels #1832

wilfriedroset · 2025-01-20T15:27:38Z

The context could be two operator and two grafana CR, the first one with the label shard: 1 and the second one shard:2 . I should I configure the operator 1 to watch only the CR with the label shard: 1 and the operator 2 watch shard: 2
This example might seems simple but I'm aiming for a bigger scale with tens of shards with each shard would host tens of thousands CR. I'm anticipating the eventual heavy workload that a single operator will need to handle with the according to the amount of CRs I will create. I'm aiming for tens of thousands CRs. So having that magnitude handle by a single operator might be hard especially when the operator restarts.

Sharding the operators would help lower the amount of CRs tracked by a single one.

edit: this is a follow up PR after a discussion on slack --> https://kubernetes.slack.com/archives/C019A1KTYKC/p1737370451358769

main.go

theSuess · 2025-01-21T11:40:07Z

Thanks for the PR! The approach looks good to me. @Baarsgaard since you took a look at the caching logic not too long ago, does this also seem good to you?

The only thing I might change is to use comma separated values for the label selector instead of JSON as it complicates quoting quite a bit.

Do you think something like cluster=prod,shard=1 would work as well? IMHO this is cleaner than {"cluster":"prod","shard":"1"} as it also reduces the ambiguity in the distinction between integers and strings

wilfriedroset · 2025-01-21T13:32:24Z

Thank you for your review @theSuess, I've simplified the implementation of the code, it should be more generic as I'm now using labels.Parse, hence it will support labelSelector such as the one you are pointing cluster=prod,shard=1 or even more complexe like partition in (customerA, customerB),environment!=qa as per the documentation here: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#set-based-requirement

WDYT?

Signed-off-by: Wilfried Roset <[email protected]>

theSuess · 2025-01-21T13:55:40Z

This is great! I'll test this locally and see if I can come up with an E2E test to make sure this keeps working when refactoring the caching logic.

wilfriedroset · 2025-01-21T14:21:35Z

main.go

@@ -120,6 +126,18 @@ func main() {
 		PprofBindAddress:       pprofAddr,
 	}

+	var labelSelectors labels.Selector


This block triggers an error from the linter

main.go:89:1: cyclomatic complexity 26 of func main is high (> 25) (gocyclo)

I could move this block of code in a dedicated func outside of the main.
WDYT?

To be honest, I think it's more readable when contained in the main function. If the refactor is easy enough then go for it, otherwise I'm good with ignoring the linter here

Let's keep it like that, I would have moved the code outside only to please the linter. I find the code readable enough.

…efinition Signed-off-by: Wilfried Roset <[email protected]>

Baarsgaard

I actually noticed this yesterday and started to compile some notes @theSuess
For the go-lint cyclomatic.., I opened #1833 to get rid it as I had the same issue in #1818

Baarsgaard · 2025-01-21T19:20:45Z

main.go

+	var labelSelectors labels.Selector
+	var err error
+	if watchLabelSelectors != "" {
+		labelSelectors, err = labels.Parse(watchLabelSelectors)


Note: I like this quite a lot, and would prefer to swap the existing namespace selector parsing to this or the ValidatedSelectorFromSet to maintain current behaviour.
To properly support sharding at scale, changing to labels.Parse might be better for the namespace selector

Baarsgaard · 2025-01-21T19:30:45Z

main.go

@@ -180,6 +198,7 @@ func main() {

 	case watchNamespace == "" && watchNamespaceSelector == "":
 		// cluster scoped
+		controllerOptions.Cache.DefaultLabelSelector = labelSelectors


This will unfortunately break the operator in several ways:

Setting a default LabelSelector would break valuesFrom referencing unlabelled ConfigMaps and Secrets, the workaround is the same cache ignore rule that is necessary in Fix: Do not cache native resources created without CommonLabels #1818

Any resource created by the Operator itself, specifically related to internal Grafana CRs would be created and maybe be in the cache for a while, but any updates or a restart and they would be orphaned as labels are not inherited, see fix: add common labels to created resources #1661

I'm not 100% sure on this, but even if you fix 2. If you create an Internal Grafana and migrate it to another shard, it would have to be with a full delete and re-apply to ensure the backing Deployment, PVC, Route, ConfigMap and such are not orphaned, losing all alert history and more if no remote database is configured.
This could potentially be fixed by cascading label updates,but would require passing down the old and new labels to all the supporting reconcilers.

I think Cache is nil at this point 😅

With #1818 most of this should be fine if we either

only apply the watch label selector to grafana CRDs

or somehow merge it with "app.kubernetes.io/managed-by": "grafana-operator"

Only applying the watchLabels on Grafana CRDs will hide them due to the defaultLabelSelector unless overwritten with byObject, and you will end up with multiple reconciles of the same resources from different shards when omitted.
Merging is an option, but does not solve the shard migration/cascading label updates to ensure nothing is orphaned.

Baarsgaard · 2025-01-21T21:33:46Z

With the above said, I have a ton of questions related to the Slack comment but I would prefer to keep all my CRs in the same namespace and other unknowns:

If startup/performance is a concern:

I would see to allowing configuration of MaxConcurrentReconcile allowing the Operator to reconcile multiple resources of the same kind concurrently, potentially speeding up startups if enough CPUs are allocated.
I think the plan is to deprecate the status lists on Grafana CRs, which would remove the startup sync @theSuess?
On startups, we currently fetch all defined Grafanas and loop through them.
do you plan on applying Grafana CRs or use Helm/Kustomize/Other for creating the Deployments?
What is the expected resyncPeriod you're targeting at this scale? Depending on the usecase it could be set as low as 8h.

Sharding:

Do you plan to migrate resources between shards? If so, what would your mechanism for controlling this be? Manual or automatic?
In this context, namespaces and labels essentially accomplish the same thing, except one is more limiting than the other when listing resources. Is there a specific reason to preferring a single namespace?
Is the plan to make use kuberenetes to prevent name collisions?
I don't know f there's a good way to drop part of the cache in order to migrate resources/namespaces between shards

Potential blockers/worries

Have you factored in the "eventually consistent" nature of applying new Grafana CRs?
Applying a new resource CR will instantly show up in all matching instances, but applying a Grafana CR, you have to either wait for resyncPeriods to elapse, restart the operator, or create a new generation/update all resources that need to be reconciled.
If you do end up with 500k CRs in a single namespace, what protections do you have in place to prevent kubectl get grafana-operator --all or similar large requests, would it interfere with operators?
From a usability standpoint, I worry there's a significant overhead when working with individually labelled resources compared to namespaced unless it's entirely abstracted away.

I do not have a great understanding of compute at that scale, but you could spin up a lot of resources quickly with the below and stress the operator:

Checkout Fix: Do not cache native resources created without CommonLabels #1818 as I've been doing some cache tuning there lately.
Create N Grafana CRs + N*7 resource CRs:
for i in {0..N}; do kubectl create ns test-$i && kubectl apply -n test-$i -f tests/example-resources.yml; done
Get an understanding of resource usage with pprof:

kubectl port-forward -n grafana-operator-system deploy/grafana-operator-controller-manager-v5 8888 &
go tool pprof -pdf http://localhost:8888/debug/pprof/heap > heap.pdf
go tool pprof -pdf http://localhost:8888/debug/pprof/goroutine > goroutine.pdf

If a lot of these resources are provided by a central team and not by consumers, is there features you'd like to see that could reduce the total number or CRs?
I have been playing with an idea to support replacing/prefixing/suffixing values of resources with values from the matching Grafana CRs
Essentially using the name/namespace of a Grafana CR in dashboard/folder names or datasource urls.

PS: It's not everyday you get work on stuff at this scale, so I am interested/slightly invested in this already.

wilfriedroset · 2025-01-22T10:12:26Z

thank you @Baarsgaard for taking the time to thoroughly review my PR. Here is more context

I would see to allowing configuration of MaxConcurrentReconcile allowing the Operator to reconcile multiple resources of the same kind concurrently, potentially speeding up startups if enough CPUs are allocated.

Indeed, I will need to adjust this setting along the growth of the number of CRs

do you plan on applying Grafana CRs or use Helm/Kustomize/Other for creating the Deployments?

I've a proof-of-concept that allow to CRUD the Grafana CRs via API call. the underlying api create the CRs directly in k8s.
Basically you can do something like that

# Create a new Grafana
curl  -v -XPOST  -H "Content-Type: application/json" localhost:8080/v1/grafana -d '{"service_id": "bbb", "version": "11.0.0"}'
 
# Update a Grafana, for example the version
curl  -v -XPUT  -H "Content-Type: application/json" localhost:8080/v1/grafana/bbb -d '{"service_id": "bbb", "version": "11.1.0"}'
 
# Delete a Grafana
curl  -v -XDELETE  -H "Content-Type: application/json" localhost:8080/v1/grafana/bbb
 
# Add datasources in a given Grafana
curl -v -POST  -H "Content-Type: application/json" localhost:8085/v1/grafana/bbb/datasources  -d '{"datasources": [{"url": "http://my-prom-not-hosted-on-k8s:8080/", "name": "prometheus", "type": "prometheus"}]}'

What is the expected resyncPeriod you're targeting at this scale? Depending on the usecase it could be set as low as 8h.

I will have to work more on my proof-of-concept to find the correct value. I'm also investigating the double shard:

multiple operator watching a subpart of the CRs
multiple k8s clusters

My control plane will be responsible to equally deploy the CRs where there is room (on less crowded cluster && less crowded operator shard)

Do you plan to migrate resources between shards? If so, what would your mechanism for controlling this be? Manual or automatic?

I do not plan to migrate resources between shard.

In this context, namespaces and labels essentially accomplish the same thing, except one is more limiting than the other when listing resources. Is there a specific reason to preferring a single namespace?

It is simpler to have everything in the same namespace from a provisioning and operation point of view.

Is the plan to make use kuberenetes to prevent name collisions?

I have full control on the name of each CRs, my API is responsible for crafting the correct name (e.g: uuid compact)

Have you factored in the "eventually consistent" nature of applying new Grafana CRs?

This is ok on my end. more over the double sharding should ease the amount of work done by a single operator.

If you do end up with 500k CRs in a single namespace, what protections do you have in place to prevent kubectl get grafana-operator --all or similar large requests, would it interfere with operators?

See my comment about the double sharding. The scale I'm aiming is 500k CRs but I'm considering spliting the workload in tens of k8s clusters. Example, I've 50 clusters, each hosting 10k CRs with each 10 shards. Then each operator shard will only be responsible for 1k CRs. However this is true that kubectl get grafana-operator --all or similar large requests will be heavy on the operators. I don't expect to do that often and I can always do 20 shards or 100 shards.

I do not have a great understanding of compute at that scale, but you could spin up a lot of resources quickly with the below and stress the operator:

I'm progressing on my proof-of-concept with a quick and dirty python script like so

from kubernetes import client, config, utils
import yaml
import logging
import argparse


def cmdline_parser():
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "-d",
        "--debug",
        help="Print lots of debugging statements",
        action="store_const",
        dest="loglevel",
        const=logging.DEBUG,
        default=logging.WARNING,
    )  # mind the default value

    parser.add_argument(
        "-v",
        "--verbose",
        help="Be verbose",
        action="store_const",
        dest="loglevel",
        const=logging.INFO,
    )

    parser.add_argument(
        "-q",
        "--quiet",
        help="Be quiet",
        action="store_const",
        dest="loglevel",
        const=logging.CRITICAL,
    )

    parser.add_argument(
        "-n",
        "--namespace",
        help="namespace where to deploy the CR",
        default="test-grafana-operator",
    )

    parser.add_argument(
        "-c", "--count", type=int, help="number of CR to deploy", default=1
    )

    args = parser.parse_args()
    logging.basicConfig(level=args.loglevel)
    return args


def main():
    args = cmdline_parser()
    config.load_kube_config()
    api = client.CustomObjectsApi()

    body = None
    with open("./grafana.yaml") as fd:
        body = yaml.safe_load(fd)
    for i in range(args.count):
        body["metadata"]["name"] = f"grafana-{i}"
        logging.debug(f"creating grafna {i}")
        try:
            api.create_namespaced_custom_object(
                group="grafana.integreatly.org",
                version="v1beta1",
                namespace=args.namespace,
                plural="grafanas",
                body=body,
            )
        except client.ApiException as e:
            if e.reason == "Conflict":
                api.patch_namespaced_custom_object(
                    name=body["metadata"]["name"],
                    group="grafana.integreatly.org",
                    version="v1beta1",
                    namespace=args.namespace,
                    plural="grafanas",
                    body=body,
                )
            else:
                raise


if __name__ == "__main__":
    main()

With the following CR

apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: grafana-1
  namespace: test-grafana-operator
spec:
  deployment:
    spec:
      template:
        spec:
          containers:
            - image: docker.io/grafana/grafana:11.0.0
              name: grafana
              resources:
                limits:
                  cpu: "1"
                  memory: 512Mi
                requests:
                  cpu: 20m
                  memory: 100Mi

I will report what I find out with pprof when I get to a confortable spot. I could spawn a pyroscope to ease the work 😇

If a lot of these resources are provided by a central team and not by consumers, is there features you'd like to see that could reduce the total number or CRs?

The current state of the operator seems to suit my usease. Upon request for a Grafana I create a new CR in the right place (right cluster, right shard). I was more or less straightforward to implement my API and test it.

wilfriedroset force-pushed the labelSelectors branch from 673b010 to a735528 Compare January 20, 2025 15:37

vaxvms reviewed Jan 20, 2025

View reviewed changes

main.go Outdated Show resolved Hide resolved

wilfriedroset force-pushed the labelSelectors branch from a735528 to a7d228c Compare January 21, 2025 08:25

feat: Allow to restrict the CRs watched according to their labels

c10e665

Signed-off-by: Wilfried Roset <[email protected]>

wilfriedroset force-pushed the labelSelectors branch 2 times, most recently from 43ae851 to 70c948d Compare January 21, 2025 13:36

wilfriedroset commented Jan 21, 2025

View reviewed changes

Use Set-based requirement to simplify the support of labelSelectors d…

b06e4da

…efinition Signed-off-by: Wilfried Roset <[email protected]>

wilfriedroset force-pushed the labelSelectors branch from 70c948d to b06e4da Compare January 21, 2025 14:35

Baarsgaard reviewed Jan 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Allow to restrict the CRs watched according to their labels #1832

feat: Allow to restrict the CRs watched according to their labels #1832

wilfriedroset commented Jan 20, 2025 •

edited

Loading

theSuess commented Jan 21, 2025

wilfriedroset commented Jan 21, 2025

theSuess commented Jan 21, 2025

wilfriedroset Jan 21, 2025

theSuess Jan 21, 2025

wilfriedroset Jan 21, 2025

Baarsgaard left a comment •

edited

Loading

Baarsgaard Jan 21, 2025 •

edited

Loading

Baarsgaard Jan 21, 2025 •

edited

Loading

theSuess Jan 22, 2025

Baarsgaard Jan 22, 2025 •

edited

Loading

Baarsgaard commented Jan 21, 2025 •

edited

Loading

wilfriedroset commented Jan 22, 2025

feat: Allow to restrict the CRs watched according to their labels #1832

Are you sure you want to change the base?

feat: Allow to restrict the CRs watched according to their labels #1832

Conversation

wilfriedroset commented Jan 20, 2025 • edited Loading

theSuess commented Jan 21, 2025

wilfriedroset commented Jan 21, 2025

theSuess commented Jan 21, 2025

wilfriedroset Jan 21, 2025

Choose a reason for hiding this comment

theSuess Jan 21, 2025

Choose a reason for hiding this comment

wilfriedroset Jan 21, 2025

Choose a reason for hiding this comment

Baarsgaard left a comment • edited Loading

Choose a reason for hiding this comment

Baarsgaard Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Baarsgaard Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

theSuess Jan 22, 2025

Choose a reason for hiding this comment

Baarsgaard Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Baarsgaard commented Jan 21, 2025 • edited Loading

wilfriedroset commented Jan 22, 2025

wilfriedroset commented Jan 20, 2025 •

edited

Loading

Baarsgaard left a comment •

edited

Loading

Baarsgaard Jan 21, 2025 •

edited

Loading

Baarsgaard Jan 21, 2025 •

edited

Loading

Baarsgaard Jan 22, 2025 •

edited

Loading

Baarsgaard commented Jan 21, 2025 •

edited

Loading