Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow to restrict the CRs watched according to their labels #1832

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wilfriedroset
Copy link

@wilfriedroset wilfriedroset commented Jan 20, 2025

The context could be two operator and two grafana CR, the first one with the label shard: 1 and the second one shard:2 . I should I configure the operator 1 to watch only the CR with the label shard: 1 and the operator 2 watch shard: 2
This example might seems simple but I'm aiming for a bigger scale with tens of shards with each shard would host tens of thousands CR. I'm anticipating the eventual heavy workload that a single operator will need to handle with the according to the amount of CRs I will create. I'm aiming for tens of thousands CRs. So having that magnitude handle by a single operator might be hard especially when the operator restarts.

Sharding the operators would help lower the amount of CRs tracked by a single one.

edit: this is a follow up PR after a discussion on slack --> https://kubernetes.slack.com/archives/C019A1KTYKC/p1737370451358769

main.go Outdated Show resolved Hide resolved
@theSuess
Copy link
Member

Thanks for the PR! The approach looks good to me. @Baarsgaard since you took a look at the caching logic not too long ago, does this also seem good to you?

The only thing I might change is to use comma separated values for the label selector instead of JSON as it complicates quoting quite a bit.

Do you think something like cluster=prod,shard=1 would work as well? IMHO this is cleaner than {"cluster":"prod","shard":"1"} as it also reduces the ambiguity in the distinction between integers and strings

@wilfriedroset
Copy link
Author

Thank you for your review @theSuess, I've simplified the implementation of the code, it should be more generic as I'm now using labels.Parse, hence it will support labelSelector such as the one you are pointing cluster=prod,shard=1 or even more complexe like partition in (customerA, customerB),environment!=qa as per the documentation here: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#set-based-requirement

WDYT?

@wilfriedroset wilfriedroset force-pushed the labelSelectors branch 2 times, most recently from 43ae851 to 70c948d Compare January 21, 2025 13:36
@theSuess
Copy link
Member

This is great! I'll test this locally and see if I can come up with an E2E test to make sure this keeps working when refactoring the caching logic.

@@ -120,6 +126,18 @@ func main() {
PprofBindAddress: pprofAddr,
}

var labelSelectors labels.Selector
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block triggers an error from the linter

main.go:89:1: cyclomatic complexity 26 of func main is high (> 25) (gocyclo)

I could move this block of code in a dedicated func outside of the main.
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I think it's more readable when contained in the main function. If the refactor is easy enough then go for it, otherwise I'm good with ignoring the linter here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it like that, I would have moved the code outside only to please the linter. I find the code readable enough.

Copy link
Contributor

@Baarsgaard Baarsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually noticed this yesterday and started to compile some notes @theSuess
For the go-lint cyclomatic.., I opened #1833 to get rid it as I had the same issue in #1818

var labelSelectors labels.Selector
var err error
if watchLabelSelectors != "" {
labelSelectors, err = labels.Parse(watchLabelSelectors)
Copy link
Contributor

@Baarsgaard Baarsgaard Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I like this quite a lot, and would prefer to swap the existing namespace selector parsing to this or the ValidatedSelectorFromSet to maintain current behaviour.
To properly support sharding at scale, changing to labels.Parse might be better for the namespace selector

@@ -180,6 +198,7 @@ func main() {

case watchNamespace == "" && watchNamespaceSelector == "":
// cluster scoped
controllerOptions.Cache.DefaultLabelSelector = labelSelectors
Copy link
Contributor

@Baarsgaard Baarsgaard Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will unfortunately break the operator in several ways:

  1. Setting a default LabelSelector would break valuesFrom referencing unlabelled ConfigMaps and Secrets, the workaround is the same cache ignore rule that is necessary in Fix: Do not cache native resources created without CommonLabels #1818
  2. Any resource created by the Operator itself, specifically related to internal Grafana CRs would be created and maybe be in the cache for a while, but any updates or a restart and they would be orphaned as labels are not inherited, see fix: add common labels to created resources #1661
  3. I'm not 100% sure on this, but even if you fix 2. If you create an Internal Grafana and migrate it to another shard, it would have to be with a full delete and re-apply to ensure the backing Deployment, PVC, Route, ConfigMap and such are not orphaned, losing all alert history and more if no remote database is configured.
    This could potentially be fixed by cascading label updates,but would require passing down the old and new labels to all the supporting reconcilers.
  4. I think Cache is nil at this point 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #1818 most of this should be fine if we either

  • only apply the watch label selector to grafana CRDs
  • or somehow merge it with "app.kubernetes.io/managed-by": "grafana-operator"

Copy link
Contributor

@Baarsgaard Baarsgaard Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only applying the watchLabels on Grafana CRDs will hide them due to the defaultLabelSelector unless overwritten with byObject, and you will end up with multiple reconciles of the same resources from different shards when omitted.
Merging is an option, but does not solve the shard migration/cascading label updates to ensure nothing is orphaned.

@Baarsgaard
Copy link
Contributor

Baarsgaard commented Jan 21, 2025

With the above said, I have a ton of questions related to the Slack comment but I would prefer to keep all my CRs in the same namespace and other unknowns:

If startup/performance is a concern:

  • I would see to allowing configuration of MaxConcurrentReconcile allowing the Operator to reconcile multiple resources of the same kind concurrently, potentially speeding up startups if enough CPUs are allocated.
  • I think the plan is to deprecate the status lists on Grafana CRs, which would remove the startup sync @theSuess?
    On startups, we currently fetch all defined Grafanas and loop through them.
  • do you plan on applying Grafana CRs or use Helm/Kustomize/Other for creating the Deployments?
  • What is the expected resyncPeriod you're targeting at this scale? Depending on the usecase it could be set as low as 8h.

Sharding:

  • Do you plan to migrate resources between shards? If so, what would your mechanism for controlling this be? Manual or automatic?
  • In this context, namespaces and labels essentially accomplish the same thing, except one is more limiting than the other when listing resources. Is there a specific reason to preferring a single namespace?
    Is the plan to make use kuberenetes to prevent name collisions?
  • I don't know f there's a good way to drop part of the cache in order to migrate resources/namespaces between shards

Potential blockers/worries

  • Have you factored in the "eventually consistent" nature of applying new Grafana CRs?
    Applying a new resource CR will instantly show up in all matching instances, but applying a Grafana CR, you have to either wait for resyncPeriods to elapse, restart the operator, or create a new generation/update all resources that need to be reconciled.
  • If you do end up with 500k CRs in a single namespace, what protections do you have in place to prevent kubectl get grafana-operator --all or similar large requests, would it interfere with operators?
  • From a usability standpoint, I worry there's a significant overhead when working with individually labelled resources compared to namespaced unless it's entirely abstracted away.

I do not have a great understanding of compute at that scale, but you could spin up a lot of resources quickly with the below and stress the operator:

  1. Checkout Fix: Do not cache native resources created without CommonLabels #1818 as I've been doing some cache tuning there lately.
  2. Create N Grafana CRs + N*7 resource CRs:
    for i in {0..N}; do kubectl create ns test-$i && kubectl apply -n test-$i -f tests/example-resources.yml; done
  3. Get an understanding of resource usage with pprof:
kubectl port-forward -n grafana-operator-system deploy/grafana-operator-controller-manager-v5 8888 &
go tool pprof -pdf http://localhost:8888/debug/pprof/heap > heap.pdf
go tool pprof -pdf http://localhost:8888/debug/pprof/goroutine > goroutine.pdf

If a lot of these resources are provided by a central team and not by consumers, is there features you'd like to see that could reduce the total number or CRs?
I have been playing with an idea to support replacing/prefixing/suffixing values of resources with values from the matching Grafana CRs
Essentially using the name/namespace of a Grafana CR in dashboard/folder names or datasource urls.

PS: It's not everyday you get work on stuff at this scale, so I am interested/slightly invested in this already.

@wilfriedroset
Copy link
Author

thank you @Baarsgaard for taking the time to thoroughly review my PR. Here is more context

I would see to allowing configuration of MaxConcurrentReconcile allowing the Operator to reconcile multiple resources of the same kind concurrently, potentially speeding up startups if enough CPUs are allocated.

Indeed, I will need to adjust this setting along the growth of the number of CRs

do you plan on applying Grafana CRs or use Helm/Kustomize/Other for creating the Deployments?

I've a proof-of-concept that allow to CRUD the Grafana CRs via API call. the underlying api create the CRs directly in k8s.
Basically you can do something like that

# Create a new Grafana
curl  -v -XPOST  -H "Content-Type: application/json" localhost:8080/v1/grafana -d '{"service_id": "bbb", "version": "11.0.0"}'
 
# Update a Grafana, for example the version
curl  -v -XPUT  -H "Content-Type: application/json" localhost:8080/v1/grafana/bbb -d '{"service_id": "bbb", "version": "11.1.0"}'
 
# Delete a Grafana
curl  -v -XDELETE  -H "Content-Type: application/json" localhost:8080/v1/grafana/bbb
 
# Add datasources in a given Grafana
curl -v -POST  -H "Content-Type: application/json" localhost:8085/v1/grafana/bbb/datasources  -d '{"datasources": [{"url": "http://my-prom-not-hosted-on-k8s:8080/", "name": "prometheus", "type": "prometheus"}]}'

What is the expected resyncPeriod you're targeting at this scale? Depending on the usecase it could be set as low as 8h.

I will have to work more on my proof-of-concept to find the correct value. I'm also investigating the double shard:

  • multiple operator watching a subpart of the CRs
  • multiple k8s clusters

My control plane will be responsible to equally deploy the CRs where there is room (on less crowded cluster && less crowded operator shard)

Do you plan to migrate resources between shards? If so, what would your mechanism for controlling this be? Manual or automatic?

I do not plan to migrate resources between shard.

In this context, namespaces and labels essentially accomplish the same thing, except one is more limiting than the other when listing resources. Is there a specific reason to preferring a single namespace?

It is simpler to have everything in the same namespace from a provisioning and operation point of view.

Is the plan to make use kuberenetes to prevent name collisions?

I have full control on the name of each CRs, my API is responsible for crafting the correct name (e.g: uuid compact)

Have you factored in the "eventually consistent" nature of applying new Grafana CRs?

This is ok on my end. more over the double sharding should ease the amount of work done by a single operator.

If you do end up with 500k CRs in a single namespace, what protections do you have in place to prevent kubectl get grafana-operator --all or similar large requests, would it interfere with operators?

See my comment about the double sharding. The scale I'm aiming is 500k CRs but I'm considering spliting the workload in tens of k8s clusters. Example, I've 50 clusters, each hosting 10k CRs with each 10 shards. Then each operator shard will only be responsible for 1k CRs. However this is true that kubectl get grafana-operator --all or similar large requests will be heavy on the operators. I don't expect to do that often and I can always do 20 shards or 100 shards.

I do not have a great understanding of compute at that scale, but you could spin up a lot of resources quickly with the below and stress the operator:

I'm progressing on my proof-of-concept with a quick and dirty python script like so

from kubernetes import client, config, utils
import yaml
import logging
import argparse


def cmdline_parser():
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "-d",
        "--debug",
        help="Print lots of debugging statements",
        action="store_const",
        dest="loglevel",
        const=logging.DEBUG,
        default=logging.WARNING,
    )  # mind the default value

    parser.add_argument(
        "-v",
        "--verbose",
        help="Be verbose",
        action="store_const",
        dest="loglevel",
        const=logging.INFO,
    )

    parser.add_argument(
        "-q",
        "--quiet",
        help="Be quiet",
        action="store_const",
        dest="loglevel",
        const=logging.CRITICAL,
    )

    parser.add_argument(
        "-n",
        "--namespace",
        help="namespace where to deploy the CR",
        default="test-grafana-operator",
    )

    parser.add_argument(
        "-c", "--count", type=int, help="number of CR to deploy", default=1
    )

    args = parser.parse_args()
    logging.basicConfig(level=args.loglevel)
    return args


def main():
    args = cmdline_parser()
    config.load_kube_config()
    api = client.CustomObjectsApi()

    body = None
    with open("./grafana.yaml") as fd:
        body = yaml.safe_load(fd)
    for i in range(args.count):
        body["metadata"]["name"] = f"grafana-{i}"
        logging.debug(f"creating grafna {i}")
        try:
            api.create_namespaced_custom_object(
                group="grafana.integreatly.org",
                version="v1beta1",
                namespace=args.namespace,
                plural="grafanas",
                body=body,
            )
        except client.ApiException as e:
            if e.reason == "Conflict":
                api.patch_namespaced_custom_object(
                    name=body["metadata"]["name"],
                    group="grafana.integreatly.org",
                    version="v1beta1",
                    namespace=args.namespace,
                    plural="grafanas",
                    body=body,
                )
            else:
                raise


if __name__ == "__main__":
    main()

With the following CR

apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: grafana-1
  namespace: test-grafana-operator
spec:
  deployment:
    spec:
      template:
        spec:
          containers:
            - image: docker.io/grafana/grafana:11.0.0
              name: grafana
              resources:
                limits:
                  cpu: "1"
                  memory: 512Mi
                requests:
                  cpu: 20m
                  memory: 100Mi

I will report what I find out with pprof when I get to a confortable spot. I could spawn a pyroscope to ease the work 😇

If a lot of these resources are provided by a central team and not by consumers, is there features you'd like to see that could reduce the total number or CRs?

The current state of the operator seems to suit my usease. Upon request for a Grafana I create a new CR in the right place (right cluster, right shard). I was more or less straightforward to implement my API and test it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants