Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2/Containerd Not Applying Rewrite Rules in /etc/rancher/rke2/registries.yaml #6889

Closed
dustin-groh-dev opened this issue Sep 27, 2024 · 10 comments

Comments

@dustin-groh-dev
Copy link

Environmental Info:

RKE2 v1.28.8
Rancher v2.8.3

Describe the bug:

When creating a new cluster via Rancher, RKE2 / containterd isn't applying the rewrite rules from /etc/rancher/rke2/registries.yaml specifically if the files have mirrors.[*].registry..... when the docs and example don't have that wildcard so it's potentially a formatting change that got missed.

Steps To Reproduce:

For a customer this was reproducible for any cluster they were attempting to create via pipeline as they use the same registries.yaml.
The fix was to edit the /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl to add in the mirror registry/rewrite rules and rerun the pipeline to create the cluster.

Expected behavior:

For RKE2 / containerd to apply the rewrite rules specified in the registries.yaml file even when they include a wildcard.

Additional context / logs:
Potentially related to #3227

@brandond
Copy link
Member

brandond commented Sep 28, 2024

I am not aware of any issues with rewrites. Can you provide a specific example of registries.yaml content that does not apply rewrites?

Note that rewrite rules ONLY apply when pulling images from a mirror endpoint; rewrites are NOT intended to apply when pulling an image directly from the registry itself (ie, when using the registry's default endpoint). If you add a wildcard entry with rewrites, but no endpoints, this is not expected to do anything.

@srikanth2071
Copy link

srikanth2071 commented Oct 28, 2024

I have similar issue. I am prepping for upgrades from rancher 2.7.10 rke2 1.24.9 to rancher 2.9.3 rke2 1.30.5.

In my test environment, I have bootstrapped rke2 with 1.30.5

k get nodes
NAME               STATUS   ROLES                              AGE   VERSION
node001-29ed1474   Ready    control-plane,etcd,master,worker   12d   v1.30.5+rke2r1
node002-29ed1474   Ready    control-plane,etcd,master,worker   7d    v1.30.5+rke2r1
node003-29ed1474   Ready    control-plane,etcd,master,worker   7d    v1.30.5+rke2r1
node-001-fe6af43f  Ready    worker                             12d   v1.30.5+rke2r1
node-002-fe6af43f  Ready    worker                             7d    v1.30.5+rke2r1
node-003-fe6af43f  Ready    worker                             7d    v1.30.5+rke2r1

with /etc/rancher/rke2/registries.yaml

mirrors:
  dockerhub.internal.com:
    endpoint:
      - "https://dockerhub.internal.com"
    rewrite:
      "^rancher/(.*)": "docker-internal/rancher/$1"
  dockerhub-master.internal.com:
    endpoint:
      - "https://dockerhub-master.internal.com"
    rewrite:
      "^rancher/(.*)": "docker-internal/rancher/$1"
  oci.internal.com:
    endpoint:
      - "https://oci.internal.com"
    rewrite:
      "^rancher/(.*)": "docker-internal/rancher/$1"
configs:
  dockerhub.internal.com:
    auth:
      password: redacted
      username: username
  dockerhub-master.internal.com:
    auth:
      password: redacted
      username: username
  qa-oci.internal.com:
    auth:
      password: redacted
      username: username

with this configuration, failed with image not found.
RKE2 configured the /var/lib/rancher/rke2/agent/etc/containerd/config.toml with config.toml.txt file content without mirrors and rewrites.

since I cannot manually edit and preserve changes in config.toml which is manged by RKE2. I used config.toml.tmpl instead to add the rewrites manually in each node then image pulls are working as expected from private registry.

After fixing the rewrites I am able to deploy rancher helm chart version 2.9.3. Then to create a downstream cluster(downstream cluster provisioning, vm and other resources are created with terraform) have the same problem.

Oct 28 22:02:31 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:31Z" level=info msg="Rancher System Agent version v0.3.10 (7ad21ff) is starting"
Oct 28 22:02:31 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:31Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Oct 28 22:02:31 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:31Z" level=info msg="Starting remote watch of plans"
Oct 28 22:02:31 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:31Z" level=info msg="Starting /v1, Kind=Secret controller"
Oct 28 22:02:31 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:31Z" level=info msg="Detected first start, force-applying one-time instruction set"
.....

Oct 28 22:02:51 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:51Z" level=info msg="[Applyinator] Applying one-time instructions for plan with checksum fd5e40f76bdb17a2c54e01742cb28311567a5fe66cb9aea935108e0a5f25b95e"
Oct 28 22:02:51 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:51Z" level=info msg="[Applyinator] Extracting image dockerhub-master.internal.com/rancher/system-agent-installer-rke2:v1.24.9-rke2r2 to directory /var/lib/rancher/agent/work/20241028-220251/fd5e40f76bdb17a2c54e01742cb28311567a5fe66cb9aea935108e0a5f25b95e_0"
Oct 28 22:02:51 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:51Z" level=info msg="Using private registry config file at /etc/rancher/agent/registries.yaml"
Oct 28 22:02:51 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:51Z" level=info msg="Pulling image dockerhub-master.internal.com/rancher/system-agent-installer-rke2:v1.24.9-rke2r2"
Oct 28 22:02:52 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:52Z" level=warning msg="Failed to get image from endpoint: GET https://dockerhub-master.internal.com/v2/rancher/system-agent-installer-rke2/manifests/v1.24.9-rke2r2: : Repository 'rancher' not found"
Oct 28 22:02:52 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:52Z" level=warning msg="Failed to get image from endpoint: GET https://dockerhub-master.internal.com/v2/rancher/system-agent-installer-rke2/manifests/v1.24.9-rke2r2: : Repository 'rancher' not found"
Oct 28 22:02:52 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:52Z" level=error msg="error while staging: all endpoints failed: GET https://dockerhub-master.internal.com/v2/rancher/system-agent-installer-rke2/manifests/v1.24.9-rke2r2: : Repository 'rancher' not found; GET https://dockerhub-master.internal.com/v2/rancher/system-agent-installer-rke2/manifests/v1.24.9-rke2r2: : Repository 'rancher' not found: failed to get image dockerhub-master.internal.com/rancher/system-agent-installer-rke2:v1.24.9-rke2r2"
Oct 28 22:02:52 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:52Z" level=error msg="error executing instruction 0: all endpoints failed: GET https://dockerhub-master.internal.com/v2/rancher/system-agent-installer-rke2/manifests/v1.24.9-rke2r2: : Repository 'rancher' not found; GET https://dockerhub-master.internal.com/v2/rancher/system-agent-installer-rke2/manifests/v1.24.9-rke2r2: : Repository 'rancher' not found: failed to get image dockerhub-master.internal.com/rancher/system-agent-installer-rke2:v1.24.9-rke2r2"
Oct 28 22:02:52 node-kbc-001-360eb52d rancher-system-agent[18951]: time="2024-10-28T22:02:52Z" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20241028-220251/fd5e40f76bdb17a2c54e01742cb28311567a5fe66cb9aea935108e0a5f25b95e_0"

In current live setup I have nearly 250 Clusters registered rancher 2.7.10 in differences sizes. In all the nodes the private registry credentials are rotated often stored in hashicorp vault > ExternalSecretsOperator > Rancher fleet-default/ExternalSecret > Rancher fleet-default/Secret > each cluster registry config uses this secret to update the credentials in each node automatically.

Updating config.toml with config.toml.tmpl in all 250 clusters with multi nodes is going to be very complex.

Is there anything with my registries.yaml? not sure why the rewrite is added to config.toml from /etc/rancher/rke2/registries.yaml.

@brandond
Copy link
Member

brandond commented Oct 29, 2024

I'm really confused by what you're doing here. Why are you trying to apply rewrites when pulling images directly from these registries? Why are you trying to override the desired behavior by providing your own containerd config template?

As I said above:

Note that rewrite rules ONLY apply when pulling images from a mirror endpoint; rewrites are NOT intended to apply when pulling an image directly from the registry itself

It looks like you're trying to use these private registries as a mirror for docker.io, and apply rewrites when pulling the Rancher images from these registries. In that case, you should actually set these up as mirrors for docker.io, as shown in the RKE2 docs:

mirrors:
  docker.io:
    endpoint:
      - "https://dockerhub.internal.com"
      - "https://dockerhub-master.internal.com"
      - "https://qa-oci.internal.com"
    rewrite:
      "^rancher/(.*)": "docker-internal/rancher/$1"
configs:
  dockerhub.internal.com:
    auth:
      password: redacted
      username: username
  dockerhub-master.internal.com:
    auth:
      password: redacted
      username: username
  qa-oci.internal.com:
    auth:
      password: redacted
      username: username

RKE2 configured the /var/lib/rancher/rke2/agent/etc/containerd/config.toml with config.toml.txt file content without mirrors and rewrites.

Specifying mirrors and rewrites in containerd's config.toml has LONG been deprecated. Recent releases of RKE2 now put these configuration where they belong, in files under /var/lib/rancher/rke2/agent/etc/containerd/certs.d. You will find a directory for each registry, containing a hosts.toml file with the mirrors and rewrites. Only credentials (auth) still go in config.toml

@srikanth2071
Copy link

srikanth2071 commented Nov 4, 2024

Thanks @brandond for reply.

I am using rewrites in registries.yaml for pulling images from dockerhub-master.internal.com, dockerhub.internal.com, qa-oci.internal.com because my images are saved in private registry at path <registry endpoint>/docker-internal/rancher/<all rancher images>. (not at this location <registry endpoint>/rancher/<all images>)

Yes, I read the containerd documentation that the using mirrors and rewrites in containerd config.toml is deprecated. I don't have plans to update the containerd's config.toml with containerd's config.toml.tmpl. I just tried as a testing to see if it works.

In my /etc/rancher/rke2/config.yaml I am using system-default-registry: dockerhub-master.internal.com and rancher-system-agent when it tries to pull the images it is trying to pull the image from dockerhub-master.internal.com/rancher/system-agent-installer-rke2:v1.24.9-rke2r2 which will fail because images are uploaded internally to dockerhub-master.internal.com/docker-internal/rancher/system-agent-installer-rke2:v1.24.9-rke2r2. To use the path docker-internal/rancher I used rewrite.

as you said

Note that rewrite rules ONLY apply when pulling images from a mirror endpoint; rewrites are NOT intended to apply when pulling an image directly from the registry itself

I don't need to use mirrors. just directly pull image from my private registry but upload the images in my private registry at path <private registry endpoint>/rancher/* ? with below config

/etc/rancher/rke2/registries.yaml

configs:
  dockerhub.internal.com:
    auth:
      password: redacted
      username: username
  dockerhub-master.internal.com:
    auth:
      password: redacted
      username: username
  qa-oci.internal.com:
    auth:
      password: redacted
      username: username

system-default-registry: dockerhub-master.internal.com in /etc/rancher/rke2/config.yaml

@brandond
Copy link
Member

brandond commented Nov 4, 2024

Yep - if they're in your private registry under the same name, then you can just set system-default-registry in the config.yaml, and provide creds in registries.yaml.

@srikanth2071
Copy link

Ok. I will need to check with internal team who maintains the private Artifactory to see if I can get a project with name rancher to use as path /rancher/*.

In case If I only need to use custom path then may I know the correct configuration to use custom path for example: /docker-internal/rancher/* ?

@brandond
Copy link
Member

brandond commented Nov 4, 2024

Leave system-default-registry unset, and do as I said above to use your Artifactory as a mirror for docker.io, with rewrites.

@srikanth2071
Copy link

Ok. I will try and test it

Thanks @brandond

@srikanth2071
Copy link

srikanth2071 commented Nov 8, 2024

With below configuration RKE2 cluster bootstrapped successfully.
unset system-default-registry

/etc/rancher/agent/registries.yaml
/etc/rancher/rke2/registries.yaml

mirrors:
  docker.io:
    endpoint:
    - https://dockerhub-master.internal.com
    - https://dockerhub.internal.com
    - https://oci.internal.com
    rewrite:
      ^rancher/(.*): docker-internal/3rdparty/rancher/$1
configs:
  dockerhub-master.internal.com:
    auth:
      username: username
      password: password
  dockerhub.internal.com:
    auth:
      username: username
      password: password
  oci.internal.com:
    auth:
      username: username
      password: password

checked the images used in pods

k get pods --all-namespaces -o jsonpath="{..image}" | tr -s '[[:space:]]' '\n' | sort | uniq
docker.io/rancher/fleet-agent:v0.8.1
docker.io/rancher/hardened-calico:v3.27.2-build20240308
docker.io/rancher/hardened-cluster-autoscaler:v1.8.10-build20240124
docker.io/rancher/hardened-coredns:v1.11.1-build20240305
docker.io/rancher/hardened-etcd:v3.5.9-k3s1-build20230802
docker.io/rancher/hardened-flannel:v0.24.3-build20240307
docker.io/rancher/hardened-k8s-metrics-server:v0.6.3-build20231009
docker.io/rancher/hardened-kubernetes:v1.26.15-rke2r1-build20240314
docker.io/rancher/klipper-helm:v0.8.3-build20240228
docker.io/rancher/kube-api-auth:v0.2.0
docker.io/rancher/mirrored-sig-storage-snapshot-controller:v6.2.1
docker.io/rancher/mirrored-sig-storage-snapshot-validation-webhook:v6.2.2
docker.io/rancher/nginx-ingress-controller:nginx-1.9.3-hardened1
docker.io/rancher/rancher-agent:v2.7.10
docker.io/rancher/rancher-webhook:v0.3.6
docker.io/rancher/rke2-cloud-provider:v1.26.3-build20230406
docker.io/rancher/shell:v0.1.21
docker.io/rancher/system-agent:v0.3.3-suc
docker.io/rancher/system-upgrade-controller:v0.11.0
index.docker.io/rancher/hardened-etcd:v3.5.9-k3s1-build20230802
index.docker.io/rancher/hardened-kubernetes:v1.26.15-rke2r1-build20240314
index.docker.io/rancher/rke2-cloud-provider:v1.26.3-build20230406
rancher/fleet-agent:v0.8.1
rancher/hardened-calico:v3.27.2-build20240308
rancher/hardened-cluster-autoscaler:v1.8.10-build20240124
rancher/hardened-coredns:v1.11.1-build20240305
rancher/hardened-flannel:v0.24.3-build20240307
rancher/hardened-k8s-metrics-server:v0.6.3-build20231009
rancher/klipper-helm:v0.8.3-build20240228
rancher/kube-api-auth:v0.2.0
rancher/mirrored-sig-storage-snapshot-controller:v6.2.1
rancher/mirrored-sig-storage-snapshot-validation-webhook:v6.2.2
rancher/nginx-ingress-controller:nginx-1.9.3-hardened1
rancher/rancher-agent:v2.7.10
rancher/rancher-webhook:v0.3.6
rancher/shell:v0.1.21
rancher/system-agent:v0.3.3-suc
rancher/system-upgrade-controller:v0.11.0

also verified the containerd/cert.d

ls -l /var/lib/rancher/rke2/agent/etc/containerd/certs.d
total 0
drwx------. 2 root root 24 Nov  8 19:05 docker.io
drwx------. 2 root root 24 Nov  8 19:05 dockerhub-master.internal.com
drwx------. 2 root root 24 Nov  8 19:05 dockerhub.internal.com
drwx------. 2 root root 24 Nov  8 19:05 oci.iotcc.internal.com

and docker.io/host.toml has proper rewrites directive

cat /var/lib/rancher/rke2/agent/etc/containerd/certs.d/docker.io/hosts.toml
# File generated by rke2. DO NOT EDIT.

server = "https://registry-1.docker.io/v2"
capabilities = ["pull", "resolve", "push"]



[host."https://dockerhub-master.internal.com/v2"]
  capabilities = ["pull", "resolve"]
  [host."https://dockerhub-master.internal.com/v2".rewrite]
    "^rancher/(.*)" = "docker-internal/3rdparty/rancher/$1"

[host."https://dockerhub.internal.com/v2"]
  capabilities = ["pull", "resolve"]
  [host."https://dockerhub.internal.com/v2".rewrite]
    "^rancher/(.*)" = "docker-internal/3rdparty/rancher/$1"

[host."https://qa-oci.iotcc.internal.com/v2"]
  capabilities = ["pull", "resolve"]
  [host."https://qa-oci.iotcc.internal.com/v2".rewrite]
    "^rancher/(.*)" = "docker-internal/3rdparty/rancher/$1"

As per the pod images, what I understood is that the images are downloaded and used from internet/public not from private registry registry( all the rancher images are uploaded to internal private registry stored at dockerhub-master.internal.com/docker-internal/3rdparty/rancher. I expected the image showing like this dockerhub-master.internal.com/docker-internal/3rdparty/rancher/rancher-agent:v2.7.10 but showing as docker.io. Is my understanding wrong? is it pulling image from internal private registry?

for testing I completed removed rewrite directive to make it fail then I have this warning msg

92:Nov 08 18:01:21 kbc-001-226c65db.novalocal rancher-system-agent[1079]: time="2024-11-08T18:01:21Z" level=warning msg="Failed to get image from endpoint: GET https://dockerhub-master.internal.com/v2/rancher/system-agent-installer-rke2/manifests/v1.26.15-rke2r1: : Repository 'rancher' not found"
93:Nov 08 18:01:21 kbc-001-226c65db.novalocal rancher-system-agent[1079]: time="2024-11-08T18:01:21Z" level=warning msg="Failed to get image from endpoint: GET https://dockerhub.internal.com/v2/rancher/system-agent-installer-rke2/manifests/v1.26.15-rke2r1: : Repository 'rancher' not found"
``` however it continued to download from docker.io and cluster bootstrap.

My use case is that I only need to download images internally only from custom location `dockerhub-master.internal.com/docker-internal/3rdparty/rancher/*`

@brandond
Copy link
Member

brandond commented Nov 8, 2024

I expected the image showing like this dockerhub-master.internal.com/docker-internal/3rdparty/rancher/rancher-agent:v2.7.10 but showing as docker.io. Is my understanding wrong? is it pulling image from internal private registry?

The K3s docs are a bit more comprehensive, all the content hasn't yet been migrated over to RKE2:
https://docs.k3s.io/installation/private-registry#mirrors

Note that when using mirrors and rewrites, images will still be stored under the original name. For example, crictl image ls will show docker.io/rancher/mirrored-pause:3.6 as available on the node, even if the image was pulled from a mirror with a different name.

The image is still from docker.io. The fact that it was actually pulled from an internal mirror, instead of directly from upstream, does not change that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants