Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rke2 cis profile 1.23, api server can't contact gatekeeper (TLS ERROR) #4910

Closed
zodd3131 opened this issue Oct 18, 2023 · 7 comments
Closed

Comments

@zodd3131
Copy link

zodd3131 commented Oct 18, 2023

Environmental Info:
RKE2 Version:

rke2 version v1.26.9+rke2r1
go version go1.20.8 X: boringcrypto

Node(s) CPU architecture, OS, and Version:

Linux 5.15.0-86-generic #96-Ubuntu SMP x86_64

Cluster Configuration:

7 servers, 3 masters, 4 workers
Describe the bug:

I'm running rke2 with profile: cis-1.23. Kube api server can't contact gatekeeper controller due to handshake TLS error

Steps To Reproduce:

  • Installed RKE2:
    I installed rke2 with tarball.
    Here is the config file
token: df8fbd2af7bdc04b4532 
cni:
- cilium
cloud-provider-name: rancher-vsphere
disable-kube-proxy: true
node-taint: "node-role.kubernetes.io/control-plane:NoSchedule"
profile: cis-1.23
write-kubeconfig: /root/.kube/config
write-kubeconfig-mode: 600

Expected behavior:

kube api server should contact gatekeeper webhook wthout error

Actual behavior:

API SERVER Failed calling webhook, failing open validation.gatekeeper.sh
I tried to write ingress and egress netpol to open everything coming to cattle-gatekeeper-system and allow everything to go out from kube-system.

Additional context / logs:
If I perform curl mannually in kube-api-server, no TLS ERROR.

kube-m01:/ # curl --resolve gatekeeper-webhook-service.cattle-gatekeeper-system.svc:443:10.43.195.131 --cacert /tmp/ca.crt  https://gatekeeper-webhook-service.cattle-gatekeeper-system.svc
404 page not found

logs from api server

Trace[1315858870]: ["Call mutating webhook" configuration:gatekeeper-mutating-webhook-configuration,webhook:mutation.gatekeeper.sh,resource:coordination.k8s.io/v1, Resource=leases,subresource:,operation:UPDATE,UID:4383ee1c-ccfa-43d0-b3b0-9832d316b3e0 4003ms (20:26:51.738)]
Trace[1315858870]: ["Call validating webhook" configuration:gatekeeper-validating-webhook-configuration,webhook:validation.gatekeeper.sh,resource:coordination.k8s.io/v1, Resource=leases,subresource:,operation:UPDATE,UID:f3b0bc36-0b3b-496d-88f8-6586765a609d 3001ms (20:26:52.740)]
Trace[1315858870]: [4.003745965s] [4.003745965s] END
W1018 20:26:55.988976       1 dispatcher.go:184] Failed calling webhook, failing open validation.gatekeeper.sh: failed calling webhook "validation.gatekeeper.sh": failed to call webhook: Post "https://gatekeeper-webhook-service.cattle-gatekeeper-system.svc:443/v1/admit?timeout=3s": context deadline exceeded
E1018 20:26:55.989015       1 dispatcher.go:192] failed calling webhook "validation.gatekeeper.sh": failed to call webhook: Post "https://gatekeeper-webhook-service.cattle-gatekeeper-system.svc:443/v1/admit?timeout=3s": context deadline exceeded
Trace[999763810]: ["Call mutating webhook" configuration:gatekeeper-mutating-webhook-configuration,webhook:mutation.gatekeeper.sh,resource:coordination.k8s.io/v1, Resource=leases,subresource:,operation:UPDATE,UID:562c7e1f-50f9-402a-bea6-e12065889cc5 4002ms (20:26:51.987)]
Trace[999763810]: ["Call validating webhook" configuration:gatekeeper-validating-webhook-configuration,webhook:validation.gatekeeper.sh,resource:coordination.k8s.io/v1, Resource=leases,subresource:,operation:UPDATE,UID:9cd032a9-adc4-492d-ba03-6d0c4e4c9e39 3002ms (20:26:52.988)]
Trace[999763810]: [4.003395529s] [4.003395529s] END
W1018 20:26:55.990775       1 dispatcher.go:184] Failed calling webhook, failing open validation.gatekeeper.sh: failed calling webhook "validation.gatekeeper.sh": failed to call webhook: Post "https://gatekeeper-webhook-service.cattle-gatekeeper-system.svc:443/v1/admit?timeout=3s": context deadline exceeded
E1018 20:26:55.990926       1 dispatcher.go:192] failed calling webhook "validation.gatekeeper.sh": failed to call webhook: Post "https://gatekeeper-webhook-service.cattle-gatekeeper-system.svc:443/v1/admit?timeout=3s": context deadline exceeded
Trace[444200807]: ["Call mutating webhook" configuration:gatekeeper-mutating-webhook-configuration,webhook:mutation.gatekeeper.sh,resource:coordination.k8s.io/v1, Resource=leases,subresource:,operation:UPDATE,UID:c19f1c7f-0001-4659-bd72-0d69ee604898 4004ms (20:26:51.987)]
Trace[444200807]: ["Call validating webhook" configuration:gatekeeper-validating-webhook-configuration,webhook:validation.gatekeeper.sh,resource:coordination.k8s.io/v1, Resource=leases,subresource:,operation:UPDATE,UID:50aa25b0-9c1a-4f20-8c4b-55b83537d474 3001ms (20:26:52.990)]
Trace[444200807]: [4.004576594s] [4.004576594s] END
I1018 20:26:57.045776       1 trace.go:219] Trace[10936608]: "Update" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:9da0077e-ec8d-4810-8d3b-57bb6ce09bcf,client:127.0.0.1,protocol:HTTP/2.0,resource:leases,scope:resource,url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/kube-m01,user-agent:kubelet/v1.26.9+rke2r1 (linux/amd64) kubernetes/d1483fd,verb:PUT (18-Oct-2023 20:26:54.046) (total time: 2999ms):
Trace[10936608]: ["GuaranteedUpdate etcd3" audit-id:9da0077e-ec8d-4810-8d3b-57bb6ce09bcf,key:/leases/kube-node-lease/kube-m01,type:*coordination.Lease,resource:leases.coordination.k8s.io 2998ms (20:26:54.046)
Trace[10936608]:  ---"About to Encode" 2997ms (20:26:57.044)]
Trace[10936608]: ["Call mutating webhook" configuration:gatekeeper-mutating-webhook-configuration,webhook:mutation.gatekeeper.sh,resource:coordination.k8s.io/v1, Resource=leases,subresource:,operation:UPDATE,UID:c86948d6-c40b-4c96-a486-5e15255fe400 2998ms (20:26:54.047)]
Trace[10936608]: ["Call validating webhook" configuration:gatekeeper-validating-webhook-configuration,webhook:validation.gatekeeper.sh,resource:coordination.k8s.io/v1, Resource=leases,subresource:,operation:UPDATE,UID:668dd983-ab98-471c-a3fe-2d66ed1912f0 1997ms (20:26:55.048)
Trace[10936608]:  ---"Request completed" 1995ms (20:26:57.044)]
Trace[10936608]: [2.999040871s] [2.999040871s] END

logs from gatekeeper controller

2023/10/18 20:28:24 http: TLS handshake error from 10.1.1.11:54702: EOF
2023/10/18 20:28:26 http: TLS handshake error from 10.1.1.11:54704: EOF
2023/10/18 20:28:28 http: TLS handshake error from 10.1.1.11:54706: EOF
2023/10/18 20:28:32 http: TLS handshake error from 10.1.1.11:49702: EOF
2023/10/18 20:28:34 http: TLS handshake error from 10.1.1.11:49714: EOF
2023/10/18 20:28:34 http: TLS handshake error from 10.1.1.11:49720: EOF
2023/10/18 20:28:34 http: TLS handshake error from 10.1.1.11:49734: EOF
2023/10/18 20:28:36 http: TLS handshake error from 10.1.1.11:49738: EOF
2023/10/18 20:28:36 http: TLS handshake error from 10.1.1.12:51786: EOF
2023/10/18 20:28:36 http: TLS handshake error from 10.1.1.12:51790: EOF
2023/10/18 20:28:36 http: TLS handshake error from 10.1.1.11:49752: EOF
2023/10/18 20:28:38 http: TLS handshake error from 10.1.1.11:49766: EOF
2023/10/18 20:28:42 http: TLS handshake error from 10.1.1.11:33352: EOF
@brandond
Copy link
Member

brandond commented Oct 18, 2023

W1018 20:26:55.990775       1 dispatcher.go:184] Failed calling webhook, failing open validation.gatekeeper.sh: failed calling webhook "validation.gatekeeper.sh": failed to call webhook: Post "https://gatekeeper-webhook-service.cattle-gatekeeper-system.svc:443/v1/admit?timeout=3s": context deadline exceeded
E1018 20:26:55.990926       1 dispatcher.go:192] failed calling webhook "validation.gatekeeper.sh": failed to call webhook: Post "https://gatekeeper-webhook-service.cattle-gatekeeper-system.svc:443/v1/admit?timeout=3s": context deadline exceeded

So the apiserver is calling the webhook with a 3 second timeout, which is being exceeded.

2023/10/18 20:28:36 http: TLS handshake error from 10.1.1.12:51786: EOF
2023/10/18 20:28:36 http: TLS handshake error from 10.1.1.12:51790: EOF
2023/10/18 20:28:36 http: TLS handshake error from 10.1.1.11:49752: EOF
2023/10/18 20:28:38 http: TLS handshake error from 10.1.1.11:49766: EOF

This pod sees incoming connections being closed; the handshake errors would appear to be because the apiserver is terminating the request when it does not complete after 3 seconds.

You said this is the "gatekeeper controller", is this the pod that backs the gatekeeper-webhook-service.cattle-gatekeeper-system.svc service? You'd probably need to turn up the gatekeeper pod log level to figure out why it's not able to service requests within the allocated time.

@zodd3131
Copy link
Author

Hello brandond,
Thank you for your reply.
I increased log level verbosity for controller that backs the service gatekeeper-webhook-service.cattle-gatekeeper-system.svc to DEBUG.
Since it's level6 layer(session), logs aren't more verbose. I think it logs the error directly from the socket.

We are behind a proxy. The files rke2-server and rke2-agent looks the same :

CONTAINERD_HTTP_PROXY=http://10.1.1.254:3128
CONTAINERD_HTTPS_PROXY=http://10.1.1.254:3128
CONTAINERD_NO_PROXY=.my-domain.fr,127.0.0.0/8,localhost

There is no mention of .svc and .cluster.local in CONTAINERD_NO_PROXY env.
Can proxy ENV affect the behaviour for api-server. I saw the issues opened by andreas previously.

logs for controller with DEBUG verbosity:

2023-10-19T16:04:51.856Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"capabilities"}}
2023-10-19T16:04:51.860Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"non-root"}}
2023-10-19T16:04:51.863Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"capabilities"}}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	resolved pre-observations	{"gvk": "mutations.gatekeeper.sh/v1, Kind=Assign", "count": 0}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	readiness state	{"gvk": "mutations.gatekeeper.sh/v1, Kind=Assign", "satisfied": "2/2"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"gvk": "mutations.gatekeeper.sh/v1, Kind=Assign"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	resolved pre-observations	{"gvk": "mutations.gatekeeper.sh/v1, Kind=ModifySet", "count": 0}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	readiness state	{"gvk": "mutations.gatekeeper.sh/v1, Kind=ModifySet", "satisfied": "0/0"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"gvk": "mutations.gatekeeper.sh/v1, Kind=ModifySet"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	resolved pre-observations	{"gvk": "mutations.gatekeeper.sh/v1alpha1, Kind=AssignImage", "count": 0}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	readiness state	{"gvk": "mutations.gatekeeper.sh/v1alpha1, Kind=AssignImage", "satisfied": "0/0"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"gvk": "mutations.gatekeeper.sh/v1alpha1, Kind=AssignImage"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"tracker": "assignMetadata"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"tracker": "assign"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"tracker": "modifySet"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"tracker": "assignImage"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	resolved pre-observations	{"gvk": "externaldata.gatekeeper.sh/v1beta1, Kind=Provider", "count": 0}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	readiness state	{"gvk": "externaldata.gatekeeper.sh/v1beta1, Kind=Provider", "satisfied": "0/0"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"gvk": "externaldata.gatekeeper.sh/v1beta1, Kind=Provider"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"tracker": "provider"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	resolved pre-observations	{"gvk": "expansion.gatekeeper.sh/v1alpha1, Kind=ExpansionTemplate", "count": 0}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	readiness state	{"gvk": "expansion.gatekeeper.sh/v1alpha1, Kind=ExpansionTemplate", "satisfied": "0/0"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"gvk": "expansion.gatekeeper.sh/v1alpha1, Kind=ExpansionTemplate"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"tracker": "expansiontemplates"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	resolved pre-observations	{"gvk": "templates.gatekeeper.sh/v1beta1, Kind=ConstraintTemplate", "count": 0}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	readiness state	{"gvk": "templates.gatekeeper.sh/v1beta1, Kind=ConstraintTemplate", "satisfied": "0/0"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"gvk": "templates.gatekeeper.sh/v1beta1, Kind=ConstraintTemplate"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"tracker": "constraints"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	resolved pre-observations	{"gvk": "config.gatekeeper.sh/v1alpha1, Kind=Config", "count": 0}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	readiness state	{"gvk": "config.gatekeeper.sh/v1alpha1, Kind=Config", "satisfied": "0/0"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"gvk": "config.gatekeeper.sh/v1alpha1, Kind=Config"}
2023-10-19T16:04:52.651Z	debug	readiness-tracker	all expectations satisfied	{"tracker": "data"}
2023-10-19T16:04:52.651Z	info	readiness-tracker	readiness satisfied, no further collection
2023-10-19T16:05:00.347Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"non-root"}}
2023-10-19T16:05:00.352Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"capabilities"}}
2023/10/19 16:05:01 http: TLS handshake error from 10.1.1.11:36982: EOF
2023-10-19T16:05:02.700Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"non-root"}}
2023-10-19T16:05:02.713Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"capabilities"}}
2023/10/19 16:05:03 http: TLS handshake error from 10.1.1.13:56428: EOF
2023/10/19 16:05:05 http: TLS handshake error from 10.1.1.11:42808: EOF
2023-10-19T16:05:10.951Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"capabilities"}}
2023-10-19T16:05:10.956Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"non-root"}}
2023-10-19T16:05:12.295Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"non-root"}}
2023-10-19T16:05:12.301Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"capabilities"}}
2023-10-19T16:05:12.954Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"non-root"}}
2023-10-19T16:05:12.959Z	info	controller	Reconcile	{"process": "assign_controller", "request": {"name":"capabilities"}}
2023/10/19 16:05:18 http: TLS handshake error from 10.1.1.11:40190: EOF
2023/10/19 16:05:28 http: TLS handshake error from 10.1.1.11:60902: EOF
2023/10/19 16:05:30 http: TLS handshake error from 10.1.1.12:43520: EOF
2023/10/19 16:05:32 http: TLS handshake error from 10.1.1.11:60912: EOF
2023/10/19 16:05:34 http: TLS handshake error from 10.1.1.12:43524: EOF
2023/10/19 16:05:36 http: TLS handshake error from 10.1.1.11:38082: EOF
2023/10/19 16:05:40 http: TLS handshake error from 10.1.1.11:38084: EOF
2023/10/19 16:05:42 http: TLS handshake error from 10.1.1.11:38098: EOF
2023/10/19 16:05:42 http: TLS handshake error from 10.1.1.12:54982: EOF
2023/10/19 16:05:48 http: TLS handshake error from 10.1.1.12:55862: EOF
2023/10/19 16:05:58 http: TLS handshake error from 10.1.1.11:39182: EOF

@brandond
Copy link
Member

I see that you have disable-kube-proxy: true in your config, and are using cilium. Can you confirm that you have the clilium kube-proxy replacement enabled, and it is working? Can you show the proxy settings in use for the apiserver pod, of any?

@zodd3131
Copy link
Author

I have disabled kube-proxy to benefit from CILIUM DSR (direct server return). We use that feature since a while. Nothing to say about cilium and dsr so far. It works great.

Here the HelmChartConfig :

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-cilium
  namespace: kube-system
spec:
  valuesContent: |-
    tunnel: disabled
    ipv4NativeRoutingCIDR: "10.42.0.0/16"
    masquerade: true
    ipam.mode: kubernetes 
    autoDirectNodeRoutes: true
    kubeProxyReplacement: strict
    loadBalancer.mode: hybrid  
    k8sServiceHost: api.rke2.my-domain.fr
    k8sServicePort: "6443"

ENV from api server:

Environment:
      FILE_HASH:               be827140e78a18a07178ee03b0829fcce0febc4649c9b5f97ad8b49b350dcb54
      CONTAINERD_HTTP_PROXY:   http://10.1.1.254:3128
      CONTAINERD_HTTPS_PROXY:  http://10.1.1.254:3128
      CONTAINERD_NO_PROXY:     .my-domain.fr,127.0.0.0/8,localhost
      NO_PROXY:                .svc,.dev.rke,10.42.0.0/16,10.43.0.0/16

@brandond
Copy link
Member

brandond commented Oct 20, 2023

I tried to write ingress and egress netpol to open everything coming to cattle-gatekeeper-system and allow everything to go out from kube-system.

Can you confirm that you can curl the endpoint from within the apiserver pod?

kubectl exec -it -n kube-system kube-apiserver-NODENAME -- curl -vks --resolve gatekeeper-webhook-service.cattle-gatekeeper-system.svc:443:10.43.195.131 https://gatekeeper-webhook-service.cattle-gatekeeper-system.svc

If you can't then I suspect your problems are still related to network policy.

@zodd3131
Copy link
Author

I come back to you, several days later and figured out what was going wrong.
First of all, I created network policy to allow ingress and egress from all concerned namespaces. I wanted to make sure network policy were note an issue

I was still having same issues related to api-server and gatekeeper controller.

Instead of trying to make a spec mutation works, I decided to try metadata mutation based on an example from gatekeeper doc with profile cis-1.23 disabled : IT WORKED.
Then I decided to enable profile cis, and it was OK again.
I tested with CONTAINERD_HTTP_PROXY and HTTP_PROXY, it was ok again, .svc is automatically inserted after NO_PROXY string.
I looked at the logs from apiserver and controller and errors were still present
apiserver: Failed calling webhook, failing open validation.gatekeeper.sh: failed calling webhook "validation.gatekeeper.sh": failed to call webhook: Post "https://gatekeeper-webhook-service.cattle-gatekeeper-system.svc:443/v1/admit?timeout=3s": context deadline exceeded
controller : TLS handshake error from 10.1.1.11:38082: EOF

The error from apiserver was due to a mutatingwebhook still present when I removed OPA gatekeeper. I figured out later, that the error disappeared when controler was reachable
The error from controller is still present. It seems api server makes sort of healthcheck but it does not affect the operation of mutating.

What makes me lose lots of time is I wrote a bad assign resource.

apiVersion: mutations.gatekeeper.sh/v1
kind: Assign
metadata:
  name: no-privileged-container
spec:
  applyTo:
    - groups: ["apps"]
      kinds:
        - Deployment
        - StatefulSet
        - DaemonSet
      versions: ["v1"]
  match:
    scope: Cluster
    excludedNamespaces:
      - kube-system
      - cattle-system
      - cis-operator-system
      - cattle-gatekeeper-system
      - metallb-system
  location: "spec.template.spec.containers[name:*].securityContext.allowPrivilegeEscalation"
  parameters:
    assign:
      value: false

changing scope to Namespaced

match:
    scope: Namespaced

If I wrote correctly the assign resource I wouldn't have question cis profile or proxy env or netpol.
Here is a link about the error in OPA
open-policy-agent/gatekeeper#2142

@brandond
Copy link
Member

Thanks for the analysis! Since there doesn't appear to be anything wrong with RKE2, I'll close this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants