-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding readinessProbe and livenessProbe probes to porch #92
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good.
Hi Team, @liamfallon @efiacor can you review these changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Hi Team, |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Catalin-Stratulat-Ericsson, kushnaidu, liamfallon, mansoor17syed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I wanted @efiacor to take a quick look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mansoor17syed , thank you for adding some best practice elements to the deployments.
The changes for the controllers should be fine as they are served by these - https://github.com/nephio-project/porch/blob/d7839690ca85fbb64669ae4158d5679c9ee29713/controllers/main.go#L174
but I am not 100% sure on the probes for the porch api server.
@@ -80,6 +80,28 @@ spec: | |||
- --repo-sync-frequency=3m | |||
- --disable-validating-admissions-policy=true | |||
- --max-request-body-size=6291456 # Keep this in sync with function-runner's corresponding argument | |||
|
|||
#adding livenessProbes and readinessProbes for porch server | |||
livenessProbe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are useful to meet best practices but may need some more investigation.
As the porch api server is a k8s aggregated api server, I am not 100% sure if we need to implement our own probe endpoints.
Also, the docs mention that /healthz is now deprecated and to use /livez instead.
https://kubernetes.io/docs/reference/using-api/health-checks/#individual-health-checks
Finally, unfortunately, the changes here are not "tested" on each PR commit so and breaking change only gets picked up when it gets deployed to the sandbox by the test-infra project. This is something we plan to address.
The configs here would need to be tailored to that env, and potentially monitored and updated if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to add something to the server config to handle the checks, such as - https://github.com/kubernetes/apiserver/blob/master/pkg/server/healthz.go#L91
Somewhere here maybe - https://github.com/nephio-project/porch/blob/main/pkg/apiserver/apiserver.go#L275
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @efiacor
Thanks for the review and the detailed insights!
Since Porch is a Kubernetes aggregated API server, I’m not entirely sure whether we need to implement our own probe endpoints. Do we have any precedent for this in similar projects?
I see that /healthz is deprecated in favor of /livez—should we update it in this PR, or would that require additional considerations?
Regarding testing, since changes here aren’t validated on each PR commit and are only caught when deployed to the sandbox environment, how do you suggest we proceed? Should we align these configs with the test-infra setup, and if so, how is that managed?
Would appreciate your thoughts on the best approach here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to get us started, we need to verify if we do need to implement some form of endpoints on the server to serve the requests. @kispaljr any idea on this one?
For testing the new probes -> endpoints out, we can use the "local" deployment in porch on a kind cluster to verify the probes are actually working.
./scripts/setup-dev-env.sh
make run-in-kind
Check the probes with a "describe" or check the server logs for incoming http requests.
The liveness probe in particular could be useful to monitor the health of the api as we are seeing some behaviour where the service becomes unavailable. The frequency of the probe would need to be tailored thought o avoid spamming the api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @efiacor, sorry for the late reply. In my local environment
kubectl port-forward -n porch-system service/api 8443:api &
curl -v -k https://127.0.0.1:8443/healthz
shows that the checked URLs are alive and exist in the latest porch server, they return with 200 OK, when the server is actually running. So from my perspective these probes seem to be correct.
Have you encountered any problems while trying them out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curl -v -k https://127.0.0.1:8443/livez
also works, we can change the liveness probe to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Querying these paths doesn't seem to generate any logs in the apiserver logs either, so it seems to be working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the readiness probe failing once in my env:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m48s default-scheduler Successfully assigned porch-system/porch-server-85bd69c887-4hnqm to porch-test-control-plane
Normal Pulled 8m48s kubelet Container image "porch-kind/porch-server:test" already present on machine
Normal Created 8m48s kubelet Created container porch-server
Normal Started 8m48s kubelet Started container porch-server
Warning Unhealthy 8m42s kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
but not after that, so this also seems fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool. If you are happy that it actually queries the server then go for it. I wasn't sure. I though maybe the k8s api was handling the default responses
No description provided.