Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding readinessProbe and livenessProbe probes to porch #92

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mansoor17syed
Copy link

No description provided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.

@mansoor17syed
Copy link
Author

Hi Team,

@liamfallon @efiacor can you review these changes
and share your feedback

Copy link
Member

@liamfallon liamfallon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@nephio-prow nephio-prow bot added the approved label Jan 15, 2025
@mansoor17syed
Copy link
Author

Hi Team,
Could you please share your feedback or advise on the next steps to move forward?

@kushnaidu
Copy link
Contributor

/approve

Copy link
Contributor

nephio-prow bot commented Jan 20, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Catalin-Stratulat-Ericsson, kushnaidu, liamfallon, mansoor17syed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@liamfallon
Copy link
Member

I wanted @efiacor to take a quick look.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mansoor17syed , thank you for adding some best practice elements to the deployments.
The changes for the controllers should be fine as they are served by these - https://github.com/nephio-project/porch/blob/d7839690ca85fbb64669ae4158d5679c9ee29713/controllers/main.go#L174
but I am not 100% sure on the probes for the porch api server.

@@ -80,6 +80,28 @@ spec:
- --repo-sync-frequency=3m
- --disable-validating-admissions-policy=true
- --max-request-body-size=6291456 # Keep this in sync with function-runner's corresponding argument

#adding livenessProbes and readinessProbes for porch server
livenessProbe:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are useful to meet best practices but may need some more investigation.
As the porch api server is a k8s aggregated api server, I am not 100% sure if we need to implement our own probe endpoints.
Also, the docs mention that /healthz is now deprecated and to use /livez instead.
https://kubernetes.io/docs/reference/using-api/health-checks/#individual-health-checks

Finally, unfortunately, the changes here are not "tested" on each PR commit so and breaking change only gets picked up when it gets deployed to the sandbox by the test-infra project. This is something we plan to address.
The configs here would need to be tailored to that env, and potentially monitored and updated if necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to add something to the server config to handle the checks, such as - https://github.com/kubernetes/apiserver/blob/master/pkg/server/healthz.go#L91

Somewhere here maybe - https://github.com/nephio-project/porch/blob/main/pkg/apiserver/apiserver.go#L275

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @efiacor
Thanks for the review and the detailed insights!

Since Porch is a Kubernetes aggregated API server, I’m not entirely sure whether we need to implement our own probe endpoints. Do we have any precedent for this in similar projects?
I see that /healthz is deprecated in favor of /livez—should we update it in this PR, or would that require additional considerations?
Regarding testing, since changes here aren’t validated on each PR commit and are only caught when deployed to the sandbox environment, how do you suggest we proceed? Should we align these configs with the test-infra setup, and if so, how is that managed?

Would appreciate your thoughts on the best approach here!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to get us started, we need to verify if we do need to implement some form of endpoints on the server to serve the requests. @kispaljr any idea on this one?
For testing the new probes -> endpoints out, we can use the "local" deployment in porch on a kind cluster to verify the probes are actually working.

nephio-project/porch#166

./scripts/setup-dev-env.sh

make run-in-kind

Check the probes with a "describe" or check the server logs for incoming http requests.

The liveness probe in particular could be useful to monitor the health of the api as we are seeing some behaviour where the service becomes unavailable. The frequency of the probe would need to be tailored thought o avoid spamming the api.

Copy link

@kispaljr kispaljr Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @efiacor, sorry for the late reply. In my local environment

kubectl port-forward -n porch-system service/api 8443:api &
curl -v -k https://127.0.0.1:8443/healthz

shows that the checked URLs are alive and exist in the latest porch server, they return with 200 OK, when the server is actually running. So from my perspective these probes seem to be correct.
Have you encountered any problems while trying them out?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curl -v -k https://127.0.0.1:8443/livez also works, we can change the liveness probe to it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Querying these paths doesn't seem to generate any logs in the apiserver logs either, so it seems to be working

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the readiness probe failing once in my env:

  Type     Reason     Age    From               Message
  ----     ------     ----   ----               -------
  Normal   Scheduled  8m48s  default-scheduler  Successfully assigned porch-system/porch-server-85bd69c887-4hnqm to porch-test-control-plane
  Normal   Pulled     8m48s  kubelet            Container image "porch-kind/porch-server:test" already present on machine
  Normal   Created    8m48s  kubelet            Created container porch-server
  Normal   Started    8m48s  kubelet            Started container porch-server
  Warning  Unhealthy  8m42s  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500

but not after that, so this also seems fine

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool. If you are happy that it actually queries the server then go for it. I wasn't sure. I though maybe the k8s api was handling the default responses

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants