Proposal: Support Readiness Healthchecks #1094

ameowlia · 2023-06-09T21:13:44Z

✨A big thank you to @mariash who has done great work to map out this problem and solution ✨

Problem

With the current implementation of application healthchecks, when the application healthcheck detects that an app instance (AI) is unhealthy, then Diego will stop the AI, delete the AI, and reschedule a new AI.

This is too aggressive from some users. There could be many reasons why a single request could fail but the next passes just fine. Killing a instance after a single failed healthcheck is seen a too aggressive. Additionally, many applications have a warm up period where they are not ready to receive requests until they populate their cache. Cloud Foundry currently cannot receive a signal that the app is running correctly, but not yet ready to serve request.

Proposed Solution Summary

We intend to support readiness healthchecks. (This was requested previously in this issue.) This would be an additional healthcheck that app developers could configure. When the readiness healthcheck passes, the app is marked "ready" and the app will be routable. When the readiness healthcheck fails, the app is marked as "not ready" and its route will be removed from gorouter's route table. This new readiness healthcheck will give users a healthcheck option that is less drastic than the current option.

Architecture overview

This feature will require changes in the following releases

CF CLI
Cloud Controller
Diego
Routing

The cloud controller will store this new data, before passing it onto the BBS as part of the desired LRP.
The Diego executor will see these new readiness healthchecks on the desired LRP and will run the healthchecker binary in the app container with configuration provided.
When the readiness healthcheck succeeds, the container will be marked as "ready". When the readiness healthcheck fails, the container will be marked as "not ready".
When the route emitter gets route information, it will inspect if the AI is ready or not ready. It will emit registration or unregistration messages as appropriate for the gorouter to consume

CC Design

Users will be able to set the healthcheck via the app manifest.

applications:
- name: test-app
  processes:
  - type: web
    health-check-http-endpoint: /health
    health-check-invocation-timeout: 2
    health-check-type: http
    timeout: 80
    readiness-health-check-http-endpoint: /health      # 👈 new property
    readiness-health-check-invocation-timeout: 2       # 👈 new property
    readiness-health-check-type: http                           # 👈 new property

LRP Design

The readiness healthcheck data will be apart of the desired LRP object

"check_definition": {
    "checks": [
      {
        "http_check": {
          "port": 8080,
          "path": "/health",
          "request_timeout_ms": 10000
        },
      }
    ],
    "readiness_checks": [                                               # 👈 new property
      {
        "tcp_check": {         
          "port": 8080,
          "connect_timeout_ms": 10000
        },
      }
    ],
    "log_source": ""
  },

👉 This work is ongoing. All comments and concerns are welcomed from the community. Either add a comment here or reach out in slack in #wg-app-runtime-platform.

The text was updated successfully, but these errors were encountered:

ameowlia · 2023-06-09T21:14:39Z

❓ Why did I open this in cf-deployment when likely nothing will have to happen in this repo?
Because this work will cross many repo boundaries and this seems like the best place to document a large change 😄

domdom82 · 2023-06-12T08:35:47Z

@ameowlia Thanks for this I like it a lot!
So, I wondered what will happen with the original health check then? Will it still stop and reschedule the AI after one failed check? Assume yes, this probably means that users need to map their original health checks as a readiness check now ("I can't accept requests just yet.") and add another logic to bind the regular health check to ("I am technically alive.").

Otherwise the old logic will stay in place and there won't be any real benefit as the app is still deemed unhealthy and is removed / rescheduled as before.

beyhan · 2023-06-12T13:07:31Z

@ameowlia @mariash thanks for the great proposal. It definitely makes sense but I didn't have time to look into it for feedback yet. I definitely will provide. @ameowlia I think that this should be an RFC like cloudfoundry/community#591 because as you mentioned will have impact on multiple WGs and we need to have technical discussion and decisions as described in RFC.

ameowlia · 2023-06-12T13:14:11Z

Thanks for this comment @domdom82,

Your understanding of the situation is correct. With this proposal the original health check will stay they same; it will still stop and reschedule the AI after one failed check[1]. We will definitely add docs to make sure the users understand how to use this feature to best take advantage of its capabilities. And we can also send the docs for you to review (once they exist).

[1] We have considered adding some more configuration here. For example, "if my app fails 2 out of 5 times, then it should fail, but not if it fails 1 out of 5 times". However, we decided that was a different track of work and we are not committing to it at this time.

PlamenDoychev · 2023-06-13T17:10:14Z

Hi folks, i really like the idea we even have a direct stakeholder for this feature. Several years ago we tried to implement something on our own, but now i am really happy to see that there is a community effort in place.

philippthun · 2023-06-14T15:02:25Z

Would this have an effect on the state of the process stats object?

ameowlia · 2023-06-14T15:03:58Z

@philippthun - We plan on making a new property on that object (name TBD). So it will not affect the state property. If you have any concerns around this, we would love to hear them!

beyhan · 2023-06-15T11:38:57Z

To my understanding this will help also for applications with instances which execute CPU intensive tasks from time to time and request processing could be blocked during that time on those instances. Those type of applications configure none http healthchecks types to not get killed because of failed healthchecks. Now with the readiness healthchecks those instances can be taken out from the request dispatching by the GoRouter when they fail the readiness healthcheck because of the CPU intensive task which will improve the overall end user experience. Is my understanding here correct or do I miss something.

Otherwise the proposed change is backwards compatible from CF users' perspective and doesn't look to be an invasive change in CF itself which I really like.

P.S. My previous comment to make this an RFC is still valid :-) and if you decide to go for an RFC I can help if needed.

JuergenSu · 2023-06-21T10:14:53Z

Just an idea, how about emitting a standard metric having the number of ready instances for an app. this metric could be used in case of autoscaling or manual scaling decisions and even in case of monitoring and alarming as it might happen that you have 10 instances but all of them are overloaded and not ready…

marcohelmerich · 2023-06-22T08:38:22Z

Basically we were missing that feature for as long as we are using cloudfoundry :-) It would help us in many use-cases from "not-so-cloud-native" applications who need time for cache warming or other data-transformation/warming tasks to avoiding consecutive app crashes in overload situations.

ameowlia · 2023-06-26T14:30:12Z

Thank you all for your valuable feedback!

Closing this issue in preference of this RFC per @beyhan 's suggestion.

ameowlia closed this as completed Jun 26, 2023

beyhan mentioned this issue Jun 27, 2023

[RFC 630] add readiness healthchecks for apps cloudfoundry/community#630

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Support Readiness Healthchecks #1094

Proposal: Support Readiness Healthchecks #1094

ameowlia commented Jun 9, 2023 •

edited

Loading

ameowlia commented Jun 9, 2023 •

edited

Loading

domdom82 commented Jun 12, 2023

beyhan commented Jun 12, 2023

ameowlia commented Jun 12, 2023

PlamenDoychev commented Jun 13, 2023

philippthun commented Jun 14, 2023

ameowlia commented Jun 14, 2023

beyhan commented Jun 15, 2023 •

edited

Loading

JuergenSu commented Jun 21, 2023

marcohelmerich commented Jun 22, 2023

ameowlia commented Jun 26, 2023

Proposal: Support Readiness Healthchecks #1094

Proposal: Support Readiness Healthchecks #1094

Comments

ameowlia commented Jun 9, 2023 • edited Loading

Problem

Proposed Solution Summary

Architecture overview

CC Design

LRP Design

ameowlia commented Jun 9, 2023 • edited Loading

domdom82 commented Jun 12, 2023

beyhan commented Jun 12, 2023

ameowlia commented Jun 12, 2023

PlamenDoychev commented Jun 13, 2023

philippthun commented Jun 14, 2023

ameowlia commented Jun 14, 2023

beyhan commented Jun 15, 2023 • edited Loading

JuergenSu commented Jun 21, 2023

marcohelmerich commented Jun 22, 2023

ameowlia commented Jun 26, 2023

ameowlia commented Jun 9, 2023 •

edited

Loading

ameowlia commented Jun 9, 2023 •

edited

Loading

beyhan commented Jun 15, 2023 •

edited

Loading