-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC 630] add readiness healthchecks for apps #630
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I really like the proposal. I added some minor comments and questions I have.
Currently diego replaces the envoy certificate when the application instance is going down. By doing so gorouter will fail to connect to envoy because of the wrong certificate while the app is known to be unavailable and the de-registration has not yet been propagated to gorouter. The same mechanism could be used for readiness checks to prevent requests from being set to the application in the small timeframe between the readiness check going down and the de-registration being propagated to gorouter. |
Co-authored-by: Maximilian Moehl <[email protected]>
Co-authored-by: Alexander Lais <[email protected]>
We just had a discussion for a customer scenario issue that might be interesting for this RFC. The readiness probe is a scheduled polling, like the health check (e.g. every n seconds). When the readiness state changes, the cascade of route updates, etc. takes place. What we would like to add is an immediate control from the app to the readiness state checker, to be able to say: "My app is not ready, right now!". The other way around this of course also works and could save some time, but the case where an app becomes not ready is from our point of view much more important. It would allow app developers to protect their app from an overload scenario. Consequences or requests for the "immediate" part of the state change:
What do you think about this, and do you see that this might still fit in the frame of this RFC? |
@beyhan Can this be resolved by lowering the poll interval? |
@mariash at the moment the interval is 30s which is also documented in here. What options to lower the interval we have? It could help but won't be perfect because there is a gap depending on how long the poll interval is in which the app instance could get requests. @ameowlia for completeness could you please add the concerns you expressed during the TOC meeting yesterday regarding the replace the certificates suggestion. |
@peanball, I really like the idea of the immediate state change, however I have some concerns with doing it via the envoy cert because of c2c implications. Failure scenarioHere is the scenario I am imagining:
Technical DetailsInternal routes are resolved via bosh dns + service discovery controller. They are not proxied through anything like gorouter that can add automatic retries. Any retry logic would have to be added to the app. Future regarding this RFCI suggest this RFC continue on without the requirement for an immediate state change. I think we should release what we have currently proposed as a v1 and get feedback on it. And during that time we should continue to think about how this is possible to get this immediate state change without negative consequences to c2c with tls. |
@mariash - can you add to the RFC about exposing the interval? It won't be a perfect fix, for this issue, but could improve it. edited to add: ✅ done |
No objections to continue without immediate state change for this RFC and consider it in a larger context. We are currently not using c2c networking because of the missing load balancing across instances and other bits (e.g. retries) that we get via gorouter. |
* add interval property * add process type for backwards compatibility
Starting Final Comment Period. FCP should end on Aug 22, 2023. |
The link on the top of this PR does not work for me. https://github.com/cloudfoundry/community/blob/main/toc/rfc/rfc-0020-readiness-healthchecks.md however does. |
For easier viewing: https://github.com/cloudfoundry/community/blob/rfc-readiness-healthchecks/toc/rfc/rfc-draft-readiness-healthchecks.md