-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reset the pods if jenkins fails to wake up within certain duration #289
Comments
This will help openshiftio/openshift.io#3517 |
let's also reduce the polling frequency in init container? So it would be fail fast |
@sthaha so, let's say, after the init container fails to connect to content-repository after trying for 15 times, it's going to report this to proxy and proxy's going to delete the pod and create a new one? |
@containscafeine proxy can keep track of the attempts to reach jenkins and decide to reset it. I think it may be better to do this on the idler side as it would need to react to build/deploy events and wake jenkins up. |
I've been trying to to reproduce the underlying network issue for the past couple of weeks without success, even by repeatedly idling/unidling my jenkins namespace through the jenkins-proxy API. The openshift networking team believe there is a problem in the way we idle/unidle in some specific situations. Some namespace are often getting stuck and I was unable to reproduce the issue on them using the idler API. One theory was that jenkins-proxy may be asking openshift to scale up content-repository at the same time the jenkins init container is trying to wake it up by connecting to it's service which would be causing a weird situation at the openshift idler/SDN layer. If we are confident that jenkins-idler is capable of handling the unidling of content-repository, we may want to try turning off the init container completely and see how it goes. If we do that, we would need jenkins-proxy to be aware of all jenkins DC changes and if jenkins DC is scaled up from outside of jenkins-idler/proxy (manually, or openshift itself unidles it), it would then have to react by also unidling/scaling up content-repository. |
I think this will have to be done on the UI side as UI knows how many times we have tried to unidle Jenkins or how much time has it been. |
@kishansagathiya no, not in UI The way I think of it, this must be done in idler which must keep a tab on the jenkins it tried to unidle and see if it got unidled properly. Like we already discussed, user-idler also unidles jenkins and the same problem can occur there can't it? So the caller asks the service to unidle and it shouldn't keep asking if the service if it really got unidled and then tell it to do its job properly. cc @chmouel WDYT? |
Let me try to make sure I understand this discussion when you say "jenkins pods fail to wake up" you are talking about that networking problem @jfchevrette has been trying to debug. Basically @sthaha you are suggesting we find a workaround around this openshift-sdn bug in idler by keeping track how many time we have been trying to unidle, track if it was successfull and reset What would be the definition of successful tho ? because we have seen cases where network connectivity would work at init time, but when spawning a slave the slave would fail to communicate (over jnlp) to the master. If we are workaround the networking connectivity problem here, maybe that should be done inside jenkins instead, get the init-container wait-for conent-repository dependence gone and have jenkins when spinning a new job detect if it can communicate properly and fails otherwise. What do you think ? |
@sthaha Any thoughts on ^? |
Update:We are going to removing content-repository from init container and let's see how things work. If the issue still persists will reset pod from idler. |
@hrishin where is the removing of content-repository tracked? |
@ppitonak ideally it could be seprate issue but its tracked openshiftio/openshift.io#3895 |
@ppitonak Created an issue to track this openshiftio/openshift.io#4083 |
I am closing this as we seem to have solved the Jenkins not waking up issue by
|
We have noticed that at times, jenkins pods fail to wake up and the only solution seems to kill the pod and start them again. Lets implement this as a part of jenkins-proxy's un-idle step.
The text was updated successfully, but these errors were encountered: