Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset the pods if jenkins fails to wake up within certain duration #289

Closed
sthaha opened this issue Jun 5, 2018 · 14 comments
Closed

Reset the pods if jenkins fails to wake up within certain duration #289

sthaha opened this issue Jun 5, 2018 · 14 comments

Comments

@sthaha
Copy link
Contributor

sthaha commented Jun 5, 2018

We have noticed that at times, jenkins pods fail to wake up and the only solution seems to kill the pod and start them again. Lets implement this as a part of jenkins-proxy's un-idle step.

@piyush-garg
Copy link
Contributor

This will help openshiftio/openshift.io#3517

@hrishin
Copy link
Member

hrishin commented Jun 20, 2018

let's also reduce the polling frequency in init container? So it would be fail fast

@concaf
Copy link

concaf commented Jun 20, 2018

@sthaha so, let's say, after the init container fails to connect to content-repository after trying for 15 times, it's going to report this to proxy and proxy's going to delete the pod and create a new one?
Since this is a networking issue underneath, it might also be the case that the init container cannot talk to proxy either.
This entire thing was done to achieve serialized execution, i.e. content-repository comes up first, and then jenkins comes up.
Maybe this behavior can be achieved by proxy independently without using init containers.
Ignore if I don't make sense :P

@sthaha
Copy link
Contributor Author

sthaha commented Jun 21, 2018

@containscafeine proxy can keep track of the attempts to reach jenkins and decide to reset it. I think it may be better to do this on the idler side as it would need to react to build/deploy events and wake jenkins up.

@jfchevrette
Copy link

I've been trying to to reproduce the underlying network issue for the past couple of weeks without success, even by repeatedly idling/unidling my jenkins namespace through the jenkins-proxy API. The openshift networking team believe there is a problem in the way we idle/unidle in some specific situations. Some namespace are often getting stuck and I was unable to reproduce the issue on them using the idler API.

One theory was that jenkins-proxy may be asking openshift to scale up content-repository at the same time the jenkins init container is trying to wake it up by connecting to it's service which would be causing a weird situation at the openshift idler/SDN layer.

If we are confident that jenkins-idler is capable of handling the unidling of content-repository, we may want to try turning off the init container completely and see how it goes. If we do that, we would need jenkins-proxy to be aware of all jenkins DC changes and if jenkins DC is scaled up from outside of jenkins-idler/proxy (manually, or openshift itself unidles it), it would then have to react by also unidling/scaling up content-repository.

@kishansagathiya
Copy link
Member

I think this will have to be done on the UI side as UI knows how many times we have tried to unidle Jenkins or how much time has it been.
@sthaha WDYT?

@sthaha
Copy link
Contributor Author

sthaha commented Jun 29, 2018

@kishansagathiya no, not in UI

The way I think of it, this must be done in idler which must keep a tab on the jenkins it tried to unidle and see if it got unidled properly. Like we already discussed, user-idler also unidles jenkins and the same problem can occur there can't it? So the caller asks the service to unidle and it shouldn't keep asking if the service if it really got unidled and then tell it to do its job properly.

cc @chmouel WDYT?

@chmouel
Copy link
Contributor

chmouel commented Jun 29, 2018

Let me try to make sure I understand this discussion

when you say "jenkins pods fail to wake up" you are talking about that networking problem @jfchevrette has been trying to debug.

Basically @sthaha you are suggesting we find a workaround around this openshift-sdn bug in idler by keeping track how many time we have been trying to unidle, track if it was successfull and reset
then after a certain number of tries?

What would be the definition of successful tho ? because we have seen cases where network connectivity would work at init time, but when spawning a slave the slave would fail to communicate (over jnlp) to the master.

If we are workaround the networking connectivity problem here, maybe that should be done inside jenkins instead, get the init-container wait-for conent-repository dependence gone and have jenkins when spinning a new job detect if it can communicate properly and fails otherwise.

What do you think ?

@kishansagathiya
Copy link
Member

@sthaha Any thoughts on ^?

@hrishin
Copy link
Member

hrishin commented Jul 24, 2018

Update:

We are going to removing content-repository from init container and let's see how things work. If the issue still persists will reset pod from idler.
WIP for reset the pod fabric8-services/fabric8-jenkins-idler#261

@ppitonak
Copy link

@hrishin where is the removing of content-repository tracked?

@hrishin
Copy link
Member

hrishin commented Jul 26, 2018

@ppitonak ideally it could be seprate issue but its tracked openshiftio/openshift.io#3895

@kishansagathiya
Copy link
Member

@ppitonak Created an issue to track this openshiftio/openshift.io#4083

@sthaha
Copy link
Contributor Author

sthaha commented Aug 16, 2018

I am closing this as we seem to have solved the Jenkins not waking up issue by

  1. removing the init-container hack
  2. removing dependency on content-repository
  3. tweaking the jvm params
  4. allocating more resources to Jenkins

@sthaha sthaha closed this as completed Aug 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants