Reset the pods if jenkins fails to wake up within certain duration #289

sthaha · 2018-06-05T10:18:01Z

We have noticed that at times, jenkins pods fail to wake up and the only solution seems to kill the pod and start them again. Lets implement this as a part of jenkins-proxy's un-idle step.

piyush-garg · 2018-06-19T11:46:02Z

This will help openshiftio/openshift.io#3517

hrishin · 2018-06-20T07:16:57Z

let's also reduce the polling frequency in init container? So it would be fail fast

concaf · 2018-06-20T07:29:30Z

@sthaha so, let's say, after the init container fails to connect to content-repository after trying for 15 times, it's going to report this to proxy and proxy's going to delete the pod and create a new one?
Since this is a networking issue underneath, it might also be the case that the init container cannot talk to proxy either.
This entire thing was done to achieve serialized execution, i.e. content-repository comes up first, and then jenkins comes up.
Maybe this behavior can be achieved by proxy independently without using init containers.
Ignore if I don't make sense :P

sthaha · 2018-06-21T04:30:04Z

@containscafeine proxy can keep track of the attempts to reach jenkins and decide to reset it. I think it may be better to do this on the idler side as it would need to react to build/deploy events and wake jenkins up.

jfchevrette · 2018-06-21T15:18:17Z

I've been trying to to reproduce the underlying network issue for the past couple of weeks without success, even by repeatedly idling/unidling my jenkins namespace through the jenkins-proxy API. The openshift networking team believe there is a problem in the way we idle/unidle in some specific situations. Some namespace are often getting stuck and I was unable to reproduce the issue on them using the idler API.

One theory was that jenkins-proxy may be asking openshift to scale up content-repository at the same time the jenkins init container is trying to wake it up by connecting to it's service which would be causing a weird situation at the openshift idler/SDN layer.

If we are confident that jenkins-idler is capable of handling the unidling of content-repository, we may want to try turning off the init container completely and see how it goes. If we do that, we would need jenkins-proxy to be aware of all jenkins DC changes and if jenkins DC is scaled up from outside of jenkins-idler/proxy (manually, or openshift itself unidles it), it would then have to react by also unidling/scaling up content-repository.

kishansagathiya · 2018-06-29T05:42:05Z

I think this will have to be done on the UI side as UI knows how many times we have tried to unidle Jenkins or how much time has it been.
@sthaha WDYT?

sthaha · 2018-06-29T11:22:14Z

@kishansagathiya no, not in UI

The way I think of it, this must be done in idler which must keep a tab on the jenkins it tried to unidle and see if it got unidled properly. Like we already discussed, user-idler also unidles jenkins and the same problem can occur there can't it? So the caller asks the service to unidle and it shouldn't keep asking if the service if it really got unidled and then tell it to do its job properly.

cc @chmouel WDYT?

chmouel · 2018-06-29T12:01:49Z

Let me try to make sure I understand this discussion

when you say "jenkins pods fail to wake up" you are talking about that networking problem @jfchevrette has been trying to debug.

Basically @sthaha you are suggesting we find a workaround around this openshift-sdn bug in idler by keeping track how many time we have been trying to unidle, track if it was successfull and reset
then after a certain number of tries?

What would be the definition of successful tho ? because we have seen cases where network connectivity would work at init time, but when spawning a slave the slave would fail to communicate (over jnlp) to the master.

If we are workaround the networking connectivity problem here, maybe that should be done inside jenkins instead, get the init-container wait-for conent-repository dependence gone and have jenkins when spinning a new job detect if it can communicate properly and fails otherwise.

What do you think ?

kishansagathiya · 2018-07-02T11:30:16Z

@sthaha Any thoughts on ^?

hrishin · 2018-07-24T12:09:22Z

Update:

We are going to removing content-repository from init container and let's see how things work. If the issue still persists will reset pod from idler.
WIP for reset the pod fabric8-services/fabric8-jenkins-idler#261

ppitonak · 2018-07-26T11:04:08Z

@hrishin where is the removing of content-repository tracked?

hrishin · 2018-07-26T12:30:44Z

@ppitonak ideally it could be seprate issue but its tracked openshiftio/openshift.io#3895

kishansagathiya · 2018-07-27T06:15:43Z

@ppitonak Created an issue to track this openshiftio/openshift.io#4083

sthaha · 2018-08-16T02:36:42Z

I am closing this as we seem to have solved the Jenkins not waking up issue by

removing the init-container hack
removing dependency on content-repository
tweaking the jvm params
allocating more resources to Jenkins

sthaha added kind/feature priority/major labels Jun 5, 2018

piyush-garg assigned piyush-garg and kishansagathiya and unassigned piyush-garg Jun 20, 2018

piyush-garg mentioned this issue Jun 21, 2018

Jenkins un-idling not working openshiftio/openshift.io#3517

Closed

kishansagathiya mentioned this issue Jun 26, 2018

An API to reset defunct pods fabric8-services/fabric8-jenkins-idler#252

Closed

kishansagathiya assigned hrishin and unassigned kishansagathiya Jul 4, 2018

hrishin mentioned this issue Jul 11, 2018

Reset the pods if Jenkins fails to wake up within certain duration fabric8-services/fabric8-jenkins-idler#258

Closed

hrishin removed their assignment Jul 24, 2018

hrishin mentioned this issue Jul 25, 2018

[WIP] Fixes #258 Reset the pods if Jenkins fails to wake up fabric8-services/fabric8-jenkins-idler#261

Open

sthaha closed this as completed Aug 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset the pods if jenkins fails to wake up within certain duration #289

Reset the pods if jenkins fails to wake up within certain duration #289

sthaha commented Jun 5, 2018

piyush-garg commented Jun 19, 2018

hrishin commented Jun 20, 2018

concaf commented Jun 20, 2018

sthaha commented Jun 21, 2018

jfchevrette commented Jun 21, 2018

kishansagathiya commented Jun 29, 2018

sthaha commented Jun 29, 2018

chmouel commented Jun 29, 2018

kishansagathiya commented Jul 2, 2018

hrishin commented Jul 24, 2018

ppitonak commented Jul 26, 2018

hrishin commented Jul 26, 2018

kishansagathiya commented Jul 27, 2018

sthaha commented Aug 16, 2018

Reset the pods if jenkins fails to wake up within certain duration #289

Reset the pods if jenkins fails to wake up within certain duration #289

Comments

sthaha commented Jun 5, 2018

piyush-garg commented Jun 19, 2018

hrishin commented Jun 20, 2018

concaf commented Jun 20, 2018

sthaha commented Jun 21, 2018

jfchevrette commented Jun 21, 2018

kishansagathiya commented Jun 29, 2018

sthaha commented Jun 29, 2018

chmouel commented Jun 29, 2018

kishansagathiya commented Jul 2, 2018

hrishin commented Jul 24, 2018

Update:

ppitonak commented Jul 26, 2018

hrishin commented Jul 26, 2018

kishansagathiya commented Jul 27, 2018

sthaha commented Aug 16, 2018