interleave Apache clones to minimise disruption (bsc#965886) #220

jsuchome · 2016-03-24T10:15:03Z

(There is a corresponding commit to crowbar-openstack, in which
there are many more clones than in this repository.)

By default, Pacemaker clones aren't interleaved. This means that if
Pacemaker wants to restart a dead clone instance, and there is an order
constraint on that clone, it will do the same restart on all other
nodes, even if all the others are healthy.

More details on interleaving are here:

https://www.hastexo.com/resources/hints-and-kinks/interleaving-pacemaker-clones/

This behaviour is far more disruptive than we want. For example, in

https://bugzilla.suse.com/show_bug.cgi?id=965886

we saw that when a network node dies and Pacemaker wants to stop the
instance of cl-g-neutron-agents on that node, it also stops and restarts
the same clone instances on the healthy nodes. This means there is a
small window in which there are no neutron agents running anywhere. If
neutron-ha-tool attempts a router migration during this window, it will
fail, at which point things start to go badly wrong.

In general, the cloned (i.e. active/active) services on our controller
and compute nodes should all behave like independent vertical stacks, so
that a failure on one node should not cause ripple effects on other
nodes. Apache is one example of that, and even though there aren't
currently any ordering constraints which would require interleaving,
we may add them in the future.

(cherry picked from commit 9589cfcdf78de37492dc6d2a03f7b7a16671a4ff)

Backport of crowbar/crowbar-ha#98

(There is a corresponding commit to crowbar-openstack, in which there are many more clones than in this repository.) By default, Pacemaker clones aren't interleaved. This means that if Pacemaker wants to restart a dead clone instance, and there is an order constraint on that clone, it will do the same restart on all other nodes, even if all the others are healthy. More details on interleaving are here: https://www.hastexo.com/resources/hints-and-kinks/interleaving-pacemaker-clones/ This behaviour is far more disruptive than we want. For example, in https://bugzilla.suse.com/show_bug.cgi?id=965886 we saw that when a network node dies and Pacemaker wants to stop the instance of cl-g-neutron-agents on that node, it also stops and restarts the same clone instances on the healthy nodes. This means there is a small window in which there are no neutron agents running anywhere. If neutron-ha-tool attempts a router migration during this window, it will fail, at which point things start to go badly wrong. In general, the cloned (i.e. active/active) services on our controller and compute nodes should all behave like independent vertical stacks, so that a failure on one node should not cause ripple effects on other nodes. Apache is one example of that, and even though there aren't currently any ordering constraints which would require interleaving, we may add them in the future. (cherry picked from commit 9589cfcdf78de37492dc6d2a03f7b7a16671a4ff)

vuntz · 2016-03-25T09:24:28Z

+1

AbelNavarro · 2016-03-29T10:31:09Z

+1

aspiers · 2016-04-05T17:01:25Z

Retriggered mkcloud rebuild to try to clear CI gate.

aspiers · 2016-04-07T20:08:19Z

Retriggered mkcloud rebuild CI again now that we have a fix.

aspiers merged commit 9afe665 into crowbar:release/tex/master Apr 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

interleave Apache clones to minimise disruption (bsc#965886) #220

interleave Apache clones to minimise disruption (bsc#965886) #220

jsuchome commented Mar 24, 2016

vuntz commented Mar 25, 2016

AbelNavarro commented Mar 29, 2016

aspiers commented Apr 5, 2016

aspiers commented Apr 7, 2016

interleave Apache clones to minimise disruption (bsc#965886) #220

interleave Apache clones to minimise disruption (bsc#965886) #220

Conversation

jsuchome commented Mar 24, 2016

vuntz commented Mar 25, 2016

AbelNavarro commented Mar 29, 2016

aspiers commented Apr 5, 2016

aspiers commented Apr 7, 2016