Skip to content
This repository has been archived by the owner on Dec 4, 2018. It is now read-only.

interleave Pacemaker clones to minimise disruption (bsc#965886) #208

Merged
merged 1 commit into from
Apr 7, 2016

Conversation

jsuchome
Copy link
Member

By default, Pacemaker clones aren't interleaved. This means that if
Pacemaker wants to restart a dead clone instance, and there is an order
constraint on that clone, it will do the same restart on all other
nodes, even if all the others are healthy.

More details on interleaving are here:

https://www.hastexo.com/resources/hints-and-kinks/interleaving-pacemaker-clones/

This behaviour is far more disruptive than we want. For example, in

https://bugzilla.suse.com/show_bug.cgi?id=965886

we saw that when a network node dies and Pacemaker wants to stop the
instance of cl-g-neutron-agents on that node, it also stops and restarts
the same clone instances on the healthy nodes. This means there is a
small window in which there are no neutron agents running anywhere. If
neutron-ha-tool attempts a router migration during this window, it will
fail, at which point things start to go badly wrong.

In general, the cloned (i.e. active/active) services on our controller
and compute nodes should all behave like independent vertical stacks,
so that a failure on one node should not cause ripple effects on other
nodes. So we interleave all our clones.

(There is a corresponding commit to crowbar-ha for the Apache clone.)

(cherry picked from commit bdde4b4dc2534e91bf1f2869a66491463134f8c1)

@vuntz
Copy link
Member

vuntz commented Mar 25, 2016

+1

@vuntz
Copy link
Member

vuntz commented Mar 25, 2016

Just realized: we usually put the meta before the action (but it doesn't matter, really).

@AbelNavarro
Copy link

+1 (CI failed for unrelated issue)

@@ -52,6 +52,9 @@
agent node[:ceilometer][:ha][:mongodb][:agent]
op node[:ceilometer][:ha][:mongodb][:op]
action :create
meta ({
"interleave" => "true",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't interleave go in the pacemaker_clone section?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, that looks correct!

By default, Pacemaker clones aren't interleaved.  This means that if
Pacemaker wants to restart a dead clone instance, and there is an order
constraint on that clone, it will do the same restart on all other
nodes, even if all the others are healthy.

More details on interleaving are here:

  https://www.hastexo.com/resources/hints-and-kinks/interleaving-pacemaker-clones/

This behaviour is far more disruptive than we want.  For example, in

  https://bugzilla.suse.com/show_bug.cgi?id=965886

we saw that when a network node dies and Pacemaker wants to stop the
instance of cl-g-neutron-agents on that node, it also stops and restarts
the same clone instances on the healthy nodes.  This means there is a
small window in which there are no neutron agents running anywhere.  If
neutron-ha-tool attempts a router migration during this window, it will
fail, at which point things start to go badly wrong.

In general, the cloned (i.e. active/active) services on our controller
and compute nodes should all behave like independent vertical stacks,
so that a failure on one node should not cause ripple effects on other
nodes.  So we interleave all our clones.

(There is a corresponding commit to crowbar-ha for the Apache clone.)

(cherry picked from commit bdde4b4dc2534e91bf1f2869a66491463134f8c1)
@jsuchome jsuchome force-pushed the interleave-clones branch from 5739441 to bec9e6f Compare March 31, 2016 11:04
@aspiers
Copy link
Member

aspiers commented Apr 7, 2016

Retriggered mkcloud rebuild CI again now that we have a fix.

@aspiers aspiers merged commit e4b231d into crowbar:release/tex/master Apr 7, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants