forked from jasonkeene/docs-rabbitmq-staging
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrolling-upgrade.html.md.erb
189 lines (144 loc) · 7.78 KB
/
rolling-upgrade.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: Rolling Upgrades in VMware RabbitMQ for Tanzu Application Service
owner: London Services
---
This topic describes the rolling upgrade strategy and how it can incur less downtime than other upgrade
methods.
It includes steps for running a rolling upgrade and a description of an experiment that illustrates the
benefits of rolling upgrades in detail.
## <a id='overview'></a> Overview
A rolling upgrade is a strategy for updating a distributed system.
In a rolling upgrade, each VM is updated in turn.
After the update completes, the VM is started, and, after the specified processes are running, the update
procedure begins for the next VM in the sequence.
In a <%= vars.product_short %> cluster, each node runs on a separate VM.
Rolling upgrades help to ensure availability by keeping at least one node up throughout the upgrade process.
Before v1.17.3, some upgrades required the whole cluster to be shut down.
For example, when a major or minor version of <%= vars.product_short %> was updated, or when a major
version of the Erlang distribution was updated.
As of v1.17.3, upgrades are performed using a rolling upgrade strategy.
The only case where a cluster is required to fully shut down as part of an upgrade is where the Erlang cookie
for the cluster is changed.
Due to ongoing development, VMware cannot guarantee that rolling upgrades will always be possible in the future.
VMware recommends always checking the release notes for each version before upgrading.
## <a id='rolling'></a> Running a Rolling Upgrade for <%= vars.product_full %>
On a single canary node in the cluster, the following steps are carried out:
* `rabbitmqctl stop` runs, stopping the <%= vars.product_short %> server process and the Erlang VM.
* The persistent disk is detached.
* The VM is torn down for the node.
* After 0–5 seconds the BOSH DNS service detects a failing health check for
the node that has just gone down. It stops routing service traffic to the node.
* BOSH requests a new VM from the underlying IaaS. It attaches the persistent
disk from the old VM.
* rabbitmq-server starts on the new VM. The node rejoins the cluster.
* The new node registers with the BOSH DNS service and begins receiving traffic.
The above steps are then carried out on the remaining nodes in the cluster, one by one.
## <a id='example'></a> Example Rolling Upgrade Scenario
The experiment described below is an example of a rolling upgrade scenario.
In the experiment, an operator upgrades their platform to use a new version of <%= vars.product_full %>.
They upgrade from v1.17.4 to v1.18.1.
This experiment is designed to show a system performing a rolling upgrade under a heavy load: there is
substantial disk I/O with both the underlying <%= vars.product_short %> and Erlang software upgrading to a
newer version.
Without a rolling upgrade, the whole cluster must shut down, resulting in a service outage given
publishers and consumers are unable to connect to the cluster.
This experiment shows the extent of downtime associated with a rolling upgrade.
<p class='note'>
<strong>Note:</strong> The following is provided for example purposes only and is not intended to represent
all upgrade situations. Your platform setup might have different results.
</p>
### <a id='setup'></a> Configuration and Setup
Details of the configuration and setup of the experiment are detailed in the sections below.
#### <a id='cluster-config'></a> Cluster Configuration
The IaaS used for this experiment is Google Cloud Platform (GCP).
The <%= vars.product_short %> node VMs are configured with:
* CPUs: 2
* RAM: 2 GB
* Disk: 8 GB
* Persistent disk: 5 GB
Initially, <%= vars.product_old %> v1.17.4 was installed with a plan configured to build a three-node
cluster with queues being mirrored.
This environment was then upgraded to use <%= vars.product_short %> v1.18.1.
#### <a id='app-config'></a> App Configuration
The RabbitMQ Performance Tool for Cloud Foundry simulates the workload on the cluster.
This is a Java app that tests throughput of <%= vars.product_short %>.
This tool uses a resilient client with reconnection and retry logic.
When the performance test is run, it creates a direct exchange and a queue.
In addition, it creates the necessary consumers and producers and binds them to the newly created queue.
In this experiment, the performance test is configured to use durable and mirrored queues and persistent
messages, which ensure that messages are persisted to disk.
A protocol extension called Publisher Confirms is enabled to ensure that there is no data loss.
This setup ensures that there is a backlog of messages to be read from disk and consumed at any point during
the upgrade.
The publishers are configured to constantly produce messages in three different bursts:
+ 500 messages per second for 30 seconds
+ 750 messages per second for 15 seconds
+ 250 messages per second for 15 seconds
The publishers are expected to consume a total of 500 messages per second.
Each message is a 50,000-byte JSON blob.
The equivalent app manifest for this test is as follows:
```yaml
---
applications:
- name: rabbitmq-perf-test
path: ./target/pcf-perf-test-1.0-SNAPSHOT.jar
buildpacks:
- https://github.com/cloudfoundry/java-buildpack.git
memory: 2G
health-check-type: process
services: [rmq]
env:
VARIABLE_RATE: "500:30,750:15,250:15"
CONSUMER_RATE: 500
JSON_BODY: true
SIZE: 50000
SLOW_START: true
METRICS_PROMETHEUS: true
FLAG: persistent
CONFIRM: 30000
```
For more information about concepts mentioned above, see:
<table class="nice">
<col width="50%">
<col width="50%">
<th>Concept</th>
<th>More information in…</th>
<tr>
<td>The RabbitMQ Performance Tool for Cloud Foundry</td>
<td>the <a href="https://github.com/rabbitmq/rabbitmq-perf-test-for-cf">RabbitMQ
PerfTest for Cloud Foundry</a> repository in GitHub</td>
</tr>
<tr>
<td>Direct exchange</td>
<td>the <a href="https://www.rabbitmq.com/tutorials/amqp-concepts.html#exchange-direct">RabbitMQ documentation</td>
</tr>
<tr>
<td>Durable queues</td>
<td>the <a href="https://www.rabbitmq.com/queues.html#durability">RabbitMQ documentation</td>
</tr>
<tr>
<td>Mirrored queues</td>
<td>the <a href="https://www.rabbitmq.com/ha.html">RabbitMQ documentation</td>
</tr>
<tr>
<td>Publisher Confirms protocol extension</td>
<td>the <a href="https://www.rabbitmq.com/confirms.html#publisher-confirms">RabbitMQ documentation</td>
</tr>
</table>
### <a id='observe'></a> Observations
Tests show that downtime experienced during this rolling upgrade is significantly reduced compared to a
similar upgrade where the cluster is fully shut down.
The metrics indicate that the downtime, in this case a publisher being unable to publish a message to a
queue, is 5 seconds at maximum.
This is because the internal BOSH DNS record used to round-robin messages to the nodes in the cluster has a
5-second time to live (TTL). The messages are routed to nodes that have just been replaced.
Because the tested app has retry logic, no service outage is observed.
For more information about creating resilient apps, see the
[resiliency-workloads](https://github.com/rabbitmq/resiliency-workloads) repository in GitHub.
In most cases, downtime is longer for a cluster under a greater load.
When a node comes back up and rejoins the cluster, messages from the other nodes sync with the newly joined node.
Queues on the newly joined node reject publishers and consumers until the messages are synced.
### <a id='consider'></a> Cluster Configuration Considerations
There is downtime for a cluster without mirrored queues.
This is because when the hosting node is down, the queue does not exist and any published messages are dropped
unless the publisher uses the `mandatory` flag or the exchange is configured with an alternate exchange.