Skip to content

Commit

Permalink
Merge pull request #407 from klgill/BetaDocs-RestructureCephMigrationRBD
Browse files Browse the repository at this point in the history
Beta docs restructure ceph migration rbd
  • Loading branch information
klgill authored Apr 17, 2024
2 parents 6209f6b + f7d1b4a commit 186764b
Show file tree
Hide file tree
Showing 4 changed files with 99 additions and 118 deletions.
17 changes: 17 additions & 0 deletions docs_user/assemblies/assembly_migrating-ceph-rbd.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
[id="migrating-ceph-rbd_{context}"]

:context: migrating-ceph-rbd

= Migrating Red Hat Ceph Storage RBD

For hyperconverged infrastructure (HCI) or dedicated Storage nodes that are running version 6 or later, you must migrate the daemons that are included in the {rhos_prev_long} control plane into the existing external RHEL nodes. The external RHEL nodes typically include the Compute nodes for an HCI environment or dedicated storage nodes.

To migrate Red Hat Ceph Storage Rados Block Device (RBD), your environment must meet the following requirements:

* Red Hat Ceph Storage is running version 6 or later and is managed by cephadm/orchestrator.
* NFS (ganesha) is migrated from a {OpenStackPreviousInstaller}-based deployment to cephadm.For more information, see xref:creating-a-ceph-nfs-cluster_migrating-databases[Creating a NFS Ganesha cluster].
* Both the Red Hat Ceph Storage public and cluster networks are propagated, with {OpenStackPreviousInstaller}, to the target nodes.
* Ceph Monitors need to keep their IPs to avoid cold migration.

include::../modules/proc_migrating-mon-and-mgr-from-controller-nodes.adoc[leveloffset=+1]

4 changes: 1 addition & 3 deletions docs_user/main.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,6 @@ include::assemblies/assembly_migrating-databases-to-the-control-plane.adoc[level

include::assemblies/assembly_adopting-openstack-control-plane-services.adoc[leveloffset=+1]

include::assemblies/openstack_adoption.adoc[leveloffset=+1]

include::assemblies/ceph_migration.adoc[leveloffset=+1]
include::assemblies/assembly_migrating-ceph-rbd.adoc[leveloffset=+1]

include::assemblies/swift_migration.adoc[leveloffset=+1]
2 changes: 1 addition & 1 deletion docs_user/modules/proc_creating-a-ceph-nfs-cluster.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[id="creating-a-ceph-nfs-cluster_{context}"]

= Creating a Ceph NFS cluster
= Creating a NFS Ganesha cluster

If you use the Ceph via NFS backend with {rhos_component_storage_file_first_ref}, prior to adoption, you must create a new clustered NFS service on the Ceph cluster. This service will replace the standalone, pacemaker-controlled `ceph-nfs` service that was used on {rhos_prev_long} {rhos_prev_ver}.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,52 +1,24 @@
[id="migrating-ceph-rbd_{context}"]
[id="migrating-mon-and-mgr-from-controller-nodes_{context}"]

//:context: migrating-ceph-rbd
//kgilliga: This module might be converted to an assembly.
= Migrating Ceph Monitor and Ceph Manager daemons to Red Hat Ceph Storage nodes
//kgilliga: I'm trying to understand the purpose of this procedure. Is this procedure a prescriptive way for customers to migrate Ceph Monitor and Ceph manager daemons from controller nodes to Red Hat Ceph Storage nodes? Or are we recommending that customers create a proof of concept before doing the actual migration? And are oc0-controller-1 and oc0-ceph-0 just examples of the names of nodes for the purposes of this procedure? Note: The SME addressed these questions in the PR. This procedure needs more work. It should not be a POC.
Migrate your Ceph Monitor daemons, Ceph Manager daemons, and object storage daemons (OSDs) from your {rhos_prev_long} Controller nodes to existing Red Hat Ceph Storage nodes. During the migration,ensure that you can do the following actions:

= Migrating Ceph RBD
* Keep the mon IP addresses by moving them to the Red Hat Ceph Storage nodes.
* Drain the existing Controller nodes and shut them down.
* Deploy additional monitors to the existing nodes, and promote them as
_admin nodes that administrators can use to manage the Red Hat Ceph Storage cluster and perform day 2 operations against it.
* Keep the cluster operational during the migration.

In this scenario, assuming Ceph is already >= 5, either for HCI or dedicated
Storage nodes, the daemons living in the OpenStack control plane should be
moved/migrated into the existing external RHEL nodes (typically the compute
nodes for an HCI environment or dedicated storage nodes in all the remaining
use cases).
The following procedure shows an example migration from a Controller node (`oc0-controller-1`) and a Red Hat Ceph Storage node (`oc0-ceph-0`). Use the names of the nodes in your environment.

== Requirements
.Prerequisites

* Ceph is >= 5 and managed by cephadm/orchestrator.
* Ceph NFS (ganesha) migrated from a https://bugzilla.redhat.com/show_bug.cgi?id=2044910[TripleO based deployment to cephadm].
* Both the Ceph public and cluster networks are propagated, via TripleO, to the target nodes.
* Ceph Mons need to keep their IPs (to avoid cold migration).

== Scenario: Migrate mon and mgr from controller nodes

The goal of the first POC is to prove that you are able to successfully drain a
controller node, in terms of ceph daemons, and move them to a different node.
The initial target of the POC is RBD only, which means you are going to move only
mon and mgr daemons. For the purposes of this POC, you will deploy a ceph cluster
with only mon, mgrs, and osds to simulate the environment a customer will be in
before starting the migration.
The goal of the first POC is to ensure that:

* You can keep the mon IP addresses moving them to the Ceph Storage nodes.
* You can drain the existing controller nodes and shut them down.
* You can deploy additional monitors to the existing nodes, promoting them as
_admin nodes that can be used by administrators to manage the Ceph cluster
and perform day2 operations against it.
* You can keep the cluster operational during the migration.

=== Prerequisites

The Storage Nodes should be configured to have both *storage* and *storage_mgmt*
network to make sure that you can use both Ceph public and cluster networks.

This step is the only one where the interaction with TripleO is required. From
17+ you do not have to run any stack update. However, there are commands that you
should perform to run os-net-config on the bare-metal node and configure
additional networks.

Make sure the network is defined in metalsmith.yaml for the CephStorageNodes:
* Configure the Storage nodes to have both storage and storage_mgmt
network to ensure that you can use both Red Hat Ceph Storage public and cluster networks. This step requires you to interact with {OpenStackPreviousInstaller}. From {rhos_prev_long} {rhos_prev_ver} and later you do not have to run a stack update. However, there are commands that you must perform to run `os-net-config` on the bare metal node and configure additional networks.

.. Ensure that the network is defined in the `metalsmith.yaml` for the CephStorageNodes:
+
[source,yaml]
----
- name: CephStorage
Expand All @@ -68,16 +40,16 @@ Make sure the network is defined in metalsmith.yaml for the CephStorageNodes:
template: templates/single_nic_vlans/single_nic_vlans_storage.j2
----

Then run:

.. Run the following command:
+
----
openstack overcloud node provision \
-o overcloud-baremetal-deployed-0.yaml --stack overcloud-0 \
--network-config -y --concurrency 2 /home/stack/metalsmith-0.yam
----

Verify that the storage network is running on the node:

.. Verify that the storage network is running on the node:
+
----
(undercloud) [CentOS-9 - stack@undercloud ~]$ ssh [email protected] ip -o -4 a
Warning: Permanently added '192.168.24.14' (ED25519) to the list of known hosts.
Expand All @@ -88,33 +60,30 @@ Warning: Permanently added '192.168.24.14' (ED25519) to the list of known hosts.
8: vlan12 inet 172.16.12.46/24 brd 172.16.12.255 scope global vlan12\ valid_lft forever preferred_lft forever
----

=== Migrate mon(s) and mgr(s) on the two existing CephStorage nodes

Create a ceph spec based on the default roles with the mon/mgr on the
controller nodes.
.Procedure

. To migrate mon(s) and mgr(s) on the two existing Red Hat Ceph Storage nodes, create a Red Hat Ceph Storage spec based on the default roles with the mon/mgr on the controller nodes.
+
----
openstack overcloud ceph spec -o ceph_spec.yaml -y \
--stack overcloud-0 overcloud-baremetal-deployed-0.yaml
----

Deploy the Ceph cluster:

. Deploy the Red Hat Ceph Storage cluster:
+
----
openstack overcloud ceph deploy overcloud-baremetal-deployed-0.yaml \
--stack overcloud-0 -o deployed_ceph.yaml \
--network-data ~/oc0-network-data.yaml \
--ceph-spec ~/ceph_spec.yaml
----
+
[NOTE]
The `ceph_spec.yaml`, which is the OSP-generated description of the Red Hat Ceph Storage cluster,
will be used, later in the process, as the basic template required by cephadm to update the status/info of the daemons.

*Note*:

The ceph_spec.yaml, which is the OSP-generated description of the ceph cluster,
will be used, later in the process, as the basic template required by cephadm
to update the status/info of the daemons.

Check the status of the cluster:

. Check the status of the cluster:
+
----
[ceph: root@oc0-controller-0 /]# ceph -s
cluster:
Expand All @@ -132,7 +101,7 @@ Check the status of the cluster:
usage: 43 MiB used, 400 GiB / 400 GiB avail
pgs: 1 active+clean
----

+
----
[ceph: root@oc0-controller-0 /]# ceph orch host ls
HOST ADDR LABELS STATUS
Expand All @@ -143,40 +112,36 @@ oc0-controller-1 192.168.24.23 _admin mgr mon
oc0-controller-2 192.168.24.13 _admin mgr mon
----

The goal of the next section is to migrate the oc0-controller-{1,2} daemons
into oc0-ceph-{0,1} as the very basic scenario that demonstrates that you can
actually make this kind of migration using cephadm.

=== Migrate oc0-controller-1 into oc0-ceph-0

ssh into controller-0, then

. Log in to the `controller-0` node, then
//kgilliga: Need more description of what is happening in this step.
+
----
cephadm shell -v /home/ceph-admin/specs:/specs
----

ssh into ceph-0, then

. Log in to the `ceph-0` node, then
//kgilliga: Need more description of what is happening in this step.
+
----
sudo “watch podman ps” # watch the new mon/mgr being deployed here
----

(optional) if mgr is active in the source node, then:

. Optional: If mgr is active in the source node, then:
+
----
ceph mgr fail <mgr instance>
----

From the cephadm shell, remove the labels on oc0-controller-1

. From the cephadm shell, remove the labels on `oc0-controller-1`:
+
----
for label in mon mgr _admin; do
ceph orch host rm label oc0-controller-1 $label;
done
----

Add the missing labels to oc0-ceph-0

. Add the missing labels to `oc0-ceph-0`:
+
----
[ceph: root@oc0-controller-0 /]#
> for label in mon mgr _admin; do ceph orch host label add oc0-ceph-0 $label; done
Expand All @@ -185,8 +150,8 @@ Added label mgr to host oc0-ceph-0
Added label _admin to host oc0-ceph-0
----

Drain and force-remove the oc0-controller-1 node

. Drain and force-remove the `oc0-controller-1` node:
+
----
[ceph: root@oc0-controller-0 /]# ceph orch host drain oc0-controller-1
Scheduled to remove the following daemons from host 'oc0-controller-1'
Expand All @@ -196,7 +161,7 @@ mon oc0-controller-1
mgr oc0-controller-1.mtxohd
crash oc0-controller-1
----

+
----
[ceph: root@oc0-controller-0 /]# ceph orch host rm oc0-controller-1 --force
Removed host 'oc0-controller-1'
Expand All @@ -209,10 +174,10 @@ oc0-controller-0 192.168.24.15 mgr mon _admin
oc0-controller-2 192.168.24.13 _admin mgr mon
----

If you have only 3 mon nodes, and the drain of the node doesn't work as
expected (the containers are still there), then SSH to controller-1 and
. If you have only 3 mon nodes, and the drain of the node doesn't work as
expected (the containers are still there), then log in to controller-1 and
force-purge the containers in the node:

+
----
[root@oc0-controller-1 ~]# sudo podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Expand All @@ -230,13 +195,14 @@ endif::[]
[root@oc0-controller-1 ~]# sudo podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
----

NOTE: Cephadm rm-cluster on a node that is not part of the cluster anymore has the
+
[NOTE]
Cephadm rm-cluster on a node that is not part of the cluster anymore has the
effect of removing all the containers and doing some cleanup on the filesystem.

Before shutting the oc0-controller-1 down, move the IP address (on the same
. Before shutting the oc0-controller-1 down, move the IP address (on the same
network) to the oc0-ceph-0 node:

+
----
mon_host = [v2:172.16.11.54:3300/0,v1:172.16.11.54:6789/0] [v2:172.16.11.121:3300/0,v1:172.16.11.121:6789/0] [v2:172.16.11.205:3300/0,v1:172.16.11.205:6789/0]
Expand All @@ -252,8 +218,14 @@ mon_host = [v2:172.16.11.54:3300/0,v1:172.16.11.54:6789/0] [v2:172.16.11.121:330
12: vlan14 inet 172.16.14.223/24 brd 172.16.14.255 scope global vlan14\ valid_lft forever preferred_lft forever
----

On the oc0-ceph-0:

. On the oc0-ceph-0, add the IP address of the mon that has been deleted from `controller-0`, and verify that the IP address has been assigned and can be reached:
//kgilliga: Revisit this step. Do we need the [heat-admin @oc0-ceph-0 ~]$ ip -o -4 a] code block? Is that code block an example of the output?
+
----
$ sudo ip a add 172.16.11.121 dev vlan11
$ ip -o -4 a
----
+
----
[heat-admin@oc0-ceph-0 ~]$ ip -o -4 a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
Expand All @@ -271,17 +243,18 @@ On the oc0-ceph-0:
8: vlan12 inet 172.16.12.46/24 brd 172.16.12.255 scope global vlan12\ valid_lft forever preferred_lft forever
----

Poweroff oc0-controller-1.

Add the new mon on oc0-ceph-0 using the old IP address:
. Optional: Power off oc0-controller-1.
//kgilliga: What is the reason for powering off the controller (or not)?

. Add the new mon on oc0-ceph-0 using the old IP address:
+
----
[ceph: root@oc0-controller-0 /]# ceph orch daemon add mon oc0-ceph-0:172.16.11.121
Deployed mon.oc0-ceph-0 on host 'oc0-ceph-0'
----

Check the new container in the oc0-ceph-0 node:

. Check the new container in the oc0-ceph-0 node:
+
----
ifeval::["{build}" != "downstream"]
b581dc8bbb78 quay.io/ceph/daemon@sha256:320c364dcc8fc8120e2a42f54eb39ecdba12401a2546763b7bef15b02ce93bc4 -n mon.oc0-ceph-0... 24 seconds ago Up 24 seconds ago ceph-f6ec3ebe-26f7-56c8-985d-eb974e8e08e3-mon-oc0-ceph-0
Expand All @@ -291,9 +264,9 @@ b581dc8bbb78 registry.redhat.io/ceph/rhceph@sha256:320c364dcc8fc8120e2a42f54eb3
endif::[]
----

On the cephadm shell, backup the existing ceph_spec.yaml, edit the spec
. On the cephadm shell, backup the existing ceph_spec.yaml, edit the spec
removing any oc0-controller-1 entry, and replacing it with oc0-ceph-0:

+
----
cp ceph_spec.yaml ceph_spec.yaml.bkp # backup the ceph_spec.yaml file
Expand Down Expand Up @@ -337,8 +310,8 @@ cp ceph_spec.yaml ceph_spec.yaml.bkp # backup the ceph_spec.yaml file
service_type: mgr
----

Apply the resulting spec:

. Apply the resulting spec:
+
----
ceph orch apply -i ceph_spec.yaml
Expand Down Expand Up @@ -369,14 +342,14 @@ osd.default_drive_group 8 2m ago 69s oc0-ceph-0;oc0-ceph-1
pgs: 1 active+clean
----

Fix the warning by refreshing the mgr:

. Fix the warning by refreshing the mgr:
+
----
ceph mgr fail oc0-controller-0.xzgtvo
----

And at this point the cluster is clean:

+
At this point the cluster is clean:
+
----
[ceph: root@oc0-controller-0 specs]# ceph -s
cluster:
Expand All @@ -394,17 +367,10 @@ And at this point the cluster is clean:
usage: 43 MiB used, 400 GiB / 400 GiB avail
pgs: 1 active+clean
----
+
The `oc0-controller-1` is removed and powered off without leaving traces on the Red Hat Ceph Storage cluster.

oc0-controller-1 has been removed and powered off without leaving traces on the ceph cluster.

The same approach and the same steps can be applied to migrate oc0-controller-2 to oc0-ceph-1.

=== Screen Recording:

* https://asciinema.org/a/508174[Externalize a TripleO deployed Ceph cluster]
. Repeat this procedure for additional Controller nodes in your environment until you have migrated all the Ceph Mon and Ceph Manager daemons to the target nodes.

//== What's next

== Useful resources

* https://docs.ceph.com/en/pacific/cephadm/services/mon/#deploy-additional-monitors[cephadm - deploy additional mon(s)]

0 comments on commit 186764b

Please sign in to comment.