Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update generic node removal documentation to schedule a drain when removing worker nodes #324

Merged
merged 2 commits into from
May 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 44 additions & 35 deletions docs/modules/ROOT/pages/how-tos/cloudscale/remove_node.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,12 @@ Steps to remove a worker node of an OpenShift 4 cluster on https://cloudscale.ch
* You have admin-level access to the cluster
* You want to remove an existing worker node in the cluster

== High-level overview

* First we identify the correct node to remove and drain it.
* Then we remove it from Kubernetes.
* Finally we remove the associated VMs.

== Prerequisites

include::partial$cloudscale/prerequisites.adoc[]
Expand All @@ -26,6 +32,42 @@ include::partial$cloudscale/prerequisites.adoc[]

include::partial$cloudscale/setup-local-env.adoc[]

== Prepare Terraform environment

include::partial$cloudscale/configure-terraform-secrets.adoc[]

include::partial$setup_terraform.adoc[]

== Drain and Remove Node

* Find the node you want to remove.
It has to be the one with the highest terraform index.
+
[source,bash]
----
# Grab JSON copy of current Terraform state
terraform state pull > .tfstate.json

export NODE_TO_REMOVE=$(jq --arg index "$node_count" -r \
'.resources[] |
select(.module=="module.cluster.module.worker" and .type=="cloudscale_server") |
.instances[$index|tonumber-1] |
.attributes.name | split(".") | first' \
.tfstate.json)
echo $NODE_TO_REMOVE
----

* If you are working on a production cluster, you need to *schedule the node drain for the next maintenance.*
* If you are working on a non-production cluster, you may *drain and remove the node immediately.*

=== Schedule node drain (production clusters)

include::partial$drain-node-scheduled.adoc[]

=== Drain and remove node immediately

include::partial$drain-node-immediately.adoc[]

== Update Cluster Config

. Update cluster config.
Expand Down Expand Up @@ -58,39 +100,6 @@ popd
commodore catalog compile ${CLUSTER_ID} --push -i
----

== Prepare Terraform environment

include::partial$cloudscale/configure-terraform-secrets.adoc[]

include::partial$setup_terraform.adoc[]

== Remove Node

* Find the node you want to remove.
It has to be the one with the highest terraform index.
+
[source,bash]
----
# Grab JSON copy of current Terraform state
terraform state pull > .tfstate.json

node_count=$(jq -r \
'.resources[] |
select(.module=="module.cluster.module.worker" and .type=="cloudscale_server") |
.instances | length' \
.tfstate.json)
# Verify that the number of nodes is one more than we configured earlier.
echo $node_count

export NODE_TO_REMOVE=$(jq --arg index "$node_count" -r \
'.resources[] |
select(.module=="module.cluster.module.worker" and .type=="cloudscale_server") |
.instances[$index|tonumber-1] |
.attributes.name | split(".") | first' \
.tfstate.json)
echo $NODE_TO_REMOVE
----

=== Remove VM
== Remove VM

include::partial$delete-node.adoc[]
include::partial$delete-node-vm.adoc[]
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,9 @@ include::partial$storage-ceph-remove-mon.adoc[]

=== Clean up the old node

include::partial$delete-node.adoc[]
include::partial$drain-node-immediately.adoc[]

include::partial$delete-node-vm.adoc[]

== Finish up

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,9 @@ include::partial$storage-ceph-remove-mon.adoc[]

=== Clean up the old nodes

include::partial$delete-node.adoc[]
include::partial$drain-node-immediately.adoc[]

include::partial$delete-node-vm.adoc[]

== Finish up

Expand Down
79 changes: 44 additions & 35 deletions docs/modules/ROOT/pages/how-tos/exoscale/remove_node.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ Steps to remove a worker node of an OpenShift 4 cluster on https://www.exoscale.
* You have admin-level access to the cluster
* You want to remove an existing worker node in the cluster

== High-level overview

* First we identify the correct node to remove and drain it.
* Then we remove it from Kubernetes.
* Finally we remove the associated VMs.

== Prerequisites

include::partial$exoscale/prerequisites.adoc[]
Expand All @@ -27,6 +33,42 @@ include::partial$exoscale/prerequisites.adoc[]

include::partial$exoscale/setup-local-env.adoc[]

== Prepare Terraform environment

include::partial$exoscale/configure-terraform-secrets.adoc[]

include::partial$setup_terraform.adoc[]

== Drain and Remove Node

* Find the node you want to remove.
It has to be the one with the highest terraform index.
+
[source,bash]
----
# Grab JSON copy of current Terraform state
terraform state pull > .tfstate.json

export NODE_TO_REMOVE=$(jq --arg index "$node_count" -r \
'.resources[] |
select(.module=="module.cluster.module.worker" and .type=="exoscale_compute") |
.instances[$index|tonumber-1] |
.attributes.hostname' \
.tfstate.json)
echo $NODE_TO_REMOVE
----

* If you are working on a production cluster, you need to *schedule the node drain for the next maintenance.*
* If you are working on a non-production cluster, you may *drain and remove the node immediately.*

=== Schedule node drain (production clusters)

include::partial$drain-node-scheduled.adoc[]

=== Drain and remove node immediately

include::partial$drain-node-immediately.adoc[]

== Update Cluster Config

. Update cluster config.
Expand Down Expand Up @@ -59,39 +101,6 @@ popd
commodore catalog compile ${CLUSTER_ID} --push -i
----

== Prepare Terraform environment

include::partial$exoscale/configure-terraform-secrets.adoc[]

include::partial$setup_terraform.adoc[]

== Remove Node

* Find the node you want to remove.
It has to be the one with the highest terraform index.
+
[source,bash]
----
# Grab JSON copy of current Terraform state
terraform state pull > .tfstate.json

node_count=$(jq -r \
'.resources[] |
select(.module=="module.cluster.module.worker" and .type=="exoscale_compute") |
.instances | length' \
.tfstate.json)
# Verify that the number of nodes is one more than we configured earlier.
echo $node_count

export NODE_TO_REMOVE=$(jq --arg index "$node_count" -r \
'.resources[] |
select(.module=="module.cluster.module.worker" and .type=="exoscale_compute") |
.instances[$index|tonumber-1] |
.attributes.hostname' \
.tfstate.json)
echo $NODE_TO_REMOVE
----

=== Remove VM
== Remove VM

include::partial$delete-node.adoc[]
include::partial$delete-node-vm.adoc[]
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,9 @@ include::partial$storage-ceph-remove-mon.adoc[]

=== Remove VM

include::partial$delete-node.adoc[]
include::partial$drain-node-immediately.adoc[]

include::partial$delete-node-vm.adoc[]

== Finish up

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,9 @@ include::partial$storage-ceph-remove-mon.adoc[]

=== Clean up the old node

include::partial$delete-node.adoc[]
include::partial$drain-node-immediately.adoc[]

include::partial$delete-node-vm.adoc[]

== Finish up

Expand Down
Original file line number Diff line number Diff line change
@@ -1,34 +1,3 @@
. Drain the node(s)
+
[source,bash,subs="attributes+"]
----
for node in $(echo -n {node-delete-list}); do
kubectl --as=cluster-admin drain "${node}" \
--delete-emptydir-data --ignore-daemonsets
done
----
+
ifeval::["{cloud_provider}" == "cloudscale"]
ifeval::["{delete-node-type}" == "storage"]
[TIP]
====
On cloudscale.ch, we configure Rook Ceph to setup the OSDs in "portable" mode.
This configuration enables OSDs to be scheduled on any storage node.

With this configuration, we don't have to migrate OSDs hosted on the old node(s) manually.
Instead, draining a node will cause any OSDs hosted on that node to be rescheduled on other storage nodes.
====
endif::[]
endif::[]

. Delete the node(s) from the cluster
+
[source,bash,subs="attributes+"]
----
for node in $(echo -n {node-delete-list}); do
kubectl --as=cluster-admin delete node "${node}"
done
----

ifeval::["{delete-node-type}" == "storage"]
ifeval::["{delete-nodes-manually}" == "yes"]
Expand Down
31 changes: 31 additions & 0 deletions docs/modules/ROOT/partials/drain-node-immediately.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
. Drain the node(s)
+
[source,bash,subs="attributes+"]
----
for node in $(echo -n {node-delete-list}); do
kubectl --as=cluster-admin drain "${node}" \
--delete-emptydir-data --ignore-daemonsets
done
----
+
ifeval::["{cloud_provider}" == "cloudscale"]
ifeval::["{delete-node-type}" == "storage"]
[TIP]
====
On cloudscale.ch, we configure Rook Ceph to setup the OSDs in "portable" mode.
This configuration enables OSDs to be scheduled on any storage node.

With this configuration, we don't have to migrate OSDs hosted on the old node(s) manually.
Instead, draining a node will cause any OSDs hosted on that node to be rescheduled on other storage nodes.
====
endif::[]
endif::[]

. Delete the node(s) from the cluster
+
[source,bash,subs="attributes+"]
----
for node in $(echo -n {node-delete-list}); do
kubectl --as=cluster-admin delete node "${node}"
done
----
Loading