Skip to content

Commit

Permalink
Add how-to for migrating to Cilium
Browse files Browse the repository at this point in the history
  • Loading branch information
simu committed Oct 16, 2023
1 parent 7c4ecf8 commit fb7898c
Show file tree
Hide file tree
Showing 4 changed files with 259 additions and 1 deletion.
218 changes: 218 additions & 0 deletions docs/modules/ROOT/pages/how-tos/network/migrate-to-cilium.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
= Migrate to Cilium CNI

== Prerequisites

* `cluster-admin` privileges
* `kubectl`
* `jq`
* Working `commodore`

// TODO: kube-proxy replacement?

== Prepare for migration

include::partial$create-alertmanager-silence-all-projectsyn.adoc[]

. Select cluster
+
[source,bash]
----
export CLUSTER_ID=c-cluster-id-1234 <1>
export COMMODORE_API_URL=https://api.syn.vshn.net <2>
export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" \
"${COMMODORE_API_URL}/clusters/${CLUSTER_ID}" | jq -r '.tenant')
export KUBECONFIG=/path/to/cluster/kubeconfig <3>
----
<1> Replace with the Project Syn cluster ID of the cluster to migrate
<2> Replace with the Lieutenant API on which the cluster is registered
<3> Ensure that `kubectl` commands are executed against the cluster you're migrating.

. Disable ArgoCD auto sync for component `openshift4-nodes`
+
:argo_app: openshift4-nodes
+
include::partial$disable-argocd-autosync.adoc[]

. Disable the cluster-network-operator.
This is necessary to ensure that we can migrate to Cilium without the cluster-network-operator trying to interfere.
+
TODO: Figure out if we need to scale down the upgrade operator
+
[source,bash]
----
kubectl --as=cluster-admin patch clusterversion version \
--type=merge \
-p '
{"spec":{"overrides":[
{
"kind": "Deployment",
"group": "apps",
"name": "network-operator",
"namespace": "openshift-network-operator",
"unmanaged": true
}
]}}'
----
+
[source,bash]
----
kubectl --as=cluster-admin -n openshift-network-operator \
scale deploy network-operator --replicas=0
----

. Remove network operator state
+
[source,bash]
----
kubectl --as=cluster-admin -n openshift-network-operator \
delete configmap applied-cluster
----

. Pause all machine config pools
+
[source,bash]
----
for mcp in $(kubectl get mcp -o name); do
kubectl --as=cluster-admin patch $mcp --type=merge -p '{"spec": {"paused": true}}'
done
----

== Migrate to Cilium

. Get local cluster working directory
+
[source,bash]
----
commodore catalog compile "$CLUSTER_ID" <1>
----
<1> We recommend switching to an empty directory to run this command.
Alternatively, switch to your existing directory for the cluster.

. Enable component `cilium`
+
[source,bash]
----
pushd inventory/classes/"${TENANT_ID}"
yq -i ".applications += "cilium" "${CLUSTER_ID}.yml"
----

. Update `upstreamRules` for monitoring
+
[source,bash]
----
yq -i ".parameters.openshift4_monitoring.upstreamRules.networkPlugin = \"Cilium\"" \
"${CLUSTER_ID}.yml"
----

. Update component `networkpolicy` config
+
[source,bash]
----
yq eval -i '.parameters.networkpolicy.networkPlugin = "cilium"' \
"${CLUSTER_ID}.yml"
yq eval -i '.parameters.networkpolicy.ignoredNamespaces = ["openshift-oauth-apiserver"]' \
"${CLUSTER_ID}.yml"
----

. Configure component `cilium`
+
.Configure required parameters for strict kube-proxy replacement
[source,bash]
----
yq -i '.parameters.cilium.cilium_helm_values.kubeProxyReplacement = "strict"' \
"${CLUSTER_ID}.yml"
yq -i '.parameters.cilium.cilium_helm_values.k8sServiceHost = "api-int.${openshift:baseDomain}"' \
"${CLUSTER_ID}.yml"
yq -i '.parameters.cilium.cilium_helm_values.k8sServicePort = "6443"' \
"${CLUSTER_ID}.yml"
----
+
[source,bash]
----
POD_CIDR=$(kubectl get network.config cluster \
-o jsonpath='{.spec.clusterNetwork[0].cidr}')
HOST_PREFIX=$(kubectl get network.config cluster \
-o jsonpath='{.spec.clusterNetwork[0].hostPrefix}')
if [ $HOST_PREFIX != "23" ]; then
yq -i '.parameters.cilium.cilium_helm_values.ipam.operator.clusterPoolIPv4MaskSize = "'"${HOST_PREFIX}"'"' \
"${CLUSTER_ID}.yml"
fi
if [ $POD_CIDR != "10.128.0.0/14" ]; then
yq -i '.parameters.cilium.cilium_helm_values.ipam.operator.clusterPoolIPv4PodCIDR = "'"${POD_CIDR}"'"' \
"${CLUSTER_ID}.yml"
fi
----

. Commit changes
+
[source,bash]
----
git commit -am "Migrate ${CLUSTER_ID} to Cilium"
git push origin master
popd
----

. Compile catalog
+
[source,yaml]
----
commodore catalog compile "${CLUSTER_ID}"
----

. Patch cluster network config
+
TODO: Should we manage this through component openshift4-networking somehow? If so, just as patches? or try to manage the objs? How to ensure we don't explode existing clusters if we manage the objs? IMPORTANT: If we manage it, this needs to be moved above `Compile catalog`.
+
[source,bash]
----
kubectl --as=cluster-admin patch network.config cluster \
--type=merge -p '{"spec":{"networkType":"Cilium"},"status":null}'
kubectl --as=cluster-admin patch network.operator cluster \
--type=merge -p '{"spec":{"defaultNetwork":{"type":"Cilium"},deployKubeProxy:false},"status":null}'
----

. TODO: Scale down/delete existing CNI?

. Apply Cilium manifests
+
[source,bash]
----
kubectl apply -Rf catalog/manifests/cilium/
----

. Wait until Cilium CNI is up and running
+
[source,bash]
----
kubectl -n cilium get pods -w
----

== Finalize migration

. Re-enable cluster network operator
+
[source,bash]
----
kubectl --as=cluster-admin -n openshift-network-operator \
scale deployment network-operator --replicas=1
kubectl --as=cluster-admin patch clusterversion version \
--type=merge -p '{"spec":{"overrides":null}}'
----

. Unpause MCPs
+
[source,bash]
----
for mcp in $(kubectl get mcp -o name); do
kubectl --as=cluster-admin patch $mcp --type=merge -p '{"spec":{"paused":false}}'
done
----

include::partial$enable-argocd-autosync.adoc[]

== Cleanup alert silence

:argo_app:

include::partial$remove-alertmanager-silence-all-projectsyn.adoc[]
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
// NOTE: this snippet only works correctly at the beginning of a numbered
// list. I was unable to figure out how to define the page attributes in a way
// that works for the alertmanager-silence-job.adoc partial without breaking
// the list flow.
:silence-target: all
:duration: +60 minutes
:http-method: POST
:alertmanager-endpoint: /api/v2/silences

. Silence all Project Syn alerts
+
include::partial$alertmanager-silence-job.adoc[]

. Extract Alertmanager silence ID from job logs
+
[source,bash]
----
silence_id=$(kubectl --as=cluster-admin -n openshift-monitoring logs jobs/${job_name} | \
jq -r '.silenceID')
----

3 changes: 2 additions & 1 deletion docs/modules/ROOT/partials/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,8 @@
** xref:oc4:ROOT:how-tos/authentication/disable-self-provisioning.adoc[Disable project self-provisioning]
** xref:oc4:ROOT:explanations/sudo.adoc[]
// Networking
* Networking
** xref:oc4:ROOT:how-tos/network/migrate-to-cilium.adoc[]
* Ingress
** xref:oc4:ROOT:how-tos/ingress/self-signed-ingress-cert.adoc[]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
// NOTE: this snippet only works correctly at the beginning of a numbered
// list. I was unable to figure out how to define the page attributes in a way
// that works for the alertmanager-silence-job.adoc partial without breaking
// the list flow.
:alertmanager-endpoint: /api/v2/silence/${silence_id}
:silence-target: all
:http-method: DELETE

. Remove silence in Alertmanager
+
include::partial$alertmanager-silence-job.adoc[]

. Clean up Alertmanager silence jobs
+
[source,bash,subs="attributes+"]
----
kubectl --as=cluster-admin -n openshift-monitoring delete jobs -l app=silence-{silence_target}-alerts
----

0 comments on commit fb7898c

Please sign in to comment.