Skip to content

Commit

Permalink
Merge pull request #197 from danskernesdigitalebibliotek/feature/upda…
Browse files Browse the repository at this point in the history
…te-tools

Lagoon + AKS upgrades
  • Loading branch information
achton authored Nov 13, 2023
2 parents a3b7cfb + d3b6c29 commit c7c12b6
Show file tree
Hide file tree
Showing 13 changed files with 82 additions and 64 deletions.
44 changes: 44 additions & 0 deletions docs/runbooks/rabbitmq-broker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# RabbitMQ broker force start

## When to use

When the PR environments are no longer being created, and the
`lagoon-core-broker-<n>` pods are missing or not running, and the container logs
contain errors like `Error while waiting for Mnesia tables:
{timeout_waiting_for_tables`.

This situation is caused by the RabbitMQ broker not starting correctly.

## Prerequisites

* A [dplsh session](using-dplsh.md) with DPLPLAT_ENV exported .

## Procedure

You are going to exec into the pod and stop the RabbitMQ application, and then
start it with [the `force_boot`
feature](https://www.rabbitmq.com/rabbitmqctl.8.html#force_boot), so that it can
perform its Mnesia sync correctly.

Exec into the pod:

```shell
dplsh:~/host_mount$ kubectl -n lagoon-core exec -ti pod/lagoon-core-broker-0 -- sh
```

Stop RabbitMQ:

```shell
/ $ rabbitmqctl stop_app
Stopping rabbit application on node [email protected]
core-broker-headless.lagoon-core.svc.cluster.local ...
```

Start it immediately after using the `force_boot` flag:

```shell
/ $ rabbitmqctl force_boot
```

Then exit the shell and check the container logs for one of the broker pods. It
should start without errors.
30 changes: 11 additions & 19 deletions docs/runbooks/upgrading-aks.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,32 +62,24 @@ task infra:provision

### Upgrade the cluster

Upgrade the control-plane:
Initiate a cluster upgrade. This will upgrade the control plane and node pools
together. See the [AKS documentation](https://learn.microsoft.com/en-us/azure/aks/upgrade-aks-cluster?tabs=azure-cli#upgrade-an-aks-cluster)
for background info on this operation.

1. Update the `control_plane_version` reference in `infrastructure/environments/<environment>/infrastructure/main.tf`
and run `task infra:provision` to apply. You can skip patch-versions, but you
can only do [one minor-version at the time](https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster?tabs=azure-cli#check-for-available-aks-cluster-upgrades)

2. Monitor the upgrade as it progresses. A control-plane upgrade is usually performed
in under 5 minutes.
2. Monitor the upgrade as it progresses. The control-plane upgrade is usually
performed in under 5 minutes. Monitor via eg. `watch -n 5 kubectl version`.

Monitor via eg.

```shell
watch -n 5 kubectl version
```

Then upgrade the system, admin and application node-pools in that order one by
one.

1. Update the `pool_[name]_version` reference in
`infrastructure/environments/<environment>/infrastructure/main.tf`.
The same rules applies for the version as with `control_plane_version`.
3. AKS will then automatically upgrade the system, admin and application
node-pools.

2. Monitor the upgrade as it progresses. Expect the provisioning of and workload
scheduling to a single node to take about 5-10 minutes. In particular be aware
that the admin node-pool where harbor runs has a tendency to take a long time
as the harbor pvcs are slow to migrate to the new node.
4. Monitor the upgrade as it progresses. Expect the provisioning of and workload
scheduling to a single node to take about 5-10 minutes. In particular be
aware that the admin node-pool where harbor runs has a tendency to take a
long time as the harbor pvcs are slow to migrate to the new node.

Monitor via eg.

Expand Down
2 changes: 1 addition & 1 deletion docs/runbooks/upgrading-lagoon.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ curl -s https://uselagoon.github.io/lagoon-charts/index.yaml \
* `task lagoon:provision:core`
2. Upgrade Lagoon remote
1. Bump the chart version `VERSION_LAGOON_REMOTE` in
`infrastructure/environments/dplplat01/lagoon/lagoon-versions.env`
`infrastructure/environments/<env>/lagoon/lagoon-versions.env`
2. Perform a helm diff
* `DIFF=1 task lagoon:provision:remote`
3. Perform the actual upgrade
Expand Down
4 changes: 2 additions & 2 deletions infrastructure/Taskfile.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,8 @@ tasks:
infra:terraform:init-upgrade:
desc: Init and upgrade local Terraform state
summary: |
Use this task the local state needs to be updated. This happens eg. when
terraform is upgraded.
Use this task when the local state needs to be updated. This happens eg.
when terraform is upgraded.
deps:
- _req_env
dir: "{{.dir_infra}}"
Expand Down
6 changes: 2 additions & 4 deletions infrastructure/environments/dplplat01/infrastructure/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,8 @@ module "environment" {
# When copying this value, consider leaving it out and falling back to the
# default of 102400.
sql_storage_mb = 409600
control_plane_version = "1.24"
pool_system_version = "1.24"
pool_admin_version = "1.24"
pool_appdefault_version = "1.24"
control_plane_version = "1.25.11"
cluster_upgrade_channel = "patch"
}

# Outputs, for values that comes straight from the dpl-platform-environment
Expand Down
15 changes: 5 additions & 10 deletions infrastructure/environments/dplplat01/lagoon/lagoon-versions.env
Original file line number Diff line number Diff line change
@@ -1,13 +1,8 @@
# Get Lagoon versions from the Chart.yml - inspect appVersion to determine which
# version of lagoon is installed

# Current appVersion for remote and core is v2.15.0
# https://github.com/uselagoon/lagoon-charts/blob/main/charts/lagoon-core/Chart.yaml#L20
VERSION_LAGOON_CORE=1.29.0
# https://github.com/uselagoon/lagoon-charts/blob/main/charts/lagoon-remote/Chart.yaml#L21
VERSION_LAGOON_REMOTE=0.77.0

# This should match the currently installed version of Lagoon Remote. It
# actually maps to an image tag here: https://hub.docker.com/r/uselagoon/kubectl-build-deploy-dind/tags
# See https://github.com/uselagoon/lagoon-charts/releases/tag/lagoon-core-1.2.0
#
# Current appVersion for remote and core is v2.16.0
# https://github.com/uselagoon/lagoon-charts/blob/main/charts/lagoon-core/Chart.yaml#L24
VERSION_LAGOON_CORE=1.39.0
# https://github.com/uselagoon/lagoon-charts/blob/main/charts/lagoon-remote/Chart.yaml#L22
VERSION_LAGOON_REMOTE=0.86.0
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ resource "azurerm_kubernetes_cluster" "cluster" {
resource_group_name = azurerm_resource_group.rg.name
dns_prefix = var.environment_name
kubernetes_version = var.control_plane_version
automatic_channel_upgrade = var.cluster_upgrade_channel

# We use a single manually scaled node pool in a single availabillity zone.
default_node_pool {
Expand All @@ -21,8 +22,6 @@ resource "azurerm_kubernetes_cluster" "cluster" {
# Attach the cluster to our private network.
vnet_subnet_id = azurerm_subnet.aks.id

orchestrator_version = var.pool_system_version

# High Avaiabillity is not a high enough priority to warrent the extra
# complexity and cost of having a multi-zonal cluster.
zones = ["1"]
Expand Down Expand Up @@ -67,7 +66,6 @@ resource "azurerm_kubernetes_cluster_node_pool" "admin" {
name = "admin"
kubernetes_cluster_id = azurerm_kubernetes_cluster.cluster.id
vnet_subnet_id = azurerm_subnet.aks.id
orchestrator_version = var.pool_admin_version
node_labels = {
"noderole.dplplatform" : "admin"
}
Expand Down Expand Up @@ -97,7 +95,6 @@ resource "azurerm_kubernetes_cluster_node_pool" "app_default" {
name = "appdefault"
kubernetes_cluster_id = azurerm_kubernetes_cluster.cluster.id
vnet_subnet_id = azurerm_subnet.aks.id
orchestrator_version = var.pool_appdefault_version
node_labels = {
"noderole.dplplatform" : "application"
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ resource "azurerm_mariadb_server" "sql" {

# Lagoon does not yet support TLS.
ssl_enforcement_enabled = false
ssl_minimal_tls_version_enforced = "TLSEnforcementDisabled"
}

# Allow any inbound connections
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Provision a wildcard record for the setup.
# This will be replaced or removed before the platform goes live.
resource "dnsimple_record" "aks_ingress" {
domain = var.base_domain
resource "dnsimple_zone_record" "aks_ingress" {
zone_name = var.base_domain
name = "*.${var.environment_name}.dpl"
value = azurerm_public_ip.aks_ingress.ip_address
type = "A"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ output "ingress_ip" {

output "ingress_hostname" {
description = "DNS wildcard domain that points at the ingress ip"
value = dnsimple_record.aks_ingress.hostname
value = dnsimple_zone_record.aks_ingress.qualified_name
}

output "keycloak_admin_pass_key_name" {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ terraform {
required_providers {
dnsimple = {
source = "dnsimple/dnsimple"
version = ">=0.6.0"
version = ">=1.3.1"
}

azuread = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ variable "control_plane_version" {
type = string
}

variable "cluster_upgrade_channel" {
description = "Which channel to use for automatic cluster upgrades. Valid values are 'stable', 'rapid', 'patch' and 'node-image'."
type = string
default = "patch"
}

variable "domain_ttl" {
description = "The Time To Live for the provisioned domains."
type = number
Expand Down Expand Up @@ -79,21 +85,6 @@ variable "node_pool_system_vm_sku" {
type = string
}

variable "pool_admin_version" {
description = "Which version Kubernetes to use for the admin node-pool. Must be compatible with control_plane_version, that is, cannot be higher and should at most trail one minor version."
type = string
}

variable "pool_appdefault_version" {
description = "Which version Kubernetes to use for the default app node-pool. Must be compatible with control_plane_version, that is, cannot be higher and should at most trail one minor version."
type = string
}

variable "pool_system_version" {
description = "Which version Kubernetes to use for the system node-pool. Must be compatible with control_plane_version, that is, cannot be higher and should at most trail one minor version."
type = string
}

variable "random_seed" {
description = "Any random value used to seed the random parts of the provisioned resources"
type = string
Expand Down
10 changes: 5 additions & 5 deletions tools/dplsh/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,19 @@ FROM hashicorp/terraform:1.5.3 as terraform
FROM mcr.microsoft.com/azure-cli:2.47.0

# See https://github.com/go-task/task/releases
ARG TASK_VERSION=v3.15.2
ARG TASK_VERSION=v3.31.0
# See https://github.com/stern/stern/releases
ARG STERN_RELEASE=1.21.0
ARG STERN_RELEASE=1.26.0
# See https://github.com/uselagoon/lagoon-cli/releases
ARG LAGOON_CLI_RELEASE=v0.14.0
ARG LAGOON_CLI_RELEASE=v0.19.0
# See https://github.com/kubernetes-sigs/krew/releases
ARG KREW_VERSION=v0.4.3
ARG KREW_VERSION=v0.4.4
# https://github.com/alenkacz/cert-manager-verifier/releases
# Exclude the "v" from the version.
ARG CERT_MANAGER_VERIFIER_VERSION=0.3.0

# This version can be bumped as we upgrade the cluster minor version.
ARG KUBECTL_VERSION=v1.20.9
ARG KUBECTL_VERSION=v1.25.11

LABEL org.opencontainers.image.source https://github.com/danskernesdigitalebibliotek/dpl-platform

Expand Down

0 comments on commit c7c12b6

Please sign in to comment.