Skip to content

Files

Latest commit

b59cbe1 · Jul 4, 2024

History

History
This branch is 216 commits behind tiiuae/ghaf-infra:main.

terraform

Ghaf-infra: Terraform

This directory contains the root terraform module describing the ghaf CI setup in Azure.

For architectural description, see README-azure.md originally from PR#35

The setup uses Nix to build disk images, uploads them to Azure, and then boots virtual machines off of them.

Images are considered "appliance images", meant the Nix code describing their configuration describes the exact same purpose of the machine (no two-staged deployment process, the machine does the thing it's supposed to do after bootup), allowing to remove the need for e.g. ssh access as much as possible.

Machines are considered ephemeral, every change in the appliance image / nixos configuration causes a new image to be built, and a new VM to be booted with that new image.

Getting Started

This document assumes you have nix package manager installed on you local host.

Clone this repository:

$ git clone https://github.com/tiiuae/ghaf-infra.git
$ cd ghaf-infra

Bootstrap nix-shell with the required dependencies:

# Start a nix-shell with required dependencies:
$ nix-shell

# Authenticate with az login:
$ az login

# Terraform comands are executed under the terraform directory:
$ cd terraform/

All commands in this document are executed from nix-shell inside the terraform directory.

Directory Structure

terraform
├── azarm
├── persistent
│   ├── binary-cache-sigkey
│   ├── binary-cache-storage
│   ├── builder-ssh-key
│   └── workspace-specific
├── state-storage
│   └── tfstate-storage.tf
├── modules
│   ├── azurerm-linux-vm
│   └── azurerm-nix-vm-image
├── binary-cache.tf
├── builder.tf
├── jenkins-controller.tf
└── main.tf
  • The terraform directory contains the root terraform deployment files with the VM configurations binary-cache.tf, builder.tf, and jenkins-controller.tf matching the components described in README-azure.md in its components section.
  • The terraform/azarm directory contains the terraform configuration for Azure aarch64 builder which is used from ghaf github-actions build.yml workflow to build aarch64 targets for authorized PRs pre-merge. azarm is disconnected from the root terraform module: it's a separate configuration with its own state.
  • The terraform/persistent directory contains the terraform configuration for parts of the infrastructure that are considered persistent - resources defined under terraform/persistent will not be removed even if the ghaf-infra instance is otherwise removed. An example of such persistent ghaf-infra resource is the binary cache storage as well as the binary cache signing key. There may be many 'persistent' infrastructure instances - currently dev and prod deployments have their own instances of the persistent resources. Section Multiple Environments with Terraform Workspaces discusses this topic with more details.
  • The terraform/state-storage directory contains the terraform configuration for the ghaf-infra remote backend state storage using Azure storage blob. See section Initializing Azure State and Persistent Data for more details.
  • The terraform/modules directory contains terraform modules used from the ghaf-infra VM configurations to build, upload, and spin up Azure nix images.

Initializing Azure State and Persistent Data

This project stores the terraform state in a remote storage in an azure storage blob as configured in tfstate-storage.tf. The benefits of using such remote storage setup are well outlined in storing state in azure storage and terraform backend configuration.

To initialize the backend storage, use the terraform-init-sh:

# Inside the terraform directory
$ ./terraform-init.sh
[+] Initializing state storage
[+] Initializing persistent data
...
[+] Running terraform init

terraform-init.sh will not do anything if the initialization has already been done. In other words, it's safe to run the script many times; it will not destroy or re-initialize anything if the init was already executed.

In addition to the shared terraform state, some of the infrastructure resources are also shared between the ghaf-infra instances. terraform-init.sh initializes the persistent configuration defined under terraform/persistent. There may be many 'persistent' infrastructure instances: currently dev and prod deployments have their own instances of the persistent resources. Section Multiple Environments with Terraform Workspaces discusses this topic with more details.

Multiple Environments with Terraform Workspaces

To support infrastructure development in isolated environments, this project uses terraform workspaces. The main reasons for using terraform workspaces include:

  • Different workspaces allow deploying different instances of ghaf-infra. Each instance has a completely separate state data, making it possible to deploy dev, prod, or even private development instances of ghaf-infra. This makes it possible to first develop and test infrastructure changes in a private development environment, before proposing changes to shared (e.g. dev or prod) environments. The configuration codebase is the same between all the environments, with the differentiation options defined in the main.tf.
  • Parts of the ghaf-infra infrastructure are persistent and shared between different environments. As an example, private dev environments share the binary cache storage. This arrangement makes it possible to treat, for instance, dev and private ghaf-infra instances dispensable: ghaf-infra instances can be temporary and short-lived as it's easy to spin-up new environments without losing any valuable data. The persistent data is configured outside the root ghaf-infra terraform deployment in the terraform/persistent directory. There may be many 'persistent' infrastructure instances - currently dev and prod deployments have their own instances of the persistent resources. This means that dev and prod instances of ghaf-infra do not share any persistent data. As an example, dev and prod deployments of ghaf-infra have a separate binary cache storage. The binding to persistent resources from ghaf-infra is done in the main.tf based on the terraform workspace name and resource location. Persistent data initialization is automatically done with terraform-init.sh script.
  • Currently, the following resources are defined 'persistent', meaning dev and prod instances do not share the following resources:

To help facilitate the usage of terraform workspaces in setting-up distinct copies of ghaf-infra, one can use terraform workspaces from the command line or consider using the helper script provided at terraform-playground.sh. Below, for the sake of example, we use the terraform-playground.sh to setup a private deployment instance of ghaf-infra:

# Activate private development environment
$ ./terraform-playground.sh activate
# ...
[+] Done, use terraform [validate|plan|apply] to work with your dev infra

Which sets-up a terraform workspace for your private development environment:

# List the current terraform worskapce
$ terraform workspace list
Terraform workspaces:
  default
  dev
* henrirosten       # <-- indicates active workspace
  prod

Terraform workflow

Following describes the intended workflow, with commands executed from the nix-shell.

Once your are ready to deploy your terraform or nix configuration changes, the following sequence of commands typically take place:

# Inside the terraform directory

# Format the terraform code files:
$ terraform fmt -recursive

# Validate the terraform changes:
$ terraform validate

# Make sure you deploy to the correct ghaf-infra instance.
# Use terraform workspace select <workspace_name> to switch workspaces
$ terraform workspace list
  default
  dev
* henrirosten      # <== This example deploys to private dev environment
  prod

# Show what actions terraform would take on apply:
$ terraform plan

# Apply your configuration changes:
$ terraform apply

Once terraform apply completes, the private development infrastructure is deployed. You can now play around in your isolated copy of the infrastructure, testing and updating the changes, making sure the changes work as expected before merging the changes.

Destroying Test Environment

Once the configuration changes have been tested, the private development environment can be destroyed:

# Destroy the private terraform worskapce using helper script
$ ./terraform-playground.sh destroy

# Alternatively, you can use terraform command directly
$ terraform workspace select <workspace_name>
$ terraform apply -destroy

The above command(s) remove all the resources that were created for the given environment.

Changing Azure Deploy Location

By default, ghaf-infra is deployed to Azure location northeurope (North Europe). However, ghaf-infra resources can be deployed to other Azure locations too, with the following caveats:

  • Ghaf-infra has been tested in a limited set of locations. terraform-init.sh exits with an error if you try to initialize ghaf-infra in a non-supported (non-tested) location. When deploying to a new, previously unsupported location, you need to modify the terraform-init.sh.
  • For a full list of available Azure location names, run az account list-locations -o table in ghaf-infra devshell.
  • Not all Azure VM sizes or other resources are available in all Azure locations. You can search the availability of specific resources through the Azure region product page e.g.: https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/?regions=europe-north&products=virtual-machines. Alternatively, you can list the VM sizes per location with az vm list-sizes command from the ghaf-infra devshell, for instance: az vm list-sizes --location 'northeurope' -o table.
  • Your Azure subscription quota limits impact the ability to deploy ghaf-infra, as such, you might need to increase the vCPU quotas for your subscription via the Azure web portal. See more information at https://learn.microsoft.com/en-us/azure/quotas/quotas-overview. You can check your quota usage from the Azure web portal or using az vm list-usage, for instance: az vm list-usage --location "northeurope" -o table.

Following shows an example of deploying ghaf-infra to Azure location SWE Central:

# Initialize terraform state and persistent data, using SWE Central as an example location:
$ ./terraform-init.sh -l swedencentral

# Switch to (and optionally create) a workspace 'devswec'
$ terraform workspace new devswec || terraform workspace select devswec

# Optionally, run Terraform plan:
# (Variable 'envtype' overrides the default environment type)
$ terraform plan -var="envtype=dev"

# Deploy with Terraform apply:
$ terraform apply -var="envtype=dev" -auto-approve

Common Terraform Errors

Below are some common Terraform errors with tips on how to resolve each.

Error: A resource with the ID already exists

$ terraform apply
...
azurerm_virtual_machine_extension.deploy_ubuntu_builder: Creating...
╷
│ Error: A resource with the ID "/subscriptions/<SUBID>/resourceGroups/rg-name-here/providers/Microsoft.Compute/virtualMachines/azarm/extensions/azarm-vmext" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_virtual_machine_extension" for more information.

Example fix:

$ terraform import azurerm_virtual_machine_extension.deploy_ubuntu_builder /subscriptions/<SUBID>/resourceGroups/rg-name-here/providers/Microsoft.Compute/virtualMachines/azarm/extensions/azarm-vmext

# Ref: https://stackoverflow.com/questions/61418168/terraform-resource-with-the-id-already-exists

Error: creating/updating Image

$ terraform apply
...
│ Error: creating/updating Image (Subscription: "<SUBID>"
│ Resource Group Name: "ghaf-infra-dev"
│ Image Name: "<NAME>"): performing CreateOrUpdate: unexpected status 400 with error: InvalidParameter: The source blob https://<INSTANCE>.blob.core.windows.net/ghaf-infra-vm-images/<IMANE>.vhd is not accessible.
│
│   with module.builder_image.azurerm_image.default,
│   on modules/azurerm-nix-vm-image/main.tf line 22, in resource "azurerm_image" "default":
│   22: resource "azurerm_image" "default" {

Try running terraform apply again if you get an error similar to one shown above. It's unclear why this error occasionally occurs, this issue should be analyzed in detail.

Error: Disk

$ terraform apply
...
│ Error: Disk (Subscription: "<SUBID>"
│ Resource Group Name: "ghaf-infra-persistent-eun"
│ Disk Name: "binary-cache-vm-caddy-state-dev") was not found
│
│   with data.azurerm_managed_disk.binary_cache_caddy_state,
│   on main.tf line 207, in data "azurerm_managed_disk" "binary_cache_caddy_state":
│  207: data "azurerm_managed_disk" "binary_cache_caddy_state" {

Above error (or similar) is likely caused by missing initialization for some persistent resources. Fix the persistent initialization by running terraform-init.sh then run terraform apply again.