Skip to content

Latest commit

 

History

History
343 lines (254 loc) · 14.9 KB

README.md

File metadata and controls

343 lines (254 loc) · 14.9 KB

Kata Containers CI

Warning

While this project's CI has several areas for improvement, it is constantly evolving. This document attempts to describe its current state, but due to ongoing changes, you may notice some outdated information here. Feel free to modify/improve this document as you use the CI and notice anything odd. The community appreciates it!

Introduction

The Kata Containers CI relies on GitHub Actions, where the actions themselves can be found in the .github/workflows directory, and they may call helper scripts, which are located under the tests directory, to actually perform the tasks required for each test case.

The different workflows

There are a few different sets of workflows that are running as part of our CI, and here we're going to cover the ones that are less likely to get rotten. With this said, it's fair to advise that if the reader finds something that got rotten, opening an issue to the project pointing to the problem is a nice way to help, and providing a fix for the issue is a very encouraging way to help.

Jobs that run automatically when a PR is raised

These are a bunch of tests that will automatically run as soon as a PR is opened, they're mostly running on "cost free" runners, and they do some pre-checks to evaluate that your PR may be okay to start getting reviewed.

Mind, though, that the community expects the contributors to, at least, build their code before submitting a PR, which the community sees as a very fair request.

Without getting into the weeds with details on this, those jobs are the ones responsible for ensuring that:

  • The commit message is in the expected format
  • There's no missing Developer's Certificate of Origin
  • Static checks are passing

Jobs that require a maintainer's approval to run

These are the required tests, and our so-called "CI". These require a maintainer's approval to run as parts of those jobs will be running on "paid runners", which are currently using Azure infrastructure.

Once a maintainer of the project gives "the green light" (currently by adding an ok-to-test label to the PR, soon to be changed to commenting "/test" as part of a PR review), the following tests will be executed:

  • Build all the components (runs on free cost runners, or bare-metal depending on the architecture)
  • Create a tarball with all the components (runs on free cost runners, or bare-metal depending on the architecture)
  • Create a kata-deploy payload with the tarball generated in the previous step (runs on free costs runner, or bare-metal depending on the architecture)
  • Run the following tests:
    • Tests depending on the generated tarball
      • Metrics (runs on bare-metal)
      • docker (runs on Azure small instances)
      • nerdctl (runs on Azure small instances)
      • kata-monitor (runs on Azure small instances)
      • cri-containerd (runs on Azure small instances)
      • nydus (runs on Azure small instances)
      • vfio (runs on Azure normal instances)
    • Tests depending on the generated kata-deploy payload
      • kata-deploy (runs on Azure small instances)
        • Tests are performed using different "Kubernetes flavors", such as k0s, k3s, rke2, and Azure Kubernetes Service (AKS).
      • Kubernetes (runs in Azure small and medium instances depending on what's required by each test, and on TEE bare-metal machines)
        • Tests are performed with different runtime engines, such as CRI-O and containerd.
        • Tests are performed with different snapshotters for containerd, namely OverlayFS and devmapper.
        • Tests are performed with all the supported hypervisors, which are Cloud Hypervisor, Dragonball, Firecracker, and QEMU.

For all the tests relying on Azure instances, real money is being spent, so the community asks for the maintainers to be mindful about those, and avoid abusing them to merely debug issues.

The different runners

In the previous section we've mentioned using different runners, now in this section we'll go through each type of runner used.

  • Cost free runners: Those are the runners provided by GIthub itself, and those are fairly small machines with no virtualization capabilities enabled -
  • Azure small instances: Those are runners which have virtualization capabilities enabled, 2 CPUs, and 8GB of RAM. These runners have a "-smaller" suffix to their name.
  • Azure normal instances: Those are runners which have virtualization capabilities enabled, 4 CPUs, and 16GB of RAM. These runners are usually garm ones with no "-smaller" suffix.
  • Bare-metal runners: Those are runners provided by community contributors, and they may vary in architecture, size and virtualization capabilities. Builder runners don't actually require any virtualization capabilities, while runners which will be actually performing the tests must have virtualization capabilities and a reasonable amount for CPU and RAM available (at least matching the Azure normal instances).

Adding new tests

Before someone decides to add a new test, we strongly recommend them to go through GitHub Actions Documentation, which will provide you a very sensible background on how to read and understand current tests we have, and also become familiar with how to write a new test.

On the Kata Containers land, there are basically two sets of tests: "standalone" and "part of something bigger".

The "standalone" tests, for example the commit message check, won't be covered here as they're better covered by the GitHub Actions documentation pasted above.

The "part of something bigger" is the more complicated one and not so straightforward to add, so we'll be focusing our efforts on describing the addition of those.

Note

TODO: Currently, this document refers to "tests" when it actually means the jobs (or workflows) of GitHub. In an ideal world, except in some specific cases, new tests should be added without the need to add new workflows. In the not-too-distant future (hopefully), we will improve the workflows to support this.

Adding a new test that's "part of something bigger"

The first important thing here is to align expectations, and we must say that the community strongly prefers receiving tests that already come with:

  • Instructions how to run them
  • A proven run where it's passing

There are several ways to achieve those two requirements, and an example of that can be seen in PR #8115.

With the expectations aligned, adding a test consists in:

Following those examples, the community advice during the review, and even asking the community directly on Slack are the best ways to get your test accepted.

Running tests

Running the tests as part of the CI

If you're a maintainer of the project, you'll be able to kick in the tests by yourself. With the current approach, you just need to add the ok-to-test label and the tests will automatically start. We're moving, though, to use a /test command as part of a GitHub review comment, which will simplify this process.

If you're not a maintainer, please, send a message on Slack or wait till one of the maintainers reviews your PR. Maintainers will then kick in the tests on your behalf.

In case a test fails and there's the suspicion it happens due to flakiness in the test itself, please, create an issue for us, and then re-run (or asks maintainers to re-run) the tests following these steps:

  • Locate which tests is failing
  • Click in "details"
  • In the top right corner, click in "Re-run jobs"
  • And then in "Re-run failed jobs"
  • And finally click in the green "Re-run jobs" button

Note

TODO: We need figures here

Running the tests locally

In this section, aligning expectations is also something very important, as one will not be able to run the tests exactly in the same way the tests are running in the CI, as one most likely won't have access to an Azure subscription. However, we're trying our best here to provide you with instructions on how to run the tests in an environment that's "close enough" and will help you to debug issues you find with the current tests, or even provide a proof-of-concept to the new test you're trying to add.

The basic steps, which we will cover in details down below are:

  1. Create a VM matching the configuration of the target runner
  2. Generate the artifacts you'll need for the test, or download them from a current failed run
  3. Follow the steps provided in the action itself to run the tests.

Although the general overview looks easy, we know that some tricks need to be shared, and we'll go through the general process of debugging one non-Kubernetes and one Kubernetes specific test for educational purposes.

One important thing to note is that "Create a VM" can be done in innumerable different ways, using the tools of your choice. For the sake of simplicity on this guide, we'll be using kcli, which we strongly recommend in case you're a non-experienced user, and happen to be developing on a Linux box.

For both non-Kubernetes and Kubernetes cases, we'll be using PR #8070 as an example, which at the time this document is being written serves us very well the purpose, as you can see that we have nerdctl and Kubernetes tests failing.

Debugging tests

Debugging a non Kubernetes test

As shown above, the nerdctl test is failing.

As a developer you can go ahead to the details of the job, and expand the job that's failing in order to gather more information.

But when that doesn't help, we need to set up our own environment to debug what's going on.

Taking a look at the nerdctl test, which is located here, you can easily see that it runs-on a garm-ubuntu-2304-smaller virtual machine.

The important parts to understand are ubuntu-2304, which is the OS where the test is running on; and "smaller", which means we're running it on a machine with 2 CPUs and 8GB of RAM.

With this information, we can go ahead and create a similar VM locally using kcli.

$ sudo kcli create vm -i ubuntu2304 -P disks=[60] -P numcpus=2 -P memory=8192 -P cpumodel=host-passthrough debug-nerdctl-pr8070

In order to run the tests, you'll need the "kata-tarball" artifacts, which you can build your own using "make kata-tarball" (see below), or simply get them from the PR where the tests failed. To download them, click on the "Summary" button that's on the top left corner, and then scroll down till you see the artifacts, as shown below.

Unfortunately GitHub doesn't give us a link that we can download those from inside the VM, but we can download them on our local box, and then scp the tarball to the newly created VM that will be used for debugging purposes.

Note

Those artifacts are only available (for 15 days) when all jobs are finished.

Once you have the kata-static.tar.xz in your VM, you can login to the VM with kcli ssh debug-nerdctl-pr8070, go ahead and then clone your development branch

$ git clone --branch feat_add-fc-runtime-rs https://github.com/nubificus/kata-containers

Add the upstream as a remote, set up your git, and rebase your branch atop of the upstream main one

$ git remote add upstream https://github.com/kata-containers/kata-containers
$ git remote update
$ git config --global user.email "[email protected]"
$ git config --global user.name "Your Name"
$ git rebase upstream/main 

Now copy the kata-static.tar.xz into your kata-containers/kata-artifacts directory

$ mkdir kata-artifacts
$ cp ../kata-static.tar.xz kata-artifacts/

Note

If you downloaded the .zip from GitHub you need to uncompress first to see kata-static.tar.xz

And finally run the tests following what's in the yaml file for the test you're debugging.

In our case, the run-nerdctl-tests-on-garm.yaml.

When looking at the file you'll notice that some environment variables are set, such as KATA_HYPERVISOR, and should be aware that, for this particular example, the important steps to follow are:

Install the dependencies Install kata Run the tests

Let's now run the steps mentioned above exporting the expected environment variables

$ export KATA_HYPERVISOR=dragonball
$ bash ./tests/integration/nerdctl/gha-run.sh install-dependencies
$ bash ./tests/integration/nerdctl/gha-run.sh install-kata
$ bash tests/integration/nerdctl/gha-run.sh run

And with this you should've been able to reproduce exactly the same issue found in the CI, and from now on you can build your own code, use your own binaries, and have fun debugging and hacking!

Debugging a Kubernetes test

Steps for debugging the Kubernetes tests are very similar to the ones for debugging non-Kubernetes tests, with the caveat that what you'll need, this time, is not the kata-static.tar.xz tarball, but rather a payload to be used with kata-deploy.

In order to generate your own kata-deploy image you can generate your own kata-static.tar.xz and then take advantage of the following script. Be aware that the image generated and uploaded must be accessible by the VM where you'll be performing your tests.

In case you want to take advantage of the payload that was already generated when you faced the CI failure, which is considerably easier, take a look at the failed job, then click in "Deploy Kata" and expand the "Final kata-deploy.yaml that is used in the test" section. From there you can see exactly what you'll have to use when deploying kata-deploy in your local cluster.

Note

TODO: WAINER TO FINISH THIS PART BASED ON HIS PR TO RUN A LOCAL CI

Adding new runners

Any admin of the project is able to add or remove GitHub runners, and those are the folks you should rely on.

If you need a new runner added, please, tag @ac in the Kata Containers slack, and someone from that group will be able to help you.

If you're part of that group and you're looking for information on how to help someone, this is simple, and must be done in private. Basically what you have to do is:

  • Go to the kata-containers/kata-containers repo
  • Click on the Settings button, located in the top right corner
  • On the left panel, under "Code and automation", click on "Actions"
  • Click on "Runners"

If you want to add a new self-hosted runner:

  • In the top right corner there's a green button called "New self-hosted runner"

If you want to remove a current self-hosted runner:

  • For each runner there's a "..." menu, where you can just click and the "Remove runner" option will show up

Known limitations

As the GitHub actions are structured right now we cannot: Test the addition of a GitHub action that's not triggered by a pull_request event as part of the PR.