- What is it?
- Background
- Out of scope
- Design
- API
- Networking
- Storage
- Devices
- Developers
- Persistent storage plugin support
- Experimental features
virtcontainers
is a Go library that can be used to build hardware-virtualized container
runtimes.
The few existing VM-based container runtimes (Clear Containers, runV, rkt's
KVM stage 1) all share the same hardware virtualization semantics but use different
code bases to implement them. virtcontainers
's goal is to factorize this code into
a common Go library.
Ideally, VM-based container runtime implementations would become translation
layers from the runtime specification they implement (e.g. the OCI runtime-spec
or the Kubernetes CRI) to the virtcontainers
API.
virtcontainers
was used as a foundational package for the Clear Containers runtime implementation.
Implementing a container runtime is out of scope for this project. Any tools or executables in this repository are only provided for demonstration or testing purposes.
virtcontainers
's API is loosely inspired by the Kubernetes CRI because
we believe it provides the right level of abstractions for containerized sandboxes.
However, despite the API similarities between the two projects, the goal of
virtcontainers
is not to build a CRI implementation, but instead to provide a
generic, runtime-specification agnostic, hardware-virtualized containers
library that other projects could leverage to implement CRI themselves.
The virtcontainers
execution unit is a sandbox, i.e. virtcontainers
users start sandboxes where
containers will be running.
virtcontainers
creates a sandbox by starting a virtual machine and setting the sandbox
up within that environment. Starting a sandbox means launching all containers with
the VM sandbox runtime environment.
The virtcontainers
package relies on hypervisors to start and stop virtual machine where
sandboxes will be running. An hypervisor is defined by an Hypervisor interface implementation,
and the default implementation is the QEMU one.
See docs
During the lifecycle of a container, the runtime running on the host needs to interact with
the virtual machine guest OS in order to start new commands to be executed as part of a given
container workload, set new networking routes or interfaces, fetch a container standard or
error output, and so on.
There are many existing and potential solutions to resolve that problem and virtcontainers
abstracts
this through the Agent interface.
In some cases the runtime will need a translation shim between the higher level container stack (e.g. Docker) and the virtual machine holding the container workload. This is needed for container stacks that make strong assumptions on the nature of the container they're monitoring. In cases where they assume containers are simply regular host processes, a shim layer is needed to translate host specific semantics into e.g. agent controlled virtual machine ones.
The high level virtcontainers
API is the following one:
-
CreateSandbox(sandboxConfig SandboxConfig)
creates a Sandbox. The virtual machine is started and the Sandbox is prepared. -
DeleteSandbox(sandboxID string)
deletes a Sandbox. The virtual machine is shut down and all information related to the Sandbox are removed. The function will fail if the Sandbox is running. In that caseStopSandbox()
has to be called first. -
StartSandbox(sandboxID string)
starts an already created Sandbox. The Sandbox and all its containers are started. -
RunSandbox(sandboxConfig SandboxConfig)
creates and starts a Sandbox. This performsCreateSandbox()
+StartSandbox()
. -
StopSandbox(sandboxID string)
stops an already running Sandbox. The Sandbox and all its containers are stopped. -
PauseSandbox(sandboxID string)
pauses an existing Sandbox. -
ResumeSandbox(sandboxID string)
resume a paused Sandbox. -
StatusSandbox(sandboxID string)
returns a detailed Sandbox status. -
ListSandbox()
lists all Sandboxes on the host. It returns a detailed status for every Sandbox.
-
CreateContainer(sandboxID string, containerConfig ContainerConfig)
creates a Container on an existing Sandbox. -
DeleteContainer(sandboxID, containerID string)
deletes a Container from a Sandbox. If the Container is running it has to be stopped first. -
StartContainer(sandboxID, containerID string)
starts an already created Container. The Sandbox has to be running. -
StopContainer(sandboxID, containerID string)
stops an already running Container. -
EnterContainer(sandboxID, containerID string, cmd Cmd)
enters an already running Container and runs a given command. -
StatusContainer(sandboxID, containerID string)
returns a detailed Container status. -
KillContainer(sandboxID, containerID string, signal syscall.Signal, all bool)
sends a signal to all or one container inside a Sandbox.
An example tool using the virtcontainers
API is provided in the hack/virtc
package.
For further details, see the API documentation.
virtcontainers
supports the 2 major container networking models: the Container Network Model (CNM) and the Container Network Interface (CNI).
Typically the former is the Docker default networking model while the later is used on Kubernetes deployments.
CNM lifecycle
-
RequestPool
-
CreateNetwork
-
RequestAddress
-
CreateEndPoint
-
CreateContainer
-
Create
config.json
-
Create PID and network namespace
-
ProcessExternalKey
-
JoinEndPoint
-
LaunchContainer
-
Launch
-
Run container
Runtime network setup with CNM
-
Read
config.json
-
Create the network namespace (code)
-
Call the prestart hook (from inside the netns) (code)
-
Scan network interfaces inside netns and get the name of the interface created by prestart hook (code)
-
Create bridge, TAP, and link all together with network interface previously created (code)
-
Start VM inside the netns and start the container (code)
Drawbacks of CNM
There are three drawbacks about using CNM instead of CNI:
- The way we call into it is not very explicit: Have to re-exec
dockerd
binary so that it can accept parameters and execute the prestart hook related to network setup. - Implicit way to designate the network namespace: Instead of explicitly giving the netns to
dockerd
, we give it the PID of our runtime so that it can find the netns from this PID. This means we have to make sure being in the right netns while calling the hook, otherwise the VETH pair will be created with the wrong netns. - No results are back from the hook: We have to scan the network interfaces to discover which one has been created inside the netns. This introduces more latency in the code because it forces us to scan the network in the
CreateSandbox
path, which is critical for starting the VM as quick as possible.
Container workloads are shared with the virtualized environment through 9pfs. The devicemapper storage driver is a special case. The driver uses dedicated block devices rather than formatted filesystems, and operates at the block level rather than the file level. This knowledge has been used to directly use the underlying block device instead of the overlay file system for the container root file system. The block device maps to the top read-write layer for the overlay. This approach gives much better I/O performance compared to using 9pfs to share the container file system.
The approach above does introduce a limitation in terms of dynamic file copy in/out of the container via docker cp
operations.
The copy operation from host to container accesses the mounted file system on the host side. This is not expected to work and may lead to inconsistencies as the block device will be simultaneously written to, from two different mounts.
The copy operation from container to host will work, provided the user calls sync(1)
from within the container prior to the copy to make sure any outstanding cached data is written to the block device.
docker cp [OPTIONS] CONTAINER:SRC_PATH HOST:DEST_PATH
docker cp [OPTIONS] HOST:SRC_PATH CONTAINER:DEST_PATH
Ability to hotplug block devices has been added, which makes it possible to use block devices for containers started after the VM has been launched.
Start a container. Call mount(8)
within the container. You should see /
mounted on /dev/vda
device.
Support has been added to pass VFIO assigned devices on the docker command line with --device. Support for passing other devices including block devices with --device has not been added yet. PCI and AP (IBM Z Crypto Express cards) devices can be passed.
- Requirements
IOMMU group represents the smallest set of devices for which the IOMMU has visibility and which is isolated from other groups. VFIO uses this information to enforce safe ownership of devices for userspace.
You will need Intel VT-d capable hardware. Check if IOMMU is enabled in your host
kernel by verifying CONFIG_VFIO_NOIOMMU
is not in the kernel configuration. If it is set,
you will need to rebuild your kernel.
The following kernel configuration options need to be enabled:
CONFIG_VFIO_IOMMU_TYPE1=m
CONFIG_VFIO=m
CONFIG_VFIO_PCI=m
In addition, you need to pass intel_iommu=on
on the kernel command line.
- Identify BDF(Bus-Device-Function) of the PCI device to be assigned.
$ lspci -D | grep -e Ethernet -e Network
0000:01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
$ BDF=0000:01:00.0
- Find vendor and device id.
$ lspci -n -s $BDF
01:00.0 0200: 8086:1528 (rev 01)
- Find IOMMU group.
$ readlink /sys/bus/pci/devices/$BDF/iommu_group
../../../../kernel/iommu_groups/16
- Unbind the device from host driver.
$ echo $BDF | sudo tee /sys/bus/pci/devices/$BDF/driver/unbind
- Bind the device to
vfio-pci
.
$ sudo modprobe vfio-pci
$ echo 8086 1528 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id
$ echo $BDF | sudo tee --append /sys/bus/pci/drivers/vfio-pci/bind
- Check
/dev/vfio
$ ls /dev/vfio
16 vfio
- Start a Clear Containers container passing the VFIO group on the docker command line.
docker run -it --device=/dev/vfio/16 centos/tools bash
- Running
lspci
within the container should show the device among the PCI devices. The driver for the device needs to be present within the Clear Containers kernel. If the driver is missing, you can add it to your custom container kernel using the osbuilder tooling.
IBM Z mainframes (s390x) use the AP (Adjunct Processor) bus for their Crypto Express hardware security modules. Such devices can be passed over VFIO, which is also supported in Kata. Pass-through happens separated by adapter and domain, i.e. a passable VFIO device has one or multiple adapter-domain combinations.
- You must follow the kernel documentation for preparing VFIO-AP passthrough.
In short, your host kernel should have the following enabled or available as
module (in case of modules, load the modules accordingly, e.g. through
modprobe
). If one is missing, you will have to update your kernel accordingly, e.g. through recompiling.
CONFIG_VFIO_AP
CONFIG_VFIO_IOMMU_TYPE1
CONFIG_VFIO
CONFIG_VFIO_MDEV
CONFIG_VFIO_MDEV_DEVICE
CONFIG_S390_AP_IOMMU
- Set the AP adapter(s) and domain(s) you want to pass in
/sys/bus/ap/apmask
and/sys/bus/ap/aqmask
by writing their negative numbers. Assuming you want to pass 06.0032, you'd run
$ echo -0x6 | sudo tee /sys/bus/ap/apmask > /dev/null
$ echo -0x32 | sudo tee /sys/bus/ap/aqmask > /dev/null
- Create one or multiple mediated devices -- one per container you want to
pass to. You must write a UUID for the device to
/sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/create
. You can useuuidgen
for generating the UUID, e.g.
$ uuidgen | sudo tee /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/create
a297db4a-f4c2-11e6-90f6-d3b88d6c9525
- Set the AP adapter(s) and domain(s) you want to pass per device by writing
their numbers to
/sys/devices/vfio_ap/matrix/${UUID}/assign_adapter
andassign_domain
in the same directory. For the UUID from step 3, that would be
$ echo 0x6 | sudo tee /sys/devices/vfio_ap/matrix/a297db4a-f4c2-11e6-90f6-d3b88d6c9525/assign_adapter > /dev/null
$ echo 0x32 | sudo tee /sys/devices/vfio_ap/matrix/a297db4a-f4c2-11e6-90f6-d3b88d6c9525/assign_domain > /dev/null
- Find the IOMMU group of the mediated device by following the link from
/sys/devices/vfio_ap/matrix/${UUID}/iommu_group
. There should be a correspondent VFIO device in/dev/vfio
.
$ readlink /sys/devices/vfio_ap/matrix/a297db4a-f4c2-11e6-90f6-d3b88d6c9525/iommu_group
../../../../kernel/iommu_groups/0
$ ls /dev/vfio
0 vfio
- This device can now be passed. To verify the cards are there, you can use
lszcrypt
froms390-tools
(s390-tools
in Alpine, Debian, and Ubuntu,s390utils
in Fedora). Withlszcrypt
, you can see the cards after the configuration time has passed.
$ sudo docker run -it --device /dev/vfio/0 ubuntu
$ lszcrypt
CARD.DOMAIN TYPE MODE STATUS REQUESTS
----------------------------------------------
06 CEX7C CCA-Coproc online 1
06.0032 CEX7C CCA-Coproc online 1
For information on how to build, develop and test virtcontainers
, see the
developer documentation.
See the persistent storage plugin documentation.
See the experimental features documentation.