Firecracker Snapshots Support #19

plamenmpetrov · 2020-09-22T14:12:34Z

Hello, we have been working on supporting microVM snapshotting in containerd-firecracker, following its introduction to firecracker. This PR contains new functions for the firecracker-containerd API that together comprise a complete working prototype for working with Firecracker snapshots. This prototype, however, contains workarounds for the missing calls in go-sdk. We also highlight a couple of issues that we would like to hear your feedback on.

We are open to feedback from the community and would be glad to engage in discussions to finalize and contribute this code to upstream.

Authored by @plamenmpetrov and @ustiugov

Summary

We implement functionality for:
- Pausing a microVM - PauseVM
- Creating a snapshot of a microVM - CreateSnapshot
- Resuming a microVM - ResumeVM
- Loading a snapshot of a microVM - LoadSnapshot
- “Offloading” a microVM, which frees up the resources occupied by the microVM - Offload
We refer to these collectively as microVM snapshotting requests.
The firecracker-go-sdk does not support microVM snapshotting as of now. As a result, we embedded the microVM snapshotting requests inside the runtime as HTTP requests. We use our own fork of the firecracker-go-sdk v0.21.0, where we provide basic support to the new logging and metrics of the firecracker version that we use (see below). Without these changes in the firecracker-go-sdk, we observe an error in the containerd logs concerning the firecracker logging. This prevents us from seeing the firecracker logs and makes debugging difficult.
We use the following firecracker version in our tests: firecracker.

API Extensions Description

We create an HTTP client upon creating a microVM or loading a microVM snapshot, which is used to send HTTP requests directly to the firecracker process for the respective microVM (contrary to using the firecracker-go-sdk).

ResumeVM, PauseVM and CreateSnapshot

ResumeVM, PauseVM and CreateSnapshot use the HTTP client to send the respective request to the firecracker process. The return code from firecracker is checked to verify that the operation was successful.

Note that CreateSnapshot does not pause the microVM, but assumes that it is paused. This is in line with the prerequisites for creating a microVM snapshot in firecracker.

Offload

Offload kills the firecracker process for the microVM with the respective ID (using SIGKILL) and deletes the firecracker process’ sock file and vsock file so the microVM can later be loaded. This functionality is implemented in the runtime.

In addition, Offload also kills the shim using SIGKILL, so that the resources can be freed up until/if the microVM is loaded in the future. We remove the functionality where the shim directory for the microVM is removed when the shim terminates. This is because in our use case we decide to store the guest memory file and the state file in the shim directory. We also remove the shim socket file and the firecracker shim socket file and recreate the sockets upon LoadSnapshot (see below). This functionality is implemented in the control plugin.

LoadSnapshot

Before doing anything else, the shim needs to be started for the microVM. We recreate the shim socket and the fccontrol shim socket, and start the shim binary. This functionality is implemented in the control plugin.

LoadSnapshot starts a firecracker process listening on the API same socket that the microVM was using prior to being offloaded. The HTTP client is recreated and a load snapshot request is sent to the firecracker process. The return code returned by firecracker is checked to verify the success of the operation. This functionality is implemented in the runtime.

Note that LoadSnapshot assumes that the tap with the same exact name, IP, and MAC, as before the VM was offloaded, exists. Currently, we recommend removing the tap after calling Offload and re-creating the tap before calling LoadSnapshot because if these two calls are back to back (as may be in tests), it would cause “Tap is busy” error.

Limitations

When calling LoadSnapshot immediately after Offload, we encounter an error that the shimSocket address is in use when trying to load the shim on LoadSnapshot. A workaround is to introduce a sleep of 10-100ms after Offload, depending on the system. This does not happen for the fccontrol shim socket.

ERROR: VM with ID "3" already exists (socket: "/containerd-shim/53d9435747fdf335f1601ccebf98aa71b29f871fcdc68c595c22ca8b0a64d53d.sock")

Calling StopVM on a microVM which has been loaded from a snapshot results in an error, because we lose connection to the agent running inside the microVM.
Performance: re-creating a shim process takes about 30ms, before loading the snapshot in Firecracker, in our experiments, we haven’t yet investigated this issue. The intuition is that shim start-up should not exceed 5-10ms as it is for starting a Firecracker process.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Signed-off-by: Plamen Petrov <[email protected]>

Notes: 1. Uses logging-only branch from ustiugov/firecracker-go-sdk 2. Firecracker logs path is hard-coded. Signed-off-by: Plamen Petrov <[email protected]>

Signed-off-by: Plamen Petrov <[email protected]>

firecracker update Signed-off-by: Plamen Petrov <[email protected]>

* Check that shim dir exists when loading shim * No longer try to create shim dir when loading shim, as it must exist Signed-off-by: Plamen Petrov <[email protected]>

Signed-off-by: Plamen Petrov <[email protected]>

plamenmpetrov added 9 commits September 22, 2020 17:50

Implemented resume and pause call chain.

4bb428d

Signed-off-by: Plamen Petrov <[email protected]>

Removed deprecated logging

6856f1b

Signed-off-by: Plamen Petrov <[email protected]>

Added support for creating and loading snapshots of VM.

58dcb44

Notes: 1. Uses logging-only branch from ustiugov/firecracker-go-sdk 2. Firecracker logs path is hard-coded. Signed-off-by: Plamen Petrov <[email protected]>

Altered buildVMConfiguration tests

c810db3

Signed-off-by: Plamen Petrov <[email protected]>

Added dialer for firecracker socket

e0bf2b7

Signed-off-by: Plamen Petrov <[email protected]>

kill shim functionality

c2c9057

firecracker update Signed-off-by: Plamen Petrov <[email protected]>

Remove unnecessary mkdir

407e26a

* Check that shim dir exists when loading shim * No longer try to create shim dir when loading shim, as it must exist Signed-off-by: Plamen Petrov <[email protected]>

Use SIGKILL, remove patchDrive artifacts

b78efd2

Signed-off-by: Plamen Petrov <[email protected]>

store firecracker logs in shimDir

e2520f3

Signed-off-by: Plamen Petrov <[email protected]>

plamenmpetrov force-pushed the snapshots branch from 9f20578 to e2520f3 Compare September 22, 2020 15:22

plamenmpetrov requested a review from ustiugov September 22, 2020 16:38

plamenmpetrov changed the title ~~MicroVM Snapshotting Support~~ Firecracker Snapshots Support Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Firecracker Snapshots Support #19

Firecracker Snapshots Support #19

plamenmpetrov commented Sep 22, 2020 •

edited

Loading

Firecracker Snapshots Support #19

Are you sure you want to change the base?

Firecracker Snapshots Support #19

Conversation

plamenmpetrov commented Sep 22, 2020 • edited Loading

Summary

API Extensions Description

ResumeVM, PauseVM and CreateSnapshot

Offload

LoadSnapshot

Limitations

plamenmpetrov commented Sep 22, 2020 •

edited

Loading