Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firecracker Snapshots Support #19

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from
Draft

Firecracker Snapshots Support #19

wants to merge 9 commits into from

Conversation

plamenmpetrov
Copy link
Collaborator

@plamenmpetrov plamenmpetrov commented Sep 22, 2020

Hello, we have been working on supporting microVM snapshotting in containerd-firecracker, following its introduction to firecracker. This PR contains new functions for the firecracker-containerd API that together comprise a complete working prototype for working with Firecracker snapshots. This prototype, however, contains workarounds for the missing calls in go-sdk. We also highlight a couple of issues that we would like to hear your feedback on.

We are open to feedback from the community and would be glad to engage in discussions to finalize and contribute this code to upstream.

Authored by @plamenmpetrov and @ustiugov

Summary

  1. We implement functionality for:

    • Pausing a microVM - PauseVM
    • Creating a snapshot of a microVM - CreateSnapshot
    • Resuming a microVM - ResumeVM
    • Loading a snapshot of a microVM - LoadSnapshot
    • “Offloading” a microVM, which frees up the resources occupied by the microVM - Offload

    We refer to these collectively as microVM snapshotting requests.

  2. The firecracker-go-sdk does not support microVM snapshotting as of now. As a result, we embedded the microVM snapshotting requests inside the runtime as HTTP requests. We use our own fork of the firecracker-go-sdk v0.21.0, where we provide basic support to the new logging and metrics of the firecracker version that we use (see below). Without these changes in the firecracker-go-sdk, we observe an error in the containerd logs concerning the firecracker logging. This prevents us from seeing the firecracker logs and makes debugging difficult.

  3. We use the following firecracker version in our tests: firecracker.

API Extensions Description

We create an HTTP client upon creating a microVM or loading a microVM snapshot, which is used to send HTTP requests directly to the firecracker process for the respective microVM (contrary to using the firecracker-go-sdk).

ResumeVM, PauseVM and CreateSnapshot

ResumeVM, PauseVM and CreateSnapshot use the HTTP client to send the respective request to the firecracker process. The return code from firecracker is checked to verify that the operation was successful.

Note that CreateSnapshot does not pause the microVM, but assumes that it is paused. This is in line with the prerequisites for creating a microVM snapshot in firecracker.

Offload

Offload kills the firecracker process for the microVM with the respective ID (using SIGKILL) and deletes the firecracker process’ sock file and vsock file so the microVM can later be loaded. This functionality is implemented in the runtime.

In addition, Offload also kills the shim using SIGKILL, so that the resources can be freed up until/if the microVM is loaded in the future. We remove the functionality where the shim directory for the microVM is removed when the shim terminates. This is because in our use case we decide to store the guest memory file and the state file in the shim directory. We also remove the shim socket file and the firecracker shim socket file and recreate the sockets upon LoadSnapshot (see below). This functionality is implemented in the control plugin.

LoadSnapshot

Before doing anything else, the shim needs to be started for the microVM. We recreate the shim socket and the fccontrol shim socket, and start the shim binary. This functionality is implemented in the control plugin.

LoadSnapshot starts a firecracker process listening on the API same socket that the microVM was using prior to being offloaded. The HTTP client is recreated and a load snapshot request is sent to the firecracker process. The return code returned by firecracker is checked to verify the success of the operation. This functionality is implemented in the runtime.

Note that LoadSnapshot assumes that the tap with the same exact name, IP, and MAC, as before the VM was offloaded, exists. Currently, we recommend removing the tap after calling Offload and re-creating the tap before calling LoadSnapshot because if these two calls are back to back (as may be in tests), it would cause “Tap is busy” error.

Limitations

  1. When calling LoadSnapshot immediately after Offload, we encounter an error that the shimSocket address is in use when trying to load the shim on LoadSnapshot. A workaround is to introduce a sleep of 10-100ms after Offload, depending on the system. This does not happen for the fccontrol shim socket.
ERROR: VM with ID "3" already exists (socket: "/containerd-shim/53d9435747fdf335f1601ccebf98aa71b29f871fcdc68c595c22ca8b0a64d53d.sock")
  1. Calling StopVM on a microVM which has been loaded from a snapshot results in an error, because we lose connection to the agent running inside the microVM.

  2. Performance: re-creating a shim process takes about 30ms, before loading the snapshot in Firecracker, in our experiments, we haven’t yet investigated this issue. The intuition is that shim start-up should not exceed 5-10ms as it is for starting a Firecracker process.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Signed-off-by: Plamen Petrov <[email protected]>
Notes:
1. Uses logging-only branch from ustiugov/firecracker-go-sdk
2. Firecracker logs path is hard-coded.

Signed-off-by: Plamen Petrov <[email protected]>
firecracker update

Signed-off-by: Plamen Petrov <[email protected]>
* Check that shim dir exists when loading shim
* No longer try to create shim dir when loading shim, as it must exist

Signed-off-by: Plamen Petrov <[email protected]>
@plamenmpetrov plamenmpetrov changed the title MicroVM Snapshotting Support Firecracker Snapshots Support Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant