Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle various termination scenarios of the conversion process #772

Merged
merged 10 commits into from
Apr 24, 2024

Conversation

apyrgio
Copy link
Contributor

@apyrgio apyrgio commented Apr 9, 2024

The conversion process can encounter various issues during its lifetime, that can affect its termination. For example, it may fail early, linger for a little while, get stuck, etc. Previously, we assumed that any such scenario would simply be resolved with a 3-second wait timeout, provided that the conversion has already finished successfully. Turns out that this is not the case, and we have several users who reported timeout issues (see #749).

In this PR, we replace these arbitrary timeouts with some termination logic, that tries to terminate first the process gracefully, and if it doesn't comply, forcefully. This seems simple enough, but in practice we have the following issues:

  1. Killing the spawned container process does not always mean that the container will be killed. Take Docker for example, where the spawned process is connected to the Docker daemon. The Docker daemon is then responsible for starting the container.
  2. Killing the disposable qube is not possible out of the box in vanilla Qubes. Our best bet is to close its standard streams (see Shut down dangling containers / VMs #563 (comment)).

Due to the finicky nature of terminating our conversion processes, and because we haven't 100% nailed the underlying reason for #749, this PR adds cross-platform tests for virtually every termination scenario that can occur. The end goal is to never let users see another timeout exception again, no matter the underlying cause.

Note

This PR does not cover the termination scenarios of the pixels to PDF conversion process. That's because we're working on removing the need for a container in that phase (see #748).

Fixes #749
Refs #563

Copy link
Contributor

@deeplow deeplow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! I like the solution overall and only have minor comments.

This was the first round. It included code review, except tests. I'll do those later. Just wanted to get something out.

dangerzone/isolation_provider/base.py Outdated Show resolved Hide resolved
dangerzone/isolation_provider/base.py Show resolved Hide resolved
dangerzone/isolation_provider/qubes.py Outdated Show resolved Hide resolved
tests/isolation_provider/base.py Outdated Show resolved Hide resolved
dangerzone/isolation_provider/container.py Show resolved Hide resolved
dangerzone/isolation_provider/qubes.py Show resolved Hide resolved
@apyrgio
Copy link
Contributor Author

apyrgio commented Apr 17, 2024

@deeplow I think I've addressed all your comments, and I've added a commit for something small that was missing: 1786d1b. Feel free to take a look.

apyrgio added 8 commits April 24, 2024 14:33
Pass the Document instance that will be converted to the
`IsolationProvider.start_doc_to_pixels_proc()` method. Concrete classes
can then associate this name with the started process, so that they can
later on kill it.
Set a unique name for spawned containers, based on the ID of the
provided document. This ID is not globally unique, as it has few bits of
entropy.  However, since we only want to avoid collisions within a
single Dangerzone invocation, and since we can't support multiple
containers running in parallel, this ID will suffice.
Extend the IsolationProvider class with a
`terminate_doc_to_pixels_proc()` method, which must be implemented by
the Qubes/Container providers and gracefully terminate a process started
for the doc to pixels phase.

Refs #563
Get the exit code of the spawned process for the doc-to-pixels phase,
without timing out. More specifically, if the spawned process has not
finished within a generous amount of time (hardcode to 15 seconds),
return UnexpectedConversionError, with a custom message.

This way, the happy path is not affected, and we still make our best to
learn the underlying cause of the I/O error.
Previously, we always assumed that the spawned process would quit
within 3 seconds. This was an arbitrary call, and did not work in
practice.

We can improve our standing here by doing the following:

1. Make `Popen.wait()` calls take a generous amount of time (since they
   are usually on the sad path), and handle any timeout errors that they
   throw. This way, a slow conversion process cleanup does not take too
   much of our users time, nor is it reported as an error.
2. Always make sure that once the conversion of doc to pixels is over,
   the corresponding process will finish within a reasonable amount of
   time as well.

Fixes #749
Add some test cases in the isolation provider tests, that check how it
behaves when a process completes successfully, lingers, or cannot
terminate.

These tests cannot run yet, since they must be imported by a concrete
isolation provider test class. In subsequent commits, we will start
enabling them.
Add termination-related tests for containers. To achieve this, we need
to make a change to the container isolation provider. More specifically,
we need to make the isolation provider yield control to the caller only
when the container is up and running. Failure to do so may lead to
lingering processes.
Add termination-related tests for Qubes. To achieve this, we need
to make a change to the Qubes isolation provider. More specifically,
we need to make the isolation provider yield control to the caller only
when the disposable qube is up and running.

Qubes does not provide us a solid guarantee to do so, but we've found a
hacky workaround, whereby we wait until the `qrexec-client-vm` process
opens a `/dev/xen` character device. This should happen, in theory, once
the disposable qube is ready, and has sent a `MSG_SERVICE_CONNECT` RPC
message to the caller.
Add termination tests for the Dummy provider, so that we can have
cross-platform coverage in our Windows/macOS CI runners, which can't use
the Container isolation providers.
@deeplow
Copy link
Contributor

deeplow commented Apr 24, 2024

Approved, pending lint fix.

Include the test dependencies when linting, especially `pytest`. We need
this because `mypy` cannot understand the `pytest.raises` exception, and
specifically the fact that it catches exceptions. It assumes that the
next code block is unreachable, since it doesn't see any try ... except.
@apyrgio apyrgio merged commit 94179a1 into main Apr 24, 2024
19 of 40 checks passed
@apyrgio apyrgio deleted the 749-timeouts branch April 24, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timeout after 3 seconds
2 participants