-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle various termination scenarios of the conversion process #772
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
deeplow
reviewed
Apr 15, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this! I like the solution overall and only have minor comments.
This was the first round. It included code review, except tests. I'll do those later. Just wanted to get something out.
deeplow
approved these changes
Apr 24, 2024
Pass the Document instance that will be converted to the `IsolationProvider.start_doc_to_pixels_proc()` method. Concrete classes can then associate this name with the started process, so that they can later on kill it.
Set a unique name for spawned containers, based on the ID of the provided document. This ID is not globally unique, as it has few bits of entropy. However, since we only want to avoid collisions within a single Dangerzone invocation, and since we can't support multiple containers running in parallel, this ID will suffice.
Extend the IsolationProvider class with a `terminate_doc_to_pixels_proc()` method, which must be implemented by the Qubes/Container providers and gracefully terminate a process started for the doc to pixels phase. Refs #563
Get the exit code of the spawned process for the doc-to-pixels phase, without timing out. More specifically, if the spawned process has not finished within a generous amount of time (hardcode to 15 seconds), return UnexpectedConversionError, with a custom message. This way, the happy path is not affected, and we still make our best to learn the underlying cause of the I/O error.
Previously, we always assumed that the spawned process would quit within 3 seconds. This was an arbitrary call, and did not work in practice. We can improve our standing here by doing the following: 1. Make `Popen.wait()` calls take a generous amount of time (since they are usually on the sad path), and handle any timeout errors that they throw. This way, a slow conversion process cleanup does not take too much of our users time, nor is it reported as an error. 2. Always make sure that once the conversion of doc to pixels is over, the corresponding process will finish within a reasonable amount of time as well. Fixes #749
Add some test cases in the isolation provider tests, that check how it behaves when a process completes successfully, lingers, or cannot terminate. These tests cannot run yet, since they must be imported by a concrete isolation provider test class. In subsequent commits, we will start enabling them.
Add termination-related tests for containers. To achieve this, we need to make a change to the container isolation provider. More specifically, we need to make the isolation provider yield control to the caller only when the container is up and running. Failure to do so may lead to lingering processes.
Add termination-related tests for Qubes. To achieve this, we need to make a change to the Qubes isolation provider. More specifically, we need to make the isolation provider yield control to the caller only when the disposable qube is up and running. Qubes does not provide us a solid guarantee to do so, but we've found a hacky workaround, whereby we wait until the `qrexec-client-vm` process opens a `/dev/xen` character device. This should happen, in theory, once the disposable qube is ready, and has sent a `MSG_SERVICE_CONNECT` RPC message to the caller.
Add termination tests for the Dummy provider, so that we can have cross-platform coverage in our Windows/macOS CI runners, which can't use the Container isolation providers.
deeplow
approved these changes
Apr 24, 2024
Approved, pending lint fix. |
Include the test dependencies when linting, especially `pytest`. We need this because `mypy` cannot understand the `pytest.raises` exception, and specifically the fact that it catches exceptions. It assumes that the next code block is unreachable, since it doesn't see any try ... except.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The conversion process can encounter various issues during its lifetime, that can affect its termination. For example, it may fail early, linger for a little while, get stuck, etc. Previously, we assumed that any such scenario would simply be resolved with a 3-second wait timeout, provided that the conversion has already finished successfully. Turns out that this is not the case, and we have several users who reported timeout issues (see #749).
In this PR, we replace these arbitrary timeouts with some termination logic, that tries to terminate first the process gracefully, and if it doesn't comply, forcefully. This seems simple enough, but in practice we have the following issues:
Due to the finicky nature of terminating our conversion processes, and because we haven't 100% nailed the underlying reason for #749, this PR adds cross-platform tests for virtually every termination scenario that can occur. The end goal is to never let users see another timeout exception again, no matter the underlying cause.
Note
This PR does not cover the termination scenarios of the pixels to PDF conversion process. That's because we're working on removing the need for a container in that phase (see #748).
Fixes #749
Refs #563