Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lxd] Add 60 seconds timeout for state operations #1821

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

townsend2010
Copy link
Contributor

If no timeout is set, LXD uses a hardcoded 30 second timeout when waiting on
operations to complete and if the wait timeout occurs, it can lead to incorrect
behavior in the LXD backend.

Fixes #1777

If no timeout is set, LXD uses a hardcoded 30 second timeout when waiting on
operations to complete and if the wait timeout occurs, it can lead to incorrect
behavior in the LXD backend.

Fixes #1777
@codecov
Copy link

codecov bot commented Oct 30, 2020

Codecov Report

Merging #1821 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1821   +/-   ##
=======================================
  Coverage   76.95%   76.95%           
=======================================
  Files         229      229           
  Lines        8512     8512           
=======================================
  Hits         6550     6550           
  Misses       1962     1962           
Impacted Files Coverage Δ
src/platform/backends/lxd/lxd_virtual_machine.cpp 98.74% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ccfdaaf...a76ca22. Read the comment docs.

@Saviq
Copy link
Collaborator

Saviq commented Nov 4, 2020

The problem I see with this is that 60s just just as arbitrary as 30s… and the "Operation cancelled" error is somewhat devoid of detail (I know…). Can we try and convert a "Operation cancelled" to, say, "Likely timed out"? Is there no detail about why it was cancelled available at all?

@townsend2010
Copy link
Contributor Author

townsend2010 commented Nov 4, 2020

Well, as mentioned, without putting a timeout at all here, LXD has a hard coded timeout of 30 seconds, which is what is tripping us up on shutdown. It seems LXD has some sort of race between the hard coded operation timeout and the actual timeout of trying to shut down an instance. I've looked through the LXD code in this area and I really don't understand why they did what they did.

That saidt, setting an explicit timeout for the state operation puts the onus on the state change, not the operation wait part, so it gets around this issue. I can easily make it 30 seconds since that is the arbitrary wait we have in other backends, but I think allowing more time for an instance to shut down is better. In fact, we've had complaints in the past about trying to cleanly shut down busy instances and 30 seconds is not enough time.

Regarding Operation cancelled, this should avoid that particular issue since we really won't be waiting on the operation itself. Also, no LXD does not really offer any helpful in its error messaging. We could catch this particular error and say something like "Timeout occurred waiting on operation to complete."

@Saviq
Copy link
Collaborator

Saviq commented Nov 5, 2020

That set, setting an explicit timeout for the state operation puts the onus on the state change, not the operation wait part, so it gets around this issue. I can easily make it 30 seconds since that is the arbitrary wait we have in other backends, but I think allowing more time for an instance to shut down is better. In fact, we've had complaints in the past about trying to cleanly shut down busy instances and 30 seconds is not enough time.

OK, so I was missing that context, that this is different to the internal LXD timeout. Should we explain in a comment?

@townsend2010
Copy link
Contributor Author

Should we explain in a comment?

Sure, I can add a comment, but it'll be verbose in order to explain why this was added. But really, adding an explicit timeout here puts us in control of the timeout and should be used regardless of working around LXD idiosyncrasies.

@townsend2010
Copy link
Contributor Author

In some testing to check on some behaviors, I found some issues with this, so will continue working on it.

@townsend2010 townsend2010 marked this pull request as draft November 6, 2020 14:55
Base automatically changed from master to main March 3, 2021 13:41
@joes
Copy link

joes commented Sep 22, 2022

When I try to delete a multipass lxd instance it will quite often result in a timeout:

$ multipass delete k8-devnode-2
[2022-09-22T09:30:58.018] [error] [lxd request] Timeout getting response for GET operation on unix://multipass/var/snap/lxd/common/lxd/[email protected]/operations/36a26883-b3ec-4f5a-bd00-325cd5dfc150/wait?project=multipass
[2022-09-22T09:30:58.018] [error] [lxd request] Timeout getting response for GET operation on unix://multipass/var/snap/lxd/common/lxd/[email protected]/operations/36a26883-b3ec-4f5a-bd00-325cd5dfc150/wait?project=multipass
delete failed: Timeout getting response for GET operation on unix://multipass/var/snap/lxd/common/lxd/[email protected]/operations/36a26883-b3ec-4f5a-bd00-325cd5dfc150/wait?project=multipass

As for the VM it only got stopped (not deleted):

$ multipass list
k8-devnode-2          Stopped           --               Ubuntu 20.04 LTS

Would this pull request fix this @townsend2010 or should I file a separate issue?

$ multipass version
multipass   1.10.1
multipassd  1.10.1
$ multipass get local.driver
lxd
$ lxd version
5.5

I also expose multipassd to the network and set the environment for this:

echo $MULTIPASS_SERVER_ADDRESS
multipass.intra:51005

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[lxd] race-y deadlock on purge
3 participants