[lxd] Add 60 seconds timeout for state operations #1821

townsend2010 · 2020-10-30T13:46:13Z

If no timeout is set, LXD uses a hardcoded 30 second timeout when waiting on
operations to complete and if the wait timeout occurs, it can lead to incorrect
behavior in the LXD backend.

Fixes #1777

If no timeout is set, LXD uses a hardcoded 30 second timeout when waiting on operations to complete and if the wait timeout occurs, it can lead to incorrect behavior in the LXD backend. Fixes #1777

codecov · 2020-10-30T14:10:09Z

Codecov Report

Merging #1821 into master will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1821   +/-   ##
=======================================
  Coverage   76.95%   76.95%           
=======================================
  Files         229      229           
  Lines        8512     8512           
=======================================
  Hits         6550     6550           
  Misses       1962     1962

Impacted Files	Coverage Δ
src/platform/backends/lxd/lxd_virtual_machine.cpp	`98.74% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ccfdaaf...a76ca22. Read the comment docs.

Saviq · 2020-11-04T13:36:41Z

The problem I see with this is that 60s just just as arbitrary as 30s… and the "Operation cancelled" error is somewhat devoid of detail (I know…). Can we try and convert a "Operation cancelled" to, say, "Likely timed out"? Is there no detail about why it was cancelled available at all?

townsend2010 · 2020-11-04T13:45:02Z

Well, as mentioned, without putting a timeout at all here, LXD has a hard coded timeout of 30 seconds, which is what is tripping us up on shutdown. It seems LXD has some sort of race between the hard coded operation timeout and the actual timeout of trying to shut down an instance. I've looked through the LXD code in this area and I really don't understand why they did what they did.

That saidt, setting an explicit timeout for the state operation puts the onus on the state change, not the operation wait part, so it gets around this issue. I can easily make it 30 seconds since that is the arbitrary wait we have in other backends, but I think allowing more time for an instance to shut down is better. In fact, we've had complaints in the past about trying to cleanly shut down busy instances and 30 seconds is not enough time.

Regarding Operation cancelled, this should avoid that particular issue since we really won't be waiting on the operation itself. Also, no LXD does not really offer any helpful in its error messaging. We could catch this particular error and say something like "Timeout occurred waiting on operation to complete."

Saviq · 2020-11-05T07:44:44Z

That set, setting an explicit timeout for the state operation puts the onus on the state change, not the operation wait part, so it gets around this issue. I can easily make it 30 seconds since that is the arbitrary wait we have in other backends, but I think allowing more time for an instance to shut down is better. In fact, we've had complaints in the past about trying to cleanly shut down busy instances and 30 seconds is not enough time.

OK, so I was missing that context, that this is different to the internal LXD timeout. Should we explain in a comment?

townsend2010 · 2020-11-05T13:52:33Z

Should we explain in a comment?

Sure, I can add a comment, but it'll be verbose in order to explain why this was added. But really, adding an explicit timeout here puts us in control of the timeout and should be used regardless of working around LXD idiosyncrasies.

townsend2010 · 2020-11-06T14:55:09Z

In some testing to check on some behaviors, I found some issues with this, so will continue working on it.

joes · 2022-09-22T09:38:58Z

When I try to delete a multipass lxd instance it will quite often result in a timeout:

$ multipass delete k8-devnode-2
[2022-09-22T09:30:58.018] [error] [lxd request] Timeout getting response for GET operation on unix://multipass/var/snap/lxd/common/lxd/[email protected]/operations/36a26883-b3ec-4f5a-bd00-325cd5dfc150/wait?project=multipass
[2022-09-22T09:30:58.018] [error] [lxd request] Timeout getting response for GET operation on unix://multipass/var/snap/lxd/common/lxd/[email protected]/operations/36a26883-b3ec-4f5a-bd00-325cd5dfc150/wait?project=multipass
delete failed: Timeout getting response for GET operation on unix://multipass/var/snap/lxd/common/lxd/[email protected]/operations/36a26883-b3ec-4f5a-bd00-325cd5dfc150/wait?project=multipass

As for the VM it only got stopped (not deleted):

$ multipass list
k8-devnode-2          Stopped           --               Ubuntu 20.04 LTS

Would this pull request fix this @townsend2010 or should I file a separate issue?

$ multipass version
multipass   1.10.1
multipassd  1.10.1

$ multipass get local.driver
lxd

$ lxd version
5.5

I also expose multipassd to the network and set the environment for this:

echo $MULTIPASS_SERVER_ADDRESS
multipass.intra:51005

[lxd] Add 60 seconds timeout for state operations

a76ca22

If no timeout is set, LXD uses a hardcoded 30 second timeout when waiting on operations to complete and if the wait timeout occurs, it can lead to incorrect behavior in the LXD backend. Fixes #1777

townsend2010 marked this pull request as draft November 6, 2020 14:55

Base automatically changed from master to main March 3, 2021 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lxd] Add 60 seconds timeout for state operations #1821

[lxd] Add 60 seconds timeout for state operations #1821

townsend2010 commented Oct 30, 2020

codecov bot commented Oct 30, 2020

Saviq commented Nov 4, 2020 •

edited

Loading

townsend2010 commented Nov 4, 2020 •

edited

Loading

Saviq commented Nov 5, 2020

townsend2010 commented Nov 5, 2020

townsend2010 commented Nov 6, 2020

joes commented Sep 22, 2022 •

edited

Loading

[lxd] Add 60 seconds timeout for state operations #1821

Are you sure you want to change the base?

[lxd] Add 60 seconds timeout for state operations #1821

Conversation

townsend2010 commented Oct 30, 2020

codecov bot commented Oct 30, 2020

Codecov Report

Saviq commented Nov 4, 2020 • edited Loading

townsend2010 commented Nov 4, 2020 • edited Loading

Saviq commented Nov 5, 2020

townsend2010 commented Nov 5, 2020

townsend2010 commented Nov 6, 2020

joes commented Sep 22, 2022 • edited Loading

Saviq commented Nov 4, 2020 •

edited

Loading

townsend2010 commented Nov 4, 2020 •

edited

Loading

joes commented Sep 22, 2022 •

edited

Loading