Reconnect SSH after disconnect #71

jackbaker1001 · 2023-06-08T02:49:36Z

This PR implements feature request #64 .

When in _poll_slurm, we now do two things:

When get_status is called, any conn.run() commands are run through a new method run_remote_command_with_reconnect.
A new method get_ssh_conn_status is run in async with get_status. The new method polls the ssh connection with a test command and reconnects on the test command's failure.

I have checked this rather manually on my laptop (2021 Mac M1 Pro, running MacOS Ventura 13.3.1 [22E261] ) by (i) running a trivial sample workflow which is disptached to Perlmutter and (ii) turning off my wifi connection once in poll slurm and (iii) Turning it back on again once the debug level logger message:

app_log.debug( f"SSH connection error {e}, reconnecting, attempt {self.ssh_connect_retries}...")

appears. I can confirm that the ssh connection becomes reestablished and the workflow runs to completion without error.

Some things to note are

This code only checks for connection drops in _poll_slurm while connection can drop and cause hangs at any call to conn.run(). If we wanted to be super tight, we should use the new method run_remote_command_with_reconnect in place of any conn.run() command
Does this code also check out with the plain ssh executor? Let's discuss.

I have added the tests to cover my changes.
I have updated the documentation, VERSION, and CHANGELOG accordingly.
I have read the CONTRIBUTING document.

… polls with different duration

…ring polling

CLAassistant · 2023-06-08T02:49:42Z

All committers have signed the CLA.

for more information, see https://pre-commit.ci

jackbaker1001 added 4 commits June 7, 2023 02:04

added new ssh connection polling and modified _poll_slurm to aync two…

9a63267

… polls with different duration

check for exception with test message instead

3ccfa85

various bug fixes. This code works if I manually turn off internet du…

3eb8d3b

…ring polling

pass forward conn to code after _poll_slurm

d2f248f

[pre-commit.ci] auto fixes from pre-commit.com hooks

bfa8049

for more information, see https://pre-commit.ci

Andrew-S-Rosen mentioned this pull request Aug 17, 2023

Add support for dropped connections Andrew-S-Rosen/covalent-hpc-plugin#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconnect SSH after disconnect #71

Reconnect SSH after disconnect #71

jackbaker1001 commented Jun 8, 2023

CLAassistant commented Jun 8, 2023 •

edited

Loading

Reconnect SSH after disconnect #71

Are you sure you want to change the base?

Reconnect SSH after disconnect #71

Conversation

jackbaker1001 commented Jun 8, 2023

CLAassistant commented Jun 8, 2023 • edited Loading

CLAassistant commented Jun 8, 2023 •

edited

Loading