Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements feature request #64 .
When in _poll_slurm, we now do two things:
When
get_status
is called, anyconn.run()
commands are run through a new methodrun_remote_command_with_reconnect
.A new method
get_ssh_conn_status
is run in async withget_status
. The new method polls the ssh connection with a test command and reconnects on the test command's failure.I have checked this rather manually on my laptop (2021 Mac M1 Pro, running MacOS Ventura 13.3.1 [22E261] ) by (i) running a trivial sample workflow which is disptached to Perlmutter and (ii) turning off my wifi connection once in poll slurm and (iii) Turning it back on again once the debug level logger message:
appears. I can confirm that the ssh connection becomes reestablished and the workflow runs to completion without error.
Some things to note are
This code only checks for connection drops in
_poll_slurm
while connection can drop and cause hangs at any call toconn.run()
. If we wanted to be super tight, we should use the new methodrun_remote_command_with_reconnect
in place of any conn.run() commandDoes this code also check out with the plain ssh executor? Let's discuss.