-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to Paramiko SSHClient with Timeouts & Logging to Investigate Rare Lockups #507
Conversation
This PR might also help address #494 |
RMS/DownloadPlatepar.py
Outdated
# Connect with timeouts | ||
ssh, sftp = getSSHClientAndSFTP( | ||
config.hostname, | ||
port=22, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use config.host_port
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thank you.
I really like this change, sometimes my stations "freeze" for a long time before starting because connection is silently dropped and it has to wait all retries before continuing. But I'd recommend to find out what are the actual connection rate limit set at UWO firewall or stations may be locked out "forever". Some time ago I figured out that connections requests to GMN's port 22 are silently ignored after 5 consecutive tries in a short period, no matter if connection was established successfully or not. I didn't determine the amount of time but you can surely reproduce it connecting 5 times in 1 minute (for instance When RMS starts it already waste 2 or 3 tries checking for new platepar and mask on the server, or uploading processed files (that could be improved in the future to do all this on a single shot, or keep the SSH session alive during startup?). If user has more than one station on the same IP address or multistation scheme, it already waste all the 5 connections allowed, so would be good to check with the IT services the behavior and see if can be changed. |
Agreed. I'm trying to find out. Yesterday, I got locked out after testing a script that made too many consecutive requests. I'm leaving this as a draft PR until we get better clarity.
|
We’ve had sporadic reports of lockups that are not reproducible in our test environment. The lockups are happening in UploadManager:
This PR
Transport
usage with Paramiko’s recommendedSSHClient().connect()
approach.timeout
,banner_timeout
,auth_timeout
) to avoid infinite waits on bad networks.transport.set_keepalive(...)
) so idle connections aren’t silently dropped on slow machines/connections.try/finally
block to prevent lingering connections.Why?
Sample Log: