Switch to Paramiko SSHClient with Timeouts & Logging to Investigate Rare Lockups #507

Cybis320 · 2025-01-22T22:22:11Z

We’ve had sporadic reports of lockups that are not reproducible in our test environment. The lockups are happening in UploadManager:

2025/01/13 14:28:25-INFO-StartCapture-line:683 - Uploading data before exiting...
2025/01/13 14:28:25-INFO-UploadManager-line:222 - Establishing SSH connection to: gmn.uwo.ca:22...
2025/01/13 14:28:26-INFO-UploadManager-line:76 - Trying ssh-agent key b'b3e09fd48cfea47fcf9d502c74436898'
2025/01/13 14:28:27-INFO-UploadManager-line:81 - ... success!

---------------- HANGS HERE INDEFINITELY --------------

2025/01/13 14:28:27-INFO-UploadManager-line:276 - Copying /home/rms/RMS_data/ArchivedFiles/US001P_20250113_003623_928590_detected.tar.bz2 (60.14MB) to files/US001P_20250113_003623_928590_detected.tar.bz2
2025/01/13 14:29:39-INFO-UploadManager-line:290 - File upload verified: files/US001P_20250113_003623_928590_detected.tar.bz2
2025/01/13 14:29:39-INFO-UploadManager-line:495 - Upload successful!

This PR

Replaces the low-level Transport usage with Paramiko’s recommended SSHClient().connect() approach.
Adds generous (5 minutes) connection-level timeouts (timeout, banner_timeout, auth_timeout) to avoid infinite waits on bad networks.
Enables a 30‑second keepalive (transport.set_keepalive(...)) so idle connections aren’t silently dropped on slow machines/connections.
Refactors authentication fallback—if the primary private key fails, it still falls back to the SSH agent.
Introduces more debug logging (including Paramiko version) to help us diagnose if issues persist.
Increases retry/backoff from 2 to 10 seconds, giving more time for the network to stabilize between attempts.
Ensures we properly close SSH and SFTP sessions in a try/finally block to prevent lingering connections.

Why?

I haven’t been able to reproduce the reported lockups locally. This change is my best guess at a fix and prevention.
The expanded debug logs will make it easier to pinpoint what’s happening if a station still sees hangs.
Properly closing resources should reduce the risk of orphaned sessions that might have caused issues.

Sample Log:

025/01/22 21:31:53-INFO-DownloadMask-line:50 - Checking for new mask on the server...
2025/01/22 21:31:53-INFO-UploadManager-line:144 - Paramiko version: 3.4.1
2025/01/22 21:31:53-INFO-UploadManager-line:145 - Establishing SSH connection to: gmn.uwo.ca:22...
2025/01/22 21:31:54-INFO-UploadManager-line:161 - SSHClient connected successfully (key file).
2025/01/22 21:31:54-INFO-UploadManager-line:184 - Keepalive set to 30 seconds
2025/01/22 21:31:54-INFO-UploadManager-line:187 - Opening SFTP channel...
2025/01/22 21:31:54-INFO-UploadManager-line:189 - SFTP channel opened.
2025/01/22 21:31:55-INFO-DownloadMask-line:89 - files/masks exists
2025/01/22 21:31:55-INFO-DownloadMask-line:107 - Most recent flat /home/bolide/RMS_data/US005A/CapturedFiles/US005A_20250122_013222_354396/flat.bmp
2025/01/22 21:31:55-INFO-DownloadMask-line:112 - Uploading to files/masks/ as US005A_20250122_flat.bmp
2025/01/22 21:31:56-INFO-DownloadMask-line:122 - Not removing newly uploaded file US005A_20250122_flat.bmp
2025/01/22 21:31:56-INFO-DownloadMask-line:142 - No new mask on the server!

Cybis320 · 2025-01-22T22:42:12Z

This PR might also help address #494

adalava · 2025-01-22T23:19:16Z

RMS/DownloadPlatepar.py

+        # Connect with timeouts
+        ssh, sftp = getSSHClientAndSFTP(
+            config.hostname,
+            port=22,


please use config.host_port

Done. Thank you.

adalava · 2025-01-22T23:42:57Z

I really like this change, sometimes my stations "freeze" for a long time before starting because connection is silently dropped and it has to wait all retries before continuing. But I'd recommend to find out what are the actual connection rate limit set at UWO firewall or stations may be locked out "forever".

Some time ago I figured out that connections requests to GMN's port 22 are silently ignored after 5 consecutive tries in a short period, no matter if connection was established successfully or not. I didn't determine the amount of time but you can surely reproduce it connecting 5 times in 1 minute (for instance echo "ls" | sftp [email protected]:files/processed). The IP address gets banned temporarily and you can only reconnect after few minutes (something between 10 and 30 minutes). I don't know what happens if a station tries to continually in 60 seconds interval in a loop, there's a small chance that firewall will never unban that IP address. Maybe it would require to increase the timeout period dinamically between retries.

When RMS starts it already waste 2 or 3 tries checking for new platepar and mask on the server, or uploading processed files (that could be improved in the future to do all this on a single shot, or keep the SSH session alive during startup?). If user has more than one station on the same IP address or multistation scheme, it already waste all the 5 connections allowed, so would be good to check with the IT services the behavior and see if can be changed.

Cybis320 · 2025-01-23T22:52:09Z

Agreed. I'm trying to find out. Yesterday, I got locked out after testing a script that made too many consecutive requests. I'm leaving this as a draft PR until we get better clarity.

I really like this change, sometimes my stations "freeze" for a long time before starting because connection is silently dropped and it has to wait all retries before continuing. But I'd recommend to find out what are the actual connection rate limit set at UWO firewall or stations may be locked out "forever".

Some time ago I figured out that connections requests to GMN's port 22 are silently ignored after 5 consecutive tries in a short period, no matter if connection was established successfully or not. I didn't determine the amount of time but you can surely reproduce it connecting 5 times in 1 minute (for instance echo "ls" | sftp [email protected]:files/processed). The IP address gets banned temporarily and you can only reconnect after few minutes (something between 10 and 30 minutes). I don't know what happens if a station tries to continually in 60 seconds interval in a loop, there's a small chance that firewall will never unban that IP address. Maybe it would require to increase the timeout period dinamically between retries.

When RMS starts it already waste 2 or 3 tries checking for new platepar and mask on the server, or uploading processed files (that could be improved in the future to do all this on a single shot, or keep the SSH session alive during startup?). If user has more than one station on the same IP address or multistation scheme, it already waste all the 5 connections allowed, so would be good to check with the IT services the behavior and see if can be changed.

Cybis320 added 2 commits January 22, 2025 14:57

Switch to Paramiko SSHClient with Timeouts & Logging

f3eafdf

Remove duplicate logging

43581fc

Cybis320 requested a review from dvida January 22, 2025 22:22

Add log entries on closing SFTP and SSH

4b669bc

adalava reviewed Jan 22, 2025

View reviewed changes

Use config.host_port systematically

0c6fc6d

Add more debug around potential trouble makers and remove f-strings

f8987d6

dvida marked this pull request as ready for review January 25, 2025 18:17

dvida merged commit 28ee2c5 into CroatianMeteorNetwork:prerelease Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to Paramiko SSHClient with Timeouts & Logging to Investigate Rare Lockups #507

Switch to Paramiko SSHClient with Timeouts & Logging to Investigate Rare Lockups #507

Cybis320 commented Jan 22, 2025

Cybis320 commented Jan 22, 2025

adalava Jan 22, 2025

Cybis320 Jan 22, 2025

adalava commented Jan 22, 2025

Cybis320 commented Jan 23, 2025

Switch to Paramiko SSHClient with Timeouts & Logging to Investigate Rare Lockups #507

Switch to Paramiko SSHClient with Timeouts & Logging to Investigate Rare Lockups #507

Conversation

Cybis320 commented Jan 22, 2025

This PR

Why?

Sample Log:

Cybis320 commented Jan 22, 2025

adalava Jan 22, 2025

Choose a reason for hiding this comment

Cybis320 Jan 22, 2025

Choose a reason for hiding this comment

adalava commented Jan 22, 2025

Cybis320 commented Jan 23, 2025