Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to Paramiko SSHClient with Timeouts & Logging to Investigate Rare Lockups #507

Merged
merged 5 commits into from
Jan 25, 2025

Conversation

Cybis320
Copy link
Contributor

We’ve had sporadic reports of lockups that are not reproducible in our test environment. The lockups are happening in UploadManager:

2025/01/13 14:28:25-INFO-StartCapture-line:683 - Uploading data before exiting...
2025/01/13 14:28:25-INFO-UploadManager-line:222 - Establishing SSH connection to: gmn.uwo.ca:22...
2025/01/13 14:28:26-INFO-UploadManager-line:76 - Trying ssh-agent key b'b3e09fd48cfea47fcf9d502c74436898'
2025/01/13 14:28:27-INFO-UploadManager-line:81 - ... success!

---------------- HANGS HERE INDEFINITELY --------------

2025/01/13 14:28:27-INFO-UploadManager-line:276 - Copying /home/rms/RMS_data/ArchivedFiles/US001P_20250113_003623_928590_detected.tar.bz2 (60.14MB) to files/US001P_20250113_003623_928590_detected.tar.bz2
2025/01/13 14:29:39-INFO-UploadManager-line:290 - File upload verified: files/US001P_20250113_003623_928590_detected.tar.bz2
2025/01/13 14:29:39-INFO-UploadManager-line:495 - Upload successful!

This PR

  • Replaces the low-level Transport usage with Paramiko’s recommended SSHClient().connect() approach.
  • Adds generous (5 minutes) connection-level timeouts (timeout, banner_timeout, auth_timeout) to avoid infinite waits on bad networks.
  • Enables a 30‑second keepalive (transport.set_keepalive(...)) so idle connections aren’t silently dropped on slow machines/connections.
  • Refactors authentication fallback—if the primary private key fails, it still falls back to the SSH agent.
  • Introduces more debug logging (including Paramiko version) to help us diagnose if issues persist.
  • Increases retry/backoff from 2 to 10 seconds, giving more time for the network to stabilize between attempts.
  • Ensures we properly close SSH and SFTP sessions in a try/finally block to prevent lingering connections.

Why?

  • I haven’t been able to reproduce the reported lockups locally. This change is my best guess at a fix and prevention.
  • The expanded debug logs will make it easier to pinpoint what’s happening if a station still sees hangs.
  • Properly closing resources should reduce the risk of orphaned sessions that might have caused issues.

Sample Log:

025/01/22 21:31:53-INFO-DownloadMask-line:50 - Checking for new mask on the server...
2025/01/22 21:31:53-INFO-UploadManager-line:144 - Paramiko version: 3.4.1
2025/01/22 21:31:53-INFO-UploadManager-line:145 - Establishing SSH connection to: gmn.uwo.ca:22...
2025/01/22 21:31:54-INFO-UploadManager-line:161 - SSHClient connected successfully (key file).
2025/01/22 21:31:54-INFO-UploadManager-line:184 - Keepalive set to 30 seconds
2025/01/22 21:31:54-INFO-UploadManager-line:187 - Opening SFTP channel...
2025/01/22 21:31:54-INFO-UploadManager-line:189 - SFTP channel opened.
2025/01/22 21:31:55-INFO-DownloadMask-line:89 - files/masks exists
2025/01/22 21:31:55-INFO-DownloadMask-line:107 - Most recent flat /home/bolide/RMS_data/US005A/CapturedFiles/US005A_20250122_013222_354396/flat.bmp
2025/01/22 21:31:55-INFO-DownloadMask-line:112 - Uploading to files/masks/ as US005A_20250122_flat.bmp
2025/01/22 21:31:56-INFO-DownloadMask-line:122 - Not removing newly uploaded file US005A_20250122_flat.bmp
2025/01/22 21:31:56-INFO-DownloadMask-line:142 - No new mask on the server!

@Cybis320 Cybis320 requested a review from dvida January 22, 2025 22:22
@Cybis320
Copy link
Contributor Author

This PR might also help address #494

# Connect with timeouts
ssh, sftp = getSSHClientAndSFTP(
config.hostname,
port=22,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use config.host_port

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you.

@adalava
Copy link
Contributor

adalava commented Jan 22, 2025

I really like this change, sometimes my stations "freeze" for a long time before starting because connection is silently dropped and it has to wait all retries before continuing. But I'd recommend to find out what are the actual connection rate limit set at UWO firewall or stations may be locked out "forever".

Some time ago I figured out that connections requests to GMN's port 22 are silently ignored after 5 consecutive tries in a short period, no matter if connection was established successfully or not. I didn't determine the amount of time but you can surely reproduce it connecting 5 times in 1 minute (for instance echo "ls" | sftp [email protected]:files/processed). The IP address gets banned temporarily and you can only reconnect after few minutes (something between 10 and 30 minutes). I don't know what happens if a station tries to continually in 60 seconds interval in a loop, there's a small chance that firewall will never unban that IP address. Maybe it would require to increase the timeout period dinamically between retries.

When RMS starts it already waste 2 or 3 tries checking for new platepar and mask on the server, or uploading processed files (that could be improved in the future to do all this on a single shot, or keep the SSH session alive during startup?). If user has more than one station on the same IP address or multistation scheme, it already waste all the 5 connections allowed, so would be good to check with the IT services the behavior and see if can be changed.

@Cybis320
Copy link
Contributor Author

Agreed. I'm trying to find out. Yesterday, I got locked out after testing a script that made too many consecutive requests. I'm leaving this as a draft PR until we get better clarity.

I really like this change, sometimes my stations "freeze" for a long time before starting because connection is silently dropped and it has to wait all retries before continuing. But I'd recommend to find out what are the actual connection rate limit set at UWO firewall or stations may be locked out "forever".

Some time ago I figured out that connections requests to GMN's port 22 are silently ignored after 5 consecutive tries in a short period, no matter if connection was established successfully or not. I didn't determine the amount of time but you can surely reproduce it connecting 5 times in 1 minute (for instance echo "ls" | sftp [email protected]:files/processed). The IP address gets banned temporarily and you can only reconnect after few minutes (something between 10 and 30 minutes). I don't know what happens if a station tries to continually in 60 seconds interval in a loop, there's a small chance that firewall will never unban that IP address. Maybe it would require to increase the timeout period dinamically between retries.

When RMS starts it already waste 2 or 3 tries checking for new platepar and mask on the server, or uploading processed files (that could be improved in the future to do all this on a single shot, or keep the SSH session alive during startup?). If user has more than one station on the same IP address or multistation scheme, it already waste all the 5 connections allowed, so would be good to check with the IT services the behavior and see if can be changed.

@dvida dvida marked this pull request as ready for review January 25, 2025 18:17
@dvida dvida merged commit 28ee2c5 into CroatianMeteorNetwork:prerelease Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants