-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C daemon becomes unresponsive #1578
Comments
Thanks @cpswan I will try to reproduce this today |
The daemon I restarted when I opened this ticket has already become unresponsive (<2h) |
This looks like a dead monitor connection. Do you have logs so we can check for when the last monitor heart beat arrives? |
I've been collecting verbose logs for the past hour. I'll give it another hour and then try to connect and see what happens (and scoop up the logs). Right now it's tailing by in my console window and I see this every 5s:
|
When you try to connect does it pick up the notification, and if so can you send those logs? |
I just tried a connect (2h after leaving it) and everything was normal. I'll give it a while longer... |
Started my own to see if I can produce the same results |
I actually saw this on one of my machines too. I had to restatt the daemon. If I see it lock up again I will grab the logs. |
Yay I saw the same before and now I have the issue again.. (All times in UTC)
Not responding to pings.. And the Noop loop seems broken as I see no logs in the tmux.. Video to come.. this is the last of the log in TMUX
Video video1909482704.mp4csshnpd log file |
Running a strace on the
So I sent it a HUP but it died (I captured this strace)
|
Loks like file descriptor 3 is in normal use the monitor socket (assuming a lot).. Data is encrypted so cannot tell for sure.. But that would indicated tring to read from a socket and getting nothing then some sort of tight loop ensues.. GUESS not FACT
|
I have an strace running in aginst the daemon to a log file so it might capture some clues |
yay died in same way, And I have the logs: the daemon logs
the same time at the strace log.. The start of the 100% cpu is at
|
I also have logs... From the last successful monitor:
After that the monitor loop seems to fail In the full log I can see several (successful) attempts to reconnect to the atServer, but the monitor loop never seems to get re-established. |
I think I know what's happening, will need to take a closer look at the code to confirm my hypothesis |
Implemented a monitor read fix in at_c. NoPorts branch pinned against it: c-daemon-haning |
When mbedtls_ssl_read returns 0 (socket closed without notify) we only partially handled it as a failure, but the return value of the atclient function was still 0, it now returns -1 in that scenario |
Sorry, after testing myself, have some more work to do. |
Added a new message type for monitor (EMPTY), which is returned when we receive a timeout (nothing to read). The main loop should now continue, freeing the empty monitor message, allocating a new monitor message, and trying to read again. We could remove the constant reallocation, but we have it in place to prevent potential bugs, clean memory every time. This new monitor message type is 4, which should be seen frequently on an inactive daemon in the logs. |
I have uptaken these changes in the same NoPorts branch: c-daemon-haning |
If this fixes its a new version push asap.. Could you build an RC and I am happy to test. |
#1579 and atsign-foundation/at_c#452 are the associated changes |
Fix is holding aftwr 8 hours.. TY! |
Reopening as both of my test instances have become unreachable after running for a few days. Most recent console logs look like:
I'm going to reconfigure logging so that I can capture the crossover from working to failing. |
What device / arch is this on? |
That's from my OpenWrt One (aarch64), but similar story with my Flint 2. |
From what i see after a few sometimes days I no longer see these messages (just type 4) and the daemon never responds
So this is different from the original issue and looks like something to do with the atclient monitor dying perhaps |
I'm separating the socket code into a platform agnostic layer for Arduino support, this will allow us to do some more rigorous testing against the mbedtls socket code. |
Describe the bug
After being left running for some hours I'm unable to connect to a C sshnpd that was previously working.
This isn't an isolated occurrence, and has been seen on multiple installations. 👀 @cconstab and @gkc
From a serial console I can see that the process is running:
And there's a connection to the device atServer:
NB the
CLOSE_WAIT
shows that at some stage the daemon has reconnected to the atServer and got a different IP from the load balancer.Steps to reproduce
sshnp
sessionssshnp
times out atWaiting for daemon feature check response
Expected behavior
sshnp
connectsAdditional context
An ssh session established with the previously working sshnpd survived whatever is happening here. Closing it makes no difference.
The text was updated successfully, but these errors were encountered: