C daemon becomes unresponsive #1578

cpswan · 2024-11-27T10:05:39Z

Describe the bug

After being left running for some hours I'm unable to connect to a C sshnpd that was previously working.

This isn't an isolated occurrence, and has been seen on multiple installations. 👀 @cconstab and @gkc

From a serial console I can see that the process is running:

root@One:~# ps -www | grep sshnpd
 4203 root      2296 R    /usr/bin/sshnpd -a @bareindoornetball -m @cpswan -d openwrtone
 4754 root      2556 S    /usr/bin/sshnpd -a @bareindoornetball -m @cpswan -d openwrtone

And there's a connection to the device atServer:

root@One:~# netstat -an | grep 15014
tcp        0      0 10.10.10.34:58654       104.198.230.74:15014    ESTABLISHED
tcp        0      0 10.10.10.34:50700       35.226.184.123:15014    CLOSE_WAIT

NB the CLOSE_WAIT shows that at some stage the daemon has reconnected to the atServer and got a different IP from the load balancer.

Steps to reproduce

First I install and configure the latest release of C sshnpd (in the case using our OpenWrt package for c0.2.4
Then I test that everything working by doing some sshnp sessions
And then leave things for a few hours
Then try again to log in. sshnp times out at Waiting for daemon feature check response

Expected behavior

sshnp connects

Additional context

An ssh session established with the previously working sshnpd survived whatever is happening here. Closing it makes no difference.

The text was updated successfully, but these errors were encountered:

gkc · 2024-11-27T10:22:43Z

Thanks @cpswan I will try to reproduce this today

cpswan · 2024-11-27T11:47:08Z

The daemon I restarted when I opened this ticket has already become unresponsive (<2h)

XavierChanth · 2024-11-27T12:43:16Z

This looks like a dead monitor connection. Do you have logs so we can check for when the last monitor heart beat arrives?

cpswan · 2024-11-27T13:00:42Z

I've been collecting verbose logs for the past hour.

I'll give it another hour and then try to connect and see what happens (and scoop up the logs).

Right now it's tailing by in my console window and I see this every 5s:

78337 | atclient_monitor |      RECV: "notification: {"id":"-1","from":"@bareindoornetball","to":"@bareindoornetball","key":"statsNotification.@bareindoornetball","value":"78937","operation":"update","epochMillis":1732712404428,"messageType":"MessageType.key","isEncrypted":false,"metadata":null}"
Wed Nov 27 13:00:04 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 13:00:04.478466 | sshnpd | Received message of type: 1
Wed Nov 27 13:00:04 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 13:00:04.478490 | sshnpd | Notification value received: 78937
Wed Nov 27 13:00:04 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 13:00:04.478510 | sshnpd | Waiting for next monitor thread message

XavierChanth · 2024-11-27T13:07:50Z

When you try to connect does it pick up the notification, and if so can you send those logs?

cpswan · 2024-11-27T13:54:26Z

I just tried a connect (2h after leaving it) and everything was normal.

I'll give it a while longer...

XavierChanth · 2024-11-27T14:04:18Z

Started my own to see if I can produce the same results

cconstab · 2024-11-27T16:29:50Z

I actually saw this on one of my machines too. I had to restatt the daemon.

If I see it lock up again I will grab the logs.

cconstab · 2024-11-27T22:39:49Z

Yay I saw the same before and now I have the issue again.. (All times in UTC)
The daemon is running at 100% CPU

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                           17754 cconstab  20   0   78904   3072   2496 R  99.9   0.6     38,38 /home/cconstab/.local/bin/sshnpd -a @ssh_1 -m +

Not responding to pings.. And the Noop loop seems broken as I see no logs in the tmux.. Video to come.. this is the last of the log in TMUX

EBG] 2024-11-27 21:12:05.929943 | connection |        RECV: "data:218516"
[DEBG] 2024-11-27 21:12:05.929956 | refresh_device_entry | Released the atclient lock
[INFO] 2024-11-27 22:12:07.149668 | refresh_device_entry | Refreshing device info entries for this device
[DEBG] 2024-11-27 22:12:07.149764 | connection |        SENT: "noop:0"
[ERROR] 2024-11-27 22:12:07.227495 | connection | mbedtls_ssl_read failed with exit code: -80
[ERROR] 2024-11-27 22:12:07.227500 | connection | Failed to send "noop:0
" to connection: -80
[INFO] 2024-11-27 22:12:07.227505 | reconnect | Worker client is not connected, attempting to reconnect:
[INFO] 2024-11-27 22:12:07.227509 | atclient | Missing atServer host or port. Using production atDirectory to look up atServer host and port
[DEBG] 2024-11-27 22:12:07.678725 | connection |        SENT: "ssh_1"
[DEBG] 2024-11-27 22:12:07.746729 | connection |        RECV: "92a7c219-e2eb-5f50-ac23-d7acba106cf5.swarm0002.atsign.zone:5703"
[DEBG] 2024-11-27 22:12:08.239173 | connection |        SENT: "from:ssh_1"
[DEBG] 2024-11-27 22:12:08.306705 | connection |        RECV: "data:_a2ada53a-9f2c-41e7-9acf-a12d36ad48e4@ssh_1:da63d2b8-6ab3-4098-96d0-94d26c4c3a53"
[DEBG] 2024-11-27 22:12:08.324493 | connection |        SENT: "pkam:KUDMZuqAa/naesCStDxx75XQ0ne5AZDLNIuvZZ4aStyHJJrqIvG6hUqR/rk7LEmotD6Egbc237haWotVbYGGkfsGeh4C+ZbE4A93ZFxu5+tM1bDaUPNbiFoDBqORQJJB6gyxzzIpS8pdHU8RTdhchJN8j0dw8A+Jt/TGyw34VMcUy2PlxEaqonC6Ej0VtXHibXzBQLx/LBXjNVO2GoqzaBCFHzQULRFXVS1QxIz69Ax36RzSC0IjaPjIWRJ65+f4YAopSapWFzwMblOIda8s17Gh1eSuNOVfKFMEiCqLCyNvBXQB9zSJUMCXO/W0yyjroPob2A5NFD4oMHptx4n0vA=="
[DEBG] 2024-11-27 22:12:08.403668 | connection |        RECV: "data:success"
[INFO] 2024-11-27 22:12:08.403674 | reconnect | Reconnected to the atServer!
[DEBG] 2024-11-27 22:12:08.403690 | connection |        SENT: "llookup:shared_key.cconstab@ssh_1"
[DEBG] 2024-11-27 22:12:08.479672 | connection |        RECV: "data:k0Er2lWbM2bWWjlGDTII4b8xI8pg+++3DJ8Vouw7rPcKu8oXIsg+3bWCcMWOJ1VpqmshXN76Ry4OAENJwF8iXIqxC7UznDdCq0/cdCDD9UIQol7Dxwj5I+1fIzLbJNDWkey1LBkqcVxCzbOZvn323e4+Hua+UnFQPtU6TCkwKstlA62vT7WoInlDPRnwoOojABD7037N8DGFPeZjQAGnlQYK7+d9SZgQqIhgJhWAHX3TUUHpQRVPBmTBIWVZYbOsi+k7WRWVDIFP4rmu2DP5cJklBbSPCM2Pa8KkYszEfa5/HQGYcI8n6b3B3lyCbL/UT+iDl9ay6yY6epkHVL1fOw=="
[DEBG] 2024-11-27 22:12:08.493921 | connection |        SENT: "noop:0"
[DEBG] 2024-11-27 22:12:08.562817 | connection |        RECV: "data:ok"
[DEBG] 2024-11-27 22:12:08.562879 | connection |        SENT: "update:ttl:2592000000:ttr:-1:ccd:true:isEncrypted:true:ivNonce:t3+Pyv31/kVb2bfytXxg5w==:@cconstab:device_info.csshnp01.sshnp@ssh_1 JVMw3MIUZWBHZMHIHk6m7oeyfk0JR9cr4GTuNldOlXFuPTdwFsXVfipA4HzoNsL7yL+RRJuwAVnnwk338wPSmhz5AmccVDq5xq99WtEB3TWYPHj+S2tVFwulktLM/+HBHsc6xa1Lm32VVboeaxYs63nT2hg4bXbllVplpWzFwdu6xJg1AcmnQevszWsYnIz+SgPscRfTHTBAZtpNDMQvvr3tEBvF4LMHHuysMYkGxmOIcfp1q+tAVFQ3z3Rto2NeQVgsRf4NN703KHeYQ4rlUBrKOaKlmxD51ARGqzhk0LWcQlx8vrUFglm01Ugs/xOD"
[DEBG] 2024-11-27 22:12:08.659503 | connection |        RECV: "data:218549"
[DEBG] 2024-11-27 22:12:08.659509 | refresh_device_entry | Released the atclient lock

Video

video1909482704.mp4

csshnpd log file

csshnpd.log

cconstab · 2024-11-27T23:21:16Z

Running a strace on the sshnpd it's in a hard loop..

pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999298})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999309})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999319})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999299})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999269})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999269})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999269})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999308})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999308})
read(3, "", 5)                          = 0

So I sent it a HUP but it died (I captured this strace)

pid 17754] 23:27:57 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999149})
[pid 17754] 23:27:57 read(3, "", 5)     = 0
[pid 17754] 23:27:57 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999169})
[pid 17754] 23:27:57 read(3, "", 5)     = 0
[pid 17754] 23:27:57 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999098})
[pid 17754] 23:27:57 read(3, "", 5)     = 0
[pid 17754] 23:27:57 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 17756] 23:27:57 <... clock_nanosleep resumed>{tv_sec=0, tv_nsec=868448275}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 17754] 23:27:57 <... pselect6 resumed>) = 1 (in [3], left {tv_sec=2, tv_nsec=999998988})
[pid 17756] 23:27:57 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=19602, si_uid=0} ---
[pid 17756] 23:27:57 +++ killed by SIGHUP +++
23:27:57 +++ killed by SIGHUP +++

cconstab · 2024-11-27T23:52:22Z

Loks like file descriptor 3 is in normal use the monitor socket (assuming a lot).. Data is encrypted so cannot tell for sure.. But that would indicated tring to read from a socket and getting nothing then some sort of tight loop ensues..

GUESS not FACT

pid 19619] 23:47:13 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 23:47:14 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 23:47:14 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 23:47:15 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 23:47:16 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 19619] 23:47:16 <... pselect6 resumed>) = 1 (in [3], left {tv_sec=0, tv_nsec=17040877})
[pid 19619] 23:47:16 read(3, "\27\3\3\0\360", 5) = 5
[pid 19619] 23:47:16 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998717})
[pid 19619] 23:47:16 read(3, "G\332\10\335R\68\366qO\3X\337^\327Q\243Y\215\31\261r\340\377-by\"w\n\366\1"..., 240) = 240
[pid 19619] 23:47:16 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-27 23:"..., 316) = 316
[pid 19619] 23:47:16 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-27 23:"..., 84) = 84
[pid 19619] 23:47:16 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-27 23:"..., 92) = 92
[pid 19619] 23:47:16 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-27 23:"..., 96) = 96

cconstab · 2024-11-27T23:54:57Z

I have an strace running in aginst the daemon to a log file so it might capture some clues

cconstab · 2024-11-28T06:45:28Z

yay died in same way, And I have the logs:

the daemon logs

[DEBG] 2024-11-28 05:51:55.249705 | sshnpd | Received message of type: 1                                                                     
[DEBG] 2024-11-28 05:51:55.249904 | sshnpd | Notification value received: 218795                                                             
[DEBG] 2024-11-28 05:51:55.249952 | sshnpd | Waiting for next monitor thread message                                                         
[DEBG] 2024-11-28 05:51:56.033871 | atclient_monitor |  RECV: "notification: {"id":"-1","from":"@ssh_1","to":"@ssh_1","key":"statsNotificatio
n.@ssh_1","value":"218796","operation":"update","epochMillis":1732773115997,"messageType":"MessageType.key","isEncrypted":false,"metadata":nu
ll}"                               
[DEBG] 2024-11-28 05:51:56.034048 | sshnpd | Received message of type: 1                                                                     
[DEBG] 2024-11-28 05:51:56.034274 | sshnpd | Notification value received: 218796                                                             
[DEBG] 2024-11-28 05:51:56.034332 | sshnpd | Waiting for next monitor thread message

the same time at the strace log..

The start of the 100% cpu is at [pid 19619] 06:05:51 read(3, "", 5) = 0

[pid 19619] 05:51:46 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998607})
[pid 19619] 05:51:46 read(3, "p\16J{\205\22\312\254\237\37\23\270<\214\0\211\275\314%\354\260i\253:\302\247\3$\223\6\216\2"..., 240) = 240
[pid 19619] 05:51:46 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 316) = 316
[pid 19619] 05:51:46 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 84) = 84
[pid 19619] 05:51:46 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 92) = 92
[pid 19619] 05:51:46 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 96) = 96
[pid 19619] 05:51:46 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 05:51:46 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 05:51:46 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:47 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:48 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 19619] 05:51:49 <... pselect6 resumed>) = 0 (Timeout)
[pid 19619] 05:51:49 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 05:51:49 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 05:51:49 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:50 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:51 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 19619] 05:51:52 <... pselect6 resumed>) = 0 (Timeout)
[pid 19619] 05:51:52 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 05:51:52 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 05:51:52 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:53 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:54 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 19619] 05:51:55 <... pselect6 resumed>) = 1 (in [3], left {tv_sec=0, tv_nsec=527201003})
[pid 19619] 05:51:55 read(3, "\27\3\3\0\360", 5) = 5
[pid 19619] 05:51:55 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998327})
[pid 19619] 05:51:55 read(3, "\2111\312\345\242\234\241\334\244\322\302\303/&\221\35B l\25\275\204\214\34\360?\346F\10.\263\2"..., 240) = 240
[pid 19619] 05:51:55 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 316) = 316
[pid 19619] 05:51:55 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 84) = 84
[pid 19619] 05:51:55 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 92) = 92
[pid 19619] 05:51:55 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 96) = 96
[pid 19619] 05:51:55 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 05:51:55 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 05:51:55 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 19619] 05:51:56 <... pselect6 resumed>) = 1 (in [3], left {tv_sec=2, tv_nsec=217408026})
[pid 19619] 05:51:56 read(3, "\27\3\3\0$", 5) = 5
[pid 19619] 05:51:56 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998627})
[pid 19619] 05:51:56 read(3, "\201\20\203\20\237[Y\201\200\221\213g\252$\270\303<%\v\336&.\303g\276_\277\305\37\345\\,"..., 36) = 36
[pid 19619] 05:51:56 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998828})
[pid 19619] 05:51:56 read(3, "\27\3\3\0\335", 5) = 5
[pid 19619] 05:51:56 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998887})
[pid 19619] 05:51:56 read(3, "\342\267\250\233\216\2433\316\316\272\210a\6\240\16\37\304\262\22\31\307\275c\332\300@Y\331C\372\300 "..., 221)
 = 221
[pid 19619] 05:51:56 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 316) = 316
[pid 19619] 05:51:56 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 84) = 84
[pid 19619] 05:51:56 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 92) = 92
[pid 19619] 05:51:56 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 96) = 96
[pid 19619] 05:51:56 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 05:51:56 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 05:51:56 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:57 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:58 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 19619] 05:51:59 <... pselect6 resumed>) = 0 (Timeout)
[pid 19619] 05:51:59 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 05:51:59 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 05:51:59 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999219})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998888})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999209})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999158})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999199})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999149})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999119})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999119})

cpswan · 2024-11-28T10:31:39Z

I also have logs...

From the last successful monitor:


Wed Nov 27 17:01:19 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:01:19.474620 | sshnpd | Waiting for next monitor thread message
Wed Nov 27 17:01:34 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:01:34.474036 | atclient_monitor |      RECV: "notification: {"id":"-1","from":"@bareindoornetball","to":"@bareindoornetball","key":"statsNotification.@bareindoornetball","value":"78978","operation":"update","epochMillis":1732726894425,"messageType":"MessageType.key","isEncrypted":false,"metadata":null}"
Wed Nov 27 17:01:34 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:01:34.474167 | sshnpd | Received message of type: 1
Wed Nov 27 17:01:34 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:01:34.474189 | sshnpd | Notification value received: 78978
Wed Nov 27 17:01:34 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:01:34.474209 | sshnpd | Waiting for next monitor thread message
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [INFO] 2024-11-27 17:48:12.124339 | refresh_device_entry | Refreshing device info entries for this device
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:48:12.124564 | connection |    SENT: "noop:0"
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 17:48:12.124640 | connection | mbedtls_ssl_read failed with exit code: 0
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:48:12.124676 | connection |    SENT: "llookup:shared_key.cpswan@bareindoornetball"
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 17:48:12.124702 | connection | mbedtls_ssl_read failed with exit code: 0
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 17:48:12.124720 | base64 | atchops_base64_decode: src_len is less than or equal to 0
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 17:48:12.124737 | encryption_key_helpers | atchops_base64_decode: 1
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 17:48:12.124754 | atclient_put_shared_key | atclient_get_shared_encryption_key_shared_by_me: 1
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 17:48:12.124773 | refresh_device_entry | Failed to refresh device entry for @cpswan
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:48:12.124790 | refresh_device_entry | Released the atclient lock
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: [INFO] 2024-11-27 18:48:13.402860 | refresh_device_entry | Refreshing device info entries for this device
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 18:48:13.402967 | connection | mbedtls_ssl_write failed with exit code: -80
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 18:48:13.402987 | connection | Failed to send "noop:0
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: " to connection: -80
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: [INFO] 2024-11-27 18:48:13.403004 | reconnect | Worker client is not connected, attempting to reconnect:
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: [INFO] 2024-11-27 18:48:13.403020 | atclient | Missing atServer host or port. Using production atDirectory to look up atServer host and port
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 18:48:13.984917 | connection |    SENT: "bareindoornetball"
Wed Nov 27 18:48:14 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 18:48:14.086475 | connection |    RECV: "e6697d6a-2567-50a7-9e61-396157b733d8.swarm0001.atsign.zone:15014"
Wed Nov 27 18:48:14 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 18:48:14.662178 | connection |    SENT: "from:bareindoornetball"
Wed Nov 27 18:48:14 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 18:48:14.763326 | connection |    RECV: "data:_0400ed33-9583-43bc-8fbd-bb93e594eab6@bareindoornetball:8bf6e0ab-bfe0-4610-905d-07e5542aa021"
Wed Nov 27 18:48:14 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 18:48:14.805014 | connection |    SENT: "pkam:bjoNoPQsSexUP0WKcX3lm3xMs04AxVoD/ZeCDDkF3bO6jjto6egJA1+XiHPUyi4s7BhMYzn50D4Fnj6LRv6X/BF28f2a/SQL9b+psuEhU7UDoxjvs31JcOKA4JH6ZwwyNzC0OQQPolT1Ld/GZ2SyFXcd71LQFoC2WdZyBS8VZaXKW2T1TltSMnkgj3rJqO27knAJQJEIGfVRb9mChxKoSeY0XaYnH1G59UvQUYsa6Mh6PgjUIha3ulNrJ+NybwRafLx8XhsUOvqMvSjjPTsFSr737RQ2n5VaS4VP3KQFgSDIOzq4b9GlTMUK7F8kao6iyH9OroFbEaZa0yuGh73eeA=="
Wed Nov 27 18:48:14 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 18:48:14.906585 | connection |    RECV: "data:success"
Wed Nov 27 18:48:14 2024 daemon.info sshnpd[10188]: [INFO] 2024-11-27 18:48:14.907140 | reconnect | Reconnected to the atServer!

After that the monitor loop seems to fail

In the full log I can see several (successful) attempts to reconnect to the atServer, but the monitor loop never seems to get re-established.

XavierChanth · 2024-11-28T12:24:19Z

I think I know what's happening, will need to take a closer look at the code to confirm my hypothesis

XavierChanth · 2024-11-28T16:25:02Z

Implemented a monitor read fix in at_c. NoPorts branch pinned against it: c-daemon-haning

XavierChanth · 2024-11-28T16:26:11Z

When mbedtls_ssl_read returns 0 (socket closed without notify) we only partially handled it as a failure, but the return value of the atclient function was still 0, it now returns -1 in that scenario

XavierChanth · 2024-11-28T16:28:37Z

Sorry, after testing myself, have some more work to do.

XavierChanth · 2024-11-28T16:41:10Z

Added a new message type for monitor (EMPTY), which is returned when we receive a timeout (nothing to read). The main loop should now continue, freeing the empty monitor message, allocating a new monitor message, and trying to read again.

We could remove the constant reallocation, but we have it in place to prevent potential bugs, clean memory every time.

This new monitor message type is 4, which should be seen frequently on an inactive daemon in the logs.

XavierChanth · 2024-11-28T16:43:09Z

I have uptaken these changes in the same NoPorts branch: c-daemon-haning

cconstab · 2024-11-28T17:44:46Z

If this fixes its a new version push asap..

Could you build an RC and I am happy to test.

XavierChanth · 2024-11-28T18:09:28Z

#1579 and atsign-foundation/at_c#452 are the associated changes

cconstab · 2024-11-29T04:36:50Z

Fix is holding aftwr 8 hours.. TY!

cpswan · 2024-12-02T16:16:33Z

Reopening as both of my test instances have become unreachable after running for a few days.

Most recent console logs look like:

Mon Dec  2 16:06:23 2024 daemon.info sshnpd[22555]: [DEBG] 2024-12-02 16:06:23.140631 | sshnpd | Waiting for next monitor thread message
Mon Dec  2 16:06:26 2024 daemon.info sshnpd[22555]: [DEBG] 2024-12-02 16:06:26.143630 | sshnpd | Received message of type: 4

I'm going to reconfigure logging so that I can capture the crossover from working to failing.

XavierChanth · 2024-12-03T17:34:00Z

Reopening as both of my test instances have become unreachable after running for a few days.

Most recent console logs look like:
Mon Dec  2 16:06:23 2024 daemon.info sshnpd[22555]: [DEBG] 2024-12-02 16:06:23.140631 | sshnpd | Waiting for next monitor thread message
Mon Dec  2 16:06:26 2024 daemon.info sshnpd[22555]: [DEBG] 2024-12-02 16:06:26.143630 | sshnpd | Received message of type: 4
I'm going to reconfigure logging so that I can capture the crossover from working to failing.

What device / arch is this on?

cpswan · 2024-12-03T18:03:52Z

What device / arch is this on?

That's from my OpenWrt One (aarch64), but similar story with my Flint 2.

cconstab · 2024-12-09T04:10:57Z

From what i see after a few sometimes days I no longer see these messages (just type 4) and the daemon never responds

[DEBG] 2024-12-09 04:08:39.835539 | atclient_monitor |  RECV: "notification: {"id":"-1","from":"@ssh_1","to":"@ssh_1","key":"statsNotification.@ssh_1","value":"227455","operation":"update","epochMillis":1733717319794,"messageType":"MessageType.key","isEncrypted":false,"metadata":null}"
[DEBG] 2024-12-09 04:08:39.835572 | sshnpd | Received message of type: 1
[DEBG] 2024-12-09 04:08:39.835576 | sshnpd | Notification value received: 227455

So this is different from the original issue and looks like something to do with the atclient monitor dying perhaps

XavierChanth · 2024-12-09T05:15:12Z

I'm separating the socket code into a platform agnostic layer for Arduino support, this will allow us to do some more rigorous testing against the mbedtls socket code.

cpswan added the bug Something isn't working label Nov 27, 2024

cpswan assigned XavierChanth Nov 27, 2024

XavierChanth mentioned this issue Nov 28, 2024

fix: C daemon hanging #1579

Merged

XavierChanth closed this as completed in #1579 Nov 28, 2024

cpswan reopened this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C daemon becomes unresponsive #1578

C daemon becomes unresponsive #1578

cpswan commented Nov 27, 2024 •

edited

Loading

gkc commented Nov 27, 2024

cpswan commented Nov 27, 2024

XavierChanth commented Nov 27, 2024

cpswan commented Nov 27, 2024

XavierChanth commented Nov 27, 2024

cpswan commented Nov 27, 2024

XavierChanth commented Nov 27, 2024

cconstab commented Nov 27, 2024

cconstab commented Nov 27, 2024 •

edited

Loading

cconstab commented Nov 27, 2024 •

edited

Loading

cconstab commented Nov 27, 2024

cconstab commented Nov 27, 2024

cconstab commented Nov 28, 2024

cpswan commented Nov 28, 2024

XavierChanth commented Nov 28, 2024

XavierChanth commented Nov 28, 2024

XavierChanth commented Nov 28, 2024

XavierChanth commented Nov 28, 2024

XavierChanth commented Nov 28, 2024 •

edited

Loading

XavierChanth commented Nov 28, 2024

cconstab commented Nov 28, 2024

XavierChanth commented Nov 28, 2024

cconstab commented Nov 29, 2024

cpswan commented Dec 2, 2024

XavierChanth commented Dec 3, 2024

cpswan commented Dec 3, 2024

cconstab commented Dec 9, 2024

XavierChanth commented Dec 9, 2024

C daemon becomes unresponsive #1578

C daemon becomes unresponsive #1578

Comments

cpswan commented Nov 27, 2024 • edited Loading

Describe the bug

Steps to reproduce

Expected behavior

Additional context

gkc commented Nov 27, 2024

cpswan commented Nov 27, 2024

XavierChanth commented Nov 27, 2024

cpswan commented Nov 27, 2024

XavierChanth commented Nov 27, 2024

cpswan commented Nov 27, 2024

XavierChanth commented Nov 27, 2024

cconstab commented Nov 27, 2024

cconstab commented Nov 27, 2024 • edited Loading

cconstab commented Nov 27, 2024 • edited Loading

cconstab commented Nov 27, 2024

cconstab commented Nov 27, 2024

cconstab commented Nov 28, 2024

cpswan commented Nov 28, 2024

XavierChanth commented Nov 28, 2024

XavierChanth commented Nov 28, 2024

XavierChanth commented Nov 28, 2024

XavierChanth commented Nov 28, 2024

XavierChanth commented Nov 28, 2024 • edited Loading

XavierChanth commented Nov 28, 2024

cconstab commented Nov 28, 2024

XavierChanth commented Nov 28, 2024

cconstab commented Nov 29, 2024

cpswan commented Dec 2, 2024

XavierChanth commented Dec 3, 2024

cpswan commented Dec 3, 2024

cconstab commented Dec 9, 2024

XavierChanth commented Dec 9, 2024

cpswan commented Nov 27, 2024 •

edited

Loading

cconstab commented Nov 27, 2024 •

edited

Loading

cconstab commented Nov 27, 2024 •

edited

Loading

XavierChanth commented Nov 28, 2024 •

edited

Loading