Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C daemon becomes unresponsive #1578

Open
cpswan opened this issue Nov 27, 2024 · 28 comments · Fixed by #1579
Open

C daemon becomes unresponsive #1578

cpswan opened this issue Nov 27, 2024 · 28 comments · Fixed by #1579
Assignees
Labels
bug Something isn't working

Comments

@cpswan
Copy link
Member

cpswan commented Nov 27, 2024

Describe the bug

After being left running for some hours I'm unable to connect to a C sshnpd that was previously working.

This isn't an isolated occurrence, and has been seen on multiple installations. 👀 @cconstab and @gkc

From a serial console I can see that the process is running:

root@One:~# ps -www | grep sshnpd
 4203 root      2296 R    /usr/bin/sshnpd -a @bareindoornetball -m @cpswan -d openwrtone
 4754 root      2556 S    /usr/bin/sshnpd -a @bareindoornetball -m @cpswan -d openwrtone

And there's a connection to the device atServer:

root@One:~# netstat -an | grep 15014
tcp        0      0 10.10.10.34:58654       104.198.230.74:15014    ESTABLISHED
tcp        0      0 10.10.10.34:50700       35.226.184.123:15014    CLOSE_WAIT

NB the CLOSE_WAIT shows that at some stage the daemon has reconnected to the atServer and got a different IP from the load balancer.

Steps to reproduce

  1. First I install and configure the latest release of C sshnpd (in the case using our OpenWrt package for c0.2.4
  2. Then I test that everything working by doing some sshnp sessions
  3. And then leave things for a few hours
  4. Then try again to log in. sshnp times out at Waiting for daemon feature check response

Expected behavior

sshnp connects

Additional context

An ssh session established with the previously working sshnpd survived whatever is happening here. Closing it makes no difference.

@cpswan cpswan added the bug Something isn't working label Nov 27, 2024
@gkc
Copy link
Contributor

gkc commented Nov 27, 2024

Thanks @cpswan I will try to reproduce this today

@cpswan
Copy link
Member Author

cpswan commented Nov 27, 2024

The daemon I restarted when I opened this ticket has already become unresponsive (<2h)

@XavierChanth
Copy link
Member

This looks like a dead monitor connection. Do you have logs so we can check for when the last monitor heart beat arrives?

@cpswan
Copy link
Member Author

cpswan commented Nov 27, 2024

I've been collecting verbose logs for the past hour.

I'll give it another hour and then try to connect and see what happens (and scoop up the logs).

Right now it's tailing by in my console window and I see this every 5s:

78337 | atclient_monitor |      RECV: "notification: {"id":"-1","from":"@bareindoornetball","to":"@bareindoornetball","key":"statsNotification.@bareindoornetball","value":"78937","operation":"update","epochMillis":1732712404428,"messageType":"MessageType.key","isEncrypted":false,"metadata":null}"
Wed Nov 27 13:00:04 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 13:00:04.478466 | sshnpd | Received message of type: 1
Wed Nov 27 13:00:04 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 13:00:04.478490 | sshnpd | Notification value received: 78937
Wed Nov 27 13:00:04 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 13:00:04.478510 | sshnpd | Waiting for next monitor thread message

@XavierChanth
Copy link
Member

When you try to connect does it pick up the notification, and if so can you send those logs?

@cpswan
Copy link
Member Author

cpswan commented Nov 27, 2024

I just tried a connect (2h after leaving it) and everything was normal.

I'll give it a while longer...

@XavierChanth
Copy link
Member

Started my own to see if I can produce the same results

@cconstab
Copy link
Member

I actually saw this on one of my machines too. I had to restatt the daemon.

If I see it lock up again I will grab the logs.

@cconstab
Copy link
Member

cconstab commented Nov 27, 2024

Yay I saw the same before and now I have the issue again.. (All times in UTC)
The daemon is running at 100% CPU

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                           17754 cconstab  20   0   78904   3072   2496 R  99.9   0.6     38,38 /home/cconstab/.local/bin/sshnpd -a @ssh_1 -m + 

Not responding to pings.. And the Noop loop seems broken as I see no logs in the tmux.. Video to come.. this is the last of the log in TMUX

EBG] 2024-11-27 21:12:05.929943 | connection |        RECV: "data:218516"
[DEBG] 2024-11-27 21:12:05.929956 | refresh_device_entry | Released the atclient lock
[INFO] 2024-11-27 22:12:07.149668 | refresh_device_entry | Refreshing device info entries for this device
[DEBG] 2024-11-27 22:12:07.149764 | connection |        SENT: "noop:0"
[ERROR] 2024-11-27 22:12:07.227495 | connection | mbedtls_ssl_read failed with exit code: -80
[ERROR] 2024-11-27 22:12:07.227500 | connection | Failed to send "noop:0
" to connection: -80
[INFO] 2024-11-27 22:12:07.227505 | reconnect | Worker client is not connected, attempting to reconnect:
[INFO] 2024-11-27 22:12:07.227509 | atclient | Missing atServer host or port. Using production atDirectory to look up atServer host and port
[DEBG] 2024-11-27 22:12:07.678725 | connection |        SENT: "ssh_1"
[DEBG] 2024-11-27 22:12:07.746729 | connection |        RECV: "92a7c219-e2eb-5f50-ac23-d7acba106cf5.swarm0002.atsign.zone:5703"
[DEBG] 2024-11-27 22:12:08.239173 | connection |        SENT: "from:ssh_1"
[DEBG] 2024-11-27 22:12:08.306705 | connection |        RECV: "data:_a2ada53a-9f2c-41e7-9acf-a12d36ad48e4@ssh_1:da63d2b8-6ab3-4098-96d0-94d26c4c3a53"
[DEBG] 2024-11-27 22:12:08.324493 | connection |        SENT: "pkam:KUDMZuqAa/naesCStDxx75XQ0ne5AZDLNIuvZZ4aStyHJJrqIvG6hUqR/rk7LEmotD6Egbc237haWotVbYGGkfsGeh4C+ZbE4A93ZFxu5+tM1bDaUPNbiFoDBqORQJJB6gyxzzIpS8pdHU8RTdhchJN8j0dw8A+Jt/TGyw34VMcUy2PlxEaqonC6Ej0VtXHibXzBQLx/LBXjNVO2GoqzaBCFHzQULRFXVS1QxIz69Ax36RzSC0IjaPjIWRJ65+f4YAopSapWFzwMblOIda8s17Gh1eSuNOVfKFMEiCqLCyNvBXQB9zSJUMCXO/W0yyjroPob2A5NFD4oMHptx4n0vA=="
[DEBG] 2024-11-27 22:12:08.403668 | connection |        RECV: "data:success"
[INFO] 2024-11-27 22:12:08.403674 | reconnect | Reconnected to the atServer!
[DEBG] 2024-11-27 22:12:08.403690 | connection |        SENT: "llookup:shared_key.cconstab@ssh_1"
[DEBG] 2024-11-27 22:12:08.479672 | connection |        RECV: "data:k0Er2lWbM2bWWjlGDTII4b8xI8pg+++3DJ8Vouw7rPcKu8oXIsg+3bWCcMWOJ1VpqmshXN76Ry4OAENJwF8iXIqxC7UznDdCq0/cdCDD9UIQol7Dxwj5I+1fIzLbJNDWkey1LBkqcVxCzbOZvn323e4+Hua+UnFQPtU6TCkwKstlA62vT7WoInlDPRnwoOojABD7037N8DGFPeZjQAGnlQYK7+d9SZgQqIhgJhWAHX3TUUHpQRVPBmTBIWVZYbOsi+k7WRWVDIFP4rmu2DP5cJklBbSPCM2Pa8KkYszEfa5/HQGYcI8n6b3B3lyCbL/UT+iDl9ay6yY6epkHVL1fOw=="
[DEBG] 2024-11-27 22:12:08.493921 | connection |        SENT: "noop:0"
[DEBG] 2024-11-27 22:12:08.562817 | connection |        RECV: "data:ok"
[DEBG] 2024-11-27 22:12:08.562879 | connection |        SENT: "update:ttl:2592000000:ttr:-1:ccd:true:isEncrypted:true:ivNonce:t3+Pyv31/kVb2bfytXxg5w==:@cconstab:device_info.csshnp01.sshnp@ssh_1 JVMw3MIUZWBHZMHIHk6m7oeyfk0JR9cr4GTuNldOlXFuPTdwFsXVfipA4HzoNsL7yL+RRJuwAVnnwk338wPSmhz5AmccVDq5xq99WtEB3TWYPHj+S2tVFwulktLM/+HBHsc6xa1Lm32VVboeaxYs63nT2hg4bXbllVplpWzFwdu6xJg1AcmnQevszWsYnIz+SgPscRfTHTBAZtpNDMQvvr3tEBvF4LMHHuysMYkGxmOIcfp1q+tAVFQ3z3Rto2NeQVgsRf4NN703KHeYQ4rlUBrKOaKlmxD51ARGqzhk0LWcQlx8vrUFglm01Ugs/xOD"
[DEBG] 2024-11-27 22:12:08.659503 | connection |        RECV: "data:218549"
[DEBG] 2024-11-27 22:12:08.659509 | refresh_device_entry | Released the atclient lock

Video

video1909482704.mp4

csshnpd log file

csshnpd.log

@cconstab
Copy link
Member

cconstab commented Nov 27, 2024

Running a strace on the sshnpd it's in a hard loop..

pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999298})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999309})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999319})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999299})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999269})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999269})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999269})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999308})
read(3, "", 5)                          = 0
pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999308})
read(3, "", 5)                          = 0

So I sent it a HUP but it died (I captured this strace)

pid 17754] 23:27:57 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999149})
[pid 17754] 23:27:57 read(3, "", 5)     = 0
[pid 17754] 23:27:57 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999169})
[pid 17754] 23:27:57 read(3, "", 5)     = 0
[pid 17754] 23:27:57 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999098})
[pid 17754] 23:27:57 read(3, "", 5)     = 0
[pid 17754] 23:27:57 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 17756] 23:27:57 <... clock_nanosleep resumed>{tv_sec=0, tv_nsec=868448275}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid 17754] 23:27:57 <... pselect6 resumed>) = 1 (in [3], left {tv_sec=2, tv_nsec=999998988})
[pid 17756] 23:27:57 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=19602, si_uid=0} ---
[pid 17756] 23:27:57 +++ killed by SIGHUP +++
23:27:57 +++ killed by SIGHUP +++

@cconstab
Copy link
Member

Loks like file descriptor 3 is in normal use the monitor socket (assuming a lot).. Data is encrypted so cannot tell for sure.. But that would indicated tring to read from a socket and getting nothing then some sort of tight loop ensues..

GUESS not FACT

pid 19619] 23:47:13 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 23:47:14 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 23:47:14 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 23:47:15 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 23:47:16 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 19619] 23:47:16 <... pselect6 resumed>) = 1 (in [3], left {tv_sec=0, tv_nsec=17040877})
[pid 19619] 23:47:16 read(3, "\27\3\3\0\360", 5) = 5
[pid 19619] 23:47:16 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998717})
[pid 19619] 23:47:16 read(3, "G\332\10\335R\68\366qO\3X\337^\327Q\243Y\215\31\261r\340\377-by\"w\n\366\1"..., 240) = 240
[pid 19619] 23:47:16 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-27 23:"..., 316) = 316
[pid 19619] 23:47:16 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-27 23:"..., 84) = 84
[pid 19619] 23:47:16 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-27 23:"..., 92) = 92
[pid 19619] 23:47:16 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-27 23:"..., 96) = 96

@cconstab
Copy link
Member

I have an strace running in aginst the daemon to a log file so it might capture some clues

@cconstab
Copy link
Member

yay died in same way, And I have the logs:

the daemon logs

[DEBG] 2024-11-28 05:51:55.249705 | sshnpd | Received message of type: 1                                                                     
[DEBG] 2024-11-28 05:51:55.249904 | sshnpd | Notification value received: 218795                                                             
[DEBG] 2024-11-28 05:51:55.249952 | sshnpd | Waiting for next monitor thread message                                                         
[DEBG] 2024-11-28 05:51:56.033871 | atclient_monitor |  RECV: "notification: {"id":"-1","from":"@ssh_1","to":"@ssh_1","key":"statsNotificatio
n.@ssh_1","value":"218796","operation":"update","epochMillis":1732773115997,"messageType":"MessageType.key","isEncrypted":false,"metadata":nu
ll}"                               
[DEBG] 2024-11-28 05:51:56.034048 | sshnpd | Received message of type: 1                                                                     
[DEBG] 2024-11-28 05:51:56.034274 | sshnpd | Notification value received: 218796                                                             
[DEBG] 2024-11-28 05:51:56.034332 | sshnpd | Waiting for next monitor thread message 

the same time at the strace log..

The start of the 100% cpu is at [pid 19619] 06:05:51 read(3, "", 5) = 0

[pid 19619] 05:51:46 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998607})
[pid 19619] 05:51:46 read(3, "p\16J{\205\22\312\254\237\37\23\270<\214\0\211\275\314%\354\260i\253:\302\247\3$\223\6\216\2"..., 240) = 240
[pid 19619] 05:51:46 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 316) = 316
[pid 19619] 05:51:46 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 84) = 84
[pid 19619] 05:51:46 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 92) = 92
[pid 19619] 05:51:46 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 96) = 96
[pid 19619] 05:51:46 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 05:51:46 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 05:51:46 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:47 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:48 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 19619] 05:51:49 <... pselect6 resumed>) = 0 (Timeout)
[pid 19619] 05:51:49 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 05:51:49 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 05:51:49 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:50 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:51 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 19619] 05:51:52 <... pselect6 resumed>) = 0 (Timeout)
[pid 19619] 05:51:52 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 05:51:52 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 05:51:52 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:53 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:54 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 19619] 05:51:55 <... pselect6 resumed>) = 1 (in [3], left {tv_sec=0, tv_nsec=527201003})
[pid 19619] 05:51:55 read(3, "\27\3\3\0\360", 5) = 5
[pid 19619] 05:51:55 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998327})
[pid 19619] 05:51:55 read(3, "\2111\312\345\242\234\241\334\244\322\302\303/&\221\35B l\25\275\204\214\34\360?\346F\10.\263\2"..., 240) = 240
[pid 19619] 05:51:55 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 316) = 316
[pid 19619] 05:51:55 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 84) = 84
[pid 19619] 05:51:55 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 92) = 92
[pid 19619] 05:51:55 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 96) = 96
[pid 19619] 05:51:55 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 05:51:55 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 05:51:55 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 19619] 05:51:56 <... pselect6 resumed>) = 1 (in [3], left {tv_sec=2, tv_nsec=217408026})
[pid 19619] 05:51:56 read(3, "\27\3\3\0$", 5) = 5
[pid 19619] 05:51:56 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998627})
[pid 19619] 05:51:56 read(3, "\201\20\203\20\237[Y\201\200\221\213g\252$\270\303<%\v\336&.\303g\276_\277\305\37\345\\,"..., 36) = 36
[pid 19619] 05:51:56 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998828})
[pid 19619] 05:51:56 read(3, "\27\3\3\0\335", 5) = 5
[pid 19619] 05:51:56 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998887})
[pid 19619] 05:51:56 read(3, "\342\267\250\233\216\2433\316\316\272\210a\6\240\16\37\304\262\22\31\307\275c\332\300@Y\331C\372\300 "..., 221)
 = 221
[pid 19619] 05:51:56 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 316) = 316
[pid 19619] 05:51:56 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 84) = 84
[pid 19619] 05:51:56 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 92) = 92
[pid 19619] 05:51:56 write(1, "\33[0;34m[DEBG]\33[0m 2024-11-28 05:"..., 96) = 96
[pid 19619] 05:51:56 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 05:51:56 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 05:51:56 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:57 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19620] 05:51:58 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid 19619] 05:51:59 <... pselect6 resumed>) = 0 (Timeout)
[pid 19619] 05:51:59 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL <unfinished ...>
[pid 19620] 05:51:59 <... clock_nanosleep resumed>0x77a7145ffd60) = 0
[pid 19620] 05:51:59 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x77a7145ffd60) = 0
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999219})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999998888})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999209})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999158})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999199})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999149})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999119})
[pid 19619] 06:05:51 read(3, "", 5)     = 0
[pid 19619] 06:05:51 pselect6(4, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 1 (in [3], left {tv_sec=2, tv_nsec=999999119})

@cpswan
Copy link
Member Author

cpswan commented Nov 28, 2024

I also have logs...

From the last successful monitor:


Wed Nov 27 17:01:19 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:01:19.474620 | sshnpd | Waiting for next monitor thread message
Wed Nov 27 17:01:34 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:01:34.474036 | atclient_monitor |      RECV: "notification: {"id":"-1","from":"@bareindoornetball","to":"@bareindoornetball","key":"statsNotification.@bareindoornetball","value":"78978","operation":"update","epochMillis":1732726894425,"messageType":"MessageType.key","isEncrypted":false,"metadata":null}"
Wed Nov 27 17:01:34 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:01:34.474167 | sshnpd | Received message of type: 1
Wed Nov 27 17:01:34 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:01:34.474189 | sshnpd | Notification value received: 78978
Wed Nov 27 17:01:34 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:01:34.474209 | sshnpd | Waiting for next monitor thread message
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [INFO] 2024-11-27 17:48:12.124339 | refresh_device_entry | Refreshing device info entries for this device
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:48:12.124564 | connection |    SENT: "noop:0"
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 17:48:12.124640 | connection | mbedtls_ssl_read failed with exit code: 0
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:48:12.124676 | connection |    SENT: "llookup:shared_key.cpswan@bareindoornetball"
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 17:48:12.124702 | connection | mbedtls_ssl_read failed with exit code: 0
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 17:48:12.124720 | base64 | atchops_base64_decode: src_len is less than or equal to 0
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 17:48:12.124737 | encryption_key_helpers | atchops_base64_decode: 1
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 17:48:12.124754 | atclient_put_shared_key | atclient_get_shared_encryption_key_shared_by_me: 1
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 17:48:12.124773 | refresh_device_entry | Failed to refresh device entry for @cpswan
Wed Nov 27 17:48:12 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 17:48:12.124790 | refresh_device_entry | Released the atclient lock
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: [INFO] 2024-11-27 18:48:13.402860 | refresh_device_entry | Refreshing device info entries for this device
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 18:48:13.402967 | connection | mbedtls_ssl_write failed with exit code: -80
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: [ERROR] 2024-11-27 18:48:13.402987 | connection | Failed to send "noop:0
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: " to connection: -80
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: [INFO] 2024-11-27 18:48:13.403004 | reconnect | Worker client is not connected, attempting to reconnect:
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: [INFO] 2024-11-27 18:48:13.403020 | atclient | Missing atServer host or port. Using production atDirectory to look up atServer host and port
Wed Nov 27 18:48:13 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 18:48:13.984917 | connection |    SENT: "bareindoornetball"
Wed Nov 27 18:48:14 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 18:48:14.086475 | connection |    RECV: "e6697d6a-2567-50a7-9e61-396157b733d8.swarm0001.atsign.zone:15014"
Wed Nov 27 18:48:14 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 18:48:14.662178 | connection |    SENT: "from:bareindoornetball"
Wed Nov 27 18:48:14 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 18:48:14.763326 | connection |    RECV: "data:_0400ed33-9583-43bc-8fbd-bb93e594eab6@bareindoornetball:8bf6e0ab-bfe0-4610-905d-07e5542aa021"
Wed Nov 27 18:48:14 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 18:48:14.805014 | connection |    SENT: "pkam:bjoNoPQsSexUP0WKcX3lm3xMs04AxVoD/ZeCDDkF3bO6jjto6egJA1+XiHPUyi4s7BhMYzn50D4Fnj6LRv6X/BF28f2a/SQL9b+psuEhU7UDoxjvs31JcOKA4JH6ZwwyNzC0OQQPolT1Ld/GZ2SyFXcd71LQFoC2WdZyBS8VZaXKW2T1TltSMnkgj3rJqO27knAJQJEIGfVRb9mChxKoSeY0XaYnH1G59UvQUYsa6Mh6PgjUIha3ulNrJ+NybwRafLx8XhsUOvqMvSjjPTsFSr737RQ2n5VaS4VP3KQFgSDIOzq4b9GlTMUK7F8kao6iyH9OroFbEaZa0yuGh73eeA=="
Wed Nov 27 18:48:14 2024 daemon.info sshnpd[10188]: [DEBG] 2024-11-27 18:48:14.906585 | connection |    RECV: "data:success"
Wed Nov 27 18:48:14 2024 daemon.info sshnpd[10188]: [INFO] 2024-11-27 18:48:14.907140 | reconnect | Reconnected to the atServer!

After that the monitor loop seems to fail

In the full log I can see several (successful) attempts to reconnect to the atServer, but the monitor loop never seems to get re-established.

@XavierChanth
Copy link
Member

I think I know what's happening, will need to take a closer look at the code to confirm my hypothesis

@XavierChanth
Copy link
Member

Implemented a monitor read fix in at_c. NoPorts branch pinned against it: c-daemon-haning

@XavierChanth
Copy link
Member

When mbedtls_ssl_read returns 0 (socket closed without notify) we only partially handled it as a failure, but the return value of the atclient function was still 0, it now returns -1 in that scenario

@XavierChanth
Copy link
Member

Sorry, after testing myself, have some more work to do.

@XavierChanth
Copy link
Member

XavierChanth commented Nov 28, 2024

Added a new message type for monitor (EMPTY), which is returned when we receive a timeout (nothing to read). The main loop should now continue, freeing the empty monitor message, allocating a new monitor message, and trying to read again.

We could remove the constant reallocation, but we have it in place to prevent potential bugs, clean memory every time.

This new monitor message type is 4, which should be seen frequently on an inactive daemon in the logs.

@XavierChanth
Copy link
Member

I have uptaken these changes in the same NoPorts branch: c-daemon-haning

@cconstab
Copy link
Member

If this fixes its a new version push asap..

Could you build an RC and I am happy to test.

@XavierChanth
Copy link
Member

#1579 and atsign-foundation/at_c#452 are the associated changes

@cconstab
Copy link
Member

Fix is holding aftwr 8 hours.. TY!

@cpswan cpswan reopened this Dec 2, 2024
@cpswan
Copy link
Member Author

cpswan commented Dec 2, 2024

Reopening as both of my test instances have become unreachable after running for a few days.

Most recent console logs look like:

Mon Dec  2 16:06:23 2024 daemon.info sshnpd[22555]: [DEBG] 2024-12-02 16:06:23.140631 | sshnpd | Waiting for next monitor thread message
Mon Dec  2 16:06:26 2024 daemon.info sshnpd[22555]: [DEBG] 2024-12-02 16:06:26.143630 | sshnpd | Received message of type: 4

I'm going to reconfigure logging so that I can capture the crossover from working to failing.

@XavierChanth
Copy link
Member

Reopening as both of my test instances have become unreachable after running for a few days.

Most recent console logs look like:

Mon Dec  2 16:06:23 2024 daemon.info sshnpd[22555]: [DEBG] 2024-12-02 16:06:23.140631 | sshnpd | Waiting for next monitor thread message
Mon Dec  2 16:06:26 2024 daemon.info sshnpd[22555]: [DEBG] 2024-12-02 16:06:26.143630 | sshnpd | Received message of type: 4

I'm going to reconfigure logging so that I can capture the crossover from working to failing.

What device / arch is this on?

@cpswan
Copy link
Member Author

cpswan commented Dec 3, 2024

What device / arch is this on?

That's from my OpenWrt One (aarch64), but similar story with my Flint 2.

@cconstab
Copy link
Member

cconstab commented Dec 9, 2024

From what i see after a few sometimes days I no longer see these messages (just type 4) and the daemon never responds

[DEBG] 2024-12-09 04:08:39.835539 | atclient_monitor |  RECV: "notification: {"id":"-1","from":"@ssh_1","to":"@ssh_1","key":"statsNotification.@ssh_1","value":"227455","operation":"update","epochMillis":1733717319794,"messageType":"MessageType.key","isEncrypted":false,"metadata":null}"
[DEBG] 2024-12-09 04:08:39.835572 | sshnpd | Received message of type: 1
[DEBG] 2024-12-09 04:08:39.835576 | sshnpd | Notification value received: 227455

So this is different from the original issue and looks like something to do with the atclient monitor dying perhaps

@XavierChanth
Copy link
Member

I'm separating the socket code into a platform agnostic layer for Arduino support, this will allow us to do some more rigorous testing against the mbedtls socket code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants