-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asterisk taskprocessors issue due to infinite executions of apr_thread_cond_timedwait #67
Comments
Another lock that is obvious all over all of the backtraces I have collected from the beginning of this investigation is the following: Thread 49 (Thread 0x7faf2f447700 (LWP 3551528) "asterisk"):
#0 futex_abstimed_wait_cancelable (private=0, abstime=0x7faf2f43c2a0, clockid=792969776, expected=0, futex_word=0x7faf5800e208) at ../sysdeps/nptl/futex-internal.h:323
#1 __pthread_cond_wait_common (abstime=0x7faf2f43c2a0, clockid=792969776, mutex=0x7faf5800e190, cond=0x7faf5800e1e0) at pthread_cond_wait.c:520
#2 __pthread_cond_timedwait (cond=cond@entry=0x7faf5800e1e0, mutex=mutex@entry=0x7faf5800e190, abstime=abstime@entry=0x7faf2f43c2a0) at pthread_cond_wait.c:656
#3 0x00007fafb4331d92 in apr_thread_cond_timedwait (cond=0x7faf5800e1d8, mutex=0x7faf5800e188, timeout=30000000) at locks/unix/thread_cond.c:89
#4 0x00007fafb43d6b19 in speech_channel_open (schannel=0x7faf5800e0b0, profile=<optimized out>) at speech_channel.c:1676
#5 0x00007fafb43e1014 in app_synthandrecog_exec (chan=<optimized out>, data=<optimized out>) at app_synthandrecog.c:1853
#6 0x000055a17c4126ce in pbx_exec (c=0x7faf3c11aae8, app=0x55a17ef9e4a0, data=0x7faf2f43e3c3 XXXX
#7 0x00007faf5752443f in handle_exec (chan=0x7faf3c11aae8, agi=0x7faf2f4434d0, argc=3, argv=0x7faf2f43df10) at res_agi.c:3175
#8 0x00007faf5752717e in agi_handle_command (chan=0x7faf3c11aae8, agi=0x7faf2f4434d0, buf=0x7faf2f43e3b0 "EXEC", dead=0) at res_agi.c:4084
#9 0x00007faf57527bc3 in run_agi (chan=0x7faf3c11aae8, request=0x7faf2f443450 "agi://secret-ip.secret-domain:secret-port/my.agi?XXX=YYYYYY", agi=0x7faf2f4434d0, pid=-1, status=0x7faf2f4434c4, dead=0, argc=1, argv=0x>
#10 0x00007faf57528f2f in agi_exec_full (chan=0x7faf3c11aae8, data=0x7faf2f443a30 "agi://secret-ip.secret-domain:secret-port/my.agi?XXX=YYYYYY\"", enhanced=0, dead=0) at res_agi.c:4562
#11 0x00007faf5752907f in agi_exec (chan=0x7faf3c11aae8, data=0x7faf2f443a30 "agi://secret-ip.secret-domain:secret-port/my.agi?XXX=YYYYYY\"") at res_agi.c:4596
#12 0x000055a17c4126ce in pbx_exec (c=0x7faf3c11aae8, app=0x55a17e341550, data=0x7faf2f443a30 "agi://secret-ip.secret-domain:secret-port/my.agi?XXX=YYYYYY\"") at pbx_app.c:492
#13 0x000055a17c3fc816 in pbx_extension_helper (c=0x7faf3c11aae8, con=0x0, context=0x7faf3c11b4a8 "my_context", exten=0x7faf3c11b4f8 "00", priority=3, label=0x0, callerid=0x7faf6c0c7400 "507840350", action=E_S>
#14 0x000055a17c400a29 in ast_spawn_extension (c=0x7faf3c11aae8, context=0x7faf3c11b4a8 "mycontext", exten=0x7faf3c11b4f8 "00", priority=3, callerid=0x7faf6c0c7400 "507840350", found=0x7faf2f446ccc, combined_>
#15 0x000055a17c4017bd in __ast_pbx_run (c=0x7faf3c11aae8, args=0x0) at pbx.c:4376
#16 0x000055a17c40315e in pbx_thread (data=0x7faf3c11aae8) at pbx.c:4700
#17 0x000055a17c4b5715 in dummy_start (data=0x7faf3c5a3a10) at utils.c:1607
#18 0x00007fafb77feea7 in start_thread (arg=<optimized out>) at pthread_create.c:477 Which points to this part of code: asterisk-unimrcp/app-unimrcp/speech_channel.c Line 497 in f16eacb
Before we reach this part, in speech_channel_open() we have already checked that the channel is not closed (if it is we exit the speech_channel_open()) but it gets closed because within these nano seconds or micro seconds ASR server crashed. After examining this case I concluded that each thread that falls in this case, will sleep forever at this point, because the channel will always remain closed (even we have a timeout of 30 seconds thread will be sleeping for 30s, wake up due to the timeout and sleep again. This will be happening infinite times, thus be trapped here forever!!!). Thus, I substituted the while with an if statement here: |
When using Asterisk with chan_sip, these locks may result in just another thread being deadlocked. However, in the case of Asterisk with pjsip, this issue causes Asterisk to become unresponsive, making it by far a more serious problem. |
Also, by the time I fixed the above locks another hidden issue was revealed. (gdb) bt
#0 0x00007f3b6d4f8283 in __pthread_mutex_unlock_usercnt () from /lib64/libpthread.so.0
#1 0x00007f3b2f3d2c70 in speech_channel_read (schannel=schannel@entry=0x7f3b502d1048, data=0x7f3b50189a20, len=len@entry=0x7f3b19195ca8, block=block@entry=0) at speech_channel.c:699
#2 0x00007f3b2f3dbc69 in recog_stream_read (stream=<optimized out>, frame=0x7f3b50189928) at app_synthandrecog.c:1190
#3 0x00007f3b2f1a7b29 in mpf_audio_stream_frame_read (frame=0x7f3b50189928, stream=<optimized out>) at ../../libs/mpf/include/mpf_stream.h:136
#4 mpf_decoder_process (stream=<optimized out>, frame=0x7f3b50189c08) at src/mpf_decoder.c:60
#5 0x00007f3b2f1a210b in mpf_bridge_process (object=0x7f3b50189bd0) at src/mpf_bridge.c:63
#6 0x00007f3b2f1a493e in mpf_context_process (context=context@entry=0x7f3b50188490) at src/mpf_context.c:438
#7 0x00007f3b2f1a4979 in mpf_context_factory_process (factory=0x3059ac8) at src/mpf_context.c:105
#8 0x00007f3b2f1a7580 in timer_thread_proc (thread=0x30729a0, data=0x3059b28) at src/mpf_scheduler.c:212
#9 0x00007f3b6d4f444b in start_thread () from /lib64/libpthread.so.0
#10 0x00007f3b6afaa52f in clone () from /lib64/libc.so.6 From logs perspective we can see that the session has been marked as disconnected: [2024-09-16 10:22:38.310] NOTICE[20011] src/rtsp_client.c: Cancel RTSP Request 0x7fdf5407ad98 <be04e9f83ffe4fc9aea4e1762d864369> CSeq:6 [500]
[2024-09-16 10:22:38.311] DEBUG[20008] src/mrcp_client_session.c: Mark Session as Disconnected ASR-592 <be04e9f83ffe4fc9aea4e1762d864369>
[2024-09-16 10:22:39.996] ERROR[22835][C-00000250] app_synthandrecog.c: (ASR-592) Unable to load grammar
...
[2024-09-16 10:22:40.000] DEBUG[22835][C-00000250] speech_channel.c: Destroy speech channel: Name=ASR-592, Type=RECOGNIZER, Codec=PCMA, Rate=8000
[2024-09-16 10:22:40.000] DEBUG[22835][C-00000250] src/apt_task.c: Signal Message to [MRCP Client] [0x7fdf60058480;4;0]
[2024-09-16 10:22:40.012] DEBUG[22835][C-00000250] speech_channel.c: (ASR-592) Waiting for MRCP session to terminate
[2024-09-16 10:22:40.013] NOTICE[20008] src/mrcp_client_session.c: Receive App Request ASR-592 <be04e9f83ffe4fc9aea4e1762d864369> [1]
[2024-09-16 10:22:40.013] DEBUG[20008] src/mrcp_client_session.c: Push Request to Queue ASR-592 <be04e9f83ffe4fc9aea4e1762d864369>
[2024-09-16 10:22:42.013] WARNING[22835][C-00000250] speech_channel.c: (ASR-592) MRCP session has not terminated after 2000 ms
[2024-09-16 10:22:42.014] ERROR[22835][C-00000250] speech_channel.c: (ASR-592) Failed to destroy channel. Continuing
[2024-09-16 10:22:42.014] DEBUG[22835][C-00000250] audio_queue.c: (ASR-592) Audio queue destroyed
[2024-09-16 10:22:42.014] DEBUG[22835][C-00000250] speech_channel.c: (ASR-592) MPF generator has been reset
[2024-09-16 10:22:42.015] DEBUG[22835][C-00000250] speech_channel.c: (ASR-592) DTMF generator destroyed
... For now I haven't managed to provide a fix, so I guess either I will live with the locks or the crash I think the locks are there a long time now, with all the previous releases of asterisk-unimrcp If I find some time I will study the unimrcp implementation as well and especially mrcp client in order to provide a fix for the crash revealed after removing the deadlocks. |
I detected the issue that leads to the above crash. This fix has to be added on unimrcp and I will do it given the first chance... So, as a recap: |
I added a patch in unimrcp related to this issue here: |
WIP for this issue:
#66
Issue Overview:
The issue is about a lock originated due to an infinite loop occurring in
speech_channel_destroy
and infinite times of execution ofapr_thread_cond_timedwait
function.Asterisk taskprocessors are increasing until they reach high water limit due to the usage of infinite while loop and
apr_thread_cond_timedwait()
and then asterisk cannot serve any new calls.Investigation Process:
… null frame. Hangup detected
sipp my_secret.domain -sf 1SecSilence_hup.xml -s +1XXXXXXX -l 30 -r 10 -t t1
asterisk -x 'core show taskprocessors'
pjsip/distributor-00000027 250 38 3 450 500
ast_coredumper --running --no-default-search
I cannot share the full files until I mask sensitive info (It will take a lot of time)
sudo watch -n 2 "asterisk -x 'core show locks'"
speech_channel_destroy (schannel=0x7f7b84050c48) at speech_channel.c:419
Solution Explanation:
Instead of waiting for globals.speech_channel_timeout which is 30 seconds by default we are waiting an infinite times of 30 seconds intervals and the reason is that the check is wrong:
In C, the association (or associativity) of most operators, including the && operator is left-to-right.
So, the sleep due to apr_thread_cond_timedwait() is always executed.
30 seconds of sleep is enough so in our first fix we will substitute while with if command and delete also warned condition.
Impact of the Change:
The investigation will continue...
Note: In my application I use APP_SESSION_LIFETIME_PERSISTENT
The text was updated successfully, but these errors were encountered: