Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asterisk taskprocessors issue due to infinite executions of apr_thread_cond_timedwait #67

Open
stavroskladis opened this issue Aug 29, 2024 · 6 comments

Comments

@stavroskladis
Copy link

stavroskladis commented Aug 29, 2024

WIP for this issue:

#66

Issue Overview:

The issue is about a lock originated due to an infinite loop occurring in speech_channel_destroy and infinite times of execution of apr_thread_cond_timedwait function.
Asterisk taskprocessors are increasing until they reach high water limit due to the usage of infinite while loop and apr_thread_cond_timedwait() and then asterisk cannot serve any new calls.

Investigation Process:

  1. Scenario: in first step I use SynthAndRecog to play a prompt and for speech recognition , in second step I stream silence I get:
    … null frame. Hangup detected
  2. sipp my_secret.domain -sf 1SecSilence_hup.xml -s +1XXXXXXX -l 30 -r 10 -t t1
  3. While calls are running I crash/force restart 1 ASR Server out of 2
  4. Monitor taskprocessors in another terminal with: asterisk -x 'core show taskprocessors'
  5. As a result we see many of taskprocessors being increased and never be freed again (for short I paste only one):
    pjsip/distributor-00000027 250 38 3 450 500
  6. Retrieve stack traces from asterisk running instance:
    ast_coredumper --running --no-default-search
Dumping running asterisk process to /tmp/core-asterisk-running-2024-08-29T10-48-27Z
Dump is complete.
Processing /tmp/core-asterisk-running-2024-08-29T10-48-27Z
    ASTBIN: /apps/asterisk/bin/sbin/asterisk
    MODDIR: /apps/asterisk/bin/lib/asterisk/modules
    ETCDIR: /apps/asterisk/etc/asterisk
    LIBDIR: /apps/asterisk/bin/lib
    Creating /tmp/core-asterisk-running-2024-08-29T10-48-27Z-thread1.txt
    Creating /tmp/core-asterisk-running-2024-08-29T10-48-27Z-brief.txt
    Creating /tmp/core-asterisk-running-2024-08-29T10-48-27Z-full.txt
    Creating /tmp/core-asterisk-running-2024-08-29T10-48-27Z-locks.txt
    Creating /tmp/core-asterisk-running-2024-08-29T10-48-27Z-info.txt

I cannot share the full files until I mask sensitive info (It will take a lot of time)

  1. Checking the active calls with no traffic
    sudo watch -n 2 "asterisk -x 'core show locks'"
Every 2.0s: asterisk -x 'core show locks'                                                                                                                                                                                              Thu Aug 29 10:57:04 2024
=======================================================================
=== 18.23.1
=== Currently Held Locks at Time: 479146:57:04.575956 =================
=======================================================================
===
=== <pending> <lock#> (<file>): <lock type> <line num> <function> <lock name> <lock addr>
=== <locked at>, <failed at>, <held for> (attempts, times locked)
===
=== Thread ID: 0x7f7b68980700 LWP:11718 (worker_start         started at [ 1111] threadpool.c worker_thread_start())
=== ---> Lock #0 (channel.c): MUTEX 2228 ast_channel_destructor chan 0x7f7b8c130660
===	 479146:46:26.751422, 479146:46:26.748890, 00:10:37.824547 (1, 1)
        [0x52e7a8] asterisk lock.c:277 __ast_pthread_mutex_lock()
        [0x46916e] asterisk astobj2.c:241 __ao2_lock()
        [0x4ade71] asterisk channel.c:2231 ast_channel_destructor()
        [0x469f96] asterisk astobj2.c:619 __ao2_ref()
        [0x7f7b6d0f27ef] chan_pjsip.so chan_pjsip.c:2524 hangup_data_destroy()
        [0x469f96] asterisk astobj2.c:619 __ao2_ref()
        [0x46a2a4] asterisk astobj2.c:675 __ao2_cleanup_debug()
        [0x7f7b6d0f2b9a] chan_pjsip.so chan_pjsip.c:2587 hangup()
        [0x5e907a] asterisk taskprocessor.c:1304 ast_taskprocessor_execute()
        [0x5f2c1b] asterisk threadpool.c:1350 execute_tasks()
        [0x5e907a] asterisk taskprocessor.c:1304 ast_taskprocessor_execute()
        [0x5f0999] asterisk threadpool.c:367 threadpool_execute()
        [0x5f2458] asterisk threadpool.c:1137 worker_active()
        [0x5f21f2] asterisk threadpool.c:1057 worker_start()
        [0x5feb10] asterisk utils.c:1607 dummy_start()
        [0x7f7bba7e444b] libpthread.so.0 :0 __pthread_get_minstack()
        [0x7f7bb829a52f] libc.so.6 :0 clone()
=== -------------------------------------------------------------------
===
=== Thread ID: 0x7f7bb655c700 LWP:13298 (worker_start         started at [ 1111] threadpool.c worker_thread_start())
=== ---> Lock #0 (channel.c): MUTEX 2228 ast_channel_destructor chan 0x7f7b9c18ac70
===	 479146:42:35.762519, 479146:42:35.630516, 00:14:28.899403 (1, 1)
        [0x52e7a8] asterisk lock.c:277 __ast_pthread_mutex_lock()
        [0x46916e] asterisk astobj2.c:241 __ao2_lock()
        [0x4ade71] asterisk channel.c:2231 ast_channel_destructor()
        [0x469f96] asterisk astobj2.c:619 __ao2_ref()
        [0x7f7b6d0f27ef] chan_pjsip.so chan_pjsip.c:2524 hangup_data_destroy()
        [0x469f96] asterisk astobj2.c:619 __ao2_ref()
        [0x46a2a4] asterisk astobj2.c:675 __ao2_cleanup_debug()
        [0x7f7b6d0f2b9a] chan_pjsip.so chan_pjsip.c:2587 hangup()
        [0x5e907a] asterisk taskprocessor.c:1304 ast_taskprocessor_execute()
        [0x5f2c1b] asterisk threadpool.c:1350 execute_tasks()
        [0x5e907a] asterisk taskprocessor.c:1304 ast_taskprocessor_execute()
        [0x5f0999] asterisk threadpool.c:367 threadpool_execute()
        [0x5f2458] asterisk threadpool.c:1137 worker_active()
        [0x5f21f2] asterisk threadpool.c:1057 worker_start()
        [0x5feb10] asterisk utils.c:1607 dummy_start()
        [0x7f7bba7e444b] libpthread.so.0 :0 __pthread_get_minstack()
        [0x7f7bb829a52f] libc.so.6 :0 clone()
=== -------------------------------------------------------------------
===
=======================================================================
  1. Based on the thread id of the locks we search the *-brief file for the stack trace of this thread:
Thread 22 (Thread 0x7f7b68980700 (LWP 11718)):
#0  0x00007f7bba7ea71e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f7b77de9662 in apr_thread_cond_timedwait (cond=0x7f7b84050d48, mutex=0x7f7b84050cf8, timeout=30000000) at locks/unix/thread_cond.c:89
#2  0x00007f7bb479e26a in speech_channel_destroy (schannel=0x7f7b84050c48) at speech_channel.c:419
#3  0x00007f7bb47a16b0 in app_session_destroy (app_session=0x7f7b84050b08) at app_datastore.c:174
#4  app_datastore_destroy (data=0x7f7b84050698) at app_datastore.c:191
#5  0x00000000004ed6f4 in ast_datastore_free (datastore=0x7f7b8401ddb0) at datastore.c:78
#6  0x00000000004ade7f in ast_channel_destructor (obj=0x7f7b8c1306b8) at channel.c:2233
#7  0x0000000000469f96 in __ao2_ref (user_data=0x7f7b8c1306b8, delta=-1, tag=0x0, file=0x7f7b6d100134 "chan_pjsip.c", line=2524, func=0x7f7b6d1029c0 <__PRETTY_FUNCTION__.30695> "hangup_data_destroy") at astobj2.c:615
#8  0x00007f7b6d0f27ef in hangup_data_destroy (obj=0x7f7b84037608) at chan_pjsip.c:2524
#9  0x0000000000469f96 in __ao2_ref (user_data=0x7f7b84037608, delta=-1, tag=0x0, file=0x7f7b6d100134 "chan_pjsip.c", line=2586, func=0x7f7b6d1029f9 <__PRETTY_FUNCTION__.30727> "hangup") at astobj2.c:615
#10 0x000000000046a2a4 in __ao2_cleanup_debug (obj=0x7f7b84037608, tag=0x0, file=0x7f7b6d100134 "chan_pjsip.c", line=2586, function=0x7f7b6d1029f9 <__PRETTY_FUNCTION__.30727> "hangup") at astobj2.c:673
#11 0x00007f7b6d0f2b9a in hangup (data=0x7f7b84037608) at chan_pjsip.c:2586
#12 0x00000000005e907a in ast_taskprocessor_execute (tps=0x2e34118) at taskprocessor.c:1302
#13 0x00000000005f2c1b in execute_tasks (data=0x2e34118) at threadpool.c:1350
#14 0x00000000005e907a in ast_taskprocessor_execute (tps=0x28fcde8) at taskprocessor.c:1302
#15 0x00000000005f0999 in threadpool_execute (pool=0x27de0f8) at threadpool.c:367
#16 0x00000000005f2458 in worker_active (worker=0x7f7b940042d8) at threadpool.c:1137
#17 0x00000000005f21f2 in worker_start (arg=0x7f7b940042d8) at threadpool.c:1056
#18 0x00000000005feb10 in dummy_start (data=0x28e3250) at utils.c:1607
#19 0x00007f7bba7e444b in start_thread () from /lib64/libpthread.so.0
#20 0x00007f7bb829a52f in clone () from /lib64/libc.so.6
  1. Based on the stack trace of this thread we can see clearly that the lock takes place in:
    speech_channel_destroy (schannel=0x7f7b84050c48) at speech_channel.c:419

Solution Explanation:

if ((apr_thread_cond_timedwait(schannel->cond, schannel->mutex, globals.speech_channel_timeout) == APR_TIMEUP) && (!warned)) {...}

Instead of waiting for globals.speech_channel_timeout which is 30 seconds by default we are waiting an infinite times of 30 seconds intervals and the reason is that the check is wrong:
In C, the association (or associativity) of most operators, including the && operator is left-to-right.
So, the sleep due to apr_thread_cond_timedwait() is always executed.
30 seconds of sleep is enough so in our first fix we will substitute while with if command and delete also warned condition.

Impact of the Change:

The investigation will continue...

Note: In my application I use APP_SESSION_LIFETIME_PERSISTENT

@stavroskladis stavroskladis changed the title Asterisk taskprocessors issue due to a trap in an infinite apr_thread_cond_timedwait Asterisk taskprocessors issue due infinite execution of apr_thread_cond_timedwait Aug 29, 2024
@stavroskladis stavroskladis changed the title Asterisk taskprocessors issue due infinite execution of apr_thread_cond_timedwait Asterisk taskprocessors issue due to infinite executions of apr_thread_cond_timedwait Aug 29, 2024
@stavroskladis
Copy link
Author

stavroskladis commented Sep 1, 2024

Another lock that is obvious all over all of the backtraces I have collected from the beginning of this investigation is the following:

Thread 49 (Thread 0x7faf2f447700 (LWP 3551528) "asterisk"):
#0  futex_abstimed_wait_cancelable (private=0, abstime=0x7faf2f43c2a0, clockid=792969776, expected=0, futex_word=0x7faf5800e208) at ../sysdeps/nptl/futex-internal.h:323
#1  __pthread_cond_wait_common (abstime=0x7faf2f43c2a0, clockid=792969776, mutex=0x7faf5800e190, cond=0x7faf5800e1e0) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=cond@entry=0x7faf5800e1e0, mutex=mutex@entry=0x7faf5800e190, abstime=abstime@entry=0x7faf2f43c2a0) at pthread_cond_wait.c:656
#3  0x00007fafb4331d92 in apr_thread_cond_timedwait (cond=0x7faf5800e1d8, mutex=0x7faf5800e188, timeout=30000000) at locks/unix/thread_cond.c:89
#4  0x00007fafb43d6b19 in speech_channel_open (schannel=0x7faf5800e0b0, profile=<optimized out>) at speech_channel.c:1676
#5  0x00007fafb43e1014 in app_synthandrecog_exec (chan=<optimized out>, data=<optimized out>) at app_synthandrecog.c:1853
#6  0x000055a17c4126ce in pbx_exec (c=0x7faf3c11aae8, app=0x55a17ef9e4a0, data=0x7faf2f43e3c3 XXXX
#7  0x00007faf5752443f in handle_exec (chan=0x7faf3c11aae8, agi=0x7faf2f4434d0, argc=3, argv=0x7faf2f43df10) at res_agi.c:3175
#8  0x00007faf5752717e in agi_handle_command (chan=0x7faf3c11aae8, agi=0x7faf2f4434d0, buf=0x7faf2f43e3b0 "EXEC", dead=0) at res_agi.c:4084
#9  0x00007faf57527bc3 in run_agi (chan=0x7faf3c11aae8, request=0x7faf2f443450 "agi://secret-ip.secret-domain:secret-port/my.agi?XXX=YYYYYY", agi=0x7faf2f4434d0, pid=-1, status=0x7faf2f4434c4, dead=0, argc=1, argv=0x>
#10 0x00007faf57528f2f in agi_exec_full (chan=0x7faf3c11aae8, data=0x7faf2f443a30 "agi://secret-ip.secret-domain:secret-port/my.agi?XXX=YYYYYY\"", enhanced=0, dead=0) at res_agi.c:4562
#11 0x00007faf5752907f in agi_exec (chan=0x7faf3c11aae8, data=0x7faf2f443a30 "agi://secret-ip.secret-domain:secret-port/my.agi?XXX=YYYYYY\"") at res_agi.c:4596
#12 0x000055a17c4126ce in pbx_exec (c=0x7faf3c11aae8, app=0x55a17e341550, data=0x7faf2f443a30 "agi://secret-ip.secret-domain:secret-port/my.agi?XXX=YYYYYY\"") at pbx_app.c:492
#13 0x000055a17c3fc816 in pbx_extension_helper (c=0x7faf3c11aae8, con=0x0, context=0x7faf3c11b4a8 "my_context", exten=0x7faf3c11b4f8 "00", priority=3, label=0x0, callerid=0x7faf6c0c7400 "507840350", action=E_S>
#14 0x000055a17c400a29 in ast_spawn_extension (c=0x7faf3c11aae8, context=0x7faf3c11b4a8 "mycontext", exten=0x7faf3c11b4f8 "00", priority=3, callerid=0x7faf6c0c7400 "507840350", found=0x7faf2f446ccc, combined_>
#15 0x000055a17c4017bd in __ast_pbx_run (c=0x7faf3c11aae8, args=0x0) at pbx.c:4376
#16 0x000055a17c40315e in pbx_thread (data=0x7faf3c11aae8) at pbx.c:4700
#17 0x000055a17c4b5715 in dummy_start (data=0x7faf3c5a3a10) at utils.c:1607
#18 0x00007fafb77feea7 in start_thread (arg=<optimized out>) at pthread_create.c:477

Which points to this part of code:

while (schannel->state == SPEECH_CHANNEL_CLOSED)

Before we reach this part, in speech_channel_open() we have already checked that the channel is not closed (if it is we exit the speech_channel_open()) but it gets closed because within these nano seconds or micro seconds ASR server crashed.

After examining this case I concluded that each thread that falls in this case, will sleep forever at this point, because the channel will always remain closed (even we have a timeout of 30 seconds thread will be sleeping for 30s, wake up due to the timeout and sleep again. This will be happening infinite times, thus be trapped here forever!!!).

Thus, I substituted the while with an if statement here:
a920a8d

@stavroskladis
Copy link
Author

stavroskladis commented Sep 1, 2024

When using Asterisk with chan_sip, these locks may result in just another thread being deadlocked. However, in the case of Asterisk with pjsip, this issue causes Asterisk to become unresponsive, making it by far a more serious problem.

@stavroskladis
Copy link
Author

The way I see it (with a quick look on the logs related) the following issues are related to the above fix:
#52
#45

@stavroskladis
Copy link
Author

stavroskladis commented Sep 16, 2024

Also, by the time I fixed the above locks another hidden issue was revealed.
In case of an asr server crash (mrcp server) asterisk-unimrcp and mrcp_client do not terminate properly the mpf_engine / stream and all the relevant to mpf resources. So the following crash is takes place.

(gdb) bt
#0  0x00007f3b6d4f8283 in __pthread_mutex_unlock_usercnt () from /lib64/libpthread.so.0
#1  0x00007f3b2f3d2c70 in speech_channel_read (schannel=schannel@entry=0x7f3b502d1048, data=0x7f3b50189a20, len=len@entry=0x7f3b19195ca8, block=block@entry=0) at speech_channel.c:699
#2  0x00007f3b2f3dbc69 in recog_stream_read (stream=<optimized out>, frame=0x7f3b50189928) at app_synthandrecog.c:1190
#3  0x00007f3b2f1a7b29 in mpf_audio_stream_frame_read (frame=0x7f3b50189928, stream=<optimized out>) at ../../libs/mpf/include/mpf_stream.h:136
#4  mpf_decoder_process (stream=<optimized out>, frame=0x7f3b50189c08) at src/mpf_decoder.c:60
#5  0x00007f3b2f1a210b in mpf_bridge_process (object=0x7f3b50189bd0) at src/mpf_bridge.c:63
#6  0x00007f3b2f1a493e in mpf_context_process (context=context@entry=0x7f3b50188490) at src/mpf_context.c:438
#7  0x00007f3b2f1a4979 in mpf_context_factory_process (factory=0x3059ac8) at src/mpf_context.c:105
#8  0x00007f3b2f1a7580 in timer_thread_proc (thread=0x30729a0, data=0x3059b28) at src/mpf_scheduler.c:212
#9  0x00007f3b6d4f444b in start_thread () from /lib64/libpthread.so.0
#10 0x00007f3b6afaa52f in clone () from /lib64/libc.so.6

From logs perspective we can see that the session has been marked as disconnected:

[2024-09-16 10:22:38.310] NOTICE[20011] src/rtsp_client.c: Cancel RTSP Request 0x7fdf5407ad98 <be04e9f83ffe4fc9aea4e1762d864369> CSeq:6 [500]
[2024-09-16 10:22:38.311] DEBUG[20008] src/mrcp_client_session.c: Mark Session as Disconnected ASR-592 <be04e9f83ffe4fc9aea4e1762d864369>
[2024-09-16 10:22:39.996] ERROR[22835][C-00000250] app_synthandrecog.c: (ASR-592) Unable to load grammar
...
[2024-09-16 10:22:40.000] DEBUG[22835][C-00000250] speech_channel.c: Destroy speech channel: Name=ASR-592, Type=RECOGNIZER, Codec=PCMA, Rate=8000
[2024-09-16 10:22:40.000] DEBUG[22835][C-00000250] src/apt_task.c: Signal Message to [MRCP Client] [0x7fdf60058480;4;0]
[2024-09-16 10:22:40.012] DEBUG[22835][C-00000250] speech_channel.c: (ASR-592) Waiting for MRCP session to terminate
[2024-09-16 10:22:40.013] NOTICE[20008] src/mrcp_client_session.c: Receive App Request ASR-592 <be04e9f83ffe4fc9aea4e1762d864369> [1]
[2024-09-16 10:22:40.013] DEBUG[20008] src/mrcp_client_session.c: Push Request to Queue ASR-592 <be04e9f83ffe4fc9aea4e1762d864369>
[2024-09-16 10:22:42.013] WARNING[22835][C-00000250] speech_channel.c: (ASR-592) MRCP session has not terminated after 2000 ms
[2024-09-16 10:22:42.014] ERROR[22835][C-00000250] speech_channel.c: (ASR-592) Failed to destroy channel. Continuing
[2024-09-16 10:22:42.014] DEBUG[22835][C-00000250] audio_queue.c: (ASR-592) Audio queue destroyed
[2024-09-16 10:22:42.014] DEBUG[22835][C-00000250] speech_channel.c: (ASR-592) MPF generator has been reset
[2024-09-16 10:22:42.015] DEBUG[22835][C-00000250] speech_channel.c: (ASR-592) DTMF generator destroyed
...

For now I haven't managed to provide a fix, so I guess either I will live with the locks or the crash

I think the locks are there a long time now, with all the previous releases of asterisk-unimrcp
They were not detected so far because most of the people or companies haven't migrated to pjsip yet (so no taskprocessors lock there).
With chan_sip asterisk module the problem exists, it just doen't affect asterisk and it is not that obvious.

If I find some time I will study the unimrcp implementation as well and especially mrcp client in order to provide a fix for the crash revealed after removing the deadlocks.

@stavroskladis
Copy link
Author

stavroskladis commented Sep 20, 2024

I detected the issue that leads to the above crash.
mrcp_client of unimrcp (implements mrcp protocol and is used by asterisk-unimrcp asterisk module) in case of a disconnection in any of its’ connections, didn’t destroy the mrcp channel and its’ resources (mpf topology, engine, etc). Without this destroy action mpf engine was trying to read frames from the speech channel even after its termination from the asterisk-unimrcp module (pointers pointed to garbage).
After this realisation the solution was 1 line of code
Stresstesting for 40 minutes, ~4000 calls processed per pjsip/distributor, 10 asr server crashes, no ivr lock and no ivr crash.

This fix has to be added on unimrcp and I will do it given the first chance...
Also I will duplicate this issue in the reported issues of unimrcp.

So, as a recap:
When the deadlocks existed, a disconnection lead to a lock and with 450 locks I got an unresponsive asterisk.
When I solved the locks on the first disconnected channel I got the crash I reported above.
A patch is required in unimrcp for this.

@stavroskladis
Copy link
Author

I added a patch in unimrcp related to this issue here:
unispeech/unimrcp#332

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant