You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the documentation for esp-protocols components and the issue is not addressed there.
I have updated my esp-protocols branch (master or release) to the latest version and checked that the issue is present there.
I have searched the issue tracker for a similar issue and not found a similar issue.
General issue report
I've been using esp_websocket_client 1.1.0 together with a GSM module (SIM7070G) connected via USB. Microcontroller is an ESP32S3 with 16 MB Flash and 8 MB SPIRAM, the PCB is a custom one with the GSM module on it. ESP IDF version 5.1.1.
There are rather frequent crashes (probability is around 10-20% in my project) when the Internet connection dies. Crashes occur because worker task esp_websocket_client_task() accesses deleted heap memory behind pointer client before exiting. So there is a race between this worker task exiting and its context being deleted with the same heap area re-used by another task. The sequence of events leading to a crash is following:
My code creates a new WS client where disable_auto_reconnect is set to true (otherwise the issue is masked). This malloc-s the client context from heap. It then connects to a WS server and happily exchanges some data.
At a random moment the GSM module stops communicating data (they have a nasty habit of occasionally doing that).
The WS worker task esp_websocket_client_task() is in the process of reading data from the server using esp_websocket_client_recv(), which fails due to the communication blackout. It aborts the connection by calling esp_websocket_client_abort_connection()here:
esp_websocket_client_abort_connection() does two things relevant to this issue. Firstly it sets the client state to WEBSOCKET_STATE_WAIT_TIMEOUT, secondly it dispatches event WEBSOCKET_EVENT_DISCONNECTED to my code.
Things now branch off into two different tasks, as my code is not allowed to kill the WS client in the same task where it received WEBSOCKET_EVENT_DISCONNECTED. My event handler posts a message to another thread to start the process of cleaning up and re-connecting the Websocket connection (remember, auto-reconnect is disabled). I'll describe what happens in the sequence of events that I observe from logs:
Task running esp_websocket_client_task()
The loop in esp_websocket_client_task() next handles state WEBSOCKET_STATE_WAIT_TIMEOUT. It delays for half the network timeout period here, then stops the loop by clearing client->runin the next iteration.
The race is triggered when worker task delays again for another half network timeout period while client->run is set to false. During this time window I'm handling the restarting of the WS connection in another task.
Task running WS cleanup and re-connect
While worker task is in delay (step 6), my task has started processing WEBSOCKET_EVENT_DISCONNECTED. My code calls esp_websocket_client_destroy() to destroy the client. This function evaluates client->run, finds it to be false and immediately frees the heap memory used by client context here.
Still while worker task is in delay, my code creates a new WS client which allocates heap, partially overlapping the heap area previously used by the now freed client context. Incidentally it occasionally initiates the memory address previously occupied by client->run with a non-zero value.
Again task running esp_websocket_client_task()
The vTaskDelay() in esp_websocket_client_task() returns. It's still working with a value of client which points to memory freed in step 7 and re-used in step 8.
Several bad things can happen at this point. In my case the next loop iteration often evaluates client->run as true and decides that it should continue looping - it then generates a memory access violation while trying to lock the non-existing semaphore at the start of the loop. Even if that loop didn't trigger, there's some actions taken at the end of esp_websocket_client_task() which operate on released memory.
PS: there's a bonus bug.
I tried to work around the problem. In the teardown code in step 7 I tried calling esp_websocket_client_stop() before calling esp_websocket_client_destroy(), because normally esp_websocket_client_stop() would wait for the worker task to signal its termination via event flag STOPPED_BIT. However, because client->run evaluates to false here it refuses to wait and returns an error :)
The text was updated successfully, but these errors were encountered:
github-actionsbot
changed the title
Destructor race in esp_websocket_client v1.1.0: context deleted under worker task before it exits
Destructor race in esp_websocket_client v1.1.0: context deleted under worker task before it exits (IDFGH-11360)
Nov 1, 2023
websocket client tries to destruct the same ssl connection from 2 threads at the same time, the thread that calls websocket_send (our main task) and the websocket client internal thread, when their read polling fails.
We have at least 100-400 such crashes per hour, possibly more because of undefined behaviour.
Answers checklist.
General issue report
I've been using esp_websocket_client 1.1.0 together with a GSM module (SIM7070G) connected via USB. Microcontroller is an ESP32S3 with 16 MB Flash and 8 MB SPIRAM, the PCB is a custom one with the GSM module on it. ESP IDF version 5.1.1.
There are rather frequent crashes (probability is around 10-20% in my project) when the Internet connection dies. Crashes occur because worker task
esp_websocket_client_task()
accesses deleted heap memory behind pointerclient
before exiting. So there is a race between this worker task exiting and its context being deleted with the same heap area re-used by another task. The sequence of events leading to a crash is following:disable_auto_reconnect
is set totrue
(otherwise the issue is masked). This malloc-s the client context from heap. It then connects to a WS server and happily exchanges some data.esp_websocket_client_task()
is in the process of reading data from the server usingesp_websocket_client_recv()
, which fails due to the communication blackout. It aborts the connection by callingesp_websocket_client_abort_connection()
here:esp_websocket_client_abort_connection()
does two things relevant to this issue. Firstly it sets the client state toWEBSOCKET_STATE_WAIT_TIMEOUT
, secondly it dispatches eventWEBSOCKET_EVENT_DISCONNECTED
to my code.Things now branch off into two different tasks, as my code is not allowed to kill the WS client in the same task where it received
WEBSOCKET_EVENT_DISCONNECTED
. My event handler posts a message to another thread to start the process of cleaning up and re-connecting the Websocket connection (remember, auto-reconnect is disabled). I'll describe what happens in the sequence of events that I observe from logs:Task running
esp_websocket_client_task()
esp_websocket_client_task()
next handles stateWEBSOCKET_STATE_WAIT_TIMEOUT
. It delays for half the network timeout period here, then stops the loop by clearingclient->run
in the next iteration.client->run
is set tofalse
. During this time window I'm handling the restarting of the WS connection in another task.Task running WS cleanup and re-connect
WEBSOCKET_EVENT_DISCONNECTED
. My code callsesp_websocket_client_destroy()
to destroy the client. This function evaluatesclient->run
, finds it to befalse
and immediately frees the heap memory used by client context here.client->run
with a non-zero value.Again task running
esp_websocket_client_task()
esp_websocket_client_task()
returns. It's still working with a value ofclient
which points to memory freed in step 7 and re-used in step 8.Several bad things can happen at this point. In my case the next loop iteration often evaluates
client->run
astrue
and decides that it should continue looping - it then generates a memory access violation while trying to lock the non-existing semaphore at the start of the loop. Even if that loop didn't trigger, there's some actions taken at the end ofesp_websocket_client_task()
which operate on released memory.PS: there's a bonus bug.
I tried to work around the problem. In the teardown code in step 7 I tried calling
esp_websocket_client_stop()
before callingesp_websocket_client_destroy()
, because normallyesp_websocket_client_stop()
would wait for the worker task to signal its termination via event flagSTOPPED_BIT
. However, becauseclient->run
evaluates to false here it refuses to wait and returns an error :)The text was updated successfully, but these errors were encountered: