Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destructor race in esp_websocket_client v1.1.0: context deleted under worker task before it exits (IDFGH-11360) #412

Closed
3 tasks done
DaStoned opened this issue Nov 1, 2023 · 2 comments · Fixed by #626

Comments

@DaStoned
Copy link

DaStoned commented Nov 1, 2023

Answers checklist.

  • I have read the documentation for esp-protocols components and the issue is not addressed there.
  • I have updated my esp-protocols branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

I've been using esp_websocket_client 1.1.0 together with a GSM module (SIM7070G) connected via USB. Microcontroller is an ESP32S3 with 16 MB Flash and 8 MB SPIRAM, the PCB is a custom one with the GSM module on it. ESP IDF version 5.1.1.

There are rather frequent crashes (probability is around 10-20% in my project) when the Internet connection dies. Crashes occur because worker task esp_websocket_client_task() accesses deleted heap memory behind pointer client before exiting. So there is a race between this worker task exiting and its context being deleted with the same heap area re-used by another task. The sequence of events leading to a crash is following:

  1. My code creates a new WS client where disable_auto_reconnect is set to true (otherwise the issue is masked). This malloc-s the client context from heap. It then connects to a WS server and happily exchanges some data.
  2. At a random moment the GSM module stops communicating data (they have a nasty habit of occasionally doing that).
  3. The WS worker task esp_websocket_client_task() is in the process of reading data from the server using esp_websocket_client_recv(), which fails due to the communication blackout. It aborts the connection by calling esp_websocket_client_abort_connection() here:
  4. esp_websocket_client_abort_connection() does two things relevant to this issue. Firstly it sets the client state to WEBSOCKET_STATE_WAIT_TIMEOUT, secondly it dispatches event WEBSOCKET_EVENT_DISCONNECTED to my code.

Things now branch off into two different tasks, as my code is not allowed to kill the WS client in the same task where it received WEBSOCKET_EVENT_DISCONNECTED. My event handler posts a message to another thread to start the process of cleaning up and re-connecting the Websocket connection (remember, auto-reconnect is disabled). I'll describe what happens in the sequence of events that I observe from logs:

Task running esp_websocket_client_task()

  1. The loop in esp_websocket_client_task() next handles state WEBSOCKET_STATE_WAIT_TIMEOUT. It delays for half the network timeout period here, then stops the loop by clearing client->run in the next iteration.
  2. The race is triggered when worker task delays again for another half network timeout period while client->run is set to false. During this time window I'm handling the restarting of the WS connection in another task.

Task running WS cleanup and re-connect

  1. While worker task is in delay (step 6), my task has started processing WEBSOCKET_EVENT_DISCONNECTED. My code calls esp_websocket_client_destroy() to destroy the client. This function evaluates client->run, finds it to be false and immediately frees the heap memory used by client context here.
  2. Still while worker task is in delay, my code creates a new WS client which allocates heap, partially overlapping the heap area previously used by the now freed client context. Incidentally it occasionally initiates the memory address previously occupied by client->run with a non-zero value.

Again task running esp_websocket_client_task()

  1. The vTaskDelay() in esp_websocket_client_task() returns. It's still working with a value of client which points to memory freed in step 7 and re-used in step 8.
    Several bad things can happen at this point. In my case the next loop iteration often evaluates client->run as true and decides that it should continue looping - it then generates a memory access violation while trying to lock the non-existing semaphore at the start of the loop. Even if that loop didn't trigger, there's some actions taken at the end of esp_websocket_client_task() which operate on released memory.

PS: there's a bonus bug.

I tried to work around the problem. In the teardown code in step 7 I tried calling esp_websocket_client_stop() before calling esp_websocket_client_destroy(), because normally esp_websocket_client_stop() would wait for the worker task to signal its termination via event flag STOPPED_BIT. However, because client->run evaluates to false here it refuses to wait and returns an error :)

@github-actions github-actions bot changed the title Destructor race in esp_websocket_client v1.1.0: context deleted under worker task before it exits Destructor race in esp_websocket_client v1.1.0: context deleted under worker task before it exits (IDFGH-11360) Nov 1, 2023
@0xFEEDC0DE64
Copy link
Contributor

websocket client tries to destruct the same ssl connection from 2 threads at the same time, the thread that calls websocket_send (our main task) and the websocket client internal thread, when their read polling fails.

We have at least 100-400 such crashes per hour, possibly more because of undefined behaviour.
image

@DaStoned
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants