WebSocket may cause a failure while the sempaphore lock in esp_websocket_client_task is in place and esp_websocket_client_send_text is used elsewhere (IDFGH-13387) #625

filzek · 2024-08-02T18:45:49Z

Answers checklist.

I have read the documentation for esp-protocols components and the issue is not addressed there.
I have updated my esp-protocols branch (master or release) to the latest version and checked that the issue is present there.
I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

The WebSocket has a problematic lock mechanism that hinders concurrent usage and the dispatch of receiving and sending data. While receiving data packets in a stream and simultaneously sending data, the lock in the esp_websocket_client_task at line:

if (xSemaphoreTakeRecursive(client->lock, lock_timeout) != pdPASS) {

can take an excessive amount of time to release, sometimes exceeding 10 seconds. This delay can cause the send operation to fail, which is not aligned with resilient transmission practices.

It appears that the task suffers from a circular non-delay issue, causing it to run continuously without interruption. In a production environment, there should be pauses within the semaphore lock cycle to enable proper multitasking and ensure that user-side operations can function correctly.

The text was updated successfully, but these errors were encountered:

filzek · 2024-08-02T20:15:57Z

It has been confirmed that the WebSocket task routine creates a lock that is not properly resolved in concurrent multithreaded tasks. This causes the lock to be taken again before other semaphore requests can be processed.

To solve this issue, add a minor delay of 100ms in the task loop as follows:
while (client->run) {
...
xSemaphoreGiveRecursive(client->lock);
vTaskDelay(100); // Tested with 100ms and it worked well
}
This small delay allows other tasks to acquire the semaphore, ensuring proper concurrency and improving the overall performance of the WebSocket communication.

I will be preparing this update as well to have a specific option to be set in the websocket config to enable user to adjust the delay needed.

euripedesrocha · 2024-08-05T12:48:22Z

Hi @filzek thanks for reporting, to help my investigation of the issue, could you please provide the scenario you got this large time to release?
If you could provide more details of your system or maybe provide a reproducer, it would be very helpful.

filzek · 2024-08-05T14:13:57Z

We cannot provide a working example since we cannot detach the application, but the scenario where this issue occurs is quite simple. The internal task of the websocket cycles so quickly that the semaphore is only taken by the task itself.

We use a queue to send messages to the websocket. This queue handles a large number of packets, and when multiple packets are received simultaneously from the server, the websocket task creates this lock. We are not sure why the websocket behaves this way, but we have used a deadlock detection mechanism in the code for debugging purposes:

if (xSemaphoreTakeRecursive(client->lock, lock_timeout) != pdPASS) {
    ESP_LOGE(TAG, "Failed to lock ws-client tasks, exiting the task...");
    break;
}
printf("LOCK FROM esp_websocket_client_task - [%d]\n", __LINE__);

xSemaphoreGiveRecursive(client->lock);
printf("LOCK RELEASED FROM esp_websocket_client_task - [%d]\n", __LINE__);

This helped us identify the root cause. We found that using vTaskDelay solves the concurrency lock issue. With a 100ms delay, there is no lock, but with a shorter delay of 30ms or none at all, the lock occurs.

On our side, we have several places where we attempt to send messages to the websocket. We use our own semaphore to lock before calling esp_websocket_client_send_text. Despite having a timeout of more than 10 seconds, we still encounter timeouts while trying to send messages. We have split the process into several timeouts and retries without success.

We moved from simple retries to identifying where the issue could occur in the esp_websocket_client_send_text routine. After analyzing this function, we found that the lock was being acquired in the esp_websocket_client_send_with_exact_opcode function, which led us to the following code:

if (xSemaphoreTakeRecursive(client->lock, timeout) != pdPASS) {
    ESP_LOGE(TAG, "Could not lock ws-client within %" PRIu32 " timeout", timeout);
    return -1;
}

The problem was that the lock was being acquired inside the internal task loop. The task was running so fast that the lock was caught immediately after being released by the same routine.

The solution we found so far was to add a vTaskDelay to solve the sequential lock issue caused by the internal task.

euripedesrocha · 2024-08-06T07:20:43Z

@filzek thanks for the clarification.
The first point is that the presence of this mutex is to allow the user facing API to be called from multiple tasks without the user having to deal with the locks themselves.
Have you tested to adjust the priority of the websocket client task? Allowing your task to acquire the mutex when ready. From your description I believe that your extra locking is unnecessary if it is just used to access the websocket client.

I'll still look into the code to find any concurrency problems in the design.

ho-229 · 2024-09-08T16:10:53Z

Hi, I'm encountering the same issue. My goal is to transfer two-way audio streaming in real time, but the fact is I almost have to receive all incoming data before I can send any

filzek · 2024-10-14T13:51:16Z

@euripedesrocha

both scenario same issue, the semaphore got sctucker much more than 10 seconds, there is something happening with the transaction inside the websocket, so, its seens that is not just a concurrence lock, but some logic that locks the semaphore for a long time.

tasks set with low priority: 10
websocket: 20
same issue

tasks set with low priority: 20
websocket: 10
same issue

bryghtlabs-richard · 2024-10-30T21:49:33Z

I've seen this also when trying to send a small message while downloading ~100KB. In my case, the websocket RX thread sends messages to the main thread through a FreeRTOS message queue, but sometimes the main thread needs to write to the websocket itself. If the main thread tries esp_websocket_client_send_text(), it is blocked from doing so until the websocket RX thread releases the lock. In our system the main thread is responsible for processing messages from the websocket RX thread, so this causes a deadlock(at least until the queue times out and drops).

General topic: websocket RX thread should not block other threads from websocket TX for too long.

Idea1(1 mutex): websocket RX thread could release the client->lock during WEBSOCKET_EVENT_DATA callback processing. This would allow @filzek to add a vTaskDelay(100); into their handler, and my WEBSOCKET_EVENT_DATA handler writing to a FreeRTOS queue would block similarly only when the queue is full. Both would allow another thread to esp_websocket_client_send_text() in this period.

Idea2(multi-mutex): a mutex for writes, a mutex for reads, a mutex for state. If a reading or writing thread needs to update the websocket state, it also has to get the state-mutex.

bryghtlabs-richard · 2024-10-30T22:22:45Z

A gross workaround: ensure any threads calling esp_websocket_client_send_text() are bound to the same core as the websocket client thread, and at a higher priority than the websocket client thread. This way the writer interrupts the client reader thread as soon as it unlocks the mutex.

These should all fit in TCP Window Size of 5744. But, we cannot fill the TCP Window, because the writer-thread(main) is too often blocked on the reader-thread(websocket client thread). Demonstrates issue espressif#625 Ex: in this case, writer-thread was blocked while reader-thread processed 6x packets back-to-back without allowing writer-thread to send to websocket: I (18093) websocket_client: Sending 4 bytes took 5.990 seconds I (18193) websocket: Sending 0013 After this, packet 0013-0018 were sent together quickly: I (18093) websocket_client: Sending 4 bytes took 5.990 seconds I (18193) websocket: Sending 0013 I (18193) websocket_client: Sending 4 bytes took 0.000 seconds I (18293) websocket: Sending 0014 I (18293) websocket_client: Sending 4 bytes took 0.000 seconds I (18393) websocket: Sending 0015 I (18393) websocket_client: Sending 4 bytes took 0.000 seconds I (18493) websocket: Sending 0016 I (18493) websocket_client: Sending 4 bytes took 0.000 seconds I (18593) websocket: Sending 0017 I (18593) websocket_client: Sending 4 bytes took 0.000 seconds W (18653) websocket: Received=0012 I (18693) websocket: Sending 0018 But then reader-thread blocked writer-thread again: W (19653) websocket: Received=0013 W (20653) websocket: Total payload length=4, data_len=4, current payload offset=0 W (20653) websocket: Received=0014 W (21653) websocket: Total payload length=4, data_len=4, current payload offset=0 W (21653) websocket: Received=0015 W (22653) websocket: Total payload length=4, data_len=4, current payload offset=0 W (22653) websocket: Received=0016 W (23653) websocket: Total payload length=4, data_len=4, current payload offset=0 W (23653) websocket: Received=0017 W (24653) websocket: Total payload length=4, data_len=4, current payload offset=0 I (24653) websocket_client: Sending 4 bytes took 5.960 seconds I (24753) websocket: Sending 0019

bryghtlabs-richard · 2024-11-01T22:50:03Z

@euripedesrocha , I've extended the esp_websocket client example to demonstrate this problem: https://github.com/bryghtlabs-richard/esp-protocols/tree/issue625demonstrator

It's a little silly because it uses the echo server to trigger it, but I hope the general problem of mutex starvation in esp_websocket_client between reader and writer threads is clear. In my linked example all writer-data should fit in the default TCP window, yet is still blocked, often for many seconds by reader-thread processing.

If the writer thread did not vTaskDelay(100 / portTICK_PERIOD_MS) between packets, it's also possible for writers to starve the reader thread.

bryghtlabs-richard · 2024-11-22T14:42:26Z

I checked in with the FreeRTOS forum , there's currently nothing like a fair-mutex that we could use here to limit thread starvation.

This resolves: 1) Deadlock when trying to reserve a lock in WEBSOCKET_EVENT_DATA, but lock is held by a thread trying to send a websocket message. 2) High latency caused by writers serialized with WEBSOCKET_EVENT_DATA while calling esp_websocket_client_send(), even when TCP window has enough space for the entire message being queued to send. Multiple writers are still serialized at fragment boundaries, but only with other writers and websocket error updates. Fixes espressif#625

espressif-bot added the Status: Opened label Aug 2, 2024

tomassebestik removed the Status: Opened label Sep 22, 2024

bryghtlabs-richard mentioned this issue Nov 4, 2024

Websocket: fragmented transfers are not serialized (IDFGH-14001) #687

Open

3 tasks

bryghtlabs-richard linked a pull request Dec 2, 2024 that will close this issue

fix(websocket): release client-lock during WEBSOCKET_EVENT_DATA #704

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebSocket may cause a failure while the sempaphore lock in esp_websocket_client_task is in place and esp_websocket_client_send_text is used elsewhere (IDFGH-13387) #625

WebSocket may cause a failure while the sempaphore lock in esp_websocket_client_task is in place and esp_websocket_client_send_text is used elsewhere (IDFGH-13387) #625

filzek commented Aug 2, 2024

filzek commented Aug 2, 2024

euripedesrocha commented Aug 5, 2024

filzek commented Aug 5, 2024 •

edited

Loading

euripedesrocha commented Aug 6, 2024

ho-229 commented Sep 8, 2024

filzek commented Oct 14, 2024 •

edited

Loading

bryghtlabs-richard commented Oct 30, 2024

bryghtlabs-richard commented Oct 30, 2024

bryghtlabs-richard commented Nov 1, 2024

bryghtlabs-richard commented Nov 22, 2024

WebSocket may cause a failure while the sempaphore lock in esp_websocket_client_task is in place and esp_websocket_client_send_text is used elsewhere (IDFGH-13387) #625

WebSocket may cause a failure while the sempaphore lock in esp_websocket_client_task is in place and esp_websocket_client_send_text is used elsewhere (IDFGH-13387) #625

Comments

filzek commented Aug 2, 2024

Answers checklist.

General issue report

filzek commented Aug 2, 2024

euripedesrocha commented Aug 5, 2024

filzek commented Aug 5, 2024 • edited Loading

euripedesrocha commented Aug 6, 2024

ho-229 commented Sep 8, 2024

filzek commented Oct 14, 2024 • edited Loading

bryghtlabs-richard commented Oct 30, 2024

bryghtlabs-richard commented Oct 30, 2024

bryghtlabs-richard commented Nov 1, 2024

bryghtlabs-richard commented Nov 22, 2024

filzek commented Aug 5, 2024 •

edited

Loading

filzek commented Oct 14, 2024 •

edited

Loading