-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebSocket may cause a failure while the sempaphore lock in esp_websocket_client_task is in place and esp_websocket_client_send_text is used elsewhere (IDFGH-13387) #625
Comments
It has been confirmed that the WebSocket task routine creates a lock that is not properly resolved in concurrent multithreaded tasks. This causes the lock to be taken again before other semaphore requests can be processed. To solve this issue, add a minor delay of 100ms in the task loop as follows: I will be preparing this update as well to have a specific option to be set in the websocket config to enable user to adjust the delay needed. |
Hi @filzek thanks for reporting, to help my investigation of the issue, could you please provide the scenario you got this large time to release? |
We cannot provide a working example since we cannot detach the application, but the scenario where this issue occurs is quite simple. The internal task of the websocket cycles so quickly that the semaphore is only taken by the task itself. We use a queue to send messages to the websocket. This queue handles a large number of packets, and when multiple packets are received simultaneously from the server, the websocket task creates this lock. We are not sure why the websocket behaves this way, but we have used a deadlock detection mechanism in the code for debugging purposes:
This helped us identify the root cause. We found that using vTaskDelay solves the concurrency lock issue. With a 100ms delay, there is no lock, but with a shorter delay of 30ms or none at all, the lock occurs. On our side, we have several places where we attempt to send messages to the websocket. We use our own semaphore to lock before calling esp_websocket_client_send_text. Despite having a timeout of more than 10 seconds, we still encounter timeouts while trying to send messages. We have split the process into several timeouts and retries without success. We moved from simple retries to identifying where the issue could occur in the esp_websocket_client_send_text routine. After analyzing this function, we found that the lock was being acquired in the esp_websocket_client_send_with_exact_opcode function, which led us to the following code:
The problem was that the lock was being acquired inside the internal task loop. The task was running so fast that the lock was caught immediately after being released by the same routine. The solution we found so far was to add a vTaskDelay to solve the sequential lock issue caused by the internal task. |
@filzek thanks for the clarification. I'll still look into the code to find any concurrency problems in the design. |
Hi, I'm encountering the same issue. My goal is to transfer two-way audio streaming in real time, but the fact is I almost have to receive all incoming data before I can send any |
both scenario same issue, the semaphore got sctucker much more than 10 seconds, there is something happening with the transaction inside the websocket, so, its seens that is not just a concurrence lock, but some logic that locks the semaphore for a long time. tasks set with low priority: 10 tasks set with low priority: 20 |
I've seen this also when trying to send a small message while downloading ~100KB. In my case, the websocket RX thread sends messages to the main thread through a FreeRTOS message queue, but sometimes the main thread needs to write to the websocket itself. If the main thread tries General topic: websocket RX thread should not block other threads from websocket TX for too long. Idea1(1 mutex): websocket RX thread could release the client->lock during WEBSOCKET_EVENT_DATA callback processing. This would allow @filzek to add a Idea2(multi-mutex): a mutex for writes, a mutex for reads, a mutex for state. If a reading or writing thread needs to update the websocket state, it also has to get the state-mutex. |
A gross workaround: ensure any threads calling |
These should all fit in TCP Window Size of 5744. But, we cannot fill the TCP Window, because the writer-thread(main) is too often blocked on the reader-thread(websocket client thread). Demonstrates issue espressif#625 Ex: in this case, writer-thread was blocked while reader-thread processed 6x packets back-to-back without allowing writer-thread to send to websocket: I (18093) websocket_client: Sending 4 bytes took 5.990 seconds I (18193) websocket: Sending 0013 After this, packet 0013-0018 were sent together quickly: I (18093) websocket_client: Sending 4 bytes took 5.990 seconds I (18193) websocket: Sending 0013 I (18193) websocket_client: Sending 4 bytes took 0.000 seconds I (18293) websocket: Sending 0014 I (18293) websocket_client: Sending 4 bytes took 0.000 seconds I (18393) websocket: Sending 0015 I (18393) websocket_client: Sending 4 bytes took 0.000 seconds I (18493) websocket: Sending 0016 I (18493) websocket_client: Sending 4 bytes took 0.000 seconds I (18593) websocket: Sending 0017 I (18593) websocket_client: Sending 4 bytes took 0.000 seconds W (18653) websocket: Received=0012 I (18693) websocket: Sending 0018 But then reader-thread blocked writer-thread again: W (19653) websocket: Received=0013 W (20653) websocket: Total payload length=4, data_len=4, current payload offset=0 W (20653) websocket: Received=0014 W (21653) websocket: Total payload length=4, data_len=4, current payload offset=0 W (21653) websocket: Received=0015 W (22653) websocket: Total payload length=4, data_len=4, current payload offset=0 W (22653) websocket: Received=0016 W (23653) websocket: Total payload length=4, data_len=4, current payload offset=0 W (23653) websocket: Received=0017 W (24653) websocket: Total payload length=4, data_len=4, current payload offset=0 I (24653) websocket_client: Sending 4 bytes took 5.960 seconds I (24753) websocket: Sending 0019
@euripedesrocha , I've extended the esp_websocket client example to demonstrate this problem: https://github.com/bryghtlabs-richard/esp-protocols/tree/issue625demonstrator It's a little silly because it uses the echo server to trigger it, but I hope the general problem of mutex starvation in esp_websocket_client between reader and writer threads is clear. In my linked example all writer-data should fit in the default TCP window, yet is still blocked, often for many seconds by reader-thread processing. If the writer thread did not |
I checked in with the FreeRTOS forum , there's currently nothing like a fair-mutex that we could use here to limit thread starvation. |
This resolves: 1) Deadlock when trying to reserve a lock in WEBSOCKET_EVENT_DATA, but lock is held by a thread trying to send a websocket message. 2) High latency caused by writers serialized with WEBSOCKET_EVENT_DATA while calling esp_websocket_client_send(), even when TCP window has enough space for the entire message being queued to send. Multiple writers are still serialized at fragment boundaries, but only with other writers and websocket error updates. Fixes espressif#625
Answers checklist.
General issue report
The WebSocket has a problematic lock mechanism that hinders concurrent usage and the dispatch of receiving and sending data. While receiving data packets in a stream and simultaneously sending data, the lock in the esp_websocket_client_task at line:
if (xSemaphoreTakeRecursive(client->lock, lock_timeout) != pdPASS) {
can take an excessive amount of time to release, sometimes exceeding 10 seconds. This delay can cause the send operation to fail, which is not aligned with resilient transmission practices.
It appears that the task suffers from a circular non-delay issue, causing it to run continuously without interruption. In a production environment, there should be pauses within the semaphore lock cycle to enable proper multitasking and ensure that user-side operations can function correctly.
The text was updated successfully, but these errors were encountered: