Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocket may cause a failure while the sempaphore lock in esp_websocket_client_task is in place and esp_websocket_client_send_text is used elsewhere (IDFGH-13387) #625

Open
3 tasks done
filzek opened this issue Aug 2, 2024 · 10 comments · May be fixed by #704

Comments

@filzek
Copy link

filzek commented Aug 2, 2024

Answers checklist.

  • I have read the documentation for esp-protocols components and the issue is not addressed there.
  • I have updated my esp-protocols branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

The WebSocket has a problematic lock mechanism that hinders concurrent usage and the dispatch of receiving and sending data. While receiving data packets in a stream and simultaneously sending data, the lock in the esp_websocket_client_task at line:

if (xSemaphoreTakeRecursive(client->lock, lock_timeout) != pdPASS) {

can take an excessive amount of time to release, sometimes exceeding 10 seconds. This delay can cause the send operation to fail, which is not aligned with resilient transmission practices.

It appears that the task suffers from a circular non-delay issue, causing it to run continuously without interruption. In a production environment, there should be pauses within the semaphore lock cycle to enable proper multitasking and ensure that user-side operations can function correctly.

@github-actions github-actions bot changed the title WebSocket may cause a failure while the sempaphore lock in esp_websocket_client_task is in place and esp_websocket_client_send_text is used elsewhere WebSocket may cause a failure while the sempaphore lock in esp_websocket_client_task is in place and esp_websocket_client_send_text is used elsewhere (IDFGH-13387) Aug 2, 2024
@filzek
Copy link
Author

filzek commented Aug 2, 2024

It has been confirmed that the WebSocket task routine creates a lock that is not properly resolved in concurrent multithreaded tasks. This causes the lock to be taken again before other semaphore requests can be processed.

To solve this issue, add a minor delay of 100ms in the task loop as follows:
while (client->run) {
...
xSemaphoreGiveRecursive(client->lock);
vTaskDelay(100); // Tested with 100ms and it worked well
}
This small delay allows other tasks to acquire the semaphore, ensuring proper concurrency and improving the overall performance of the WebSocket communication.

I will be preparing this update as well to have a specific option to be set in the websocket config to enable user to adjust the delay needed.

@euripedesrocha
Copy link
Collaborator

Hi @filzek thanks for reporting, to help my investigation of the issue, could you please provide the scenario you got this large time to release?
If you could provide more details of your system or maybe provide a reproducer, it would be very helpful.

@filzek
Copy link
Author

filzek commented Aug 5, 2024

We cannot provide a working example since we cannot detach the application, but the scenario where this issue occurs is quite simple. The internal task of the websocket cycles so quickly that the semaphore is only taken by the task itself.

We use a queue to send messages to the websocket. This queue handles a large number of packets, and when multiple packets are received simultaneously from the server, the websocket task creates this lock. We are not sure why the websocket behaves this way, but we have used a deadlock detection mechanism in the code for debugging purposes:

if (xSemaphoreTakeRecursive(client->lock, lock_timeout) != pdPASS) {
    ESP_LOGE(TAG, "Failed to lock ws-client tasks, exiting the task...");
    break;
}
printf("LOCK FROM esp_websocket_client_task - [%d]\n", __LINE__);

xSemaphoreGiveRecursive(client->lock);
printf("LOCK RELEASED FROM esp_websocket_client_task - [%d]\n", __LINE__);

This helped us identify the root cause. We found that using vTaskDelay solves the concurrency lock issue. With a 100ms delay, there is no lock, but with a shorter delay of 30ms or none at all, the lock occurs.

On our side, we have several places where we attempt to send messages to the websocket. We use our own semaphore to lock before calling esp_websocket_client_send_text. Despite having a timeout of more than 10 seconds, we still encounter timeouts while trying to send messages. We have split the process into several timeouts and retries without success.

We moved from simple retries to identifying where the issue could occur in the esp_websocket_client_send_text routine. After analyzing this function, we found that the lock was being acquired in the esp_websocket_client_send_with_exact_opcode function, which led us to the following code:

if (xSemaphoreTakeRecursive(client->lock, timeout) != pdPASS) {
    ESP_LOGE(TAG, "Could not lock ws-client within %" PRIu32 " timeout", timeout);
    return -1;
}

The problem was that the lock was being acquired inside the internal task loop. The task was running so fast that the lock was caught immediately after being released by the same routine.

The solution we found so far was to add a vTaskDelay to solve the sequential lock issue caused by the internal task.

@euripedesrocha
Copy link
Collaborator

@filzek thanks for the clarification.
The first point is that the presence of this mutex is to allow the user facing API to be called from multiple tasks without the user having to deal with the locks themselves.
Have you tested to adjust the priority of the websocket client task? Allowing your task to acquire the mutex when ready. From your description I believe that your extra locking is unnecessary if it is just used to access the websocket client.

I'll still look into the code to find any concurrency problems in the design.

@ho-229
Copy link

ho-229 commented Sep 8, 2024

Hi, I'm encountering the same issue. My goal is to transfer two-way audio streaming in real time, but the fact is I almost have to receive all incoming data before I can send any

@filzek
Copy link
Author

filzek commented Oct 14, 2024

@euripedesrocha

both scenario same issue, the semaphore got sctucker much more than 10 seconds, there is something happening with the transaction inside the websocket, so, its seens that is not just a concurrence lock, but some logic that locks the semaphore for a long time.

tasks set with low priority: 10
websocket: 20
same issue

tasks set with low priority: 20
websocket: 10
same issue

@bryghtlabs-richard
Copy link
Contributor

I've seen this also when trying to send a small message while downloading ~100KB. In my case, the websocket RX thread sends messages to the main thread through a FreeRTOS message queue, but sometimes the main thread needs to write to the websocket itself. If the main thread tries esp_websocket_client_send_text(), it is blocked from doing so until the websocket RX thread releases the lock. In our system the main thread is responsible for processing messages from the websocket RX thread, so this causes a deadlock(at least until the queue times out and drops).

General topic: websocket RX thread should not block other threads from websocket TX for too long.

Idea1(1 mutex): websocket RX thread could release the client->lock during WEBSOCKET_EVENT_DATA callback processing. This would allow @filzek to add a vTaskDelay(100); into their handler, and my WEBSOCKET_EVENT_DATA handler writing to a FreeRTOS queue would block similarly only when the queue is full. Both would allow another thread to esp_websocket_client_send_text() in this period.

Idea2(multi-mutex): a mutex for writes, a mutex for reads, a mutex for state. If a reading or writing thread needs to update the websocket state, it also has to get the state-mutex.

@bryghtlabs-richard
Copy link
Contributor

A gross workaround: ensure any threads calling esp_websocket_client_send_text() are bound to the same core as the websocket client thread, and at a higher priority than the websocket client thread. This way the writer interrupts the client reader thread as soon as it unlocks the mutex.

bryghtlabs-richard added a commit to bryghtlabs-richard/esp-protocols that referenced this issue Nov 1, 2024
These should all fit in TCP Window Size of 5744.
But, we cannot fill the TCP Window, because
the writer-thread(main) is too often blocked on
the reader-thread(websocket client thread).

Demonstrates issue espressif#625

Ex: in this case, writer-thread was blocked while
  reader-thread processed 6x packets back-to-back
  without allowing writer-thread to send to websocket:

  I (18093) websocket_client: Sending 4 bytes took 5.990 seconds
  I (18193) websocket: Sending 0013

After this, packet 0013-0018 were sent together quickly:

  I (18093) websocket_client: Sending 4 bytes took 5.990 seconds
  I (18193) websocket: Sending 0013
  I (18193) websocket_client: Sending 4 bytes took 0.000 seconds
  I (18293) websocket: Sending 0014
  I (18293) websocket_client: Sending 4 bytes took 0.000 seconds
  I (18393) websocket: Sending 0015
  I (18393) websocket_client: Sending 4 bytes took 0.000 seconds
  I (18493) websocket: Sending 0016
  I (18493) websocket_client: Sending 4 bytes took 0.000 seconds
  I (18593) websocket: Sending 0017
  I (18593) websocket_client: Sending 4 bytes took 0.000 seconds
  W (18653) websocket: Received=0012
  I (18693) websocket: Sending 0018

But then reader-thread blocked writer-thread again:
  W (19653) websocket: Received=0013
  W (20653) websocket: Total payload length=4, data_len=4, current payload offset=0

  W (20653) websocket: Received=0014
  W (21653) websocket: Total payload length=4, data_len=4, current payload offset=0

  W (21653) websocket: Received=0015
  W (22653) websocket: Total payload length=4, data_len=4, current payload offset=0

  W (22653) websocket: Received=0016
  W (23653) websocket: Total payload length=4, data_len=4, current payload offset=0

  W (23653) websocket: Received=0017
  W (24653) websocket: Total payload length=4, data_len=4, current payload offset=0

  I (24653) websocket_client: Sending 4 bytes took 5.960 seconds
  I (24753) websocket: Sending 0019
@bryghtlabs-richard
Copy link
Contributor

@euripedesrocha , I've extended the esp_websocket client example to demonstrate this problem: https://github.com/bryghtlabs-richard/esp-protocols/tree/issue625demonstrator

It's a little silly because it uses the echo server to trigger it, but I hope the general problem of mutex starvation in esp_websocket_client between reader and writer threads is clear. In my linked example all writer-data should fit in the default TCP window, yet is still blocked, often for many seconds by reader-thread processing.

If the writer thread did not vTaskDelay(100 / portTICK_PERIOD_MS) between packets, it's also possible for writers to starve the reader thread.

@bryghtlabs-richard
Copy link
Contributor

I checked in with the FreeRTOS forum , there's currently nothing like a fair-mutex that we could use here to limit thread starvation.

bryghtlabs-richard added a commit to bryghtlabs-richard/esp-protocols that referenced this issue Dec 2, 2024
This resolves:

 1) Deadlock when trying to reserve a lock in WEBSOCKET_EVENT_DATA,
    but lock is held by a thread trying to send a websocket message.
 2) High latency caused by writers serialized with WEBSOCKET_EVENT_DATA
    while calling esp_websocket_client_send(), even when TCP window
    has enough space for the entire message being queued to send.

Multiple writers are still serialized at fragment boundaries, but
only with other writers and websocket error updates.

Fixes espressif#625
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants