Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Wi-Fi to publish MQTT data, transport timeout, then stuck. (IDFGH-13901) #14743

Open
3 tasks done
Sky-Soo-Ap opened this issue Oct 17, 2024 · 26 comments
Open
3 tasks done
Assignees
Labels
Status: Opened Issue is new Type: Bug bugs in IDF

Comments

@Sky-Soo-Ap
Copy link

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.3.1

Espressif SoC revision.

ESP32-D0WDR2-V3-V3 (revision v3.1)

Operating System used.

Windows

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

None

Development Kit.

Custom Board

Power Supply used.

USB

What is the expected behavior?

I expected Wi-Fi or MQTT disconnected would not be stuck here.
It will keep re-connecting Wi-Fi or MQTT;

What is the actual behavior?

Using Wi-Fi to publish MQTT data, transport timeout, then stuck.

Steps to reproduce.

  1. Wi-Fi & BLE coexist;
  2. ESP32 keep receiving BLE notificaiton data;
  3. Using ESP32 Wi-Fi publish notification data to MQTT;
    ...

Debug Logs.

I (11:44:10.037) azp_mqtt: iometer/agent/devices/notification/bb:a0:50:78:97:da, drop packet count:0
I (11:44:10.248) azp_mqtt: iometer/agent/devices/notification/bb:a0:50:78:97:da, drop packet count:0
I (11:44:10.428) azp_mqtt: iometer/agent/devices/notification/bb:a0:50:78:97:da, drop packet count:0
W (11:44:20.670) transport_base: Poll timeout or error, errno=Success, fd=48, timeout_ms=10000
E (11:44:20.672) mqtt_client: Writing didn't complete in specified timeout: errno=0
E (11:44:20.680) azp_mqtt: MQTT_EVENT_DISCONNECTED

More Information.

No response

@Sky-Soo-Ap Sky-Soo-Ap added the Type: Bug bugs in IDF label Oct 17, 2024
@espressif-bot espressif-bot added the Status: Opened Issue is new label Oct 17, 2024
@github-actions github-actions bot changed the title Using Wi-Fi to publish MQTT data, transport timeout, then stuck. Using Wi-Fi to publish MQTT data, transport timeout, then stuck. (IDFGH-13901) Oct 17, 2024
@Sky-Soo-Ap
Copy link
Author

Hello,
Let me add some information. When verbose-level log is turned on, there are no more logs, only the same three lines of logs.
W (15:45:45.560) transport_base: Poll timeout or error, errno=Success, fd=48, timeout_ms=10000
E (15:45:45.560) mqtt_client: Writing didn't complete in specified timeout: errno=0
E (15:45:45.569) azp_mqtt: MQTT_EVENT_DISCONNECTED

Anyway, attach the verbose-level log.
2024-10-21_Verbose_Level.log

According to the verbose level log, when stuck, BLE still seems to be receiving data normally.
Is it because MQTT has not reconnected or is it stuck on Wi-Fi?
I have done Wi-Fi reconnection in the app, and if the Wi-Fi disconnected.
MQTT will automatically reconnect when disconnected, right?

This problem is very serious and requires a manual reboot to return to normal.

@euripedesrocha
Copy link
Collaborator

Hi @Sky-Soo-Ap, the MQTT client will try to reconnect, unless you set it not to.
To help me investigate the issue, could you share the configuration you have for the mqtt client?
It looks like you don't have the mqtt client in verbose level log, I'm missing a few messages that would appear in debug level.

@Sky-Soo-Ap
Copy link
Author

Hi Euripedesrocha,
I’m attaching mqtt client settings code and sdkconfig files.
MQTT_settings.txt
sdkconfig.txt

How to set mqtt client to verbose level log?
Is it using the API esp_log_level_set("*", ESP_LOG_VERBOSE) to set it?

@euripedesrocha
Copy link
Collaborator

Hi, about the logs from the code sent, you should use lower case tags to set the logs.
Also, that memcpy for the username and password are unnecessary, just pass the pointers and the client will copy. At first glance, it also looks incorrect. Aren't you writing over the struct data?

@Sky-Soo-Ap
Copy link
Author

Hi, about the logs from the code sent, migrated from old esp-idf v5.0.x.
I will change it to lower case as following and trying to get the log.
esp_log_level_set("*", ESP_LOG_INFO); // Is this line needed to change to ESP_LOG_VERBOSE?
 esp_log_level_set("mqtt_client", ESP_LOG_VERBOSE);
 esp_log_level_set("mqtt_example", ESP_LOG_VERBOSE);
 esp_log_level_set("transport_base", ESP_LOG_VERBOSE);
 esp_log_level_set("esp-tls", ESP_LOG_VERBOSE);
 esp_log_level_set("transport", ESP_LOG_VERBOSE);
 esp_log_level_set("outbox", ESP_LOG_VERBOSE);

Aren't you writing over the struct data?
Sorry, I didn't catch what you were trying to say.
Could you point it out?
Thanks

@euripedesrocha
Copy link
Collaborator

About the logs: The first line set it to all the tags, the following esp_log_level_set define the level for specific components of interest. The tags are lower case from IDF > 5.0
About the memcpy, you If you are not allocating the necessary space that pointer is likely pointing to invalid memory. That copy is unnecessary anyway. The client do the copy already.

@Sky-Soo-Ap
Copy link
Author

I turned on the verbose log level of "mqtt_client" and did not see any more logs.
Anyway, the attachment is a recaptured log.
2024-10-23__mqtt_client_verbose_log_level.log

@Sky-Soo-Ap
Copy link
Author

Hi euripedesrocha,
Following are the reproducible steps:

  1. ESP32 connects to phone Wi-Fi AP;
  2. ESP32 connects to an online MQTT broker;
  3. ESP32 keeps receiving notification data and publishes it to the broker;
  4. Disable the phone 4G/5G network;

@Sky-Soo-Ap
Copy link
Author

Hi euripedesrocha,
Is there any progress?
If this issue is not solved, the product cannot be shipped.

@euripedesrocha
Copy link
Collaborator

@Sky-Soo-Ap sorry for the delay, this seems to be unlikely an issue on the MQTT client, but I'm having a hard time identifying the root cause.
To be able to help you, I'll need a code reproducer to evaluate the scenario. From the configuration code you sent, you should see the reconnection attempts from the mqtt client.

@Sky-Soo-Ap
Copy link
Author

Hi euripedesrocha,
Finally found the root cause, it was stuck in API vTaskSuspend.
Sorry, I don't have detailed debugging here.
The attachment is a reproducible program.
tcp.zip

Following are the reproducible steps:
ESP32 connects to phone Wi-Fi AP;
ESP32 connects to an online MQTT broker;
ESP32 keeps receiving notification data and publishes it to the broker;
Disable the phone 4G/5G network;

@euripedesrocha
Copy link
Collaborator

I see. Have you considered using MQTT_SKIP_PUBLISH_IF_DISCONNECTED?

@Sky-Soo-Ap
Copy link
Author

Hi euripedesrocha,
The actual situation is this: sometimes Wi-Fi, or MQTT is disconnected and reconnected;
To prevent packet loss, we store the notification data packet in a large queue and pause it by calling the API vTaskSuspend. Because QoS 0 is used to publish, mqtt client disconnected will cause packet loss.
Why does calling API vTaskSuspend in an MQTT disconnected event get stuck?

@euripedesrocha
Copy link
Collaborator

euripedesrocha commented Nov 5, 2024

@Sky-Soo-Ap from reading your code you sent, the only thing I see is the notify_task not being available and the task handler being NULL at that point. This would suspend the mqtt task.
I have a few suggestions.

  • You can use the App trace library to debug the situation and find the issue. We'll spend some time on it also.

  • For the problem you are trying to solve, I'll suggest using esp_mqtt_client_enqueue instead of publish to handle this. Even QoS0 messages will be stored in the outbox queue and you can keep pushing to it, limited of course by the available memory.

@euripedesrocha
Copy link
Collaborator

Hi @Sky-Soo-Ap I just tried the reproducer sent and wasn't able to reproduce the issue. Is there something missing in the code?

@Sky-Soo-Ap
Copy link
Author

Hi euripedesrocha,
The reproducer is able to reproduce this issue; following is the log file.
Reproducer.log
And, I checked that the task handler is not NULL after creation, then called “configASSERT(xNotifyTaskHandle);”.

void suspendNotifyTask() {
if (xNotifyTaskHandle != NULL) {
vTaskSuspend(xNotifyTaskHandle);
ESP_LOGW(TAG, "Suspended publish notification packets");
}
}

How does the API esp_mqtt_client_enqueue behave when the outbox is full? It will drop the current data or the first queued data?

@euripedesrocha
Copy link
Collaborator

@Sky-Soo-Ap I wasn't able to reproduce on my side using the code you sent. There were other parts of your system that you omitted in this reproducer that might affect the behavior?

When you have outbox memory exhausted, either by system without enough memory or outbox memory usage limit, you get an error, no messages are dropped from the outbox.

@Sky-Soo-Ap
Copy link
Author

@euripedesrocha
Could you try the bin file I make? Did you use IDF v5.3.1?
mqtt_tcp.zip

no messages are dropped from the outbox.

So, the get an error data will drop, right?

@euripedesrocha
Copy link
Collaborator

@Sky-Soo-Ap I retried and noticed that I had one step incorrect and was able to reproduce it, thanks for the help.

Not exactly. The message is not created, your data isn't touched, so no data is lost/drop. You'll not be able to publish/enqueue at that point, but that is because your system memory is exhausted.

@euripedesrocha
Copy link
Collaborator

@Sky-Soo-Ap we were able to identify the source of your problem.
The disconnect event is dispatched from the notify task context, at this point publish is running and holding the mutex of the mqtt client task. Since you suspend the notify task, the mutex is never released and the mqtt client isn't able to run and notify task never runs gain as well.

@Sky-Soo-Ap
Copy link
Author

@euripedesrocha
This is strange, because I intuitively think that the mqtt event handler is handled by the mqtt client task; the API esp_mqtt_client_publish does not explain it, so no one will know.
Furthermore, the disconnected event will not be deemed relevant, while the published event will have a slight relevance.
Will all mqtt events be handled by mqtt client task?

@euripedesrocha
Copy link
Collaborator

@Sky-Soo-Ap you are partially correct.
The event is dispatched from the task context that is running at the moment, TBH it took me a while to figure out the deadlock here as well, I was considering the mqtt task had acquired the mutex.

I didn't follow your thoughts on relevance of the disconnected/published events. Could you elaborate?

Events will most of the time be dispatched from the mqtt task, I'll check other points where this could happen and improve the documentation about it.

Have you experimented with enqueue function? Does it solve your original problem?

@Sky-Soo-Ap
Copy link
Author

@euripedesrocha

I didn't follow your thoughts on relevance of the disconnected/published events. Could you elaborate?

What I mean is that the API and event tasks must be disconnected; in this example, the upper layer calls the API, returns an error after time out, and the event generated by time out is processed by the mqtt task.
If the API and event tasks are not disconnected, the upper layer calls the publish API, receives the published event, and then calls the publish API, which will form a loop, and the stack size will be full soon.

Have you experimented with enqueue function? Does it solve your original problem?

The queuing function does not meet the functional requirements. What I need is that when the queue is full, the oldest data is discarded and the latest is saved.

@euripedesrocha
Copy link
Collaborator

@Sky-Soo-Ap

What I mean is that the API and event tasks must be disconnected; in this example, the upper layer calls the API, returns an error after time out, and the event generated by time out is processed by the mqtt task.
If the API and event tasks are not disconnected, the upper layer calls the publish API, receives the published event, and then calls the publish API, which will form a loop, and the stack size will be full soon.

The published event is dispatched only by the mqtt task. I still need to double-check, but I believe this behavior is only for the disconnection event.

The queuing function does not meet the functional requirements. What I need is that when the queue is full, the oldest data is discarded and the latest is saved.

Could still be a solution. The outbox behavior can be fully customized.

@Sky-Soo-Ap
Copy link
Author

@euripedesrocha

The published event is dispatched only by the mqtt task. I still need to double-check, but I believe this behavior is only for the disconnection event.

Will you solve this problem? Disconnect event dispatched by the mqtt task.

Could still be a solution. The outbox behavior can be fully customized.

I didn't find the relevant APIs or structural data to customize it.

@euripedesrocha
Copy link
Collaborator

@Sky-Soo-Ap

Will you solve this problem? Disconnect event dispatched by the mqtt task.

This behavior isn't likely to change at the moment. I'll work on it later though.

I didn't find the relevant APIs or structural data to customize it.

You can find the relevant example here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Opened Issue is new Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

6 participants