Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esp_modem: silent disconnection leads to endless OTA (IDFGH-10572) #322

Closed
3 tasks done
bbinet opened this issue Jul 5, 2023 · 14 comments
Closed
3 tasks done

esp_modem: silent disconnection leads to endless OTA (IDFGH-10572) #322

bbinet opened this issue Jul 5, 2023 · 14 comments

Comments

@bbinet
Copy link
Contributor

bbinet commented Jul 5, 2023

Answers checklist.

  • I have read the documentation for esp-protocols components and the issue is not addressed there.
  • I have updated my esp-protocols branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

Context: We have some battery powered esp32c3 systems which use Quectel BG95-M3 modem to publish sensor data though MQTT.

I've noticed that when pushing OTA updates (using esp_https_ota component), the OTA update often fails and drain the battery because BG95 modem remain powered until the OTA ends, but it never ends...
Looking at the DEBUG logs of esp_https_ota component, the OTA seems blocked in an endless loop saying "ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready".
This is probably caused by a silent PPP disconnection.

So I think I have multiple issues here:

  • we regularly face PPP disconnection whereas our signal quality is good (AT+CSQ reports a value of 19). Any idea how can I debug the root cause of these disconnections?
  • this disconnections are not detected by esp_modem and the PPP connection is not terminated
  • the esp_https_ota component can enter an endless loop in these conditions

I've come across issue #287: may it be related to this one?

@github-actions github-actions bot changed the title esp_modem: silent disconnection leads to endless OTA esp_modem: silent disconnection leads to endless OTA (IDFGH-10572) Jul 5, 2023
@bbinet
Copy link
Contributor Author

bbinet commented Jul 5, 2023

Here are the related esp logs:

I (2371) uart: ESP_INTR_FLAG_IRAM flag not set while CONFIG_UART_ISR_IN_IRAM is enabled, flag updated
I (2381) uart: queue free spaces: 30
W (2421) uart_terminal: Rx Break
I (13121) esp-netif_lwip-ppp: Connected
W (13141) esp_rmaker_time: Waiting for time to be synchronized. This may take time.
I (13151) esp_rmaker_time: The current time is: Wed Jul  5 14:13:13 2023 +0000[GMT], DST: No.
I (13161) esp_mqtt_glue: Connecting to mq.helioslite.net
I (13171) esp_rmaker_core: Waiting for MQTT connection
I (13171) main_task: Returned from app_main()
I (15491) esp_mqtt_glue: MQTT Connected
I (15491) esp_rmaker_node_config: Reporting Node Configuration of length 904 bytes.
I (15571) esp_rmaker_param: Params MQTT Init done.
I (15571) esp_rmaker_param: Reporting params (init): {"HL":{"SPM":{}},"OTA":{"Status":"","Info":"","URL":""},"Schedule":{"Schedules":[]}}
I (15581) esp_rmaker_cmd_resp: Enabling Command-Response Module.
I (15581) esp_rmaker_core: End of wifi_wait task
I (16091) esp_rmaker_param: Received params: {"OTA":{"URL":"20230705_141204362780.bin"}}
I (16091) esp_rmaker_ota_using_params: Received value = 20230705_141204362780.bin for OTA - URL
I (17151) NimBLE: GAP procedure initiated: stop advertising.
I (17151) NimBLE: GAP procedure initiated: stop advertising.
I (17151) wifi_prov_mgr: Provisioning stopped
I (17191) wifi_prov_scheme_ble: BTDM memory released
I (17201) esp_rmaker_ota: Reporting in-progress: Starting OTA Upgrade
I (17211) esp_rmaker_param: Reporting params: {"OTA":{"Info":"Starting OTA Upgrade"}}
I (17221) esp_rmaker_param: Reporting params: {"OTA":{"Status":"in-progress"}}
W (17221) esp_rmaker_ota: Starting OTA. This may take time.
I (19581) esp_https_ota: Starting OTA...
I (19581) esp_https_ota: Writing to partition subtype 17 at offset 0x1a0000
W (19581) wifi: [pm.c,3517] Set ps type: 0
I (19591) esp_rmaker_ota: Reporting in-progress: Downloading Firmware Image
I (19601) esp_rmaker_param: Reporting params: {"OTA":{"Info":"Downloading Firmware Image"}}
I (19611) esp_rmaker_param: Reporting params: {"OTA":{"Status":"in-progress"}}
D (19691) esp_https_ota: Written image length 1024
D (19691) esp_https_ota: Written image length 2048
D (19761) esp_https_ota: Written image length 3072
D (20351) esp_https_ota: Written image length 4096
D (20431) esp_https_ota: Written image length 5120
D (20431) esp_https_ota: Written image length 6144
D (20481) esp_https_ota: Written image length 7168
D (20571) esp_https_ota: Written image length 8192
D (20981) esp_https_ota: Written image length 9216
D (20981) esp_https_ota: Written image length 10240
D (20981) esp_https_ota: Written image length 11264
D (21061) esp_https_ota: Written image length 12288
D (21131) esp_https_ota: Written image length 13312
D (21131) esp_https_ota: Written image length 14336
D (22271) esp_https_ota: Written image length 15360
D (22351) esp_https_ota: Written image length 16384
D (22411) esp_https_ota: Written image length 17408
D (22411) esp_https_ota: Written image length 18432
D (22411) esp_https_ota: Written image length 19456
D (22451) esp_https_ota: Written image length 20480
D (22521) esp_https_ota: Written image length 21504
D (22521) esp_https_ota: Written image length 22528
E (27521) transport_base: esp_tls_conn_read error, errno=No more processes
D (27521) esp_https_ota: Written image length 22733
D (32531) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (37541) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (42551) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (47561) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (52571) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (57581) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (62591) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (67601) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (72611) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (77621) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (82631) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (87641) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (92651) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (97661) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (102671) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (107681) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (112691) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (117701) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (122711) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (127721) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (132731) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
E (135521) mqtt_client: No PING_RESP, disconnected
W (135521) esp_mqtt_glue: MQTT Disconnected. Will try reconnecting in a while...
D (137741) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (142751) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (147761) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (152771) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (157781) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
E (160531) esp-tls: [sock=54] select() timeout
E (160531) esp-tls: Failed to open new connection
E (160531) transport_base: Failed to open a new connection
E (160531) mqtt_client: Error transport connect
E (160551) esp_mqtt_glue: MQTT_EVENT_ERROR
W (160551) esp_mqtt_glue: MQTT Disconnected. Will try reconnecting in a while...
D (162791) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
I (162791) esp_rmaker_ota: Image bytes read: 22733
D (167801) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (418301) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
...

@bbinet
Copy link
Contributor Author

bbinet commented Jul 6, 2023

I've activated DEBUG log level for more components in the following logs:
https://gist.github.com/bbinet/26720eac93f0fd4280ca1d195840add9

I have the impression that the more component with DEBUG log level are activated, the earlier this silent disconnection issue happens. Could the PPP silent disconnection be due to cpu being busy doing other things (like printing debug logs to uart) and then HTTP connection timeouts?

@bbinet
Copy link
Contributor Author

bbinet commented Jul 6, 2023

Also note that when I change my code to connect through WiFi instead of LTE with BG95 (esp_modem), the OTA always succeeds and is much faster to complete.

@david-cermak
Copy link
Collaborator

@bbinet Which IDF version are you using? Are you also seeing this issue on IDF v5.0 or v5.1 (if it's easy for you to switch and check)?

I could only reproduce the problem when I physically disconnect the UART wires from my modem. When connected normally, it shouldn't happen and and the disconnection should be reported to the http client and OTA layers.

By silent disconnection, you mean that the PPP interface is still considered up, so you receive no event about the disconnection?
You can enable LCP keepalive to prevent this issue, but I'd have to investigate why it's happening during normal operation. Is it possible that the modem restarts during connection (that would be one reason)?

@bbinet
Copy link
Contributor Author

bbinet commented Jul 21, 2023

@david-cermak I was using a specific commit between IDF v5.0 and v5.1, but I've just updated to v5.1, and I still face the same issue.

By silent disconnection, I mean that I don't receive any connection error, and OTA is still trying to continue indefinitly, so I suppose that there is an issue with the PPP connection but it is still considered up.

I will have a look to LCP keepalive, and report here when ready.

No, the modem does not restart during connection: I have checked its rx/tx line of the modem uart, and the modem just stops sending data to the esp, and I can't see any visible error on uart lines.

I also tried to increase baudrate speed to 460800 for the modem uart just to see if the OTA could last shorter, hoping to reduce the risk of OTA failure, but same issue again.

Also, I noticed that when trying to connect to MQTT server, it regularly fails in the TLS handshake phase:

I (7479) app_connect: Waiting for LTE modem to be registered on network
I (13819) app_connect: Modem Connect to PPP Server
I (13819) app_connect: ~~~~~~~~~~~~~~
I (13819) app_connect: IP          : 10.181.144.244
I (13829) app_connect: Netmask     : 255.255.255.255
I (13829) app_connect: Gateway     : 10.64.64.64
I (13839) esp-netif_lwip-ppp: Connected
I (13839) app_connect: Name Server1: 8.8.8.8
I (13849) app_connect: Name Server2: 8.8.4.4
I (13849) app_connect: ~~~~~~~~~~~~~~
I (13859) app_connect: GOT ip event!!!
I (13859) sensornode: Connected: got ip
W (13869) esp_rmaker_time: Waiting for time to be synchronized. This may take time.
I (13869) main_task: Returned from app_main()
I (15819) hl_spsn_service: SNTP Synchronised.
I (15869) esp_rmaker_time: The current time is: Fri Jul 21 14:49:56 2023 +0000[GMT], DST: No.
I (15869) esp_mqtt_glue: Connecting to mq.helioslite.net
I (15879) esp_rmaker_core: Waiting for MQTT connection
I (16649) wifi:flush txq
I (16649) wifi:stop sw txq
I (16649) wifi:lmac stop hw txq
E (17659) esp-tls-mbedtls: mbedtls_ssl_handshake returned -0x004C
I (17659) esp-tls-mbedtls: Certificate verified.
E (17659) esp-tls: Failed to open new connection
E (17669) transport_base: Failed to open a new connection
E (17669) mqtt_client: Error transport connect
E (17679) esp_mqtt_glue: MQTT_EVENT_ERROR
W (17679) esp_mqtt_glue: MQTT Disconnected. Will try reconnecting in a while...

I think this is a different issue, but this is still an issue with the modem only (I don't see such handshake issues while connecting through wifi).

@bbinet
Copy link
Contributor Author

bbinet commented Jul 25, 2023

@david-cermak I've just implemented LCP keepalive, and it seems the PPP session is still up (I don't reveive any PPP STATUS about ppp termination...)

Here are the full logs: https://gist.githubusercontent.com/bbinet/b8e7621aed9bdc1183a7b92a05c5dcf9/raw/1dc42a316193717f33653ac869bc5e8d0d4714d9/uart.log

If PPP is still up, I don't understand why I can't get any more data for OTA and why MQTT is disconnected.
Any idea?

@david-cermak
Copy link
Collaborator

Could you please also enable PPP_DEBUG (it's just next to the LCP in menuconfig) and turn off the MQTT task.

What I find really strange is that the MQTT client got disconnected, while OTA still keeps trying.

This definitely looks like a bug, but need to find out if it's in the networking layers or in OTA.

@bbinet
Copy link
Contributor Author

bbinet commented Jul 26, 2023

PPP_DEBUG was already enabled. Here are new logs with all DEBUG logs activated except for some verbose components (nvs, gdma, CMUX):
https://gist.githubusercontent.com/bbinet/b0cf418a48f5851ab852d6f88d8c6dff/raw/b272e8197bc6316e3a73ec56b965f307cd148779/uart2.log

Also, here is my sdkconfig file if you want to try to reproduce the issue: https://gist.githubusercontent.com/bbinet/b0cf418a48f5851ab852d6f88d8c6dff/raw/b272e8197bc6316e3a73ec56b965f307cd148779/sdkconfig

I kept MQTT task running for now as it's not easy to turn off for me because I use MQTT to trigger the OTA update. I'll send you an update with MQTT task disabled soon.

I think that the MQTT client got disconnected because the MQTT keepalive timeout expires at MQTT level (see log: "mqtt_client: No PING_RESP, disconnected").

@david-cermak
Copy link
Collaborator

interesting, it looks like the link is still up and running, so the LCP echo replies are still being sent by the device.

I think that the MQTT client got disconnected because the MQTT keepalive timeout expires at MQTT level

Oh, okay. makes sense.

To workaround this issue, you can enable TCP keepalive checks for OTA as well, here an example:

https://github.com/espressif/esp-idf/blob/28167ea5a3cdd49227216c4131a9df7f4eda260e/examples/system/ota/simple_ota_example/main/simple_ota_example.c#L100C35-L100C35

@bbinet
Copy link
Contributor Author

bbinet commented Jul 27, 2023

I've just checked and keep_alive_enable = true was already set.

I've already implemented a temporary workaround for the endless OTA: I've setup a timer to restart the ESP after 30min.

Do you have any idea why the connection may not be working whereas the PPP link is still up?

@david-cermak
Copy link
Collaborator

I've just checked and keep_alive_enable = true was already set.

Weird, this line should've appeared in your log, but I didn't see it anywhere...

I've already implemented a temporary workaround for the endless OTA: I've setup a timer to restart the ESP after 30min.

The TCP level keepalive mechanism should also work, but need to understand what's wrong in your configuration.

Do you have any idea why the connection may not be working whereas the PPP link is still up?

No idea, one theory could be that the internet connectivity got broken somewhere upstream? In that case you would be able to ping your remote IP, but cannot access the OTA endpoint.

@bbinet
Copy link
Contributor Author

bbinet commented Jul 27, 2023

I've just checked and keep_alive_enable = true was already set.

Weird, this line should've appeared in your log, but I didn't see it anywhere...

You're right, this line does not appear, so the tcp keepalive is finally not enabled, but I've double checked, and the keep_alive_enable = true is correctly set: I've added a log message in esp_http_client_init to confirm (here: https://github.com/espressif/esp-idf/blob/master/components/esp_http_client/esp_http_client.c#L668).
So maybe a bug: this keepalive config may be lost somewhere before the esp_tls_set_socket_options function is called.

Do you have any idea why the connection may not be working whereas the PPP link is still up?

No idea, one theory could be that the internet connectivity got broken somewhere upstream? In that case you would be able to ping your remote IP, but cannot access the OTA endpoint.

Ok, I will try to use my 1NCE SIM card in a 4G Mobile WiFi Hot Spot product to see if it comes from my upstream 1NCE operator, or if it was really an issue with my own esp+bg95 modem.

@bbinet
Copy link
Contributor Author

bbinet commented Jul 28, 2023

Ok, I will try to use my 1NCE SIM card in a 4G Mobile WiFi Hot Spot product to see if it comes from my upstream 1NCE operator, or if it was really an issue with my own esp+bg95 modem.

So I've just tried to insert my SIM card in a 4G modem and then connect my esp32c3 to the 4G modem through WiFi: everything is running fine and OTA performs with a much higher data rate than when using my BG95 with esp_modem.
So this is probably not an issue upstream with my operator...

@david-cermak
Copy link
Collaborator

Some updates about these issues (sorry for the late reply):

  1. Default config (no TCP keepalive, nor LCP echo) disconnection:

    • We have no way to tell if the connection was dropped (unless we got disconnected before receiving an ACK or something)
    • Ethernet and WiFi interfaces handle such disconnection on link layer; PPP doesn't provide anything but the LCP protocol.
  2. LCP echo: This works correctly on my end, so either there's something wrong with your configuration (saw you're using rainmaker?) or your modem actually sends the LCP echo replies.

  • Could you please enable LWIP debug, so we can see the echo requests and replies in the log? (I think you were missing CONFIG_LWIP_DEBUG=y)
  • Could you run the default cmux-client example and enable OTA and configure the address of your binary? This way we could be on the same page with the code and configuration.
  1. TCP keepalive: There's a bug in http client's implementation in IDF, I'll post a fix MR soon. Here's a quick patch to check:
diff --git a/components/esp_http_client/esp_http_client.c b/components/esp_http_client/esp_http_client.c
index 0deb011253..f10b487224 100644
--- a/components/esp_http_client/esp_http_client.c
+++ b/components/esp_http_client/esp_http_client.c
@@ -692,6 +692,14 @@ esp_http_client_handle_t esp_http_client_init(const esp_http_client_config_t *co
         goto error;
     }
 
+    if (config->keep_alive_enable == true) {
+        client->keep_alive_cfg.keep_alive_enable = true;
+        client->keep_alive_cfg.keep_alive_idle = (config->keep_alive_idle == 0) ? DEFAULT_KEEP_ALIVE_IDLE : config->keep_alive_idle;
+        client->keep_alive_cfg.keep_alive_interval = (config->keep_alive_interval == 0) ? DEFAULT_KEEP_ALIVE_INTERVAL : config->keep_alive_interval;
+        client->keep_alive_cfg.keep_alive_count =  (config->keep_alive_count == 0) ? DEFAULT_KEEP_ALIVE_COUNT : config->keep_alive_count;
+        esp_transport_ssl_set_keep_alive(ssl, &client->keep_alive_cfg);
+    }
+
     if (config->crt_bundle_attach != NULL) {
 #ifdef CONFIG_MBEDTLS_CERTIFICATE_BUNDLE

(the problem is that the keepalive was enabled correctly for TCP transport, but wasn't enabled for SSL transport which you're using)

CommanderRedYT pushed a commit to CommanderRedYT/esp-idf that referenced this issue Sep 24, 2023
Foundation transport contained TCP properties for both TCP and SSL
transport, so it was enough to set the TCP connection properties
(keepalive, interface binding) to one transport only. After merging
5778a7c we have separate TCP properties for these transports and need
to set the same for both.
This commit also fixes unnecessary allocation of 1 more byte for if_name

Closes espressif/esp-protocols#322
espressif-bot pushed a commit to espressif/esp-idf that referenced this issue Sep 30, 2023
Foundation transport contained TCP properties for both TCP and SSL
transport, so it was enough to set the TCP connection properties
(keepalive, interface binding) to one transport only. After merging
5778a7c we have separate TCP properties for these transports and need
to set the same for both.
This commit also fixes unnecessary allocation of 1 more byte for if_name

Closes espressif/esp-protocols#322
espressif-bot pushed a commit to espressif/esp-idf that referenced this issue Oct 6, 2023
Foundation transport contained TCP properties for both TCP and SSL
transport, so it was enough to set the TCP connection properties
(keepalive, interface binding) to one transport only. After merging
5778a7c we have separate TCP properties for these transports and need
to set the same for both.
This commit also fixes unnecessary allocation of 1 more byte for if_name

Closes espressif/esp-protocols#322
espressif-bot pushed a commit to espressif/esp-idf that referenced this issue Oct 10, 2023
Foundation transport contained TCP properties for both TCP and SSL
transport, so it was enough to set the TCP connection properties
(keepalive, interface binding) to one transport only. After merging
5778a7c we have separate TCP properties for these transports and need
to set the same for both.
This commit also fixes unnecessary allocation of 1 more byte for if_name

Closes espressif/esp-protocols#322
hathach pushed a commit to adafruit/esp-idf that referenced this issue Mar 27, 2024
Foundation transport contained TCP properties for both TCP and SSL
transport, so it was enough to set the TCP connection properties
(keepalive, interface binding) to one transport only. After merging
5778a7c we have separate TCP properties for these transports and need
to set the same for both.
This commit also fixes unnecessary allocation of 1 more byte for if_name

Closes espressif/esp-protocols#322
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants