Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WiFi APSTA Mode, esp_wifi_scan_start in blocking mode triggers a WDT timeout (IDFGH-13854) #14703

Open
3 tasks done
eroom1966 opened this issue Oct 10, 2024 · 9 comments
Open
3 tasks done
Assignees
Labels
Status: Reviewing Issue is being reviewed Type: Bug bugs in IDF

Comments

@eroom1966
Copy link

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v4.4.7

Espressif SoC revision.

ESP32-S3

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

Custom Board

Power Supply used.

USB

What is the expected behavior?

I am running the WIFI in APSTA mode.
I run successive blocking wifi scans every 5 seconds using
esp_err_t err = esp_wifi_scan_start(&scan_config, true);
the scan_config defines

    wifi_scan_config_t scan_config = {
      .ssid          = 0,
      .bssid         = 0,
      .channel       = channel,
      .show_hidden   = false,
      .scan_type     = WIFI_SCAN_TYPE_ACTIVE,
    };
    scan_config.scan_time.active.min = 120;
    scan_config.scan_time.active.max = 120;
    scan_config.home_chan_dwell_time = 0;

The channel scan is changed for each call from 1-13 iteratively

I encounter a non-deterministic bug whereby after a long runtime, I hit a WDT timer, here is the output from the monitor
the call to esp_wifi_scan_start(&scan_config, true) does not return - it is blocked indefinitely
I think I have noticed a specific pattern that when the scan does not return, I have requested a scan on the same channel that I am currently connected
I am still confirming this is the case, but I have not yet seen a lockup on a channel scan, that is not the one I am currently connected, which in my case is channel 6

What is the actual behavior?

unfortunately the system locks up in a call to esp_wifi_scan_start()

I encounter a non-deterministic bug whereby after a long runtime, I hit a WDT timer, here is the output from the monitor
the call to esp_wifi_scan_start(&scan_config, true) does not return - it is blocked indefinitely
I think I have noticed a specific pattern that when the scan does not return, I have requested a scan on the same channel that I am currently connected
I am still confirming this is the case, but I have not yet seen a lockup on a channel scan, that is not the one I am currently connected, which in my case is channel 6

Steps to reproduce.

This may take some work my entire code stack is around 100,000 lines

Debug Logs.

E (52848383) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (52848383) task_wdt:  - IDLE0 (CPU 0)
E (52848383) task_wdt: Tasks currently running:
E (52848383) task_wdt: CPU 0: wifi
E (52848383) task_wdt: CPU 1: IDLE1
E (52848383) task_wdt: Print CPU 0 (current core) backtrace

0x42044a9a: task_wdt_isr at /home/user/esp/esp-idf-v4.4.7/components/esp_system/task_wdt.c:183 (discriminator 3)
0x40377d35: _xt_lowint1 at /home/user/esp/esp-idf-v4.4.7/components/freertos/port/xtensa/xtensa_vectors.S:1118
0x4038971d: ppTxPkt at ??:?
0x4038985e: ppTxPktForceWaked at ??:?
0x4209ddae: ic_tx_pkt at ??:?
0x42094291: scan_flush_all_tx_buf at ??:?
0x42094f4f: clear_bss_queue at ??:?
0x4209548b: scan_inter_channel_timeout_process at ??:?
0x4209999b: chm_end_op at ??:?
0x420999ba: chm_end_op_timeout_process at ??:?
0x42099520: esp_wifi_register_mgmt_frame_internal at ??:?
0x420997a7: ieee80211_timer_do_process at ??:?
0x40389c51: ppTask at ??:?
0x4038495e: vPortTaskWrapper at /home/user/esp/esp-idf-v4.4.7/components/freertos/port/xtensa/port.c:142


E (52923383) task_wdt: Print CPU 1 backtrace
0x4037babd: esp_crosscore_isr at /home/user/esp/esp-idf-v4.4.7/components/esp_system/crosscore_int.c:92
0x40377d35: _xt_lowint1 at /home/user/esp/esp-idf-v4.4.7/components/freertos/port/xtensa/xtensa_vectors.S:1118
0x40384c4e: vPortClearInterruptMaskFromISR at /home/user/esp/esp-idf-v4.4.7/components/freertos/port/xtensa/include/freertos/portmacro.h:571
 (inlined by) vPortExitCritical at /home/user/esp/esp-idf-v4.4.7/components/freertos/port/xtensa/port.c:332
0x42044d5f: esp_task_wdt_reset at /home/user/esp/esp-idf-v4.4.7/components/esp_system/task_wdt.c:334
0x42044d6b: idle_hook_cb at /home/user/esp/esp-idf-v4.4.7/components/esp_system/task_wdt.c:80
0x42005679: esp_vApplicationIdleHook at /home/user/esp/esp-idf-v4.4.7/components/esp_system/freertos_hooks.c:51 (discriminator 1)
0x40382669: prvIdleTask at /home/user/esp/esp-idf-v4.4.7/components/freertos/tasks.c:4099 (discriminator 1)
0x4038495e: vPortTaskWrapper at /home/user/esp/esp-idf-v4.4.7/components/freertos/port/xtensa/port.c:142


### More Information.

_No response_
@eroom1966 eroom1966 added the Type: Bug bugs in IDF label Oct 10, 2024
@espressif-bot espressif-bot added the Status: Opened Issue is new label Oct 10, 2024
@github-actions github-actions bot changed the title WiFi APSTA Mode, esp_wifi_scan_start in blocking mode triggers a WDT timeout WiFi APSTA Mode, esp_wifi_scan_start in blocking mode triggers a WDT timeout (IDFGH-13854) Oct 10, 2024
@eroom1966
Copy link
Author

A clarification to the following comment

I think I have noticed a specific pattern that when the scan does not return, I have requested a scan on the same channel that I am currently connected

After running upto many crashes - this is not true, I am also seeing the same lockup in the call to esp_wifi_scan_start() when the channel scan requested is NOT the channel currently connected

@hansw123
Copy link
Collaborator

@eroom1966
sorry for late reply!
Can you describe the logic of your application code, especially regarding the scanning part?

@eroom1966
Copy link
Author

eroom1966 commented Oct 11, 2024

@hansw123
Sure thanks for replying.

I am configuring an APSTA mode WiFi, and need to manage both sides of the connection.

For the STA side, I have a user thread which is triggering a scan every 5 seconds, it is effectively this (channel is initially set to 0)

        if (tick_sec5) {
            // scan one channel at a time
            INFO("Scan Channel=%d\n", channel);
            esp_err_t err = wifi_scan_start(channel);
            if (err == ESP_OK) {
                channel++;
                if (channel > 13) channel = 1;
            }
        }

The intention is that the (esp)STA can obtain a list of available AP's after 120 seconds, with minimal 'off-air' time to the connected Stations on the (esp)AP
The function wifi_scan_start() is a wrapper around esp_wifi_scan_start() which gets called with the following scan settings

static esp_err_t wifi_scan_start(Uns32 channel) {
    wifi_scan_config_t scan_config = {
      .ssid          = 0,
      .bssid         = 0,
      .channel       = channel,
      .show_hidden   = false,
      .scan_type     = WIFI_SCAN_TYPE_PASSIVE,
    };
    scan_config.scan_time.active.min = 120;
    scan_config.scan_time.active.max = 120;

    esp_err_t err = esp_wifi_scan_start(&scan_config, false);

There is error checking on the return code from esp_wifi_scan_start() to check that a connect is not currently in progress.

I am using this in background mode, so I need an event handler to detect WIFI_EVENT_SCAN_DONE, then perform some housekeeping to update a list of available WiFi Access Points

There is an interlock to ensure that a scan request cannot be called again, until the WIFI_EVENT_SCAN_DONE event has been received.

The goal is to manage a set of Known AP's that can be connected to the (esp)STA should one of them disappear.

Hopefully this makes sense.
I suspect there is either a corruption, or a race occurring, but very difficult to debug as the WiFi stack is closed source. All I tend to see is an error code something like
(E) wifi:pp.c:37
I will capture this again on the next failure

UPDATE
The error message that appears when this problem occurs is
E (1653554) wifi:pp.c 837
I presume this refers to line 837 of file pp.c - which of course I do not have access.

I have instrumented the code and what appears to happen is the scan is requested, but the event_handler never triggers with an event to indicate the SCAN_DONE. Hence the detection of the watch dog timeout seems legitimate

@eroom1966
Copy link
Author

@hansw123
When I say legitimate - I mean the reason for it happening is legitimate.
The BUG is that their is some kind of lockup in the wifi code.
How can I proceed to debug this when it is closed source ?

@eroom1966
Copy link
Author

Can anybody suggest a way forward on this ?
We can run for less than an hour before this issue occurs and locks up the system

@hansw123
Copy link
Collaborator

hansw123 commented Oct 18, 2024

@eroom1966
the watch dog was caused by the sending wifi packet's QID doesn't match it's AC,this means the packet was not a proper wifi packet。

  1. you should checkout your code ,esp_err_t err = esp_wifi_scan_start(&scan_config, false) means no blocking scan ,but both blocking and no blocking should be ok

2.beacuse your mode is apsta,so you should checkout what ap do (espcially connecting process)during sta scan ,or some ohter sending packet action, or some wifi action special,you can post more logic of your code

3.apsta mode example can refer https://github.com/espressif/esp-idf/blob/master/examples/wifi/softap_sta/main/softap_sta.c

@espressif-bot espressif-bot added Status: In Progress Work is in progress Status: Reviewing Issue is being reviewed and removed Status: Opened Issue is new Status: In Progress Work is in progress labels Oct 18, 2024
@hansw123
Copy link
Collaborator

hansw123 commented Oct 23, 2024

release/v4.4 is end of life, and you can update idf to latest v5.x and have a try. thank you!

@eroom1966
Copy link
Author

This will take quite some time as many of the API calls I am using are deprecated in v5.x
How likely is this to have fixed my issue ?

@hansw123
Copy link
Collaborator

hansw123 commented Nov 5, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Reviewing Issue is being reviewed Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

3 participants