Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WiFi Mesh unstable when parent offline (IDFGH-13875) #14720

Closed
3 tasks done
michaelsimp opened this issue Oct 14, 2024 · 109 comments
Closed
3 tasks done

WiFi Mesh unstable when parent offline (IDFGH-13875) #14720

michaelsimp opened this issue Oct 14, 2024 · 109 comments
Assignees
Labels
Resolution: NA Issue resolution is unavailable Status: In Progress Work is in progress Type: Bug bugs in IDF

Comments

@michaelsimp
Copy link

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.3.0

Espressif SoC revision.

Chip is ESP32-S3 (QFN56) (revision v0.2)

Operating System used.

Windows

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

PowerShell

Development Kit.

ESP32-S3-WROOM-1

Power Supply used.

USB

What is the expected behavior?

I expect the ESP32 to continue to run the application without crashing when the WIFI Mesh parent disappears.
If the MESH_ROOT was powered off, I expect a MESH_NODE to assume the role of MESH_ROOT
If the WIFI Router is powered off, when I restore it, I expect the mesh network to establish itself

What is the actual behavior?

  1. Sometimes these tests work perfectly. The Mesh network goes down and the nodes start scanning. If I restore the WiFi router, the Mesh network is reestablished
  2. Sometimes the Mesh network goes down, and can't recover. It doesn't crash but it doesn't scan properly and reestablish the Mesh network.
  3. Very regularly, if I power off the WiFi router the MESH_ROOT intermittently crashes OR if I power off the MESH_ROOT a MESH_NODE intermittently crash

Steps to reproduce.

  1. Power on system comprising 2 x ESP32-S3 dev boards and a Wifi router
  2. Connect a serial terminal (I am using PUTTY) to each serial port for monitoring
  3. Let the Mesh network get established and verify MESH_ROOT and MESH_NODE connected.
  4. Power off the WiFi Router
  5. In the example (logs below) the MESH_ROOT crashed

Debug Logs.

I (00:58:03.336) aWifiMesh: <MESH_EVENT_MESH_STARTED>ID:77:77:77:77:77:76
I (136546) mesh: <MESH_NWK_LOOK_FOR_NETWORK>need_scan:0x3, need_scan_router:0x0, look_for_nwk_count:1
I (00:58:03.336) aWifiMesh: This node MAC:48:ca:43:9b:53:d8
I (00:58:03.354) aWifiMesh: WiFi Mesh started successfully, heap:141084, root not fixed
WIN> I (140766) mesh: [S6]VONETS, 00:17:13:20:bd:74, channel:8, rssi:-12
I (140776) mesh: find router:[ssid_len:6]VONETS, rssi:-12, 00:17:13:20:bd:74(encrypted), new channel:8, old channel:0
I (140786) mesh: [FIND][ch:0]AP:11, otherID:0, MAP:1, idle:1, candidate:0, root:0[00:17:13:20:bd:74]router found
I (140796) mesh: [FIND:1]find a network, channel:8, cfg<channel:0, router:VONETS, 00:00:00:00:00:00>

I (00:58:07.590) aWifiMesh: <MESH_EVENT_FIND_NETWORK>new channel:8, router BSSID:00:00:00:00:00:00
W (140796) wifi:<MESH AP>adjust channel:1, secondary channel offset:1(40U)
W (140816) wifi:<MESH AP>adjust channel:8, secondary channel offset:1(40U)
I (141126) mesh: [SCAN][ch:8]AP:1, other(ID:0, RD:0), MAP:0, idle:0, candidate:1, root:0, topMAP:0[c:0,i:0][00:17:13:20:bd:74]router found<>
I (141126) mesh: 1330[SCAN]init rc[48:ca:43:9b:53:d9,-9], mine:0, voter:0
I (141136) mesh: 1368, vote myself, router rssi:-9 > voted rc_rssi:-120
I (141146) mesh: [SCAN:1/10]rc[128][48:ca:43:9b:53:d9,-9], self[48:ca:43:9b:53:d8,-9,reason:0,votes:1,idle][mine:1,voter:1(1.00)percent:1.00][128,1,48:ca:43:9b:53:d9]

I (141456) mesh: [SCAN][ch:8]AP:2, other(ID:0, RD:0), MAP:1, idle:1, candidate:1, root:0, topMAP:0[c:0,i:1][00:17:13:20:bd:74]router found<>
I (141466) mesh: [SCAN:2/10]rc[128][48:ca:43:9b:53:d9,-8], self[48:ca:43:9b:53:d8,-8,reason:0,votes:1,idle][mine:1,voter:2(0.50)percent:1.00][128,1,48:ca:43:9b:53:d9]

I (141776) mesh: [SCAN][ch:8]AP:2, other(ID:0, RD:0), MAP:1, idle:0, candidate:1, root:1, topMAP:0[c:0,i:0][00:17:13:20:bd:74]router found<>
I (141776) mesh: 7391[selection]try rssi_threshold:-78, backoff times:0, max:5<-78,-82,-85>
I (141796) mesh: [DONE]connect to parent:ESPM_3372B8, channel:8, rssi:-15, 30:30:f9:33:72:b9[layer:1, assoc:0], my_vote_num:0/voter_num:0, rc[00:00:00:00:00:00/-8/0]
I (141806) mesh: set router bssid:00:17:13:20:bd:74
I (142596) mesh: <MESH_NWK_MIE_CHANGE><><><><ROOT ADDR><><><>
I (142596) mesh: <MESH_NWK_ROOT_ADDR>from assoc, layer:2, root_addr:30:30:f9:33:72:b9, root_cap:1
I (142616) mesh: <MESH_NWK_ROOT_ADDR>idle, layer:2, root_addr:30:30:f9:33:72:b9, conflict_roots.num:0<>
I (00:58:09.409) aWifiMesh: <MESH_EVENT_ROOT_ADDRESS>root address:30:30:f9:33:72:b9
I (142616) mesh: [scan]new scanning time:600ms, beacon interval:300ms
I (142636) mesh: 2012<arm>parent monitor, my layer:2(cap:6)(node), interval:7286ms, retries:1<normal connected>
I (00:58:09.436) aWifiMesh: <MESH_EVENT_PARENT_CONNECTED>layer:1-->2, parent:30:30:f9:33:72:b9<layer2>, ID:77:77:77:77:77:76
I (00:58:09.451) mesh_netif: It was a wifi station removing stuff
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4212753c  PS      : 0x00060030  A0      : 0x82127613  A1      : 0x3fcc1660
A2      : 0xffffffff  A3      : 0x00000000  A4      : 0xff000000  A5      : 0x00000001
A6      : 0x3fcc0a64  A7      : 0xff000000  A8      : 0x3c1505e4  A9      : 0x00000000
A10     : 0x3fcc0a64  A11     : 0x00000000  A12     : 0x00000101  A13     : 0x3c1505e4
A14     : 0x00000007  A15     : 0x3fcd8024  SAR     : 0x00000004  EXCCAUSE: 0x0000001c
EXCVADDR: 0xff00000c  LBEG    : 0x40056f5c  LEND    : 0x40056f72  LCOUNT  : 0xffffffff


Backtrace: 0x42127539:0x3fcc1660 0x42127610:0x3fcc16b0 0x4037e0aa:0x3fcc16d0

More Information.

My application integrates a number of IDF example programs including ip_internal_network
I went back to the example project ip_internal_network and built it unmodified, and can reproduce the same problems quite readily.

Also, for when the ESP32 nodes don't completely crash, I would like to know how to restart the Mesh network in software.
I have tried stopping the Mesh network with:
ESP_ERROR_CHECK(esp_mesh_stop());
ESP_ERROR_CHECK(esp_mesh_deinit());
ESP_ERROR_CHECK(mesh_netifs_destroy()); // I have tried with and without this line. Without it, the logs continually report:
I (135746) mesh: mesh is not started
E (00:58:02.547) mesh_netif: Received with err code 16388 ESP_ERR_MESH_NOT_START

I then try to restart the Mesh network with:
/* mesh initialization /
ESP_ERROR_CHECK(esp_mesh_init());
ESP_ERROR_CHECK(esp_mesh_set_max_layer(CONFIG_MESH_MAX_LAYER));
ESP_ERROR_CHECK(esp_mesh_set_vote_percentage(1));
ESP_ERROR_CHECK(esp_mesh_set_ap_assoc_expire(10));
/
set blocking time of esp_mesh_send() to 30s, to prevent the esp_mesh_send() from permanently for some reason /
ESP_ERROR_CHECK(esp_mesh_send_block_time(5000)); // was 30 seconds
mesh_cfg_t cfg = MESH_INIT_CONFIG_DEFAULT();
cfg.crypto_funcs = NULL;
/
mesh ID */
memcpy((uint8_t ) &cfg.mesh_id, MESH_ID, MAC_SIZE);
/
router */
cfg.channel = CONFIG_MESH_CHANNEL;

cfg.router.ssid_len = strlen(meshProvisionData.ssid);
memcpy((uint8_t *) &cfg.router.ssid, meshProvisionData.ssid, cfg.router.ssid_len);
memcpy((uint8_t *) &cfg.router.password, meshProvisionData.password, strlen(meshProvisionData.password));

ESP_ERROR_CHECK(esp_mesh_set_ap_authmode((wifi_auth_mode_t) CONFIG_MESH_AP_AUTHMODE));
cfg.mesh_ap.max_connection = CONFIG_MESH_AP_CONNECTIONS;
cfg.mesh_ap.nonmesh_max_connection = CONFIG_MESH_NON_MESH_AP_CONNECTIONS;
memcpy((uint8_t *) &cfg.mesh_ap.password, CONFIG_MESH_AP_PASSWD, strlen(CONFIG_MESH_AP_PASSWD));
ESP_ERROR_CHECK(esp_mesh_set_config(&cfg));
ESP_ERROR_CHECK(esp_mesh_start());

Doing the above when the system is running normally, often causes the ESP32's to crash with various errors
eg after start, MESH_NODE does a scan and then crashes with Guru Meditation Error: Core 0 panic'ed
I (00:22:02.752) aWifiMesh: <MESH_EVENT_FIND_NETWORK>new channel:8, router BSSID:00:00:00:00:00:00
W (1323864) wifi:adjust channel:1, secondary channel offset:1(40U)
W (1323874) wifi:adjust channel:8, secondary channel offset:1(40U)
I (1324184) mesh: [SCAN][ch:8]AP:2, other(ID:0, RD:0), MAP:1, idle:0, candidate:1, root:1, topMAP:0[c:0,i:0][00:17:13:20:bd:74]router found<>
I (1324184) mesh: 7391[selection]try rssi_threshold:-78, backoff times:0, max:5<-78,-82,-85>
I (1324204) mesh: [DONE]connect to parent:ESPM_3372B8, channel:8, rssi:-14, 30:30:f9:33:72:b9[layer:1, assoc:0], my_vote_num:0/voter_num:0, rc[00:00:00:00:00:00/-120/0]
I (1324214) mesh: set router bssid:00:17:13:20:bd:74
I (1324834) mesh: <MESH_NWK_MIE_CHANGE><><><><><><>
I (1324834) mesh: <MESH_NWK_ROOT_ADDR>from assoc, layer:2, root_addr:30:30:f9:33:72:b9, root_cap:1
I (1324844) mesh: <MESH_NWK_ROOT_ADDR>idle, layer:2, root_addr:30:30:f9:33:72:b9, conflict_roots.num:0<>
I (1324854) mesh: [scan]new scanning time:600ms, beacon interval:300ms
I (00:22:03.744) aWifiMesh: <MESH_EVENT_ROOT_ADDRESS>root address:30:30:f9:33:72:b9
I (1324854) mesh: 2012parent monitor, my layer:2(cap:6)(node), interval:4526ms, retries:1
I (00:22:03.771) aWifiMesh: <MESH_EVENT_PARENT_CONNECTED>layer:2-->2, parent:30:30:f9:33:72:b9, ID:77:77:77:77:77:76
I (00:22:03.785) mesh_netif: It was a wifi station removing stuff
Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.

Core 0 register dump:
PC : 0x4212753c PS : 0x00060830 A0 : 0x82127613 A1 : 0x3fcc15d0
A2 : 0xffffffff A3 : 0x00000000 A4 : 0x00000278 A5 : 0x00000001
A6 : 0x3fcc09d0 A7 : 0x00000278 A8 : 0x3c1505e4 A9 : 0x3fcd778c
A10 : 0x3fcc09d0 A11 : 0x00000000 A12 : 0x00000101 A13 : 0x3c1505e4
A14 : 0x00000007 A15 : 0x3fcaa7f4 SAR : 0x00000004 EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000284 LBEG : 0x40056f5c LEND : 0x40056f72 LCOUNT : 0xffffffff

Backtrace: 0x42127539:0x3fcc15d0 0x42127610:0x3fcc1620 0x4037e0aa:0x3fcc1640

I sometimes get MTX task stack overflows too when I try this, same as #13882

@michaelsimp michaelsimp added the Type: Bug bugs in IDF label Oct 14, 2024
@espressif-bot espressif-bot added the Status: Opened Issue is new label Oct 14, 2024
@github-actions github-actions bot changed the title WiFi Mesh unstable when parent offline WiFi Mesh unstable when parent offline (IDFGH-13875) Oct 14, 2024
@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp
I have tested using the ip_internal_network example, but I didn't reproduce your problem. Can you provide the .elf file when the crash issue happen? Or can you provide the core dump decode file?

@michaelsimp
Copy link
Author

michaelsimp commented Oct 24, 2024 via email

@michaelsimp
Copy link
Author

michaelsimp commented Oct 24, 2024

Hi
I did a quick set of tests today.

I created the ip_internal_network project from examples and configured as follows:
Set IDF version to 5.3.0
Set target device to ESP32-S3 with jtag integrated debugger
Set partition table Factory partition to 0x400000
Set device flash size to 16MB - matches my ESP32-S3
Set the Router SSID to "VONETS" and Password to "pass9999"
Set Panic Handler to "Print registers and halt"

Clean and Build project.
Load into 2 ESP32-S3 with serial terminals connected for monitoring.
One becomes MESH_ROOT and the other connects as MESH_NODE
Took turns at powering off the MESH_ROOT and watching the other become MESH_ROOT and then powering it back on and it connects as a MESH_NODE. This seemed to work ok today.

But what I did find easy to reproduce was:
Power off MESH_ROOT and power back on BEFORE other MESH_NODE becomes MESH_ROOT.
The original MESH_ROOT I power cycled, becomes MESH_ROOT again, but the MESH_NODE remains disconnected.

See attached files:
MESH_ROOT powered off at line 171
MESH_NODE loses connection around line 33 and never recovers

Also see .elf and .bin files in attachment. I don't know what or where the "Core dump decode file is", but these tests don't show a CPU crash. I cant run under the debugger due to the power cycle tests.

Please see attachment MESH-Testing.zip two comments down

@michaelsimp
Copy link
Author

I did some more tests which can easily cause CPU crashes.

Power on both nodes, one becomes MESH_ROOT and one MESH_NODE.
Power off WIFI router.
MESH_ROOT crashes. See file attached MESH_ROOT "Router power off.txt" (crash at end)

Second test. Power up only one node, becomes MESH_ROOT
Power off Router
This time MESH_ROOT does not crash
Power on Router
MESH_ROOT does not crash
Power on a second node which connects to the first MESH_ROOT - see line 492
MESH_ROOT crashes.
See file attached "MESH_ROOT crash on MESH_NODE connect.txt"
MESH_ Testing 2.zip

@michaelsimp
Copy link
Author

MESH-Testing.zip
This is the attachment for the first tests, 2 entries up. It did not upload properly last before

@brianignacio5
Copy link
Contributor

Hi @michaelsimp

The esp-idf vscode extension allows you to save settings in multiple places: User (Global settings for vscode), Workspace and Workspace folder (your project's .vscode/settings.json). The ESP-IDF: Show Examples command shows you the current esp-idf path used in the current vscode window. You can change where to save settings with the ESP-IDF: Select where to save configuration settings command. It sounds confusing but it does allow to use multiple projects each with different esp-idf versions (even at the same time! Using vscode workspace) More information in here

It seems the example you are trying to use have some components with specific behavior in each esp-idf version. So building a v5.2.2 example using esp-idf v5.3 might produce some compilation problems. How about creating an example using esp-idf v5.3 ?

  1. Open a vscode window.
  2. Select esp-idf v5.3 from status bar (recommended) or the ESP-IDF: Configure ESP-IDF extension.
  3. Run the ESP-IDF: Doctor command. Check that esp-idf is indeed using v5.3
  4. Run the ESP-IDF: Show examples. The esp-idf path shown should be v5.3 now
  5. Create your project from esp-idf example and try to build.

We will try to update the Show examples command to show all available esp-idf versions from esp-idf vscode extension to make this easier.

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp The backtrace of the crash issue is here:

xtensa-esp32s3-elf-addr2line -piaf 0x4208c922:0x3fca7e60 0x4201c951:0x3fca7ee0 0x4201f96f:0x3fca7f20 0x4202345e:0x3fca7f50 0x420167bd:0x3fca7f70 -e ip_internal_network.elf

0x4208c922: parse_msg at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/apps/dhcpserver/dhcpserver.c:993
 (inlined by) handle_dhcp at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/apps/dhcpserver/dhcpserver.c:1190
 (inlined by) handle_dhcp at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/apps/dhcpserver/dhcpserver.c:1106
0x4201c951: udp_input at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/core/udp.c:404
0x4201f96f: ip4_input at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:746
0x4202345e: ethernet_input at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/netif/ethernet.c:186
0x420167bd: tcpip_thread_handle_msg at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/api/tcpip.c:174
 (inlined by) tcpip_thread at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/api/tcpip.c:148

I think this issue is caused by the mismatch between your IDF version and the version in the example. Please update your version according to Brain's suggestion and test it again.

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Oct 29, 2024
@michaelsimp
Copy link
Author

I installed IDF version 5.3.1
I had a problem at step 2 in your instructions to entries up states: "Select esp-idf v5.3 from status bar (recommended) or the ESP-IDF: Configure ESP-IDF extension."
When I try is reports, "Open a folder first."
So I open a folder of a the project I made in version IDF 3.0. Then on the status bar I could select Version 5.3.1
When I run Run the ESP-IDF: Doctor command, I get the following errors:

  • Extension configuration report has been copied to the clipboard with errors.
  • Cannot open file ../report.txt. Detail: FIles above 50MB cannot be synchnrozied with extensions.
    I checked my report,txt and found it was over 181MB
    I tried continuing:
    When I select Show examples it only shows 5.3.1 which is good.
    I created ip_internal_network, but when completed the status bar reports ESP-IDF v5.2.2 again.
    So I tried deleting the large report.txt and trying again.
    Same problem, it created a report.txt of 181MB again

@brianignacio5
Copy link
Contributor

Delete this file:

%USERPROFILE%\.vscode\extensions\espressif.esp-idf-extension-VERSION\esp_idf_vsc_ext.log

and try to run ESP-IDF: Doctor command again. Seems that your extension log have been logging a lot and vscode limit.

About the ESP-IDF v5.2.2 again, it is because the newly created project does not set settings when created. You can select the v5.3.0 from status bar again.

Again sorry for this issue, will work to make it easier to use in the next release of esp-idf extension.

@michaelsimp
Copy link
Author

I need to make some real progress on this so I have completely uninstalled esp-idf and manually deleted all the ESP and espressif folders including all 3 IDF versions.
I have reinstalled ESP-IDF and only IDF version 5.3.1 to remove all doubt.
I will rebuild and test and report
Thanks

@michaelsimp
Copy link
Author

Hi again
Its hard to tell, but seems as if it might be a little more robust, especially with the router power off and on test.
Attachments.zip
But it still crashes, see files attached including my .elf

"Fail 1.txt" is taken from the Mesh_Root
Line 1458 Mesh_Node disconnected
Line 1495 Mesh_Node reconnects
Line 1496 crash

"Fails 2.txt" is taken from the Mesh_Node
Line 1789 Mesh_Root disconnected
Line 1822 Reconnect
Line 1832 Crash divide by zero

@zhangyanjiaoesp
Copy link
Collaborator

It's weird, I have tested it multiple times as you said (the following two cases) and it can connect normally without any crashing issues.

Power on both nodes, one becomes MESH_ROOT and one MESH_NODE. Power off WIFI router. MESH_ROOT crashes. See file attached MESH_ROOT "Router power off.txt" (crash at end)

Second test. Power up only one node, becomes MESH_ROOT Power off Router This time MESH_ROOT does not crash Power on Router MESH_ROOT does not crash Power on a second node which connects to the first MESH_ROOT - see line 492 MESH_ROOT crashes.

I'm using the Github IDF, and I will try to test with the vscode extension

@michaelsimp
Copy link
Author

michaelsimp commented Oct 30, 2024 via email

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp

Are you using the completely unmodified example code during your testing?

@michaelsimp
Copy link
Author

michaelsimp commented Oct 30, 2024

Yes. I did not change anything except:

  • set target device to ESP32-S3 - Internal JTAG debug - What device are you testing with? Could this be a factor?
  • change partition table, factory partition size to 0x400000
  • Using vscode GUI menuconfig:
    • change device flash size to 16MB
    • set WiFi router SSID to VONETS and password to "pass9999"

See my source files attached where you can see most are untouched with the original install date 30/10/24 02:13pm NZ time.
Source.zip

@michaelsimp
Copy link
Author

FYI I use vscode for coding and building and sometimes JTAG debugging
Most of the time in testing I am am monitoring serial com port using Putty terminals on com ports

In addition to the crashing, sometimes a MESH_NODE will not reconnect to the MESH network. As a work around for this, I want to stop and restart the wifi mesh network? No matter what I try, my nodes intermittently crash on restart? Is this perhaps related?

I am hoping you could please answer a few questions to help me.

What are the recommended steps to stop and then restart the WiFi Mesh network? I currently have :

    ESP_ERROR_CHECK(esp_mesh_stop());
    ESP_ERROR_CHECK(esp_mesh_deinit());

but this causes a lot of error logging which stops if I add ...
ESP_ERROR_CHECK(mesh_netifs_destroy());

My restart is as follow:

void wifiMeshStart() {
    ESP_LOGW(TAG, "Wifi Mesh switch on");
    /*  mesh initialization */
    ESP_ERROR_CHECK(esp_mesh_init());
    ESP_ERROR_CHECK(esp_mesh_set_max_layer(CONFIG_MESH_MAX_LAYER));
    ESP_ERROR_CHECK(esp_mesh_set_vote_percentage(1));
    ESP_ERROR_CHECK(esp_mesh_set_ap_assoc_expire(10));
    /* set blocking time of esp_mesh_send() to 30s, to prevent the esp_mesh_send() from permanently for some reason */
    ESP_ERROR_CHECK(esp_mesh_send_block_time(30000));
    mesh_cfg_t cfg = MESH_INIT_CONFIG_DEFAULT();
#if !MESH_IE_ENCRYPTED
    cfg.crypto_funcs = NULL;
#endif
    /* mesh ID */
    memcpy((uint8_t *) &cfg.mesh_id, MESH_ID, MAC_SIZE);
    /* router */
    cfg.channel = CONFIG_MESH_CHANNEL;

    cfg.router.ssid_len = strlen(meshProvisionData.ssid);
    memcpy((uint8_t *) &cfg.router.ssid, meshProvisionData.ssid, cfg.router.ssid_len);
    memcpy((uint8_t *) &cfg.router.password, meshProvisionData.password, strlen(meshProvisionData.password));
    
    ESP_ERROR_CHECK(esp_mesh_set_ap_authmode((wifi_auth_mode_t) CONFIG_MESH_AP_AUTHMODE));
    cfg.mesh_ap.max_connection = CONFIG_MESH_AP_CONNECTIONS;
    cfg.mesh_ap.nonmesh_max_connection = CONFIG_MESH_NON_MESH_AP_CONNECTIONS;
    memcpy((uint8_t *) &cfg.mesh_ap.password, CONFIG_MESH_AP_PASSWD, strlen(CONFIG_MESH_AP_PASSWD));
    ESP_ERROR_CHECK(esp_mesh_set_config(&cfg));
    /* mesh start */
    ESP_ERROR_CHECK(esp_mesh_start());
    ESP_LOGI(TAG, "WiFi Mesh started successfully");
}

I notice with my custom application:
I have 1 MESH_ROOT and 5 MESH_NODEs spread around the office. Mesh_Node seem to connect to parents not based on the RSSI as documented.

The IDF documentation states "To prevent nodes from forming a weak upstream connection, ESP-WIFI-MESH implements an RSSI threshold mechanism for beacon frames." Is this configurable and if so where? I cant find it in the API or in MenuConfig. What is the default RSSI threshhold value?

The IDF documentation states in Preferred Parent Node "The preferred parent node is determined based on the following criteria: Which layer the parent node candidate is situated on. The number of downstream connections (child nodes) the parent node candidate currently has".
Does this mean RSSI is not part of the parent selection process?

Is it recommend to use self-organized networking or for serious applications should I manually build the mesh network? I will only have a max of 10 mesh nodes altogether but they do a reasonable amount of MQTT5 communications to the cloud.

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp The following are the answers for your questions:

  1. What are the recommended steps to stop and then restart the WiFi Mesh network? I currently have :

        ESP_ERROR_CHECK(esp_mesh_stop());
        ESP_ERROR_CHECK(esp_mesh_deinit());
    

    Call esp_mesh_stop() is enough.

  2. but this causes a lot of error logging which stops if I add ...
    ESP_ERROR_CHECK(mesh_netifs_destroy());
    Where did you add the mesh_netifs_destory() function? What does the error log look like? Can you provide an example?

  3. Where did you call the wifiMeshStart() function?

  4. I have 1 MESH_ROOT and 5 MESH_NODEs spread around the office. Mesh_Node seem to connect to parents not based on
    the RSSI as documented.

    RSSI is not the only criterion for selecting the parent node, the layer and connections also need to be considered.

  5. The IDF documentation states "To prevent nodes from forming a weak upstream connection, ESP-WIFI-MESH implements an RSSI threshold mechanism for beacon frames." Is this configurable and if so where? I cant find it in the API or in MenuConfig. What is the default RSSI threshhold value?

    You can call this API:

    esp_err_t esp_mesh_set_rssi_threshold(const mesh_rssi_threshold_t *threshold);

  6. The IDF documentation states in Preferred Parent Node "The preferred parent node is determined based on the following criteria: Which layer the parent node candidate is situated on. The number of downstream connections (child nodes) the parent node candidate currently has". Does this mean RSSI is not part of the parent selection process?

    Same to the fourth point, selecting parent need to consider RSSI, layer and connections, the doc need to be updated.

  7. Is it recommend to use self-organized networking or for serious applications should I manually build the mesh network? I will only have a max of 10 mesh nodes altogether but they do a reasonable amount of MQTT5 communications to the cloud.

    You can use self-organized network.

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp I can reproduce the crash using the vscode, I will check the difference between VSCode and standard IDF

@michaelsimp
Copy link
Author

Hi Zhangyanjiaoesp
This excellent news for me. Hopefully it is just something simple you will find soon and be able to offer me a fix.
Thank you

@michaelsimp
Copy link
Author

Hi Zhangyanjiaoesp

Thanks so much for taking the time to answer all my questions.

Please note THESE tests are with MY application (NOT with example program ip_internal_network) running on a network of 6 nodes - all ESP32-S3. My application has a CLI console integrated so I can trigger actions and see the responses on the COM port.

Q1 I will go back to just calling esp_mesh_stop() and see what happens.
To restart should I just be able to call ESP_ERROR_CHECK(esp_mesh_start());

Q2 Triggered from the CLI Console I was calling:

    ESP_ERROR_CHECK(esp_mesh_stop());
    ESP_ERROR_CHECK(esp_mesh_deinit());
    ESP_ERROR_CHECK(mesh_netifs_destroy());

Q3 wifiMeshStart() is also called from my CLI console

The CLI Console is started from my Mainline as is my WiFi Mesh application (built on top of the ip_internal_network source).
Triggering the Mesh Stop and Start would be called from the CLI Console thread. I am assuming this is ok and does not need any mutex protection. Please advise how I should call it if this is a problem.

Q4 I understand this

Q5 Thanks

Q6 Thanks for clarifying this but I am not finding this to be the case. I distributed some MESH_Nodes across the office with the aim of creating a multi-hop network between the far extremes. But it does not form as expected or at all well for healthy RSSI. I have nodes which are close to my Mesh_Root or 2nd layer Mesh_Nodes which are not at parent capacity numbers. When my other Mesh_Nodes do connect to these parents they provide an RSSI on the child to the parent of -35dBm. But they most frequently want connect to nodes a much longer distance away getting a RSSI of < -70dBm.

I read somewhere the default RSSI threshold is -120dBm, but I am finding nodes with RSSIs < -70dBm often lose their MQTT connection to the broker. I have an office environment and I have located the nodes approx 10 to 20 meters apart with a max of one wall between but they are not all line of site. I very much doubt I could even get a connection at RSSI less than -100dBm. I am thinking it may be a signal to noise ratio issue so I have scanned the office for WiFi channel usage and selected channel 1 on my WIFI Router as nothing else is using this channel and no other channels overlap. I know this is not an easy question to answer with precision and I appreciate the many influences, but realistic what is the ballpark min RSSI range at which I can expect a node to work reliably at what sort of distance range.

Q7 Because of my Q6 response above, I have started evaluating the example project "manual_networking" to make my MESH_NODES manually scan and select MESH_NODES with the healthiest RSSI. It sounds like you are saying the IDF framework should already be doing this?
So I am now wondering if the vscode crashing issue is also causing this to not work properly and your fix might fix both.
Should I put manual scanning and parent selection changes to one side and wait for the outcome of the vscode crashing?
I guess I would prefer to use the self-organized network as much as possible, if it works as you describe.
Please advise / confirm manual scanning and parent selection should not be required and I might need you to look at the node parent selection for healthy RSSI next.

Best regards

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp

  1. According to the backtrace of the crash issue, it seems related to DHCP, it won't affect the mesh networking.
  2. I'm sorry I didn't quite understand your question regarding the selection of the parent node. Can you draw a picture to explain it? For example, where are nodes A, B, and C located? What level? How many child nodes are connected below? What is the RSSI of A, B, and C scanning each other? Do you expect A to connect to B but actually connect A to C?

@michaelsimp
Copy link
Author

See attached:
MeshMaps.zip

"Target Network .png" shows the walls as black lines and nodes as circles. The blue lines are approx what I was expecting to see.

It forms very randomly but with bad choices like the file "Actual example Network.png" with links and dBm in red.

My project is configured for up to 50 nodes and 3 children per node as I wanted to force some layers.

@michaelsimp
Copy link
Author

"According to the backtrace of the crash issue, it seems related to DHCP, it won't affect the mesh networking."

That may be the case with the crash issue you found, but the root cause of the vscode IDF environment may cause more than one issue. I guess you will know better when you get to the bottom of the vscode IDF environment issue

Do the diagrams I sent help you understand my issue better now?

@zhangyanjiaoesp
Copy link
Collaborator

Yes, I now understand your question. Once the ROOT node is formed, the chances for other idle nodes to connect to the root node are the same; as long as they can scan the root node within the RSSI threshold range, they can connect and become second-layer nodes. Therefore, it is reasonable for C and D to connect to A and become second-layer nodes. What is the RSSI that B, E, and F receive from A? Since each node can connect to 3 child nodes, if they are within the RSSI threshold range and root A is not yet fully connected, at least one of B, E, or F should be able to connect to A.

I think you can call esp_mesh_set_rssi_threshold() to limit the RSSI threshold for optional parents, which would allow nearby nodes to connect as much as possible. However, nodes D and E are too far from node A. If you set the same RSSI threshold for all nodes, the connection results might still not meet your expectations, unless you configure different RSSI thresholds for each node. Alternatively, you could only call the esp_mesh_set_parent() function to specify the parent of each node.

截屏2024-10-31 18 04 10

@michaelsimp
Copy link
Author

Hi

In response to your last post:

At present I only consider the RSSI value.
The test in findClosestParent() does consider nodes already at capacity number of children, and will not try to switch to them.

My thoughts were, I am not wanting to build the mesh network from scratch as I start with a self configured network. I am only planning to make changes to nodes with poor RSSIs. So far my tests have been successful network architecture wise (when I have a fixed ROOT so I don't get the broken mesh problem).

I appreciate what you are saying and will certainly doo more testing and add more intelligence into the parent selection if necessary. I already send all my node network attributes to the MESH_ROOT where I have a table of all node and their parent, children, layers and RSSI. I could broadcast this to all nodes if necessary to enable smarter logic at the selection.

But I can't keep the manual scan and parent switch code while it causes my network to break which leaves me in a real predicament performance wise.

I really need a resolution to this as my priority.

Thanks for your ongoing help

@zhangyanjiaoesp
Copy link
Collaborator

The definition of reason code 100/101 is here:

} mesh_disconnect_reason_t;

image

@zhangyanjiaoesp
Copy link
Collaborator

I don't feel comfortable with this solution unless it is endorsed by you guys, but anyway after several successful cycles, it failed again with lots of:

You can use the reason code to categorize the issues, but this is not entirely reliable, as different scenarios may generate the same reason code.

I have tried this but no success.

Are you saying that using disconnected->reason == WIFI_REASON_BEACON_TIMEOUT for judgment is completely ineffective, or it can work but can't work as well as (disconnected->reason != WIFI_REASON_ASSOC_LEAVE) && (disconnected->reason != WIFI_REASON_NO_AP_FOUND) && (disconnected->reason != WIFI_REASON_AP_TSF_RESET)?

@zhangyanjiaoesp
Copy link
Collaborator

  1. The test 2 1/2/3/4 refer to the four devices in a single round of testing?
  2. where is test 3?
    image

@michaelsimp
Copy link
Author

Yes Test 2 1.txt through test 2 4.txt were the 4 devices on a single round of testing
Sorry here are the missing test logs from yesterday
Nov22.zip
Tests 1, 2, 3 were done on the software with:
disconnected->reason == WIFI_REASON_BEACON_TIMEOUT

test 4 was done with"
(disconnected->reason != WIFI_REASON_ASSOC_LEAVE) && (disconnected->reason != WIFI_REASON_NO_AP_FOUND) && (disconnected->reason != WIFI_REASON_AP_TSF_RESET)

Are you saying that using disconnected->reason == WIFI_REASON_BEACON_TIMEOUT for judgment is completely ineffective, or it can work but can't work as well as (disconnected->reason != WIFI_REASON_ASSOC_LEAVE) && (disconnected->reason != WIFI_REASON_NO_AP_FOUND) && (disconnected->reason != WIFI_REASON_AP_TSF_RESET)

Neither work reliably. after 2 or 3 cycles some nodes will fail and go MESH_IDLE and not scan and the network is broken.

@zhangyanjiaoesp
Copy link
Collaborator

I just reviewed the log for test3, and the device behavior is normal. The device being in the MESH_IDLE state is not permanent; it is a temporary state. Below is my analysis:

  1. At the beginning, the self-organizing network formed the following topology:
    root(53:d8) --- node A (39:d4)
    |--- node B (72:b8)
    |--- node C (5c:68)
  2. node C call manual scan, select A as the better parent, change to layer3 node (self-organized disabled, set parent)
  3. node A call manual scan, still select root as the better parent, still layer2 node (self-organized disabled, set parent)
  4. root power off
  5. node B found root leave, beacon timeout, parent disconnect, enable self-organized, change to be root
  6. node A found root leave, beacon timeout, parent disconnect, enable self-organized. However, at that moment, it was sending data, and it is trying to reconnecting, when you queried, the device was shown as in the MESH_IDLE state. I believe that if the device remains in an idle state and cannot recover, then this is an issue. However, if there are no subsequent logs, I don't consider it a problem. You cannot expect the device to always be in a non-idle state whenever the application layer checks the mesh status.

@zhangyanjiaoesp
Copy link
Collaborator

In the test4 log, the device eventually connected successfully.
image

The log you referred to is just a part of the intermediate process.
image

@zhangyanjiaoesp
Copy link
Collaborator

According to your test log, I think the disconnected->reason == WIFI_REASON_BEACON_TIMEOUT will be better than (disconnected->reason != WIFI_REASON_ASSOC_LEAVE) && (disconnected->reason != WIFI_REASON_NO_AP_FOUND) && (disconnected->reason != WIFI_REASON_AP_TSF_RESET) , because there are too many disconnect reason, and it is unreasonable to switch to self-organized mode as soon as the reason is not equal to 8, 201, or 206

@michaelsimp
Copy link
Author

How long should it take to for a MESH_IDLE to find a parent again? I am sure I waited 10s of seconds and it wasn't even scanning.
I am setting up for another run of tests where I will wait longer. I am just worried about my back trace overflowing using Putty

@zhangyanjiaoesp
Copy link
Collaborator

In the test2 logs, I cannot analyze the entire network change process as I did with test3 because the log only contains part of the information. It is curious why such a reason would occur.
image

It seems that in test2_2 and test2_3, there was no opportunity to switch to the self-organized network, and the device kept trying to connect to the originally configured parent, but the parent could not be detected.

@michaelsimp
Copy link
Author

Hi

Today I am getting problems where nodes get stuck in a loop logging forever. I can't keep my trace open long enough as I lose the start, but take my word for it please, once in this state it never comes out no matter how long (minutes). eg

I (00:02:07.516) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:139
W (00:02:07.528) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1
W (00:02:07.529) aWifiMesh: <MESH_EVENT_ROUTING_TABLE_REMOVE>remove 1, new:3
I (128652) mesh: [wifi]disconnected reason:201(), continuous:1/max:12, non-root, vote(,stopped)<><>
I (00:02:07.644) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201
W (00:02:07.645) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1
I (128772) mesh: [wifi]disconnected reason:201(), continuous:2/max:12, non-root, vote(,stopped)<><>
I (00:02:07.769) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201
W (00:02:07.769) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1
I (128882) mesh: 1145[xrsp:1]the asked:19, max window:2, force to increase/decrease(up) xseqno:17 for child 48:ca:43:9b:5d:20, xrsp_seqno:14, heap:101160
I (128892) mesh: 1307[recv]cidx[0]48:ca:43:9b:5d:20 xseqno loss, current/new:15/19, in:17, out:17, pending:0
I (128892) mesh: [wifi]disconnected reason:201(), continuous:3/max:12, non-root, vote(,stopped)<><>
I (00:02:07.893) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201
W (00:02:07.894) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1
I (129022) mesh: [wifi]disconnected reason:201(), continuous:4/max:12, non-root, vote(,stopped)<><>
I (00:02:08.018) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201
W (00:02:08.019) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1
I (129142) mesh: [wifi]disconnected reason:201(), continuous:5/max:12, non-root, vote(,stopped)<><>
I (00:02:08.143) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201
W (00:02:08.144) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1
I (129272) mesh: [wifi]disconnected reason:201(), continuous:6/max:12, non-root, vote(,stopped)<><>
I (00:02:08.268) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201
W (00:02:08.269) aWifiMesh: WiFi Disconnected
>>>last layer = 4, layer = -1

Nov 25.zip

I found I am the author of one way that this can happen, if I scan and switch parents manually. See test1 logs attached where
node A MAC 48:27:e2:18:39:80 switches no node B 48:ca:43:9b:5d:20
Bode B MAC:48:ca:43:9b:5d:20 switches no node A 48:27:e2:18:39:80

This is one cause of the above. I think I can fix this by checking that I am not switching the nodes parent to one of its children.
I think it probably also makes sense to not swap to a parent node which has a higher layer than this node too.
But while not ideal that I am doing this, it shouldn't result in the node getting stuck in a disconnect loop?

But Test 2 looks the same problem but is not triggered by the above. MESH_ROOT is powered off. A panic crash on a MESH_NODE 48:ca:43:9b:5d:20 which still happens from time to time, but my bigger concern is that after this, MESH_NODE MAC: 48:27:e2:18:39:80 gets stuck in the disconnect loop

Are you able to reproduce these problems with the ip_internal_network you modified a week or so back? I get lost in all of this and feel we would make better progress if you were able to test, analyze and debug directly.

@michaelsimp
Copy link
Author

Hi again
Regarding your analysis of test 3 specifically node A where you said.
node A found root leave, beacon timeout, parent disconnect, enable self-organized. However, at that moment, it was sending data, and it is trying to reconnecting, when you queried, the device was shown as in the MESH_IDLE state. I believe that if the device remains in an idle state and cannot recover, then this is an issue. However, if there are no subsequent logs, I don't consider it a problem. You cannot expect the device to always be in a non-idle state whenever the application layer checks the mesh status.
The disconnect came at 00:01:18
I (00:01:18.371) aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:200
I stopped the log after 00:01:58, 40 seconds later and nothing was happening, no visible signs of scanning for a parent.
I am sure that when this occurs, no matter how long I leave it, it does not recover.
Also when it gets stuck in the disconnect loop, logs the repetitive sequence (above) indefinitely.

@michaelsimp
Copy link
Author

michaelsimp commented Nov 25, 2024

Two posts back I wrote:

I found I am the author of one way that this can happen, if I scan and switch parents manually. See test1 logs attached where
node A MAC 48:27:e2:18:39:80 switches no node B 48:ca:43:9b:5d:20
Bode B MAC:48:ca:43:9b:5d:20 switches no node A 48:27:e2:18:39:80

This is one cause of the above. I think I can fix this by checking that I am not switching the nodes parent to one of its children.
I think it probably also makes sense to not swap to a parent node which has a higher layer than this node too.
But while not ideal that I am doing this, it shouldn't result in the node getting stuck in a disconnect loop?

When I looked at the code, I am finding it difficult to decipher the variables

  • parent_record is the best parent candidate found so far
  • assoc I think is each node found in the scan esp_mesh_scan_get_ap_record(&record, &assoc); is this correct?
  • parent_assoc I am not clear on what this is or how it gets set.
    It is initialized to: mesh_assoc_t parent_assoc = { .layer = CONFIG_MESH_MAX_LAYER, .rssi = -120 }; as a worst case record
    and updated to contents of assoc when a better parent is found

the original source (taken from the example project "manual_networking") seems to be already checking :
if (assoc.layer < parent_assoc.layer || assoc.layer2_cap < parent_assoc.layer2_cap) {
But I am not sure if this stops a MESH_NODE selecting a child as a new parent, or do I need to add the line:
if (esp_mesh_get_layer() >= assoc.layer)

Could you take a look at the test 1 logs as they do appear to be setting parent to each other.

The entire routine is currently as follows if you could check and make any changes please.

void findClosestParent(int num) { // after a WiFi scan
    ESP_LOGW(TAG, "findClosestParent  Current RSSI: %d", currentRSSI);
    int i;
    int ie_len = 0;
    mesh_assoc_t assoc;
    mesh_assoc_t parent_assoc = { .layer = CONFIG_MESH_MAX_LAYER, .rssi = -120 };
    wifi_ap_record_t record;
    wifi_ap_record_t parent_record = { 0, };
    parent_record.rssi = currentRSSI; // has to be better than current RSSI to change parent
    bool parent_found = false;
    mesh_type_t my_type = MESH_IDLE;
    int my_layer = -1;
    wifi_config_t parent = { 0, };
    wifi_scan_config_t scan_config = { 0 };

    for (i = 0; i < num; i++) { // iterate through scan records looking for eligible closer parent node
        ESP_ERROR_CHECK(esp_mesh_scan_get_ap_ie_len(&ie_len));
        ESP_ERROR_CHECK(esp_mesh_scan_get_ap_record(&record, &assoc));
        ESP_LOGD(TAG, "ie_len: %d  sizeof(assoc): %d", ie_len, sizeof(assoc));
        if (ie_len == sizeof(assoc)) {
            ESP_LOGI(TAG,
                     "<MESH>[%d]%s, layer:%d/%d, assoc:%d/%d, %d, "MACSTR", channel:%u, rssi:%d, ID<"MACSTR"><%s>",
                     i, record.ssid, assoc.layer, assoc.layer_cap, assoc.assoc, assoc.assoc_cap, assoc.layer2_cap, MAC2STR(record.bssid),
                     record.primary, record.rssi, MAC2STR(assoc.mesh_id), assoc.encrypted ? "IE Encrypted" : "IE Unencrypted");

            // ESP_LOGI(MESH_TAG, "Type: %d  layer_cap %d:  assoc %d  assoc_cap: %d  rssi: %d", assoc.mesh_type, assoc.layer_cap, assoc.assoc, assoc.assoc_cap, record.rssi);
            if (assoc.mesh_type != MESH_IDLE && assoc.layer_cap && assoc.assoc < assoc.assoc_cap) { 
                // ESP_LOGI(MESH_TAG, "assoc.layer: %d  parent_assoc.layer %d:  assoc.layer2_cap %d  parent_assoc.layer2_cap: %d", assoc.layer, parent_assoc.layer, assoc.layer2_cap, parent_assoc.layer2_cap);
                if (assoc.layer < parent_assoc.layer || assoc.layer2_cap < parent_assoc.layer2_cap) {
                    if (record.rssi > parent_record.rssi) { // closer parent found
                        if (memcmp(parent_record.bssid, record.bssid, MAC_SIZE) != 0) { // dont switch to same parent
                            ESP_LOGW(TAG, "Closer Parent found: %s  RSSI: %d", record.ssid, record.rssi);
                            parent_found = true;
                            memcpy(&parent_record, &record, sizeof(record));
                            memcpy(&parent_assoc, &assoc, sizeof(assoc));
                            if (parent_assoc.layer_cap != 1) {
                                my_type = MESH_NODE;
                            } else {
                                my_type = MESH_LEAF;
                            }
                            my_layer = parent_assoc.layer + 1;
                            // break; // MSB removed, keep searching for the closest parent
                        }
                    }
                }
            }
        } else {
            ESP_LOGD(TAG, "[%d]%s, "MACSTR", channel:%u, rssi:%d", i, record.ssid, MAC2STR(record.bssid), record.primary, record.rssi);
        }
    }

    esp_mesh_flush_scan_result();
    if (parent_found) { // parent: Both channel and SSID of the parent are mandatory
        parent.sta.channel = parent_record.primary;
        memcpy(&parent.sta.ssid, &parent_record.ssid, sizeof(parent_record.ssid));
        parent.sta.bssid_set = 1;
        memcpy(&parent.sta.bssid, parent_record.bssid, 6);
        if ((my_type == MESH_NODE) || (my_type == MESH_LEAF) || (my_type == MESH_IDLE)) {
            ESP_ERROR_CHECK(esp_mesh_set_ap_authmode(parent_record.authmode));
            if (parent_record.authmode != WIFI_AUTH_OPEN) {
                memcpy(&parent.sta.password, CONFIG_MESH_AP_PASSWD, strlen(CONFIG_MESH_AP_PASSWD));
            }
            ESP_LOGW(TAG,
                     "<PARENT>%s, layer:%d/%d, assoc:%d/%d, %d, "MACSTR", channel:%u, rssi:%d",
                     parent_record.ssid, parent_assoc.layer,
                     parent_assoc.layer_cap, parent_assoc.assoc,
                     parent_assoc.assoc_cap, parent_assoc.layer2_cap,
                     MAC2STR(parent_record.bssid), parent_record.primary,
                     parent_record.rssi);
            esp_err_t err = esp_mesh_set_parent(&parent, (mesh_addr_t *)&parent_assoc.mesh_id, my_type, my_layer);
            switchParentTimer = currentTimeMs(); // reset timer for event <MESH_EVENT_PARENT_DISCONNECTED>
            if (err != ESP_OK) {
                ESP_LOGE(TAG, "esp_mesh_set_parent Error %d  my_type: %d  my_layer: %d", err, my_type, my_layer);
            }
            selfOrganizeReactivateTimer = SELF_ORGANIZE_REACTIVATE_TIME; // start self organize reactivation timer
        }
    } else {
        ESP_LOGE(TAG, "No eligible closer Parent found");
        if (currentRSSI == NO_RSSI) { // scan again if no connection yet
            esp_mesh_set_self_organized(false, false);
            esp_wifi_scan_stop();
            scan_config.show_hidden = 1;
            scan_config.scan_type = WIFI_SCAN_TYPE_PASSIVE;
            esp_wifi_scan_start(&scan_config, 0);
        }
    }
}

@michaelsimp
Copy link
Author

michaelsimp commented Nov 25, 2024

Hi

By the way all yesterdays test and logs and today were made with your recommendation of only using
disconnected->reason == WIFI_REASON_BEACON_TIMEOUT

I have been testing getting the 4 nodes stacked up across 4 layers and powering off the NODE on layer 2 rather than the MESH_ROOT as this provide a cleaner set of logs.
Test 1 Nov 26.zip

See test 1

Layer 1 MESH_ROOT 48:ca:43:9b:53:d8
NODE A Layer 2 48:27:e2:18:39:80
NODE B Layer 3 48:ca:43:9b:54:c0
NODE C Layer 4 48:ca:43:9b:5d:20

Then power down NODE A on layer 2

NODE B switched from layer 3 to layer 2 and parent from NODE A to MESH_ROOT - perfect
NODE C stayed on layer 4 with parent 48:ca:43:9b:54:c1 which is now on layer 2, and does not show in the

Is this valid?
Node B moved from layer 3 to 2 when its parent dropped. Why did Node C not move to layer 3 ?

It stayed like this for minutes while I wrote this up

Then I powered of the MESH_ROOT see test 1 MESH_ROOT line 469. This node 48:ca:43:9b:53:d8 now becomes MESH_NODE and child of Node B.

See test 1 Node B.txt line 2038
NODE B which was on layer 2 connected to MESH_ROOT goes to MESH_IDLE with 2 children Node C and the old MESH_ROOT 48:ca:43:9b:53:d8

Remains broken like this indefinitely.

@zhangyanjiaoesp
Copy link
Collaborator

  1. Regarding the issue of the log infinite loop (aWifiMesh: <MESH_EVENT_PARENT_DISCONNECTED>reason:201), I have already explained it in my previous comment.

    It seems that in test2_2 and test2_3, there was no opportunity to switch to the self-organized network, and the device kept trying to connect to the originally configured parent, but the parent could not be detected.

    Maybe we should first investigate why the specified parent node cannot be found at this point. Is it due to a power failure, has it become idle, or is there another underlying reason?

Are you able to reproduce these problems with the ip_internal_network you modified a week or so back? I get lost in all of this and feel we would make better progress if you were able to test, analyze and debug directly.

Sorry, I can't reproduce your issue on my side.

  1. I have already discussed this with you in my previous comment: when selecting a better parent node, what criteria do you prioritize? I believe you can completely disregard the conditions in the example and instead design your own criteria based on your specific needs. First, you can move the definitions of parent_assoc and parent_record outside,
    image

then update the parent_assoc->layer when connecting to the parent.
image

Before scanning, retrieve the current parent information.
image

Finally, within the findClosestParent() function, design the criteria for selecting a better parent based on your requirements and the issues encountered during testing.

My thoughts were, I am not wanting to build the mesh network from scratch as I start with a self configured network. I am only planning to make changes to nodes with poor RSSIs. So far my tests have been successful network architecture wise (when I have a fixed ROOT so I don't get the broken mesh problem).

You mentioned that you don't want to rebuild the network from scratch, but instead, you want to adjust the initial network formed by the self-organizing process. However, during the actual testing, I've observed that you often call scan at the application layer while the initial network is still being formed, which forcibly interrupts the self-organizing process.
image

So, when you call scan at the application layer, is it completely random? Would it make sense to first check whether the initial network has been fully formed before manually triggering the scan?

I believe we must first resolve the issues mentioned in points 3 and 4 before proceeding with further problem analysis. If the initial logic framework isn't properly established, it could lead to a range of unforeseen issues down the line, which would be quite painful for me to handle.

@michaelsimp
Copy link
Author

  1. The problem I am trying to solve all the way through this is triggered by the MESH_ROOT or the parent of a node being powered off or rebooted. So yes definitely I have been doing both powering off and rebooting the parent and both trigger this issue. So this is why parent can not be found at this point, the question is how to fix it. This can and will happen in the field.
    The latest catch of one of these today, I powered off the parent on layer 2 and the MESH_NODE on layer 3 continued to report this see attached. This was made after the changes documented in point 3 below.
    MESH_EVENT_PARENT_DISCONNECTED.txt
    So the symptoms are continuous <MESH_EVENT_PARENT_DISCONNECTED> events OR nothing. Node just sits in MESH_IDLE state and does nothing.

  2. I have made the changes you suggested, but as you commented this would stop my nodes finding a better parent. So I modified the event <MESH_EVENT_PARENT_CONNECTED> to:
    parent_assoc.layer = mesh_layer; // MBS consider can switch to parent on same level or better

and function:

void findClosestParent(int num) { // after a WiFi scan
...
                if (assoc.layer <= parent_assoc.layer || assoc.layer2_cap < parent_assoc.layer2_cap) { // parents on same layer or better qualify
...

I understand this could drop a node up (numerically) one layer, but I think this is acceptable as it is much better to a further layer back with a stronger RSSI from my testing. Do you understand and accept this strategy, at least as far as not causing the broken network issue?

  1. My background task only scans for a better parent, if (esp_mesh_get_type() != MESH_IDLE), as I assumed this meant the network was established.
    As mentioned earlier, I have been manually triggering the scan from my console (my main task) when I consider the network to be established based on the event ip_event_handler triggering which calls my printMeshInfo(); function.
    Two questions:

1 You said Would it make sense to first check whether the initial network has been fully formed before manually triggering the scan? What is a suitable test to determine the network is fully formed?
2 Am I potentially doing too much in the event_handler functions? Is there any restriction on how long they can run (within reason of course)?
3 How long should it take to for a node which is disconnected or MESH_IDLE to start to search for a parent again or scan for MESH_ROOT if the MESH_ROOT disappears? Just roughly eg 10 seconds, 30 seconds, longer? I have waited 10s of minutes while writing up logs etc...
Also once the network is broken and not able to fix itself, I sometimes execute the following commands from my console to try and fix it.
wifi scan - triggers my wifiMeshScan() function to look for a new parent. But only if the node is not MESH_IDLE.
wifi root - triggers my wifiMeshRoot() function which calls ESP_ERROR_CHECK(esp_mesh_set_self_organized(true, true));
But neither of these has ever fixed it.

@zhangyanjiaoesp
Copy link
Collaborator

  1. For the <MESH_EVENT_PARENT_DISCONNECTED>reason:201 loop issue, is it possible to add a timer or counter, so that once the time or count exceeds a certain threshold, the function esp_mesh_set_self_organized(true, true) is called to reselect the parent?
  2. How about directly comparing with mesh_layer ?
void findClosestParent(int num) { // after a WiFi scan
...
                if (assoc.layer <= mesh_layer || assoc.layer2_cap < parent_assoc.layer2_cap) { // parents on same layer or better qualify
...

My background task only scans for a better parent, if (esp_mesh_get_type() != MESH_IDLE), as I assumed this meant the network was established.

Why does this phenomenon occur if you only call scanning when it is not idle? It's obvious that the device hasn't been connected to WiFi yet.
image
4.

How long should it take to for a node which is disconnected or MESH_IDLE to start to search for a parent again or scan for MESH_ROOT if the MESH_ROOT disappears?

If the device is in self-organized network, it will attempt to reconnect the original parent for at least 6 seconds before selecting a new parent.
5.

wifi root - triggers my wifiMeshRoot() function which calls ESP_ERROR_CHECK(esp_mesh_set_self_organized(true, true));

Have you ever called this command during the testing process? I don't think I saw this command in the log.

wifi scan - triggers my wifiMeshScan() function to look for a new parent. But only if the node is not MESH_IDLE.

The function of this scan cannot cover all situations because the judgment criteria are relatively single. Perhaps we can add more judgments?

@michaelsimp
Copy link
Author

  1. Yes I am happy to add this if it worked. But as I said for now I can invoke esp_mesh_set_self_organized(true, true) manually from my console command "mesh root". See logs attached, where I rebooted the MESH_ROOT to start the problems:
    Nov 29.zip

test 1 mesh root
line 378 Reboot the mesh root

test 1 became 2nd root
Line 524 the disconnect loop started.
Line 946 invoke wifi root command
Node becomes MESH_ROOT but there already is a MESH_ROOT

test 1 became node on layer 5
Line 126 starts logging I (65677) wifi:>>>intv = 102400, max = 0. I don't know what this means
Line 437 disconnect loop starts
line 1621 run wifi root command
Node becomes MESH_NODE on layer 5 to parent 48:ca:43:9b:53:d9 which is on layer 2

test 1 node 48ca439b53d8
for your reference. Ends up mesh node on layer 2 to the second MESH_ROOT

  1. You said "How about directly comparing with mesh_layer ?"
    I don't understand what you mean here. Please elaborate

  2. You said "Why does this phenomenon occur if you only call scanning when it is not idle? It's obvious that the device hasn't been connected to WiFi yet."
    It is not obvious to me. I don't know what half the logs mean reported from "wifi" and "mesh"
    I only run wifi scan when the node looks stable (not logging a whole lot of stuff) and it is reporting MESH_NODE
    Can you please advise a suitable test I can implement to determine if it is safe to scan for closer parents.

  3. My test logs clearly show I have waited much longer than 6 seconds for a node type MESH_IDLE to take some action - more like up to 60 seconds before I give up.

  4. See notes above and logs running "mesh root" to manually trigger esp_mesh_set_self_organized(true, true)

You said "The function of this scan cannot cover all situations because the judgment criteria are relatively single. Perhaps we can add more judgments?"
Can you please propose suitable tests.

I suspect the reason you can't reproduce the problem using your modified version of "ip_internal_network" is that you are only looking for new parents on higher layers and so you don't find one to switch to. You raised this with me and your code was only looking for layer -1
image
Can you please send me the source file (not just the patch file) so I can play with it.
I would also appreciate it if you could try this looking for a closer parent on the same or lower (numeric) layer so that it actually finds one calls esp_err_t err = esp_mesh_set_parent(&parent, (mesh_addr_t *)&parent_assoc.mesh_id, my_type, my_layer);
I suspect this breaks something which triggers the problems later when the parent or MESH_ROOT either power off or reboot causing combinations of the following:

  • Repeating <MESH_EVENT_PARENT_DISCONNECTED> event loop
  • Nodes stuck indefinitely on type MESH_IDLE
  • Nodes sitting on layers more than one level up from the parent eg node on layer 5 with parent on layer 2
  • Nodes connected to parent which is MESH_IDLE

@zhangyanjiaoesp
Copy link
Collaborator

  1. From the Nov 29.zip log, running "mesh root" to manually trigger esp_mesh_set_self_organized(true, true) seems work well. Then you can use it to stop the 201 loop.
  2. I (65677) wifi:>>>intv = 102400, max = 0 this log was added as a debug log while addressing the crash issue. The solution to the crash has already been merged into release v5.3, but it has not yet been synced to GitHub. Once it is synced, you can update your IDF version.

You said "How about directly comparing with mesh_layer ?"
I don't understand what you mean here. Please elaborate

In your last comment, you let parent_assoc.layer = mesh_layer and in findClosestParent() function, you use assoc.layer <= parent_assoc.layer for comparison. So I suggest using mesh_layer for the comparison instead.

void findClosestParent(int num) { // after a WiFi scan
...
                if (assoc.layer <= mesh_layer || assoc.layer2_cap < parent_assoc.layer2_cap) { // parents on same layer or better qualify
...
  1. The mesh network by default allows multiple roots. However, users can call the esp_mesh_allow_root_conflicts(false) to disable this feature, and once configured, only a single root will be allowed to exist.
  2. Here is my mesh_main.c
    mesh_main_1203.zip

@michaelsimp
Copy link
Author

Hi

Thanks for your ongoing work on this. Very much appreciated as I am in deep with this contract.

  1. Yes it works in terms of breaking out of the 201 loop, but sometimes it results in the software locking up.
  2. Are you saying you have made some more changes to address the crash issue, which I don't have yet? I am currently running ESP IDF ver 5.3.1. Will the update be ver 5.3.2 or later so I can recognize it? Do you have an approx date for release?
  3. Thank you for clarifying I will correct this
  4. I have added esp_mesh_allow_root_conflicts(false) to by Wifi Mesh configuration.
  5. Thanks for this, but if there are more updates which fix the crash (per item 2), I might wait for them to be released in interest of saving time.

Can I ask another unrelated question please.
I want to use a 4G modem using the USB bus on the ESP32-S3. I found an example in a Espressif Git repository https://github.com/espressif/esp-iot-solution called usb_cdc_4g_module. But I don't know how esp-iot-solution relates to IDF.
https://docs.espressif.com/projects/esp-iot-solution/en/latest/ states, "ESP-IoT-Solution contains device drivers and code frameworks for the development of IoT system, which works as extra components of ESP-IDF and much easier to start."
But I have found that esp-iot-solution seems to have a parallel overlapping directory structure of components and examples to IDF.
I copied usb_cdc_4g_module project and I tried to build it using IDF 5.3.1 but the USB modem component was missing, so I manually installed this. Now it builds but crashes soon after startup.
I don't want to enter a debugging session for this in this ticket, but I would appreciate it if you could explain how esp-iot-solution and IDF relate to each other?
Will the USB modem components be released under IDF? If so, roughly when?
If not, should I
The "Getting started" page states the latest version "master" corresponds to IDF ver >-4.4, so it looks like it is not updated as much as IDF. So for me switching from IDF v5.3.1 to esp-iot-solution does not look viable.
I would appreciate any advice on how I can connect to a 4G modem using USB.

@zhangyanjiaoesp
Copy link
Collaborator

zhangyanjiaoesp commented Dec 3, 2024

2. Are you saying you have made some more changes to address the crash issue, which I don't have yet? I am currently running ESP IDF ver 5.3.1. Will the update be ver 5.3.2 or later so I can recognize it? Do you have an approx date for release?

Yes, the v5.3.2 will contain the fix. (The fix in v5.4 is here). For the I (65677) wifi:>>>intv = 102400, max = 0, please refer to the previous comment, it solved the node crash issue, hope you have an impression of this.

I have added esp_mesh_allow_root_conflicts(false) to by Wifi Mesh configuration.

In a self-organized mesh network, if multiple roots are not allowed, when more than one root is created, the two roots will compare their RSSI and capacity values. The one with the better metrics will continue to act as the root, while the other will relinquish its root role and reconnect to the network.

@zhangyanjiaoesp
Copy link
Collaborator

For the ESP-IoT-Solution issue, please create a new issue under the esp-iot-solution project, the colleague in charge of this issue will reply to you.

@Xiehanxin
Copy link

hi @michaelsimp , after you set the IDF_path, you can directly build the iot-solution example under iot-solution path,

@michaelsimp
Copy link
Author

Hi

I don't think we understand each others position.

  1. Are you saying you have made some more changes to address the crash issue, which I don't have yet? I am currently running ESP IDF ver 5.3.1. Will the update be ver 5.3.2 or later so I can recognize it? Do you have an approx date for release?

Yes, the v5.3.2 will contain the fix. (The fix in v5.4 is here). For the I (65677) wifi:>>>intv = 102400, max = 0, please refer to the #14720 (comment), it solved the node crash issue, hope you have an impression of this.

Your reference back to a post you made a couple of weeks ago which said

@michaelsimp
Yes, I added a fix to the wifi lib I provided to you. wifi:>>>intv = 102400, max = 0 this log was added by me to locate the problem. I'm glad to hear that the problem has been solved.

Have you tested the router reboot case? Does the root node crash still exist?

But I am confused as since then I have posted heaps of logs which still show multiple problems with the mesh network when the parent is powered off or rebooted, including:
A) the 201 loop. esp_mesh_set_self_organized(true, true) works in terms of breaking out of the 201 loop, but sometimes it results in the software locking up. This still a serious problem for me.
B) nodes locking up
C) the child going to MESH_IDLE and not recovering or attempting to find a new parent. My attempts to trigger a scan for a new parent using

                esp_mesh_set_self_organized(false, false);`
                esp_wifi_scan_stop();
                scan_config.show_hidden = 1;
                scan_config.scan_type = WIFI_SCAN_TYPE_PASSIVE;
                esp_wifi_scan_start(&scan_config, 0);

... result in lock up.

My posts continue through to the end of last week and then yesterday you stated

"The solution to the crash has already been merged into release v5.3, but it has not yet been synced to GitHub. Once it is synced, you can update your IDF version."

So if the latest update you have made was back before #14720 (comment), You need to understand my problem is most certainly not fixed.
I think you reported that this was done when looking for a parent on a lower numeric layer which with the tests both you and I did, no nodes were found so it never ran the code
esp_err_t err = esp_mesh_set_parent(&parent, (mesh_addr_t *)&parent_assoc.mesh_id, my_type, my_layer);
which I think is the trigger to breaking the mesh network.

Please confirm again if you have more changes since this date. I tried to check the reference fix in v5.4 is here but I cant view them as I just get action not available. (I am not very familiar with GIT).
image

@zhangyanjiaoesp
Copy link
Collaborator

You said in the comment that you couldn't understand what this log meant.

Line 126 starts logging I (65677) wifi:>>>intv = 102400, max = 0. I don't know what this means

Then I replied to you that this log was added to solve the previous crash issue. I quoted my previous comment just to let you know that there have been similar logs before, and this fix has not been updated to GitHub yet, that's all. It has nothing to do with the other issues you reported later !!!

@michaelsimp
Copy link
Author

Ok thanks for clarifying. I didn't understand the context and got excited at the prospects of a fix.
So is my case still open for action from your side, or do you need something more from me?

@michaelsimp
Copy link
Author

Can I have an update please.

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp

  1. The IDF v5.3.2 has been released.
  2. Regarding the other issues, could you provide a simpler and reproducible demo? First, during my testing, I am unable to set up the mesh devices in the same way as you. Additionally, the demo you previously provided involves operations such as QR code-based provisioning, console control, and others, while my local testing only uses the ip_internal_network example and performs periodic scanning. Our testing methods also differ. If you can simplify the steps and methods for reproducing the issue, I can conduct local tests and work on resolving it.

@michaelsimp
Copy link
Author

Ok, but I will need to park it and come back as soon as I can. My project is far behind now and I have to make progress in other areas. Please keep this ticket open

@zhangyanjiaoesp
Copy link
Collaborator

sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: NA Issue resolution is unavailable Status: In Progress Work is in progress Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

5 participants