Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash: timer_insert #30

Closed
mullender opened this issue Jan 7, 2025 · 5 comments
Closed

Crash: timer_insert #30

mullender opened this issue Jan 7, 2025 · 5 comments

Comments

@mullender
Copy link

After enabling crash dumps in #2 I was able to collect the following log

Depending on the version of esp_timer in use, the crash happens between these lines of code:
https://github.com/espressif/esp-idf/blob/release/v5.3/components/esp_timer/src/esp_timer.c#L327-L337
possibly the assert, if there are multiple threads editing the list being iterated over. There is an older commit that suggests this file has thread safety issues: espressif/esp-idf@3ba7049

After symbolication of the stack:

➜  ~ /opt/homebrew/opt/binutils/bin/addr2line -p -f -C -e ~/Downloads/homekit-ratgdo32-v3.0.5.elf \
 -a 0x400847B1 0x400848E5 0x401054C3 0x400DD669 0x400E405A 0x400DDC91 \
    0x400DCFD8 0x400F7E16 0x400DA8F2 0x40090512
0x400847b1: timer_insert at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp_timer/src/esp_timer.c:328
0x400848e5: esp_timer_start_once at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp_timer/src/esp_timer.c:230
0x401054c3: Ticker::_attach_us(unsigned long long, bool, void (*)(void*), void*) at /home/runner/.platformio/packages/framework-arduinoespressif32/libraries/Ticker/src/Ticker.cpp:47
0x400dd669: void Ticker::once_ms<LED*>(unsigned long long, void (*)(LED*), LED*) at /home/runner/.platformio/packages/framework-arduinoespressif32/libraries/Ticker/src/Ticker.h:96
0x400e405a: SSEBroadcastState(char const*, BroadcastType) at /home/runner/work/homekit-ratgdo32/homekit-ratgdo32/src/web.cpp:973
0x400ddc91: LOG::logToBuffer(char const*, ...) at /home/runner/work/homekit-ratgdo32/homekit-ratgdo32/src/log.cpp:138
0x400dcfd8: DEV_Motion::loop() at /home/runner/work/homekit-ratgdo32/homekit-ratgdo32/src/homekit.cpp:627 (discriminator 6)
0x400f7e16: Span::pollTask() at /home/runner/work/homekit-ratgdo32/homekit-ratgdo32/.pio/libdeps/ratgdo_esp32dev/HomeSpan/src/HomeSpan.cpp:269 (discriminator 1)
0x400da8f2: Span::autoPoll(unsigned long, unsigned long, unsigned long)::{lambda(void*)#1}::operator()(void*) const at /home/runner/work/homekit-ratgdo32/homekit-ratgdo32/.pio/libdeps/ratgdo_esp32dev/HomeSpan/src/HomeSpan.h:399
0x40090512: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:139

Raw crash report

Messages logged immediately before last system crash...
 Time of crash: 05-Jan-2025 22:05:48 PST
Crash reason: Interrupt wdt timeout on CPU0
Firmware version: 3.0.5

00001193 000000E09B1D0126 0260C181
>>> [  13508.527] ratgdo-packet: PACKET(0x1D0126 @ 0x1193) Status - Status: [DoorState Open, Parity 0xC, Obs 1, Lock 0, Light 1]
>>> [  13508.547] ratgdo-comms: tgt 0 curr 0
>>> [  13512.745] ratgdo-reader: reader completed packet
>>> [  13512.745] ratgdo-packet: DECODED  00000FD5 000000622E9CA914 01018280
>>> [  13512.746] ratgdo-packet: PACKET(0x9CA914 @ 0xFD5) DoorAction - DoorAction: [DoorAction Toggle, Pressed 1, Id 01]
>>> [  13512.756] ratgdo-comms: Door Action
>>> [  13512.767] ratgdo-comms: Push count reset
>>> [  13512.767] ratgdo-comms: Push count 0
>>> [  13512.781] ratgdo-homekit: Motion detected
>>> [  13512.888] ratgdo-reader: reader completed packet
>>> [  13512.888] ratgdo-packet: DECODED  00001194 000000E09B1D0126 000931A1
>>> [  13512.889] ratgdo-packet: PACKET(0x1D0126 @ 0x1194) Pair3Resp - NoData: [Zero: 0x00090100, Parity: 0x3]
>>> [  13512.899] ratgdo-comms: Support for Pair3Resp packet unimplemented. Ignoring.
>>> [  13513.020] ratgdo-reader: reader completed packet
>>> [  13513.020] ratgdo-packet: DECODED  00000FD5 000000622E9CA914 01009280
>>> [  13513.021] ratgdo-packet: PACKET(0x9CA914 @ 0xFD5) DoorAction - DoorAction: [DoorAction Toggle, Pressed 0, Id 01]
>>> [  13513.031] ratgdo-comms: Door Action
>>> [  13513.412] ratgdo-reader: reader completed packet
>>> [  13513.413] ratgdo-packet: DECODED  00000FD6 000000622E9CA914 0000F085
>>> [  13513.413] ratgdo-packet: PACKET(0x9CA914 @ 0xFD6) Motion - NoData: [Zero: 0x00000000, Parity: 0xF]
>>> [  13513.423] ratgdo-comms: Motion Detected
>>> [  13513.455] ratgdo-packet: ENCODING 00000AE8 0000000000FBE539 00000080
>>> [  13513.587] ratgdo-reader: reader completed packet
>>> [  13513.587] ratgdo-packet: DECODED  00000AE8 0000000000FBE539 00008080
>>> [  13513.588] ratgdo-packet: PACKET(0xFBE539 @ 0xAE8) GetStatus - NoData: [Zero: 0x00000000, Parity: 0x8]
>>> [  13513.598] ratgdo-comms: Support for GetStatus packet unimplemented. Ignoring.
>>> [  13513.719] ratgdo-reader: reader completed packet
!!! [  13513.719] ratgdo-packet: Failed to decode packet
>>> [  13513.719] ratgdo-packet: DECODED  00000000 0000000000000000 00000000
>>> [  13513.730] ratgdo-packet: PACKET(0x0 @ 0x0) UNKNOWN - Unknown: [000]
>>> [  13513.740] ratgdo-comms: Support for UNKNOWN packet unimplemented. Ignoring.
>>> [  13513.852] ratgdo-reader: reader completed packet
>>> [  13513.852] ratgdo-packet: DECODED  00001195 000000E09B1D0126 02608581
>>> [  13513.853] ratgdo-packet: PACKET(0x1D0126 @ 0x1195) Status - Status: [DoorState Closing, Parity 0x8, Obs 1, Lock 0, Light 1]
>>> [  13513.873] ratgdo-comms: tgt 1 curr 3
>>> [  13513.874] ratgdo-vehicle: Vehicle status: Arriving
>>> [  13513.874] ratgdo-http: Current Door State changing from 0 to 3
>>> [  13513.885] ratgdo-homekit: Garage door set CurrentDoorState: 3
>>> [  13513.886] ratgdo-homekit: Arriving detected
>>> [  13513.902] ratgdo-homekit: Garage door set TargetDoorState: 1
>>> [  13513.910] ratgdo-vehicle: Vehicle status: Parked
>>> [  13513.911] ratgdo-homekit: Arriving reset


Crash in task: pollTask, at address: 0x400847B1
Decode backtrace with this Linux command:

addr2line -p -f -C -e homekit-ratgdo32-v3.0.5.elf \
 -a 0x400847B1 0x400848E5 0x401054C3 0x400DD669 0x400E405A 0x400DDC91 \
    0x400DCFD8 0x400F7E16 0x400DA8F2 0x40090512 

Make sure that the ELF file matches the binary that crashed.
@dkerr64
Copy link
Collaborator

dkerr64 commented Jan 7, 2025

The last part of my code before the crash is a call to led.flash() in SSEBroadcastState. The purpose of this is to blink the blue activity LED for 500ms every time there is a message logged. In this specific case it is being called from the HomeSpan loop... which is its own thread. led.flash() is also called from the comms loop (to indicate communications activity with the GDO) and when a vehicle arrives/departs. Those last two are in the same main loop thread.

So your thread theory may be valid. And it is interesting that the log messages show that vehicle status just changed to park, and that is in the function that flashes the LED... so within 500ms two threads are calling on the LED to flash.

But... I'm 99% sure that the esp_timer code we're using includes the commit you pointed at that talks about thread safety.

Things we could try...

  • If it really is the interrupt watchdog then consider suggestions in this documentation.
  • I could remove the led.flash() from the vehicle arrive/depart code. It feels worthless anyway.
  • I could try and make flashing LED thread safe... either a) wrap in a mutex semaphore. Though I am not sure how much good that would do, or b) don't flash if the LED is already in the midst of flashing (ie, don't start another once timer if a current one has not expired yet).
  • And/or remove LED flashing from the message log code and leave it only in the comms loop... guaranteeing that LED flashing will only be requested from that thread and not a.n.other thread.

I'm not a fan of the first option. But some combination of the others sounds wise.

But I bet this is hard to reproduce (unless you tell me otherwise!)

dkerr64 added a commit to dkerr64/homekit-ratgdo32 that referenced this issue Jan 8, 2025
@mullender
Copy link
Author

It looks like the commit I linked only handled the delete case, not the insert.

But I like your suggestion of simply avoiding this edge case by not blinking the led :) That seems straight-forward and will at least confirm the hypothesis of a thread-safety issue being the root cause.

@dkerr64
Copy link
Collaborator

dkerr64 commented Jan 8, 2025

As you can see above, I've gone with 2 and 3b from above bulleted list. Once I merge and release we'll see how successful that is.

@dkerr64
Copy link
Collaborator

dkerr64 commented Jan 18, 2025

Fixed in v3.0.6

@dkerr64 dkerr64 closed this as completed Jan 18, 2025
@mullender
Copy link
Author

Confirming this crash is no longer happening. No other crashes observed either ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants