-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HCI: after MTU change, gatt client's answer is not pushed to HCI (IDFGH-13790) #14648
Comments
Anyone please?! This is unbelievable. HCI times out in every single 10-12 hours and NO log is dumped/printed on debug serial. I'm now on latest latest latest idf, with latest latest lib. |
If I disable le ad filtering, esp32 can't survive for more than 3hrs. it needs to be RESETTED in every single 3 hrs. Please someone say something! NO coredump, NO debug log, NO nothing on serial console. Only physical HW RESET is the solution for 3hrs. |
I got some updates - although you don't care - shame on you. After constant stress load: LE scanning and occasionally gatt connection to a device. Gatt device wants to increase MTU from 23 to 247 in every connection. We accept this. But after a couple of hours, this is simply NOT working anymore: gatt device not answering. Now big update! After the stress test fails: I can still connect manually to this gatt device (via gatttool)!!! Gatttool refuses accepting MTU request so it stays 23 and gatt device will also answer to this. I believe gatt device would answer anyway (no matter weather we accept/refuse its MTU change request), but esp is NOT putting it into hci anymore until it gets a reset. Does this makes sense to you? So: Is it some buffer handling problem in esp32 lib? Please folks. |
Note: gate's reply is not received in either case:
Only in one single case will gatt client's reply reach us: when we don't accept the new mtu at all (AND we don't request MTU increase either) |
Anyone knows anybody who can provide me the source of libbtdm_app.a, or I will have to reverse engineer it? Maybe @BetterJincheng? |
Ping |
pong |
@danergo |
@esp-zhp: sure. Thank you! I have some addition: when ESP starts this behavior (i.e doesn't forwards packets after mtu negotiations has been done), a 100% fix is to restart the ESP with its reset button (/en switch).
This is (I believe) sending a hci reset command which seems to solve the issue for another 10-12 hours. |
Currently, there isn't enough information for me to pinpoint the issue. |
HCI reset command can resolve my issue. Other than that, I have thousands (if not millions) of lines of "btmon" logs. BlueZ doesn't send any extra command (at least nothing extra is visible in button logs). In case hci reset fixes the mtu problem, how would it be ideal? Thank you! |
I have some news: while I was away from this device, ESP started behaving wrong again (same MTU problem). But after constant thorough connection requests (1130 connection retrials for more than 3 hours!!!), it 'fixed' itself: now MTU negotiation doesn't ruin the incoming indication packets, but I believe after 10-12hours it will break again. |
If resetting the HCI resolves the MTU issue, I believe it's acceptable. |
Sorry, I don't think it's acceptable:
Now, this device doesn't move a single millimeter, it's staying in one place (so as the ESP). My point here is, that from my application, I can't judge weather device timeout is due to real timeout, or ESP misbehavior. Anyway, in dmesg, I see these errors a lot:
This is timeout, ESP doesn't answer for my requests. This happens, when ESP reaches the problematic phase. |
Could you capture the packets to verify if the peer device is indeed sending Indications above 23? If the peer device doesn't send the Indications, then the ESP device wouldn't be able to receive them either. If the issue still persists, I need more information to further diagnose the issue. Could you provide the complete HCI data? |
Dear @esp-zhp: I would happily provide any logs if it can help you with diag. "LE Create connection" always succeeds. Also MTU exchange request (from 23 to 247) also always succeeds. Shorter characteristic writes, and their confirmations also always succeeds. Longer characteristics writes and their confirmations also always succeeds. Shorter indications also always received. Longer than 23 indications received for 10-12 hours, then they are not forwarded back to hci anymore. I doubt the client has any problems because 2 quite hard reasons:
HCI data: I have many. All created by button -w. Is it okay for you? Can I share this privately? |
One more thing needs to be confirmed to narrow down the issue. Do you think your problem is related to classic Bluetooth? If you are not using classic Bluetooth, does the issue still persist? I'm responsible for BLE and don't have much knowledge about classic Bluetooth. If you believe the issue is related to classic Bluetooth, I can ask my colleagues who handle classic Bluetooth to assist you. |
Could you please send me all the HCI logs from the ESP32? It would be best to use GitHub so that other colleagues can also view them. If it's not convenient for you to share publicly on GitHub, you can also send them to my email ([email protected]). and Do you have any packet capture devices on your side?It would be even better if you could capture the packets to confirm whether the ESP32 has sent an indication(ATT_HANDLE_VALUE_IND)when mtu above 23. |
Sorry for the confusion! Let me clear things up! I'm using ESP for Dual-Mode Bluetooth Controller (controller_hci_uart example from this repo). The issue is related purely to BLE. My RPi is connected via UART to this ESP. ESP is responsible for providing Bluetooth to this RPi (it has no onboard Bt). RPi is attaching the ESP with btattach, therefore RPi sees hci0. BlueZ uses this hci0 interface to provide Bluetooth functionality to my app on RPi. My app is constantly monitoring LE advertisements from a bunch of devices (about 10). No filtering is enabled. Occasionally (4-5 times per 10hrs) my app connects to a remote client with gatt-charactetistics, notifications and indications. This occasional "LE Create Connection" always succeeds, but the client is asking an "MTU Exchange Request" after the connection is established (23->247). My app always accepts this new mtu, and responds as intended. Then my app subscribes to indications by writing a specific data to a gatt handle. If we don't do reset, longer indications are not arriving anymore. All the rest details are provided earlier: Does this change anything? Thank you! |
@esp-zhp: mail sent to you, would be appreciated if you could take a look at it. |
Hi, @esp-zhp:
Yes, I do have esp terminal logs yes, will attach here.
It is issuing the "LE Create Connection", but "btmon" recorded the timestamps in UTC, while "dmesg" output has 2 hours later timings.
Yes, I believe this is the correct terminology.
That's correct, there is a 2hour difference:
You will find it at 08:31:15 in the HCI log (packet no: 37761: LE Create Connection).
Please check the problematic parts: from 15261 - 47498 (07:03:07 - 09:07:25 in HCI log): constant, thorough trying of connection. LE Create Connection succeeds! GATT writes confirmed! Short GATT Indications are received! But not a single sing of the long indication. You can see a working example starting in 14584 (06:59:55,57) please pay attention to long indication for this connection in 14611 (06:59:55,69): 86 bytes long, response for our "0x0092" GATT write request in 14608 (06:59:55,64). Connection to this client is always done by: Now, with correct sequences at the beginning of the HCI logs, you can investigate this behavior, then you will see the problematic parts from 15261 - 47498 (07:03:07 - 09:07:25 in HCI log): step5 is missing. There are more than 2 hours, and more than 30000 HCI Packets trying to connect to this Client. During this period, any other device can perfectly connect to the same Client (therefore Client is behaving correctly). Also, during this period, in case I deny the MTU change, ESP will forward the indication in step5, with multiple indications (23-23-23-17). Thank you very much, I appreciate your time spent on this. |
@esp-zhp: i have a new idea: Creating a firmware in which we can turn on/off logs dynamically via debug uart (i.e.: not hardcode log setting into firmware). In this case we can start with hci logging turned off, then once issue occurs, we can turn on hci logging over debug port, without rebooting esp. What do you think? Can this help? How shall we read from debug uart without disturbing the btdm controller? |
@danergo Based on your description, there is a version with HCI logging enabled. If HCI logging is always on, it works fine, but power consumption will be higher. For now, you can use the library with HCI logging enabled to continue your development. You mentioned: “Creating a firmware that allows us to dynamically turn logging on and off via the debug UART (i.e., without hardcoding log settings into the firmware). In this case, we can start with HCI logging turned off, and then, if an issue occurs, enable HCI logging through the debug port without rebooting the ESP.” I’m not sure if this approach will work, as I haven’t yet identified the root cause of the issue. I will update the library and plan to add debug information for suspected areas. When an issue occurs, you can dump this debug information (I’ll provide a new API for this), and I’ll use the data to help resolve the problem. I’m still working on the debug library, and this may take some time. In the meantime, please proceed with your testing and development. |
Okay, thank you! |
Hi, @esp-zhp: did you have some time to progress? We are running our main service with HCI logs enabled, and it's smooth. But I think this shall not be considered as a final solution. Thank you! |
yes,I will give you lib this week |
Dear @esp-zhp: We are sending you a new "proof" capture: this time with:
This contain a LOT of garbage, please forgive us. Please focus on the ATT frames. You will see normal procedures of GATT writes and confirmations. Focus on 0x0092 Write Response! You'll see: 0x0092 responses are coming until ID#634057 (last Write Response in log). Afterwards, you'll see no more Write Responses (for 0x0092). As this time we have the debug log enabled, we can state, that simply enabling "HCI verbose log" doesn't solve this problem. (Luckily! At least, logging is not hiding this bug :) ) Please check our logs, maybe you'll see something interesting which can help you with this topic. |
From the HCI log on your host side, it appears that Bluz sent a write request (handle 0x92), but the ESP32 controller either did not send the packet or failed to send it promptly. This seems to be an issue on the ESP32 controller's side. However, from the ESP32 controller's perspective, it did not receive the write request (handle 0x92). I’m unsure how the HCI log on Bluz was captured or whether it is entirely accurate. |
That's exactly the root cause of the problem! Please note: all ATT 146 messages are long (longer than 63 bytes). I have tested this many times: after ESP (or BlueZ) gets into this problematic stage, any write request (ATT handle doesn't matter) with data length above 63 won't send out (anything below or equal 63 will be sent out perfectly)! Let's try to identify the wrong part here: For the 100% proof, let me wait for a couple of days to let the problematic stage arrive, and then I'll hook up the oscillator to the ESP's RX pin, to make you a final proof. :) Thank you! |
Hi, all (@esp-zhp). We have concluded a final test with an O-scope tapped onto the TX pin of RPi (RX pin of ESP32). Our assumption has been 100% validated now: RPi sends the long gatt write, but ESP32 doesn't acknowledge it. Please see the shared log: it's a small one! Facts:
For the last step (3), we have recorded the screen of the scope: this is the proof, that ESP's RX pin is receiving this long write (takes almost 700us), but I assume you will not see this in the debug log from ESP (shared with you too): This is proved now an ESP-side bug, and it is present in the closed-source library, so we can't do anything else on our side to fix it. |
@danergo
|
Thank you for extensive description. I am more than sure that my long write is correct although I have shared a scope screenshot and not a signal analyser (measured shorter gatt writes, and scope showed shorter data flow). However, we do have signal analyser as well, will provide you that evidence too. Will also share our sdkconfig, sleep mode (as far as I remember) is enabled. Will attach it here. Thank you! |
Great! Make sure to print the relevant registers when an issue occurs. Disable sleep mode. |
how do you think A potential solution? |
Uart flow control is enabled. Anyway, for VHCI we have just a slight knowledge. Thank you. Will get back to you with details later today. |
OMG, my sdkconfig is huge, compared to example's. Please find it attached here. Many sleep configs are enabled (which might have been modified by us). Will do now 2 things: 1.) Signal-trace the RX pin for the HCI data We have high hopes about this sleep configuration that it can solve this problem :) Thank you! |
Hi, @esp-zhp: I have some news! We hooked up our signal tracer onto the RX pin, and see what happened. We measured two cases: 1.) gatt write request with data of This is great, works OK. 2.) gatt write request with data of As you can see, RPi is sending only 64bytes of data! By this time, we thought this must be on RPi (BlueZ or kernel) side. But we went a little further and reverted back our sdkconfig to the defaults (no sleep enabled now). But then we have some amazing result: As you can see, there are two "batches"! First batch is exactly 64bytes long. Delay between two batches is approximately 100us. Then we pushed even further with a very long gatt write: No need to say anything. Our SPI to UART component have a 64byte FIFO buffer. Is is possible that ESP is holding down (or up) the CTS (or RTS?) pin, in order to prevent further data transmission because it must process the current batch? In full UART communication, which pin is used (driven) by ESP in order to signal the RPi to prevent data transmission? RTS or CTS? Thank you very much! |
Based on your feedback, the current issue seems to be that the Raspberry Pi is not transmitting data correctly, rather than a problem with the ESP32 receiving end. Is that correct? When hardware and drivers support RTS and CTS, their behavior is automatically managed, and user intervention is typically not required. If you wish to manually intervene, it should be feasible. Feel free to test it. RTS and CTS CollaborationRTS and CTS work together to achieve hardware flow control:
From the logs:
Pin CTS 23 is used to signal the Raspberry Pi to prevent data transmission. |
Thank you.
Yes, it seems being on our side, so we wish to say a huge sorry and thank for your kind help through all of this. We don't want to control the CTS/RTS behavior. We just wanted to check weather it's ESP not accepting more, or RPi can't send more.
This case we need to trace PIN 23 on ESP to see if it's need to stop us sending, right? One final test we must do to confirm this is really on the RPi side: once it fails again, we need to signal trace the CTS too: it still can be that in failed case ESP is not pulling the CTS line low fast enough for RPi to continue sending and it halts the transmission (although it's fairly unlikely - BUT! In case we don't reset ESP after failed case, it's not getting back to normal, so still pretty mysterious). Thank you! |
Since you plan to continue testing, you can track the following signals:
By monitoring these signals, I believe we can ultimately determine whether it's the ESP32 not accepting more data, or the Raspberry Pi being unable to send more. Looking forward to your further response. |
Thank you, I really appreciate this. Test is now running, waiting for the stuck. After it happens again, we will analyze these signals and report it back here for sure. Best regards until then |
Hi! We got a recommendation of update our kernel (to 6.12 from 6.1). Our SPI2UART silicon might had some "silicon-bug" which is published as errata. Kernel developers worked on that, and seems to be fixed in latest kernel. Bug was corrupted FIFO due to wrong interrupt timings, therefore losing data. We are still not 100% sure, but fact is fact: it runs now since 18hrs, without a single lost byte. As it was some cases when it reached to more, we still wait for a couple of days, but in case it won't be any lost byte and missed packet, we shall consider this as "wontfix", because this was indeed not caused by ESP in any manner. In this case I wish to apologize again for taking your precious time. Thank you! |
@danergo |
We forwarded you our latest log. With the new kernel, communication between Host and ESP is super-smooth, running since more than 4 days without any single lost byte. There were 2 "glitches" though, which you might give us some input on: In the hci log (btmon_v612.pcap), we see "Hardware Error"s from controller at IDs:
Can you check the minicom log (debug log from ESP, also shared), in regards for these two events? Do you see any uncommon communication from our side, or any reason why ESP is presenting this Hardware Error? Please note, the logs are quite large, as we tried to stress the system as much as we can, disabling LE ad filtering completely. Thank you! |
Thank you for the detailed information. The original issue related to the Linux kernel appears resolved, so that can be considered closed for now. Regarding the new issue, I will examine the provided Since this is a new issue, it would be better to track it separately. Please create a new GitHub issue specifically for this problem. This will help us streamline the analysis and resolution process. I'll provide feedback here as soon as I've reviewed the logs! |
Exactly, thank you! It's running now over a week (8days), without a single byte of lost data, so we are pretty confident this original issue was caused by a kernel bug (which was related to a silicon bug). So we are thankful for you providing so much information on this topic. We created now a new topic for the hardware error here: #14964. Let's continue there, and finally close this one :) Thank you. |
Answers checklist.
IDF version.
Latest
Espressif SoC revision.
NodeMCU-ESP-32S
Operating System used.
Linux
How did you build your project?
Command line with idf.py
If you are using Windows, please specify command line type.
None
Development Kit.
NodeMCU-ESP-32S
Power Supply used.
External 5V
What is the expected behavior?
Stable operation
What is the actual behavior?
Manual power recycle is needed in every 12hrs.
hciconfig hci0 reset is timing out.
btattach also times out.
Watchdog enabled but it is not triggering a reset.
Coredump enabled but no coredump is being written.
Verbose logging also enabled but only few log items are shown.
Steps to reproduce.
Ble ad scanning with hardware filtering (based on device mac and ad) at least 8 devices.
In every 5mins, try connecting to a standard (not ble) devices (which is out of range) - so connection will have to fail always.
Occasionally connect to a ble device (which is in range and shall be succeed).
Every 12 hours (roughly) we have to manually reset the esp. Otherwise hci0 will eventually go down.
Before hci0 going down, we can still try connecting to a ble device but we can't receive longer data from it.
(Ble device asks us an MTU increase, and we accept it, but then we can't receive data: but this happens ONLY after 10-12 hours of constant stressing esp with the above advertising scaninngs and 5mins inactive device connect trials).
I guess some buffer is overfilling but I couldnt enable any practical logging in menuconfig.
What do you suggest?
Debug Logs.
No response
More Information.
No response
The text was updated successfully, but these errors were encountered: