Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HCI: after MTU change, gatt client's answer is not pushed to HCI (IDFGH-13790) #14648

Closed
3 tasks done
danergo opened this issue Sep 30, 2024 · 107 comments
Closed
3 tasks done
Assignees
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally Type: Bug bugs in IDF

Comments

@danergo
Copy link

danergo commented Sep 30, 2024

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

Latest

Espressif SoC revision.

NodeMCU-ESP-32S

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

NodeMCU-ESP-32S

Power Supply used.

External 5V

What is the expected behavior?

Stable operation

What is the actual behavior?

Manual power recycle is needed in every 12hrs.

hciconfig hci0 reset is timing out.
btattach also times out.

Watchdog enabled but it is not triggering a reset.
Coredump enabled but no coredump is being written.
Verbose logging also enabled but only few log items are shown.

Steps to reproduce.

Ble ad scanning with hardware filtering (based on device mac and ad) at least 8 devices.

In every 5mins, try connecting to a standard (not ble) devices (which is out of range) - so connection will have to fail always.

Occasionally connect to a ble device (which is in range and shall be succeed).

Every 12 hours (roughly) we have to manually reset the esp. Otherwise hci0 will eventually go down.

Before hci0 going down, we can still try connecting to a ble device but we can't receive longer data from it.

(Ble device asks us an MTU increase, and we accept it, but then we can't receive data: but this happens ONLY after 10-12 hours of constant stressing esp with the above advertising scaninngs and 5mins inactive device connect trials).

I guess some buffer is overfilling but I couldnt enable any practical logging in menuconfig.

What do you suggest?

Debug Logs.

No response

More Information.

No response

@danergo danergo added the Type: Bug bugs in IDF label Sep 30, 2024
@espressif-bot espressif-bot added the Status: Opened Issue is new label Sep 30, 2024
@github-actions github-actions bot changed the title HCI dual-mode controller BTDM with many devices: unstable! HCI dual-mode controller BTDM with many devices: unstable! (IDFGH-13790) Sep 30, 2024
@danergo
Copy link
Author

danergo commented Oct 3, 2024

Anyone please?! This is unbelievable. HCI times out in every single 10-12 hours and NO log is dumped/printed on debug serial.

I'm now on latest latest latest idf, with latest latest lib.

@danergo
Copy link
Author

danergo commented Oct 3, 2024

If I disable le ad filtering, esp32 can't survive for more than 3hrs. it needs to be RESETTED in every single 3 hrs.

Please someone say something! NO coredump, NO debug log, NO nothing on serial console. Only physical HW RESET is the solution for 3hrs.

@danergo
Copy link
Author

danergo commented Oct 4, 2024

I got some updates - although you don't care - shame on you.

After constant stress load: LE scanning and occasionally gatt connection to a device.

Gatt device wants to increase MTU from 23 to 247 in every connection. We accept this.

But after a couple of hours, this is simply NOT working anymore: gatt device not answering.

Now big update! After the stress test fails: I can still connect manually to this gatt device (via gatttool)!!! Gatttool refuses accepting MTU request so it stays 23 and gatt device will also answer to this.

I believe gatt device would answer anyway (no matter weather we accept/refuse its MTU change request), but esp is NOT putting it into hci anymore until it gets a reset.

Does this makes sense to you?

So:
Phase1: test: gatt client changes MTU, we accept, gatt client replies and we receive it
.
.
.
After 10hrs
Phase2: test: gatt client changes MTU, we accept, gatt client MIGHT answer but we NEVER receives it via hci.
Phase2: workaround: don't accept the MTU request and gatt client's reply is arriving to hci.

Is it some buffer handling problem in esp32 lib?

Please folks.

@danergo danergo changed the title HCI dual-mode controller BTDM with many devices: unstable! (IDFGH-13790) HCI: after MTU change, gatt client's answer is not pushed to HCI (IDFGH-13790) Oct 4, 2024
@danergo
Copy link
Author

danergo commented Oct 4, 2024

Note: gate's reply is not received in either case:

  1. When we accept client's mtu
  2. When we initiate mtu change from server (with gatttool mtu 247).

Only in one single case will gatt client's reply reach us: when we don't accept the new mtu at all (AND we don't request MTU increase either)

@danergo
Copy link
Author

danergo commented Oct 6, 2024

Anyone knows anybody who can provide me the source of libbtdm_app.a, or I will have to reverse engineer it? Maybe @BetterJincheng?

@danergo
Copy link
Author

danergo commented Oct 7, 2024

Ping

@danergo
Copy link
Author

danergo commented Oct 8, 2024

pong

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 8, 2024

@danergo
hi,I sincerely apologize for the delay in getting back to you. The recent National Day holiday in China meant that I was out of the office, and I was unable to respond to messages during that time.
I will check your issue,give me some time...

@danergo
Copy link
Author

danergo commented Oct 8, 2024

@esp-zhp: sure. Thank you!

I have some addition: when ESP starts this behavior (i.e doesn't forwards packets after mtu negotiations has been done), a 100% fix is to restart the ESP with its reset button (/en switch).
However (update!), sometimes it's enough to do:

systemctl stop bluetooth.service
systemctl stop my-btattach.service (this is running btattach to get the hci0 interface)
systemctl start bluetooth.service
systemctl start my-btattach.service

This is (I believe) sending a hci reset command which seems to solve the issue for another 10-12 hours.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 8, 2024

Currently, there isn't enough information for me to pinpoint the issue.
I have a few points of suspicion:
ESP32 only support BLE4.2 HCI command,but BlueZ might send latest HCI command, causing a timeout. You can refer to #12650 for more information.
Additionally, if the 'HCI reset command' doesn't resolve the issue, more debug information will be needed, such as providing all HCI commands and event data.If the 'HCI reset command' can resolve your problem, that would be ideal.

@danergo
Copy link
Author

danergo commented Oct 8, 2024

HCI reset command can resolve my issue.

Other than that, I have thousands (if not millions) of lines of "btmon" logs.

BlueZ doesn't send any extra command (at least nothing extra is visible in button logs).

In case hci reset fixes the mtu problem, how would it be ideal?

Thank you!

@danergo
Copy link
Author

danergo commented Oct 8, 2024

I have some news: while I was away from this device, ESP started behaving wrong again (same MTU problem).

But after constant thorough connection requests (1130 connection retrials for more than 3 hours!!!), it 'fixed' itself: now MTU negotiation doesn't ruin the incoming indication packets, but I believe after 10-12hours it will break again.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 9, 2024

If resetting the HCI resolves the MTU issue, I believe it's acceptable.
In fact, MTU doesn't have an HCI command; it's only transmitted via ACL.

@danergo
Copy link
Author

danergo commented Oct 9, 2024

Sorry, I don't think it's acceptable:

  • MTU 23->247, every party accepts it.
  • Subscribe to indications -> we got acknowledge
  • GATT write -> we got confirmation
  • GATT read -> we can read
  • Indications below 24 -> we receive it
  • Indications above 23 -> we don't receive it -> this can be due to:
    A.) ESP fails to handle, or
    B.) Device moved out from range during this exact moment

Now, this device doesn't move a single millimeter, it's staying in one place (so as the ESP).

My point here is, that from my application, I can't judge weather device timeout is due to real timeout, or ESP misbehavior.
Due to this, I can't accept that HCI reset solution is fair, sorry.

Anyway, in dmesg, I see these errors a lot:

[54034.156769] Bluetooth: hci0: Opcode 0x200d failed: -110
[54034.156891] Bluetooth: hci0: request failed to create LE connection: err -110

This is timeout, ESP doesn't answer for my requests. This happens, when ESP reaches the problematic phase.
Problematic phase: when it accept higher MTU, but won't forward longer messages to HCI.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 9, 2024

according to log “[54034.156769] Bluetooth: hci0: Opcode 0x200d failed: -110
[54034.156891] Bluetooth: hci0: request failed to create LE connection: err -110”
The log you provided is too limited, so I couldn’t extract much useful information. However, I’d like to remind you that if the 'LE Create Connection' command is issued, it will keep attempting to establish a connection until it's successful, as this action has a high priority. If the connection fails, you will need to send the 'LE Create Connection Cancel' command to stop the connection attempts.
ps:
BLUETOOTH CORE SPECIFICATION Version 5.4 | Vol 4, Part E page 2366
image

image

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 9, 2024

Could you capture the packets to verify if the peer device is indeed sending Indications above 23? If the peer device doesn't send the Indications, then the ESP device wouldn't be able to receive them either.

If the issue still persists, I need more information to further diagnose the issue. Could you provide the complete HCI data?

@danergo
Copy link
Author

danergo commented Oct 9, 2024

Dear @esp-zhp:

I would happily provide any logs if it can help you with diag.

"LE Create connection" always succeeds. Also MTU exchange request (from 23 to 247) also always succeeds. Shorter characteristic writes, and their confirmations also always succeeds. Longer characteristics writes and their confirmations also always succeeds. Shorter indications also always received. Longer than 23 indications received for 10-12 hours, then they are not forwarded back to hci anymore.

I doubt the client has any problems because 2 quite hard reasons:

  1. After ESP restart, everything starts working again (Longer-than-23 indications are arriving), without restarting the client.
  2. When ESP is not forwarding the Longer indications, I tried with another bluetooth controller: it can perfectly receive the Longer indications from the client. Client hasn't changed in any way.

HCI data: I have many. All created by button -w. Is it okay for you? Can I share this privately?

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 9, 2024

One more thing needs to be confirmed to narrow down the issue. Do you think your problem is related to classic Bluetooth? If you are not using classic Bluetooth, does the issue still persist? I'm responsible for BLE and don't have much knowledge about classic Bluetooth. If you believe the issue is related to classic Bluetooth, I can ask my colleagues who handle classic Bluetooth to assist you.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 9, 2024

Could you please send me all the HCI logs from the ESP32? It would be best to use GitHub so that other colleagues can also view them. If it's not convenient for you to share publicly on GitHub, you can also send them to my email ([email protected]).

and Do you have any packet capture devices on your side?It would be even better if you could capture the packets to confirm whether the ESP32 has sent an indication(ATT_HANDLE_VALUE_IND)when mtu above 23.
ps ATT_HANDLE_VALUE_IND:
image

@danergo
Copy link
Author

danergo commented Oct 9, 2024

@esp-zhp:

Sorry for the confusion! Let me clear things up!

I'm using ESP for Dual-Mode Bluetooth Controller (controller_hci_uart example from this repo).

The issue is related purely to BLE.

My RPi is connected via UART to this ESP. ESP is responsible for providing Bluetooth to this RPi (it has no onboard Bt).

RPi is attaching the ESP with btattach, therefore RPi sees hci0. BlueZ uses this hci0 interface to provide Bluetooth functionality to my app on RPi.

My app is constantly monitoring LE advertisements from a bunch of devices (about 10). No filtering is enabled. Occasionally (4-5 times per 10hrs) my app connects to a remote client with gatt-charactetistics, notifications and indications.

This occasional "LE Create Connection" always succeeds, but the client is asking an "MTU Exchange Request" after the connection is established (23->247). My app always accepts this new mtu, and responds as intended. Then my app subscribes to indications by writing a specific data to a gatt handle.
This write is confirmed by the controller, and my app gets reply from the client.
Then my app writes a long data to the client with gatt char-write-req, and this is also always gets acknowledged from ESP side over hci. Then for this large data, Client responds with a long indication (longer than 23). This long indication is arriving to my application for about 10-12 hours, then a reset to ESP is needed (sometimes, HCI reset is enough, but not always: it usually times out in this phase).

If we don't do reset, longer indications are not arriving anymore. All the rest details are provided earlier:
In case my app doesn't accept the new MTU after LE Create Connection, even the longer indication arrives (but in multiple parts: 23-23-23-10 lengths).

Does this change anything?

Thank you!

@danergo
Copy link
Author

danergo commented Oct 9, 2024

@esp-zhp: mail sent to you, would be appreciated if you could take a look at it.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 10, 2024

From the log you provided, the ESP32 is acting as a master role in the GAP layer and as a client role in the GATT layer. right?

In your dmesg log, it shows a connection failure at "Wed Oct 9 10:31:15 2024": "[Wed Oct 9 10:31:15 2024] Bluetooth: hci0: Opcode 0x200d failed: -110 [Wed Oct 9 10:31:15 2024] Bluetooth: hci0: request failed to create LE connection: err -110"

I tried to find the failed connection HCI command and event in the HCI log you provided, but I couldn’t locate the connection establishment command at "Wed Oct 9 10:31:15 2024."
image

There are two possibilities:

1-The timestamps in the dmesg log and HCI log do not match.
Please check why they aren’t aligned.
2-The BlueZ host didn’t send the create connection HCI command.
If the BlueZ host didn’t issue the create connection command, there might be a bug in BlueZ, and you should investigate it further from the BlueZ side.

Additionally, I didn’t find any connection failure information in the HCI log.

The last connection attempt was at 09:07:25, and it was successful. After that, the HCI log only shows continuous scanning without any further connection attempts.
ps:
09:07:25 create connect
image

scan
image

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 10, 2024

First, please check why the BlueZ host didn’t issue the create connect HCI command.

Do you have the ESP32 terminal logs on your side? The ESP32 terminal logs might contain some useful information.

Here’s an example of the ESP32 terminal log(I want the full log):
image

@danergo
Copy link
Author

danergo commented Oct 10, 2024

Hi, @esp-zhp:

Do you have the ESP32 terminal logs on your side? The ESP32 terminal logs might contain some useful information.

Yes, I do have esp terminal logs yes, will attach here.

First, please check why the BlueZ host didn’t issue the create connect HCI command.

It is issuing the "LE Create Connection", but "btmon" recorded the timestamps in UTC, while "dmesg" output has 2 hours later timings.

From the log you provided, the ESP32 is acting as a master role in the GAP layer and as a client role in the GATT layer. right?

Yes, I believe this is the correct terminology.

The timestamps in the dmesg log and HCI log do not match.

That's correct, there is a 2hour difference:

I tried to find the failed connection HCI command and event in the HCI log you provided, but I couldn’t locate the connection establishment command at "Wed Oct 9 10:31:15 2024."

You will find it at 08:31:15 in the HCI log (packet no: 37761: LE Create Connection).

Additionally, I didn’t find any connection failure information in the HCI log.
The last connection attempt was at 09:07:25, and it was successful. After that, the HCI log only shows continuous scanning without any further connection attempts.

Please check the problematic parts: from 15261 - 47498 (07:03:07 - 09:07:25 in HCI log): constant, thorough trying of connection. LE Create Connection succeeds! GATT writes confirmed! Short GATT Indications are received! But not a single sing of the long indication.

You can see a working example starting in 14584 (06:59:55,57) please pay attention to long indication for this connection in 14611 (06:59:55,69): 86 bytes long, response for our "0x0092" GATT write request in 14608 (06:59:55,64).

Connection to this client is always done by:
1.) We send LE Create Connection
2.) We Accept (by default)/Decline the Client's new MTU of 247
3.) We send GATT Write Request (Handle: 0x0093) -> Write Response received from Client (short length)
4.) We send GATT Write Request (Handle: 0x0092) -> Write Response received from Client (short length)
5.) Client send GATT Handle Value Indication (Handle: 0x0092), long length -> We send Handle Value Confirmation
6.) Disconnect

Now, with correct sequences at the beginning of the HCI logs, you can investigate this behavior, then you will see the problematic parts from 15261 - 47498 (07:03:07 - 09:07:25 in HCI log): step5 is missing. There are more than 2 hours, and more than 30000 HCI Packets trying to connect to this Client. During this period, any other device can perfectly connect to the same Client (therefore Client is behaving correctly). Also, during this period, in case I deny the MTU change, ESP will forward the indication in step5, with multiple indications (23-23-23-17).

Thank you very much, I appreciate your time spent on this.

esp.log

@esp-zhp
Copy link
Collaborator

esp-zhp commented Oct 10, 2024

1-create connection
In the HCI log at line 37761, I noticed that the connection was not successfully established. I also observed that the BlueZ host canceled the connection after 4 seconds.

I believe it’s normal for a connection to occasionally fail within 4 seconds, as there could be interference over the air.
image

I reviewed the subsequent logs and saw that the connection was successfully established later. Therefore, I didn’t find any issues.
image
image

2-GATT writes confirmed! Short GATT Indications are received! But not a single sing of the long indication.
At line 14593, everything works well, but at line 15312, there is no indication.

I don’t believe this is an issue with the ESP32.
For the ESP32 controller, it does not differentiate between the types and contents of ATT packets; all packets are reported to the host for processing. The ESP32 controller did not receive the indication, because other ATT packets are still being reported to the host normally.
I think the peer device didn’t send the indication at all, so the ESP32 controller couldn’t receive it. BlueZ wouldn't receive the indication either.

line 14593(work well)
image

line 15312(no indicate)
image

You should check why the peer device didn’t send the indication.
you can debug why the peer device did not send the indication based on the following points:

1-Why does the peer device consistently send MTU requests?
2-Are the contents of the write operations we perform on the peer device all correct?
I suspect this is an issue with the application layer logic, not a bug in the Bluetooth protocol stack.
I noticed that the write content sent by the ESP32 is not entirely consistent. For example:
image
image

@danergo
Copy link
Author

danergo commented Nov 7, 2024

@esp-zhp: i have a new idea:

Creating a firmware in which we can turn on/off logs dynamically via debug uart (i.e.: not hardcode log setting into firmware). In this case we can start with hci logging turned off, then once issue occurs, we can turn on hci logging over debug port, without rebooting esp.

What do you think? Can this help? How shall we read from debug uart without disturbing the btdm controller?

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 8, 2024

@danergo
Sorry for the late reply again.

Based on your description, there is a version with HCI logging enabled. If HCI logging is always on, it works fine, but power consumption will be higher. For now, you can use the library with HCI logging enabled to continue your development.

You mentioned: “Creating a firmware that allows us to dynamically turn logging on and off via the debug UART (i.e., without hardcoding log settings into the firmware). In this case, we can start with HCI logging turned off, and then, if an issue occurs, enable HCI logging through the debug port without rebooting the ESP.”

I’m not sure if this approach will work, as I haven’t yet identified the root cause of the issue. I will update the library and plan to add debug information for suspected areas. When an issue occurs, you can dump this debug information (I’ll provide a new API for this), and I’ll use the data to help resolve the problem.

I’m still working on the debug library, and this may take some time. In the meantime, please proceed with your testing and development.

@danergo
Copy link
Author

danergo commented Nov 8, 2024

Okay, thank you!

@danergo
Copy link
Author

danergo commented Nov 12, 2024

Hi, @esp-zhp: did you have some time to progress?

We are running our main service with HCI logs enabled, and it's smooth. But I think this shall not be considered as a final solution.

Thank you!

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 13, 2024

yes,I will give you lib this week

@danergo
Copy link
Author

danergo commented Nov 13, 2024

Dear @esp-zhp:

We are sending you a new "proof" capture: this time with:

  • local btmon log (on RPi which ESP is attached to)
  • minicom debug log (ESP's debug log), with full HCI log enabled!

This contain a LOT of garbage, please forgive us. Please focus on the ATT frames. You will see normal procedures of GATT writes and confirmations. Focus on 0x0092 Write Response!

You'll see: 0x0092 responses are coming until ID#634057 (last Write Response in log). Afterwards, you'll see no more Write Responses (for 0x0092).

As this time we have the debug log enabled, we can state, that simply enabling "HCI verbose log" doesn't solve this problem. (Luckily! At least, logging is not hiding this bug :) )

Please check our logs, maybe you'll see something interesting which can help you with this topic.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 15, 2024

Here’s the English translation and refinement:


I analyzed the HCI log file “minicom_newproof.cap” that you provided. Using internal tools, I converted the HCI log into a BTsnoop file format, which is compatible with software like Wireshark and "Wireless Protocol Suite."

At timestamp 654,633, there is an ATT_WRITE_REQ with handle 147:

ATT_WRITE_REQ  147  Not Mapped  0(C)  Unknown [3]  14  00:00:00.000063  2024/11/14 11:29:20.135370  147

I noticed that the controller did not subsequently receive any HCI data related to Attribute Handle: 146. In other words, the controller did not receive the write request (handle 0x92) issued by the Bluz host. This behavior appears abnormal.
I don’t fully understand this behavior. Why didn’t the host continue to send the write request (handle 0x92)? Or did it send the request, but the controller simply didn’t receive it?
image

ps:HCI log from ESP32(decoded)
I will send you by email to protect your privacy.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 15, 2024

From the HCI log on your host side, it appears that Bluz sent a write request (handle 0x92), but the ESP32 controller either did not send the packet or failed to send it promptly. This seems to be an issue on the ESP32 controller's side.

However, from the ESP32 controller's perspective, it did not receive the write request (handle 0x92).

I’m unsure how the HCI log on Bluz was captured or whether it is entirely accurate.
If this issue occurs, could restarting Bluz while keeping the ESP32 controller running resolve the problem?

@danergo
Copy link
Author

danergo commented Nov 17, 2024

@esp-zhp:

I noticed that the controller did not subsequently receive any HCI data related to Attribute Handle: 146. In other words, the controller did not receive the write request (handle 0x92) issued by the Bluz host. This behavior appears abnormal.
I don’t fully understand this behavior. Why didn’t the host continue to send the write request (handle 0x92)? Or did it send the request, but the controller simply didn’t receive it?

That's exactly the root cause of the problem! Please note: all ATT 146 messages are long (longer than 63 bytes).

I have tested this many times: after ESP (or BlueZ) gets into this problematic stage, any write request (ATT handle doesn't matter) with data length above 63 won't send out (anything below or equal 63 will be sent out perfectly)!

Let's try to identify the wrong part here:
1.) BlueZ doesn't forwards the 146: can be, but highly unlikely, as btmon (part of BlueZ) still logs the long gatt writes even during the problematic stage.
2.) ESP doesn't receive the 146: can be, as there is no 146 in the decoded sniff.

For the 100% proof, let me wait for a couple of days to let the problematic stage arrive, and then I'll hook up the oscillator to the ESP's RX pin, to make you a final proof. :)

Thank you!

@danergo
Copy link
Author

danergo commented Nov 20, 2024

Hi, all (@esp-zhp).

We have concluded a final test with an O-scope tapped onto the TX pin of RPi (RX pin of ESP32).

Our assumption has been 100% validated now: RPi sends the long gatt write, but ESP32 doesn't acknowledge it.

Please see the shared log: it's a small one!

Facts:

  1. First we send a few short gatt writes (0x00001111): works OK (Write Request + Write Response)
  2. Secondly, we send a few medium length gatt writes (0x0000111222333444555): works OK (Write Request + Write Response)
  3. Lastly, we send long gatt writes (0x0000000011111111...eeeeeeeeffffffff): FAIL at ESP side!: it is not acknowledging the write.

For the last step (3), we have recorded the screen of the scope: this is the proof, that ESP's RX pin is receiving this long write (takes almost 700us), but I assume you will not see this in the debug log from ESP (shared with you too):

image

This is proved now an ESP-side bug, and it is present in the closed-source library, so we can't do anything else on our side to fix it.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 22, 2024

@danergo
Thank you for your feedback.

  1. You've demonstrated with an oscilloscope that the host is sending data to the ESP32 controller, which is great. However, there's still a question: the oscilloscope lacks the functionality of a logic analyzer. This means we cannot be 100% certain that the transmitted data is correct unless we use a logic analyzer to parse the UART data from the ESP32's RX pin.
    That said, we currently have strong reason to believe that the ESP32 UART is not correctly receiving the data. While this cannot be confirmed with absolute certainty at the moment, it is highly probable that the ESP32 is failing to receive the data properly.

  2. Additionally, I reanalyzed the minicom_newproof.cap file you provided, which shows all the data received by the ESP32 controller.
    I found an anomaly (which is crucial):
    Normally, ACL data begins with 0x02, as per the specification. However, during the issue, the data received by the ESP32 was missing the first 5 bytes. Within the received data, "12 92 00" indicates a write request to handle 0x92, which is the data we need!

image
image

  1. Continuing with debugging:
    (Assuming the host's transmitted data is 100% correct)
    The reason the write request wasn't sent out is that the UART failed to receive the data correctly and dropped part of it. Here are some potential points to investigate:

    3.1 Has the ESP32 entered sleep mode?
    By default, it doesn't unless you've modified the sdkconfig. If you have, could you share your sdkconfig file with us?

    3.2 Is UART flow control enabled on the ESP32?
    Please confirm whether CTS (GPIO 23) and RTS (GPIO 19) are being used. If not, UART issues may arise.

    3.3 Add some UART debugging:
    When the issue occurs (i.e., when the write request fails to send), print the UART registers. Here's a sample code snippet:

    #include "soc/uart_reg.h"
    #include "driver/uart.h"
    uint32_t int_raw_status = REG_READ(UART_INT_RAW_REG(CONFIG_BTDM_CTRL_HCI_UART_NO));
  2. A potential solution:
    The ESP32 controller supports both UART and VHCI methods.
    (Top) → Component config → Bluetooth → Controller Options → HCI mode

    Currently, the esp-idf/examples/bluetooth/hci/controller_hci_uart_esp32 example uses UART directly instead of VHCI, and it doesn't leverage the IDF's UART driver.
    For other BLE and BT examples, we typically use VHCI directly.
    Here are the relevant functions:

    /* VHCI */
    extern bool API_vhci_host_check_send_available(void);
    extern void API_vhci_host_send_packet(uint8_t *data, uint16_t len);
    extern int API_vhci_host_register_callback(const vhci_host_callback_t *callback);

    Currently, the issue lies in the ESP32 controller not receiving data (by uart) from the host correctly.
    Therefore, you can transmit HCI data via the VHCI interface instead of using the controller's UART.
    Specifically:

    • Implement a UART program using the IDF driver to receive data, then send it to the controller via the VHCI interface.
    • Similarly, data from the controller can also be returned through VHCI.

    VHCI has already undergone extensive testing, so I believe it can address your issue.

    This task can be undertaken by either you or us, though it will take some time.
    Additionally, I noticed that someone has already started working on this, but I haven’t tested their implementation yet.
    Here is the code link:
    [ESP-Hosted Network Adapter Bluetooth Slave](https://github.com/espressif/esp-hosted/blob/master/esp_hosted_ng/esp/esp_driver/network_adapter/main/slave_bt.c)

@danergo
Copy link
Author

danergo commented Nov 22, 2024

Thank you for extensive description.

I am more than sure that my long write is correct although I have shared a scope screenshot and not a signal analyser (measured shorter gatt writes, and scope showed shorter data flow).

However, we do have signal analyser as well, will provide you that evidence too.

Will also share our sdkconfig, sleep mode (as far as I remember) is enabled.

Will attach it here.

Thank you!

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 22, 2024

Great! Make sure to print the relevant registers when an issue occurs.
Additionally:

Disable sleep mode.
Enable UART flow control.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 22, 2024

how do you think A potential solution?

@danergo
Copy link
Author

danergo commented Nov 22, 2024

Uart flow control is enabled.

Anyway, for VHCI we have just a slight knowledge.
How shall we attach to the BlueZ? With btattach? Will vhci present a device in /dev?

Thank you. Will get back to you with details later today.

@danergo
Copy link
Author

danergo commented Nov 22, 2024

OMG, my sdkconfig is huge, compared to example's. Please find it attached here.

Many sleep configs are enabled (which might have been modified by us). Will do now 2 things:

1.) Signal-trace the RX pin for the HCI data
2.) Recompile the controller's code with default sdkconfig (or a slightly modified, without sleep).

We have high hopes about this sleep configuration that it can solve this problem :)

Thank you!

sdkconfig.zip

@danergo
Copy link
Author

danergo commented Nov 24, 2024

Hi, @esp-zhp:

I have some news!

We hooked up our signal tracer onto the RX pin, and see what happened. We measured two cases:

1.) gatt write request with data of 0000000011111111:

Correct

This is great, works OK.

2.) gatt write request with data of 0000000011111111...ffffffff:

Incorrect

As you can see, RPi is sending only 64bytes of data!

By this time, we thought this must be on RPi (BlueZ or kernel) side.

But we went a little further and reverted back our sdkconfig to the defaults (no sleep enabled now).
So, we also had to reflash the ESP, and of course we had to restart BlueZ too. Signal tracer was still hooked onto the RX pin, and by some random idea we did another measurement (now gatt long write is working, as usual after reboot).

But then we have some amazing result:

image

As you can see, there are two "batches"! First batch is exactly 64bytes long. Delay between two batches is approximately 100us.

Then we pushed even further with a very long gatt write:

image

No need to say anything. Our SPI to UART component have a 64byte FIFO buffer.

Is is possible that ESP is holding down (or up) the CTS (or RTS?) pin, in order to prevent further data transmission because it must process the current batch?

In full UART communication, which pin is used (driven) by ESP in order to signal the RPi to prevent data transmission? RTS or CTS?

Thank you very much!

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 25, 2024

Based on your feedback, the current issue seems to be that the Raspberry Pi is not transmitting data correctly, rather than a problem with the ESP32 receiving end. Is that correct?


When hardware and drivers support RTS and CTS, their behavior is automatically managed, and user intervention is typically not required.

If you wish to manually intervene, it should be feasible. Feel free to test it.
However, may I ask why you want to take this approach?


RTS and CTS Collaboration

RTS and CTS work together to achieve hardware flow control:

  1. When Device A is ready to send data, it pulls its RTS pin low to notify Device B.
  2. Device B checks if it can receive data. If it can, it pulls its CTS pin low to notify Device A.
  3. Once Device A detects a valid CTS signal, it starts transmitting data.
  4. If Device B finds it cannot continue receiving data during transmission (e.g., buffer full), it pulls CTS high to signal Device A to stop transmitting.

From the logs:

I (2168) CONTROLLER_UART_HCI: HCI UART1 Pin select: TX 5, RX 18, CTS 23, RTS 19 Baudrate: 921600

Pin CTS 23 is used to signal the Raspberry Pi to prevent data transmission.

@danergo
Copy link
Author

danergo commented Nov 25, 2024

Thank you.

Based on your feedback, the current issue seems to be that the Raspberry Pi is not transmitting data correctly, rather than a problem with the ESP32 receiving end. Is that correct?

Yes, it seems being on our side, so we wish to say a huge sorry and thank for your kind help through all of this.

We don't want to control the CTS/RTS behavior. We just wanted to check weather it's ESP not accepting more, or RPi can't send more.

uart_set_pin(CONFIG_BT_HCI_UART_NO, 5, 18, 19, 23);

This case we need to trace PIN 23 on ESP to see if it's need to stop us sending, right?

One final test we must do to confirm this is really on the RPi side: once it fails again, we need to signal trace the CTS too: it still can be that in failed case ESP is not pulling the CTS line low fast enough for RPi to continue sending and it halts the transmission (although it's fairly unlikely - BUT! In case we don't reset ESP after failed case, it's not getting back to normal, so still pretty mysterious).

Thank you!

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 25, 2024

Since you plan to continue testing, you can track the following signals:

  • CTS 23
  • RTS 19
  • Raspberry Pi's TX signal
  • ESP32's TX signal printed in the terminal (the terminal will display all data received from the Raspberry Pi, so it's recommended to disable other unrelated prints)

By monitoring these signals, I believe we can ultimately determine whether it's the ESP32 not accepting more data, or the Raspberry Pi being unable to send more.

Looking forward to your further response.

@danergo
Copy link
Author

danergo commented Nov 25, 2024

Thank you, I really appreciate this.

Test is now running, waiting for the stuck. After it happens again, we will analyze these signals and report it back here for sure.

Best regards until then

@danergo
Copy link
Author

danergo commented Nov 27, 2024

Hi!

We got a recommendation of update our kernel (to 6.12 from 6.1). Our SPI2UART silicon might had some "silicon-bug" which is published as errata. Kernel developers worked on that, and seems to be fixed in latest kernel.

Bug was corrupted FIFO due to wrong interrupt timings, therefore losing data.

We are still not 100% sure, but fact is fact: it runs now since 18hrs, without a single lost byte. As it was some cases when it reached to more, we still wait for a couple of days, but in case it won't be any lost byte and missed packet, we shall consider this as "wontfix", because this was indeed not caused by ESP in any manner. In this case I wish to apologize again for taking your precious time.

Thank you!

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 28, 2024

@danergo
nothing.
thanks for your update.
If there are any updates, please let me know.

@danergo
Copy link
Author

danergo commented Dec 1, 2024

@esp-zhp:

We forwarded you our latest log. With the new kernel, communication between Host and ESP is super-smooth, running since more than 4 days without any single lost byte.

There were 2 "glitches" though, which you might give us some input on:

In the hci log (btmon_v612.pcap), we see "Hardware Error"s from controller at IDs:

  • 423920
  • 443764

Can you check the minicom log (debug log from ESP, also shared), in regards for these two events? Do you see any uncommon communication from our side, or any reason why ESP is presenting this Hardware Error?

Please note, the logs are quite large, as we tried to stress the system as much as we can, disabling LE ad filtering completely.

Thank you!

@esp-zhp
Copy link
Collaborator

esp-zhp commented Dec 2, 2024

Thank you for the detailed information. The original issue related to the Linux kernel appears resolved, so that can be considered closed for now.

Regarding the new issue, I will examine the provided btmon_v612.pcap and minicom logs for the two identified "Hardware Error" events at IDs 423920 and 443764. I'll check for any anomalies or potential reasons for these errors in the ESP debug log.

Since this is a new issue, it would be better to track it separately. Please create a new GitHub issue specifically for this problem. This will help us streamline the analysis and resolution process.

I'll provide feedback here as soon as I've reviewed the logs!

@danergo
Copy link
Author

danergo commented Dec 2, 2024

The original issue related to the Linux kernel appears resolved, so that can be considered closed for now.

Exactly, thank you! It's running now over a week (8days), without a single byte of lost data, so we are pretty confident this original issue was caused by a kernel bug (which was related to a silicon bug).

So we are thankful for you providing so much information on this topic.

We created now a new topic for the hardware error here: #14964.

Let's continue there, and finally close this one :)

Thank you.

@danergo danergo closed this as completed Dec 2, 2024
@espressif-bot espressif-bot added Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: In Progress Work is in progress labels Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

3 participants