-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I2C data corruption with timer and UDP server running (IDFGH-11762) #12860
Comments
Q1: From you description. Do you mean that the data on oscilloscope is correct, but data which esp got is wrong? |
@mythbuster5 |
Hi, When disabling the UDP server or the BLE scanning, the problem seems to occur much less often. Our guess is that the problem is coming from the i2c driver (or how we use it). The UDP server and BLE scanning are just using the heap a lot which create an environment where memory corruption is much more likely. We have tested and reproduced the issue on v5.0 and v5.1. |
Seems to be related to #7781 |
Some more tests have been performed. New insights in the errors have been discovered. Error analysisMeasurement formatIn the original code a current, voltage and temperature measurement (using I2C as described in the issue) are sequentially performed in a FreeRTOS task. This task then waits 10 ms (using vTaskDelay) and performs the 3 measurements again and again... Each measurement consists of 2 bytes of data. A diagram is shown below. I1 is the most significant byte of the current measurement and I0 the less significant byte. The same convention is used for the voltage and temperature. Error formatA pattern was discovered in the errors. These errors can be separated in multiple cases. Keep in mind that the I2C bus was monitored and that this bus contains no errors. All the errors happen internally in the ESP32. Temperature errorsThe temperature errors are always the same. The first byte of the temperature T1 is replaced by the last byte of voltage measurement V0 as shown below. Current errorsThere are two types of current errors. The first one being similar to the temperature error. The first byte of the current I1 is replaced by the last byte of the previous measurement being T0. This is shown below. The second error is more complicated as I have no idea where the erroneous byte comes from. The last byte of the current I0 is replaced by a value that is constant within one run of the code. Restarting the ESP32 (without rebuilding/reflashing) can change that value, but not always. The monitored values are 0x04 and 0x8D, but I have no idea what causes these bytes to appear there. FreeRTOS timingWhile I was monitoring the bus with an oscilloscope and logging the measured data with its timestamp using a logging script, a weird timing behavior was discovered. The FreeRTOS tickrate is set to 100 ticks per second, which gives a minimal interval of 10ms between two timestamps of measurements. An example of the normal behavior is shown below. Whenever a temperature error occurs the OS misses its 10ms second mark and instead logs a timestamp at a 15 ms mark. It then again waits 15ms and then proceeds with the expected 10 ms between two logged timestamps. I would expect that whenever the OS cannot reach its set tickrate (due to for example CPU overload), it would skip a tick causing it to have a delta of 20 ms between two timestamps. An example of the error and expected behavior is shown below. The current error again is more complex. Whenever the erroneous current is negative (MSB of I1 being 1) the OS misses its 10 ms mark and instead logs a timestamp at the 12 ms mark. It then waits 18 ms before proceeding with the expected 10 ms delta between two timestamps. Whenever the erroneous current is positive (MSB of I1 being 0) the 10 ms mark is reached and no timing problems are visible in the timestamp logging. Example code ESP32-Ethernet-Kit V1.2I provided example code in the original issue to easily reproduce the errors. The new insights in the replaced bytes give information about the error pattern of the example code. As only temperature measurements are performed, the first byte of the newly measured temperature is replaced by the last byte of the previously measured temperature. The temperature is relatively stable which results in two of the same bytes in the reported erroneous temperature measurement. Notes and other tests
@mythbuster5 Would you have any insights on this weird behavior? Thanks in advance. RVB |
I can totally reproduce this issue on the Espressif devkit |
We have been able to make the problematic code a lot smaller: it's only 170 lines of C now, in a single file, based on the 'i2c simple example' https://github.com/espressif/esp-idf/tree/v5.1.2/examples/peripherals/i2c/i2c_simple. We have determined that the network stack is not involved with this bug, so we've eliminated that from the code. I've attached the zip as attachment. As for other suggestions:
This does not fix the problem, corruption still happens (but less often)
Increased the timer from 1 to 10 us, corruption still happens
However, with the affinity set to CPU1, the corruption does not seem to happen anymore. We're not satisfied with this solution yet, for the following reasons:
We have also reduced the hardware needed to reproduce this bug. We are able to reproduce this on just an official ESP32-Ethernet-Kit_A_V1.2 with a PCT2075 sensor module from Adafruit (https://www.adafruit.com/product/4369). This sensor is likely also available from other vendors, in case Adafruit does not ship to your region. |
We received support from Espressif. There was indeed an issue with the I2C FIFO, the following patch given by one of their employees fixes it: From fb0c921cc6c93a755f3f39f472fc88b59d130dad Mon Sep 17 00:00:00 2001
From: Jacques_Zhao <[email protected]>
Date: Fri, 30 Aug 2024 19:23:45 +0800
Subject: [PATCH] i2c: fix i2c read error
---
components/hal/esp32/include/hal/i2c_ll.h | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/components/hal/esp32/include/hal/i2c_ll.h b/components/hal/esp32/include/hal/i2c_ll.h
index f2903de44a..f1aa6aacf9 100644
--- a/components/hal/esp32/include/hal/i2c_ll.h
+++ b/components/hal/esp32/include/hal/i2c_ll.h
@@ -518,6 +518,7 @@ static inline void i2c_ll_get_scl_timing(i2c_dev_t *hw, int *high_period, int *l
__attribute__((always_inline))
static inline void i2c_ll_write_txfifo(i2c_dev_t *hw, const uint8_t *ptr, uint8_t len)
{
+ hw->fifo_conf.nonfifo_en = 0;
uint32_t fifo_addr = (hw == &I2C0) ? 0x6001301c : 0x6002701c;
for(int i = 0; i < len; i++) {
WRITE_PERI_REG(fifo_addr, ptr[i]);
@@ -536,9 +537,14 @@ static inline void i2c_ll_write_txfifo(i2c_dev_t *hw, const uint8_t *ptr, uint8_
__attribute__((always_inline))
static inline void i2c_ll_read_rxfifo(i2c_dev_t *hw, uint8_t *ptr, uint8_t len)
{
+ hw->fifo_conf.nonfifo_en = 1;
for(int i = 0; i < len; i++) {
- ptr[i] = HAL_FORCE_READ_U32_REG_FIELD(hw->fifo_data, data);
+ ptr[i] = hw->ram_data[i];
}
+ hw->fifo_conf.nonfifo_en = 0;
+
+ hw->fifo_conf.rx_fifo_rst = 1;
+ hw->fifo_conf.rx_fifo_rst = 0;
}
/**
--
2.34.1 |
@mythbuster5 Could you review #12860 (comment) ? |
We have tested this on 10 boards for two weeks and haven't had a single error anymore (we used to have multiple per 10 minutes). The patch above was sent by an Espressif employee, thanks for the support! |
Is there anything wrong with above fix? (I'm wondering why it's still not yet fixed in github). |
@AxelLin This is provided by me actually. I'm still doing the internal test. This is a really hard to debug due to it related to wifi/bt etc. Sorry for the late waiting.. |
That fix looks like a workarond, however, it also proves the issue is real. |
@mythbuster5 |
We spent a while troubleshooting a I2C corruption issue on our product. Occasional I2C read value is corrupted. The data on the wire is correct, but ESP32 reads it incorrectly. I found several others report issues reading I2C:
We tried the suggestions mentioned in those posts:
Finally found this recent report with a patch! And this commit that looks like a simpler workaround based on the same idea: I applied this one-line change to IDF 4.4.8 (also had to #include "soc/dport_access.h") and the problem has not occurred after running for several days. I suspect those other issues on esp32.com are related to this and should be updated to indicate there may be a fix. |
Answers checklist.
IDF version.
v5.1.2 (also tested on master)
Espressif SoC revision.
ESP32 (revision v3.1)
Operating System used.
Windows
How did you build your project?
VS Code IDE
If you are using Windows, please specify command line type.
None
Development Kit.
ESP32-Ethernet-Kit-V1.2 and custom board
Power Supply used.
External 5V
What is the expected behavior?
The temperature is monitored using a PCT2075 over an I2C-bus, while an auto-reload esp timer that triggers a level 3 interrupt is running. An UDP server is also set up. It is expected that the temperature is outputted without corrupted data.
What is the actual behavior?
The temperature that is monitored by the setup described above gives reasonable data most of the time, but randomly logs temperature spikes. These spikes seem to happen at random moments. Sometimes the corrupted data is two of the same bytes after each other and other times it looks random. No pattern is seen yet. The data on the I2C bus has been checked and does not show any of those temperature spikes. The device does not crash.
Steps to reproduce.
Connect the SDA and SCL pin of the Adafruit PCT2075 to IO2 and IO4 of the ESP32-Ethernet-Kit V1.2, respectively.
Connect the address pins of the PCT2075 to ground or 3V3 (make sure to change to the appropriate address in the code (ec_control.c --> PCT2075_I2C_ADDR).
Connect the PCT2075 to GND and 3V3.
Connect the ESP32-Ethernet-Kit V1.2 to a PoE capable device.
Make sure the interrupt level of the 'High resolution timer (esp_timer)' is set to '3' and the 'Support ISR dispatch method' checkbox is active in the sdkconfig.
Build and flash the project found in the attached files.
Open the monitor; an IP address will be assigned to the device and the temperatures below 20°C and above 60°C will be logged. Also the measurements before and after the erroneous data is logged.
The occurrence of errors can be significantly increased by flooding the device with ARP messages. This can be done by:
EC_controller_test.zip
Debug Logs.
More Information.
Initial Setup
Custom PCB
The custom design is a PCB containing:
Errors on the custom PCB
I2C
The custom board sporadically reported current and temperature spikes (both positive and negative) at random moments. Those spikes do not happen an the same time. The I2C bus was monitored with an oscilloscope and did not show any sign of corrupted data sent over the bus. The time between two spikes ranges from a few seconds to a couple of hours.
We discovered later that the rate of erroneous values is increased by flooding the network with ARP messages. Disabling the initialization of the UDP server removed the current and temperature spikes.
It was also discovered that disabling the esp timer callback also removes the current and temperature spikes. However enabling the callback to an empty function still gives erroneous data. Increasing the timer's frequency increases the number of error rate. The frequency can not be too high as it will introduce watchdog timeouts.
The increase in timer frequency and the ARP flooding consistently reduce the time between two spikes to a couple of spikes per 10 minutes.
SPI
The SPI bus reads from the DAC are randomly converted to writes which gives unwanted values at the ouput of the DAC (confirmed by monitoring the SPI bus with an oscilloscope). ARP flooding has no impact on the rate of SPI read/writes. However, increasing the timer's frequency increases the number of read/writes. It is still unclear if SPI and I2C errors are related to each other.
Tests
The system has been tested on stack overflows, task sizes, memory leaking...
The power supply is stable.
Also tested:
but none of the above helped to resolve the weird behavior of the system.
ESP32-Ethernet-Kit V1.2
First, the code has been reduced to its minimum, while still showing erroneous data on the custom board. Therefore only the UDP server initialization (no active task), the auto-reload timer with an empty callback and a task that reads the temperature sensor using I2C have been preserved. This reduces the errors to only temperature spikes. This code has been ported to be used on the ESP32-Ethernet-Kit V1.2 in combination with a Adafruit PCT2075.
To increase the number of errors the timer auto-reload value has been set to 1 µs and the number of I2C reads have been increased. To be clear, the errors still occur without those changes, but these can take hours to happen.
Does anyone know what is going on with this specific combination of UDP server, auto-reload timer and I2C bus?
Thanks in advance
RVB
The text was updated successfully, but these errors were encountered: