-
-
Notifications
You must be signed in to change notification settings - Fork 732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document Wire timeout API on website #895
Comments
What about WIRE_HAS_END macro? It is used to indicate the existence of end() Also, it seems like the macros WIRE_HAS_END should show up somewhere on the end() web page Perhaps add a section called "NOTES" or "OTHER INFORMATION" to the documentation web pages for the information about the macro. |
Thanks, forgot about that one. I added it now.
Good point. It was already on the
I added a "Portability Notes" section now, which seems more specific. |
Sounds great. |
Got here after reading arduino/ArduinoCore-avr#42 and related issues/pr. It would be great to also add a note about it directly to the Wire page too https://www.arduino.cc/en/Reference/Wire. Having it visible like that can potentially safe a lot of pain. Also should there be some extra warning or links about the concerns mentioned here (if confirmed to be valid)? arduino/ArduinoCore-avr#42 (comment). The claim seems to be that is a good idea to take timeout cases as a warning sign of hardware issues in some cases, that if left alone can be lead to other issues. |
@freddyrios, But I do agree with you that it would be a good idea to have some sort of note/information about the potential causes of timeouts and seriousness of i2c bus signal corruption. |
Good suggestion, I added a small paragraph to the first post.
There was already this bit:
But, you make a good point, so I added a more specific warning to the main description. How do these look? |
IMO, there needs to be a bit more more about timeouts. In the real-world cases I've seen, there is often quite a bit of data corruption before a lockup (now timeout) would happen. The reason I think additional text is needed, is that in issue thread about adding the timeout, it seems like several of the posters seem to be of the incorrect assumption that just having a timeout in the Wire library and retries on top of that either in a higher level library that uses Wire, or in the sketch, can fix things. But that is definitely not the case. |
The way it reads now for me is there is a note in the Wire page leads you to the method doc that has this fairly up in the text:
Sounds good to me, as very early it is talking about the actual problems underneath. Of course the more info to help us outside avoids the pitfalls the better. Links to any relevant topics that points people in the right direction(s) to help solve the real root causes would be incredibly helpful (and almost certainly get people to give it a shot). |
My issue is with
It hints that even though there are may be timeouts, that the system is still capable of properly functioning. So, IMO, the message could/should have a bit stronger message than just saying it is "not recommended". |
The way I meant this is that if there are bus lockups or other problems, that you should not just enable timeouts and except it to make things run properly. But we can make it stronger, how about this?
I don't want to go into too much detail here, also since this page is about the Wire library in general, not necessarily specific to the AVR hardware. But external links could probably be added, if anyone knows of appropriate ones. |
I like the added detail. Here is an additional tweak:
Not sure how much, if any, information to provide on how to identify/solve potential issues. Although It might be worth mentioning some of the more common issues, such as poor wiring/connections, or attempting to use "long" wires (I know, we would need to try to somehow describe what "long" means) I think the most important thing is the information about timeouts that hopefully gets the message out that Wire library lockups or timeouts are an indication of other h/w issues that cannot be fully resolved in s/w in the sketch or the library. |
I2C using both a strong low state and a weak high state with a pullup resistor, it's very susceptible to all kinds of electrical noises that are not the sign of h/w issues. Position your device close to a dimmer controlled brushed motor and see the error count increase, or just be in a thunderstorm. Shielding can be effective but has it's limits.
As a side note, the problem with hd44780 LCD with a PCF8574 backpack is tricky and very specific. It occurs because this display is 'write only' and there is no way to read from it and know if the nibbles are in the correct order or inverted causing display corruption. This is a very rare case as all other devices I can think of have some ability to be read that would allow such a problem to be detected. From a reliability point of view, the hd44780 LCD + PCF8574 combination is not a good solution. If it needs to be used, a periodic display reset sequence (the only way to reorder the nibbles) should be implemented, the delay being a compromise between how long a garbled display can be tolerated and the short visible glitch that's visible each time the display is reset. |
@ermtl, most of what you write seems like good advice, though I'm not entirely sure what you intend with this? I'm not sure if this much detail is appropriate for this reference documentation page, or did you have something else in mind? Concerning timeouts, two additional remarks:
|
@matthijskooijman what I mean is that a major cause for I2C errors is unavoidable random electrical noise. Given enough time, these will occur regardless of how well a device is designed. The easiest way to get a taste of it is to listen to an AM radio background noise between stations. Most of it is mild, but you'll occasionally hear some very strong ones related to all sorts of electrical noise sources occurring in the neighbourhood. the weak high state on SCL and SDA makes them a lot more susceptible to those interference than regular digital lines that are driven in both states. Short connections, use of shielding, etc... will reduce the problem, but never completely eliminate it. My goal here was to counter the idea that timeouts would automatically mean a design flaw that needs to be corrected, timeout merely being a band aid. In any transmission protocol such as I2C, Rs485, Ethernet or even IP networks, occasional errors and timeouts are an annoying but normal occurrence that's not almost always an indication of an underlying problem. The whole uncertainty about it does not sit well with the deterministic nature of digital electronics, but that's the world we live in ... Here is how I would change the paragraph: |
This was already discussed above and some notice was already added, but I just noticed that there was a half-finished sentence in there, and reading your proposal, I like it, so I added it verbatim. Your suggestion adds a bit more detail in a nice and concise way, thanks! |
Most of this discussion was using the words "h/w issue". @ermtl For example: I definitely don't like this statement, I believe it is very misleading in that it seems to indicate that these types of h/w errors are rare but are "normal". Also, how to fully recover from a timeout issue, can vary substantially and is not as simple as doing retries on operations that received a timeout error. So, IMO, this type of language is not very helpful:
Doing a retry after a timeout will not ensure reliable operation. The reason being that when timeouts are occurring, particularly during a write to a slave, there is no way to know what really transferred on the bus to the slave and to which slave it really went to. So, IMO, the most important thing is to explain what the timeouts mean, along with explaining to users that these timeouts are a sign of a h/w issue and that just doing retries is very unlikely to be enough to keep things going or make the transmission reliable. |
@bperrybap In most cases where a timeout occurs, it's because the communication glitch prevented it from completing. In such cases, the peripheral won't get the stop condition where it expects it and nothing is written to the peripheral, either the one that was intended or another. While a scenario where the wrong peripheral gets the data, misses the problem, stores the data while the master gets locked is possible, it would be very rare. what could happen more often is a case where SDA is affected,the data is corrupted, but the transaction completes without timeout and there is no way to know. That's an I2C limitation, it's beyond the Wire library's control. When glitches are the result of electrical noise, retrying the failed read from a sensor will in practice give a valid result however, your suggestion that, in case of a timeout, a safe practice would be to reinitialize all the settings for all the connected I2C peripherals makes a lot of sense. Also, as a high number of timeouts could indicate a very noisy environment that could have stability implications for the device beyond it's I2C peripherals, in some applications, too many timeouts could be treated like a watchdog error and trigger a software reset. Since there is no CRC check in I2C, a safer (paranoid ?) option would be to read back any data written to an I2C peripheral, the odds of an undetected error at the same place on both the write and the subsequent read being incredibly low, and even lower if the read is made twice. Probably the "To ensure reliable operation, " sentence should be replaced with "To ensure a more reliable operation, " as no workaround will insure 100% reliability. Also, I mentioned that the timeout error count should be monitored (at least during development) and that's a very important part as it's how you separate electrical glitches from design errors. If a developer sees a glitch every few days, after carefully reviewing possible causes in the design, chances are it's random electrical disturbance, however, if it happens several times a minute without a clear reason (huge unshielded contactor, brushed motor, fluo tube ballast, etc ...) there is clearly something wrong. it's impossible to say everything in a single paragraph. The old wording of the paragraph made it seem like the timeout was a band aid for sloppy design, all I wanted to add is that such timeout can also occur in a well designed device so the developer won't get in a deadlock, searching for an mistake that is nowhere to be found when mitigation techniques offer a way to solve the problem and give a hint of what such mitigation techniques could be like. |
@ermtl It isn't that simple. Some devices are not readable, some are state driven, some use a combination of writes and reads to control which register you are accessing. My biggest concern is that I don't want to give users the impression that they can work around these timeout issues by simply doing something like retries. It isn't that simple. When the signal corruption happens during the address transfer, the address is corrupted and will address the incorrect slave. In most cases a user may have very few slaves or even a single slave so the corrupted address will be addressing a non existent slave and Wire will return a error since no slave will respond. However, if you have multiple slaves, the corrupted address can match a different slave. I have also seen this happen in real world testing with many i2c lcds hooked up. I get error code 2 or 4, but usually what happens is that a nibble transfer is lost to one slave or an extra one sent to another slave and now both LCDs are hopelessly out of nibble sync and will need re-initialization. When the signal corruption happens during the actual data transfer during writes, it can be undetectable and corrupted data is written to the slave. I have seen this happen in real world test and when the signals are marginal and there is corruption, it happens WAY more often than these new timeout errors or some of the other Wire error codes. The point is if you are getting these new timeout errors and there are not multiple masters, then signal corruption is happening, and more than likely there is other data corruption happening that may or may not be detected and cannot ever be detected. In the case of a hd44780 LCD using an i/o expander like the PCF8574, if you get data errors or a timeout when writing to the expander, it is more often than not non recoverable since the host and the LCD will lose nibble sync. This is why I say, at least on writes, when one of these timeout errors on writes, the user must be aware that more than likely there has been some sort of silent data corruption on previous "successful" writes so there is absolutely no way of knowing the state of all the slaves, so the only thing the user can do to ensure 100% proper operation is to re-initialize everything on all the slaves. |
Bad design is a preventable cause of I2C transmission errors and timeout, but it's only one of the possible causes. the other major cause is electromagnetic interferences beyond the control of the designer and the problem is so pervasive there are mitigation norms (european EMC directives) that regulates what device can emit and how resistant they must be to electromagnetic interference depending on the intended use and environment. There is a good overview document made by Texas Instruments called Understanding and Eliminating EMI in Microcontroller Applications and another called EMC design guide for STM8, STM32 and Legacy MCUs by ST that gives hints about building more resistant circuits. It won't eliminate the problem, but make the devices resistant enough to pass EMI tests and regulations. However, the goal of this page is simply to document the wire library, not give a complete guide about good design and electromagnetic interference. as such, it needs to give the designers leads about the possible causes and remedies for timeouts. The goal for the designer should be to reduce those timeouts as much as possible, but knowing they can never be completely eliminated (even more so if multiple devices are deployed in uncontrolled environments). Considering the possibility that data to/from an I2C (or any other bus connected device) might occasionally be wrong, with or without timeouts and hardening the software to detect and recover from it (design dependant, from a simple retry read to a complete reset) will make it overall more robust and also help solve many issues such as missing, disconnected or damaged devices. Digital electronics is a largely successful attempt to insulate electronics from the analog chaos so that we can reason and process data with some level of certainty, however, given a chance (such as when a digital level is weakly enforced by a resistor instead of being low impedance driven as is the case on both signals with I2C), the analog world creeps in and somewhat 'interesting' stuff happens reminding us we still live in an analog world and we need to deal with it... |
@ermtl I'm aware of the real world noise issues in analog environments. I started a company that designed and shipped over 90% of the worlds DSL modems in the late 90s/early 2000s. Literally 100+ million devices to pretty much every telco around the world. In fact we designed the tests for the FCC for their certifications since, at the time, this type of device was not covered by any existing tests. With DSL, signal levels, noise, and cross talk is much more complex and a much bigger issue in that environment since you have can hundreds of subscriber wires all terminating together with very low signal levels. The signaling of DSL is so complex it uses DSPs to process the signals. Where we disagree on this i2c stuff (it seems pretty heavily) is the effects of i2c signal issues and how to recover from them. There are just so many cases where there is no recovery other to start over and re-initialize all the slaves not just the one that received an error/exception. I disagree with the text you proposed as I believe it is misleading and in some cases factually incorrect.
I would be very light on what the user could / should do to mitigate this as the most important thing is convey the information that when seeing these types of timeouts, all sorts of data corruption is likely to be happening. |
Re-reading everything again, I think I agree that the proposal from @ermtl might be presenting noise and similar problems a bit too much as "fact of life" (sure, noise is not always under control of the designer, but one can always argue that if noise affects data, then the design has too little noise immunity for the environment it is in, so that's always a design problem). Anyway, I've rewritten the paragraph again, I now wrote:
I think this now makes more clear that:
@ermtl, @bperrybap What do you think of this version? |
I like it. It might be useful/helpful to mention that it is possible that data and/or state corruption can happen to any slave on the bus not just the one that was being addressed or that has just gotten the error. |
Thanks, made the changes you suggested.
Good point. Even more, if timeouts occur, there's likely noise on the bus that affects all slaves, so even without corruption on the address bits, all slaves can have seen corrupt data. I added "(also to other slaves)" to:
How's that? |
I obviously don't like this new version as it once again puts the blame on bad design and considers electrical glitches as a non issue. pberrybap wrote:" bit errors on the signals can occur anywhere, not just at the tail end of the message." but i never pretended otherwise and that is a misrepresentation of what I wrote. Any glitch occuring on SCL will result in either missed or additional clock pulses as seen by one or more slaves relative to what the master sent. The master and slave will thus get out of sync and this loss of synchronicity will show when the stop transaction fails resulting in a timeout no matter where the glitch was in the transmission, not just at the precise time of the stop condition. Another way to see the problem is from the user(programmer) perspective, since the goal here is user documentation. In the "If there are many timeouts, your design is crappy, but a if it's only a rare occurrence it could be unavoidable glitches and you need to monitor it and add remediation in software" approach, the user is still urged to improve the design, but when rare glitches occur in the field, the device will be able to handle them. I think it's way more constructive and will result in more resilient designs. |
@ermtl, I'm not seeing what you are saying is in the new/updated text from @matthijskooijman
I just don't see that in the text. It says that there could be various types of issues. Perhaps it should mention that these timeouts or other issues are caused by something outside of any of the s/w in the Wire library or the sketch. To make it clear that the problem is not caused by and cannot be eliminated in s/w. Perhaps it could mention voltage/power supply issues as I've also seen that show up quite a bit in various forums where a heater element goes on and the voltage sags or creates massive undershoot/overshoot which can cause all sorts of signal integrity issues. Also, I missed this when I read it at first but I think this part could use a tweek/correction:
What I see in real world heavy write testing in certain environments is that 100% of the time I get a Wire timeout error, there was undetected data corruption occurring prior to the timeout. In some cases going to the wrong slave.
Again, IMO, I think the most important thing to try to get across to people is to understand that these errors / timeout issues mean that there is i2c signal integrity issues. Once there are i2c signal integrity issues, lots of odd/bad things can happen like The reason to point all this out to people is to try to get them to understand that after getting these types of i2c errors, you can not just do something simple like retry the operation to always get things working again. I can tell you from reading many threads on the Arduino forum, and a long standing github issue about adding timeouts to the Wire library, that a large percentage of the people believe that all i2c errors/issues can be detected and that doing a retry of the operation after getting an i2c timeout can resolve the issue. But If they really understood what could be happening to i2c bus and to their slaves when these timeouts are happening, then they should understand that recovery is not as simple as doing a retry on a failed operation. |
@ermtl, you write:
I don't think that a glitch on SCL necessarily results in a timeout. In most cases, the data will be shifted by one bit, resulting in invalid data, or a NACK instead of an ACK, but all of that is handled with a timeout. The only cause I've seen that causes a timeout is when noise results in either an arbitration error (noise on SDA during transmission of 1 bits) or an errant start condition (which needs noise on both channels during idle, I think). In both cases, the hardware assumes that another master is present and has an active transaction, so the hardware stalls indefinitely until the "other master" releases the bus (i.e. never).
So in my experience, this is accurate.
But enhancing the design is mitigation? I have the impression that you might have a different or more narrow definition of "design" (and "bad design") in mind than @bperrybap and myself. For me, "design" involves the entire system: Circuit, PCB layout, cabling, shielding, chassis, power supply. And IMHO any noise that results in bitflips means that your design is not sufficient to deal with the particular environment that the system is operating in at that moment. @bperrybap As for your last suggestion, I included it. I already didn't like that sentence all that much, I think your suggestion is a bit nicer indeed. |
You are all wrong. The instructions should only contain what YOU know about the event. Not your guesses. Data was not received / sent within the allotted time. |
Ran into this issue... super fun to troubleshoot. |
@per1234 I have made an internal issue about this for the content team to take a look at. |
The "Wire" library reference content is now hosted in the I see that a claim was made that the suggestions made here were incorporated during the migration of the content to this repository: #870. I haven't had the chance to review that though so I will leave this open until the resolution is verified. The published content is here: https://www.arduino.cc/reference/en/language/functions/communication/wire/ If anyone spots any omissions or errors, please let us know. |
Recently, some new timeout API methods were added to the AVR Wire library (see arduino/ArduinoCore-avr#42), which should be documented. Given there is no repository for the library reference, I'm going to report this here. While looking at the Wire docs at https://www.arduino.cc/en/Reference/Wire I noticed that the
end()
method is also not documented yet.Please find a proposal for documentation below, comments welcome. I've tried to match the formatting (heading levels etc.) to the existing doc pages, but it's likely that this still needs some handwork to integrate. Also, there's a fair chance that I've written this in too much detail or technically too complex for the novice audience, so feedback on that aspect is also welcome.
Wire
Just above the "Note" section, add:
Recent versions of the Wire library can use timeouts to prevent a lockup in the face of certain problems on the bus, but this is not enabled by default (yet) in current versions. It is recommended to always enable these timeouts when using the Wire library. See the
Wire.setWireTimeout
function for more details.Wire.end()
Description
Disable the Wire library, reversing the effect of
Wire.begin()
. To use the Wire library again after this, callWire.begin()
again.Syntax
Parameters
None.
Returns
None.
Portability Notes
This function was not available in the original version of the Wire library and might still not be available on all platforms. Code that needs to be portable across platforms and versions can use the
WIRE_HAS_END
macro, which is only defined whenWire.end()
is available.Wire.endTransmission()
Under "Returns", add:
Wire.setWireTimeout()
Description
Sets the timeout for Wire transmissions in master mode.
On platforms that support it, these timeouts can help handle unexpected situations on the Wire bus, such as another device or a short-circuit that keeps the bus blocked indefinitely, or noise that looks like a start condition, making it look there is another master active that keeps the bus claimed.
Note that these timeouts are almost always an indication of an underlying problem, such as misbehaving devices, noise, insufficient shielding, or other electrical problems. These timeouts will prevent your sketch from locking up, but not solve these problems. In such situations there will often (also) be data corruption which doesn't result in a timeout or other error and remains undetected. So when a timeout happens, it is likely that some data previously read or written is also corrupted. Additional measures might be needed to more reliably detect such issues (e.g. checksums or reading back written values) and recover from them (e.g. full system reset). This timeout and such additional measures should be seen as a last line of defence, when possible the underlying cause should be fixed instead.
Syntax
Parameters
timeout a timeout
: timeout in microseconds, if zero then timeout checking is disabledreset_on_timeout
: if true then Wire hardware will be automatically reset on timeoutWhen this function is called without parameters, a default timeout is configured that should be sufficient to prevent lockups in a typical single-master configuration.
Returns
None.
Example Code
Notes and Warnings
How this timeout is implemented might vary between different platforms, but typically a timeout condition is triggered when waiting for (some part of) the transaction to complete (e.g. waiting for the bus to become available again, waiting for an ACK bit, or maybe waiting for the entire transaction to be completed).
When such a timeout condition occurs, the transaction is aborted and
endTransmission()
orrequestFrom()
will return an error code or zero bytes respectively. While this will not resolve the bus problem by itself (i.e. it does not remove a short-circuit), it will at least prevent blocking potentially indefinitely and allow your software to detect and maybe solve this condition.If
reset_on_timeout
was set to true and the platform supports this, the Wire hardware is also reset, which can help to clear any incorrect state inside the Wire hardware module. For example, on the AVR platform, this can be required to restart communications after a noise-induced timeout.When a timeout is triggered, a flag is set that can be queried with
getWireTimeoutFlag()
and must be cleared manually usingclearWireTimeoutFlag()
(and is also cleared whensetWireTimeout()
is called).Note that this timeout can also trigger while waiting for clock stretching or waiting for a second master to complete its transaction. So make sure to adapt the timeout to accomodate for those cases if needed. A typical timeout would be 25ms (which is the maximum clock stretching allowed by the SMBus protocol), but (much) shorter values will usually also work.
Portability Notes
This function was not available in the original version of the Wire library and might still not be available on all platforms. Code that needs to be portable across platforms and versions can use the
WIRE_HAS_TIMEOUT
macro, which is only defined whenWire.setWireTimeout()
,Wire.getWireTimeoutFlag()
andWire.clearWireTimeout()
are all available.When this timeout feature was introduced on the AVR platform, it was initially kept disabled by default for compatibility, expecting it to become enabled at a later point. This means the default value of the timeout can vary between (versions of) platforms. The default timeout settings are available from the
WIRE_DEFAULT_TIMEOUT
andWIRE_DEFAULT_RESET_WITH_TIMEOUT
macro.If you require the timeout to be disabled, it is recommended you disable it by default using
setWireTimeout(0)
, even though that is currently the default.See Also
Wire.getWireTimeoutFlag()
Description
Checks whether a timeout has occured since the last time the flag was cleared.
This flag is set is set whenever a timeout occurs and cleared when
Wire.clearWireTimeoutFlag()
is called, or when the timeout is changed usingWire.setWireTimeout()
.Timeouts might not be enabled by default. See the documentation for
Wire.setWireTimeout()
for more information on how to configure timeouts and how they work.Syntax
Parameters
None.
Returns
bool: The current value of the flag
Portability Notes
This function was not available in the original version of the Wire library and might still not be available on all platforms. Code that needs to be portable across platforms and versions can use the
WIRE_HAS_TIMEOUT
macro, which is only defined whenWire.setWireTimeout()
,Wire.getWireTimeoutFlag()
andWire.clearWireTimeout()
are all available.See Also
Wire.clearWireTimeoutFlag()
Description
Clear the timeout flag.
Timeouts might not be enabled by default. See the documentation for
Wire.setWireTimeout()
for more information on how to configure timeouts and how they work.Syntax
Parameters
None.
Returns
None.
Portability Notes
This function was not available in the original version of the Wire library and might still not be available on all platforms. Code that needs to be portable across platforms and versions can use the
WIRE_HAS_TIMEOUT
macro, which is only defined whenWire.setWireTimeout()
,Wire.getWireTimeoutFlag()
andWire.clearWireTimeout()
are all available.See Also
The text was updated successfully, but these errors were encountered: