Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Wire timeout API on website #895

Open
matthijskooijman opened this issue Sep 26, 2020 · 30 comments
Open

Document Wire timeout API on website #895

matthijskooijman opened this issue Sep 26, 2020 · 30 comments
Assignees
Labels

Comments

@matthijskooijman
Copy link
Collaborator

matthijskooijman commented Sep 26, 2020

Recently, some new timeout API methods were added to the AVR Wire library (see arduino/ArduinoCore-avr#42), which should be documented. Given there is no repository for the library reference, I'm going to report this here. While looking at the Wire docs at https://www.arduino.cc/en/Reference/Wire I noticed that the end() method is also not documented yet.

Please find a proposal for documentation below, comments welcome. I've tried to match the formatting (heading levels etc.) to the existing doc pages, but it's likely that this still needs some handwork to integrate. Also, there's a fair chance that I've written this in too much detail or technically too complex for the novice audience, so feedback on that aspect is also welcome.

Wire

Just above the "Note" section, add:

Recent versions of the Wire library can use timeouts to prevent a lockup in the face of certain problems on the bus, but this is not enabled by default (yet) in current versions. It is recommended to always enable these timeouts when using the Wire library. See the Wire.setWireTimeout function for more details.

Wire.end()

Description

Disable the Wire library, reversing the effect of Wire.begin(). To use the Wire library again after this, call Wire.begin() again.

Syntax

Wire.end()

Parameters

None.

Returns

None.

Portability Notes

This function was not available in the original version of the Wire library and might still not be available on all platforms. Code that needs to be portable across platforms and versions can use the WIRE_HAS_END macro, which is only defined when Wire.end() is available.

Wire.endTransmission()

Under "Returns", add:

  • 5:timeout

Wire.setWireTimeout()

Description

Sets the timeout for Wire transmissions in master mode.

On platforms that support it, these timeouts can help handle unexpected situations on the Wire bus, such as another device or a short-circuit that keeps the bus blocked indefinitely, or noise that looks like a start condition, making it look there is another master active that keeps the bus claimed.

Note that these timeouts are almost always an indication of an underlying problem, such as misbehaving devices, noise, insufficient shielding, or other electrical problems. These timeouts will prevent your sketch from locking up, but not solve these problems. In such situations there will often (also) be data corruption which doesn't result in a timeout or other error and remains undetected. So when a timeout happens, it is likely that some data previously read or written is also corrupted. Additional measures might be needed to more reliably detect such issues (e.g. checksums or reading back written values) and recover from them (e.g. full system reset). This timeout and such additional measures should be seen as a last line of defence, when possible the underlying cause should be fixed instead.

Syntax

Wire.setWireTimeout(timeout, reset_on_timeout)
Wire.setWireTimeout()

Parameters

timeout a timeout: timeout in microseconds, if zero then timeout checking is disabled
reset_on_timeout: if true then Wire hardware will be automatically reset on timeout

When this function is called without parameters, a default timeout is configured that should be sufficient to prevent lockups in a typical single-master configuration.

Returns

None.

Example Code

#include <Wire.h>

void setup() {
  Wire.begin(); // join i2c bus (address optional for master)
  #if defined(WIRE_HAS_TIMEOUT)
    Wire.setWireTimeout(3000 /* us */, true /* reset_on_timeout */);
  #endif
}

byte x = 0;

void loop() {
  /* First, send a command to the other device */
  Wire.beginTransmission(8); // transmit to device arduino/Arduino#8
  Wire.write(123);           // send command
  byte error = Wire.endTransmission(); // run transaction
  if (error) {
    Serial.println("Error occured when writing");
    if (error == 5)
      Serial.println("It was a timeout");
  }

  delay(100);

  /* Then, read the result */
  #if defined(WIRE_HAS_TIMEOUT)
  Wire.clearWireTimeoutFlag();
  #endif
  byte len = Wire.requestFrom(8, 1); // request 1 byte from device arduino/Arduino#8
  if (len == 0) {
    Serial.println("Error occured when reading");
    #if defined(WIRE_HAS_TIMEOUT)
    if (Wire.getWireTimeoutFlag())
      Serial.println("It was a timeout");
    #endif
  }

  delay(100);
}

Notes and Warnings

How this timeout is implemented might vary between different platforms, but typically a timeout condition is triggered when waiting for (some part of) the transaction to complete (e.g. waiting for the bus to become available again, waiting for an ACK bit, or maybe waiting for the entire transaction to be completed).

When such a timeout condition occurs, the transaction is aborted and endTransmission() or requestFrom() will return an error code or zero bytes respectively. While this will not resolve the bus problem by itself (i.e. it does not remove a short-circuit), it will at least prevent blocking potentially indefinitely and allow your software to detect and maybe solve this condition.

If reset_on_timeout was set to true and the platform supports this, the Wire hardware is also reset, which can help to clear any incorrect state inside the Wire hardware module. For example, on the AVR platform, this can be required to restart communications after a noise-induced timeout.

When a timeout is triggered, a flag is set that can be queried with getWireTimeoutFlag() and must be cleared manually using clearWireTimeoutFlag() (and is also cleared when setWireTimeout() is called).

Note that this timeout can also trigger while waiting for clock stretching or waiting for a second master to complete its transaction. So make sure to adapt the timeout to accomodate for those cases if needed. A typical timeout would be 25ms (which is the maximum clock stretching allowed by the SMBus protocol), but (much) shorter values will usually also work.

Portability Notes

This function was not available in the original version of the Wire library and might still not be available on all platforms. Code that needs to be portable across platforms and versions can use the WIRE_HAS_TIMEOUT macro, which is only defined when Wire.setWireTimeout(), Wire.getWireTimeoutFlag() and Wire.clearWireTimeout() are all available.

When this timeout feature was introduced on the AVR platform, it was initially kept disabled by default for compatibility, expecting it to become enabled at a later point. This means the default value of the timeout can vary between (versions of) platforms. The default timeout settings are available from the WIRE_DEFAULT_TIMEOUT and WIRE_DEFAULT_RESET_WITH_TIMEOUT macro.

If you require the timeout to be disabled, it is recommended you disable it by default using setWireTimeout(0), even though that is currently the default.

See Also

  • Wire.getWireTimeoutFlag()
  • Wire.clearWireTimeoutFlag()
  • Wire.endTransmission()
  • Wire.requestFrom()

Wire.getWireTimeoutFlag()

Description

Checks whether a timeout has occured since the last time the flag was cleared.

This flag is set is set whenever a timeout occurs and cleared when Wire.clearWireTimeoutFlag() is called, or when the timeout is changed using Wire.setWireTimeout().

Timeouts might not be enabled by default. See the documentation for Wire.setWireTimeout() for more information on how to configure timeouts and how they work.

Syntax

Wire.getWireTimeoutFlag()

Parameters

None.

Returns

bool: The current value of the flag

Portability Notes

This function was not available in the original version of the Wire library and might still not be available on all platforms. Code that needs to be portable across platforms and versions can use the WIRE_HAS_TIMEOUT macro, which is only defined when Wire.setWireTimeout(), Wire.getWireTimeoutFlag() and Wire.clearWireTimeout() are all available.

See Also

  • Wire.clearWireTimeoutFlag()
  • Wire.setWireTimeout()

Wire.clearWireTimeoutFlag()

Description

Clear the timeout flag.

Timeouts might not be enabled by default. See the documentation for Wire.setWireTimeout() for more information on how to configure timeouts and how they work.

Syntax

Wire.clearTimeout()

Parameters

None.

Returns

None.

Portability Notes

This function was not available in the original version of the Wire library and might still not be available on all platforms. Code that needs to be portable across platforms and versions can use the WIRE_HAS_TIMEOUT macro, which is only defined when Wire.setWireTimeout(), Wire.getWireTimeoutFlag() and Wire.clearWireTimeout() are all available.

See Also

  • Wire.getWireTimeoutFlag()
  • Wire.setWireTimeout()
@bperrybap
Copy link

What about WIRE_HAS_END macro? It is used to indicate the existence of end()

Also, it seems like the macros WIRE_HAS_END should show up somewhere on the end() web page
Llkewise it seems like WIRE_HAS_TIMEOUT should show up somewhere on the pages for
setTimeout(), Wire.getWireTimeoutFlag(), Wire.clearWireTimeoutFlag() i.e. all the pages that are for functions that exist when the macro exists.

Perhaps add a section called "NOTES" or "OTHER INFORMATION" to the documentation web pages for the information about the macro.

@matthijskooijman
Copy link
Collaborator Author

What about WIRE_HAS_END macro? It is used to indicate the existence of end()

Thanks, forgot about that one. I added it now.

Likewise it seems like WIRE_HAS_TIMEOUT should show up somewhere on the pages for

Good point. It was already on the setWireTimeout() page above, but I now added it to all three (and changed the wording on getWireTimeout() to be more complete).

Perhaps add a section called "NOTES" or "OTHER INFORMATION" to the documentation web pages for the information about the macro.

I added a "Portability Notes" section now, which seems more specific.

@bperrybap
Copy link

Sounds great.

@freddyrios
Copy link

freddyrios commented Oct 2, 2020

Got here after reading arduino/ArduinoCore-avr#42 and related issues/pr.

It would be great to also add a note about it directly to the Wire page too https://www.arduino.cc/en/Reference/Wire. Having it visible like that can potentially safe a lot of pain.

Also should there be some extra warning or links about the concerns mentioned here (if confirmed to be valid)? arduino/ArduinoCore-avr#42 (comment). The claim seems to be that is a good idea to take timeout cases as a warning sign of hardware issues in some cases, that if left alone can be lead to other issues.

@bperrybap
Copy link

@freddyrios,
The concerns are valid. If there are timeout errors and there are not multi masters on the bus, then there is some sort of h/w issue causing bit errors on the bus. The effects of those bit errors are unpredictable.
Not only that, but most bit errors on the bus can not be detected. The only bit errors that can be detected are those that happen to occur during the address, or the status portion of the transfer because they cause some sort of issue like trying to address a non existent slave, or confuse the master into thinking that there is another master on the bus.

But I do agree with you that it would be a good idea to have some sort of note/information about the potential causes of timeouts and seriousness of i2c bus signal corruption.

@matthijskooijman
Copy link
Collaborator Author

It would be great to also add a note about it directly to the Wire page too https://www.arduino.cc/en/Reference/Wire. Having it visible like that can potentially safe a lot of pain.

Good suggestion, I added a small paragraph to the first post.

Also should there be some extra warning or links about the concerns mentioned here (if confirmed to be valid)?

There was already this bit:

When such a timeout condition occurs, the transaction is aborted and endTransmission() or requestFrom() will return an error code or zero bytes respectively. While this will not resolve the bus problem by itself (i.e. it does not remove a short-circuit), it will at least prevent blocking potentially indefinitely and allow your software to detect and maybe solve this condition.

But, you make a good point, so I added a more specific warning to the main description.

How do these look?

@bperrybap
Copy link

IMO, there needs to be a bit more more about timeouts.
The language around timeouts does not seem strong enough. It does not mention some of the other possible things that could and often do happen and can go undetected when there are issues like misbehaving slaves and/or bus noise.
Things like data corruption or writing/reading to the wrong slave.
And that this data corruption cause the need to re-initialize or restart the slave.
So getting a timeout likely means that a slave will have to be full re-initialized since there is now way to know what data/commands the slave has received.

In the real-world cases I've seen, there is often quite a bit of data corruption before a lockup (now timeout) would happen.

The reason I think additional text is needed, is that in issue thread about adding the timeout, it seems like several of the posters seem to be of the incorrect assumption that just having a timeout in the Wire library and retries on top of that either in a higher level library that uses Wire, or in the sketch, can fix things. But that is definitely not the case.
For example, on i2c LCDs, you can end up with lots of garbage on the display from data corruption before the lockup/timeout.
So even if you did do a retry on an operation when there was a timeout, it would likely not keep the display from getting corrupted.
In some cases like a hd44780 LCD with a PCF8574 based backpack, the i2c data corruption can cause the host and the LCD to lose nibble sync from garbage commands. When out of nibble sync, the display will continue to be corrupted as the host sends more data/commands since it is being misinterpreted.
The only way to get back into nibble sync is to start the full initialization over.

@freddyrios
Copy link

freddyrios commented Oct 5, 2020

The way it reads now for me is there is a note in the Wire page leads you to the method doc that has this fairly up in the text:

Note that such a timeout is almost always an indication of an underlying problem, such as misbehaving devices, bus noise or other electrical problems. Relying on this timeout for proper operation is not recommended, it is better to fix the underlying problem instead.

Sounds good to me, as very early it is talking about the actual problems underneath. Of course the more info to help us outside avoids the pitfalls the better. Links to any relevant topics that points people in the right direction(s) to help solve the real root causes would be incredibly helpful (and almost certainly get people to give it a shot).

@bperrybap
Copy link

My issue is with

Relying on this timeout for proper operation is not recommended

It hints that even though there are may be timeouts, that the system is still capable of properly functioning.
There will never be proper operation when timeouts are occurring due to signal corruption. When there is enough signal corruption to cause a timeout, there is plenty more that is also causing silent/undetected data corruption.

So, IMO, the message could/should have a bit stronger message than just saying it is "not recommended".
It should indicate that the presence of timeouts very likely indicates the presence of other issues which cannot be detected and are silently occurring like data corruption.

@matthijskooijman
Copy link
Collaborator Author

Relying on this timeout for proper operation is not recommended

The way I meant this is that if there are bus lockups or other problems, that you should not just enable timeouts and except it to make things run properly.

But we can make it stronger, how about this?

Note that such a timeout is almost always an indication of an underlying problem, such as misbehaving devices, bus noise or other electrical problems which. These timeouts will prevent your sketch from locking up, but not solve these problems. In addition to locking the bus, there might also be data corruption. To ensure reliable operation, whenever timeouts occur, make sure to find and fix the underlying problem, rather than assuming that these timeouts will fix those problems.

Of course the more info to help us outside avoids the pitfalls the better. Links to any relevant topics that points people in the right direction(s) to help solve the real root causes would be incredibly helpful (and almost certainly get people to give it a shot).

I don't want to go into too much detail here, also since this page is about the Wire library in general, not necessarily specific to the AVR hardware. But external links could probably be added, if anyone knows of appropriate ones.

@bperrybap
Copy link

I like the added detail. Here is an additional tweak:

Note that such a timeout is almost always an indication of an underlying problem, such as misbehaving devices, bus noise or other electrical problems which. These timeouts will prevent your sketch from locking up, but not solve these problems. In addition to locking the bus, there may also be undetectable data corruption or accesses to slave addresses other than the ones specified. To ensure reliable operation, whenever timeouts occur, make sure to find and fix the underlying problem, rather than assuming that these timeouts will fix those problems.

Not sure how much, if any, information to provide on how to identify/solve potential issues.
It is kind of can of worms and it is a pretty complex subject getting into analog electrical issues.

Although It might be worth mentioning some of the more common issues, such as poor wiring/connections, or attempting to use "long" wires (I know, we would need to try to somehow describe what "long" means)
Since posts related to these issues do come up from time to time on the forum.

I think the most important thing is the information about timeouts that hopefully gets the message out that Wire library lockups or timeouts are an indication of other h/w issues that cannot be fully resolved in s/w in the sketch or the library.

@ermtl
Copy link

ermtl commented Nov 21, 2020

I2C using both a strong low state and a weak high state with a pullup resistor, it's very susceptible to all kinds of electrical noises that are not the sign of h/w issues. Position your device close to a dimmer controlled brushed motor and see the error count increase, or just be in a thunderstorm. Shielding can be effective but has it's limits.
A way to approach the problem is to consider I2C as a transmission protocol that might have errors and see what could be done about it. Designers of the OneWire protocol that's electrically similar with a pullup and a long line added CRC checks to all communications, but nearly all I2C circuits don't have that.
Fortunately, the protocol itself includes a 'start' condition that resets the state machine of any I2C compliant device, making each transaction independent (errors don't propagate) and a lot can be done in software, the exact details depend on the chip used, but here are a few hints:

  • When a timeout occurs, read the value again (if in an interrupt, using the previous value and increasing an error counter can be an alternative)
  • Have a min value, max value, default value, max change value (according the the expected range for the value being measured) and previous measurement value.
    If the sensor value fails the plausibility test, replace it with either the previous value, the default value or an average of the previous and default value. Have a counter for such occurrences. This can handle both I2C errors and sensor malfunctions
  • oversample and discard. Take 4 samples or more, clear the highest and lowest value(s) and take the average of the (2 or more) remaining values. This is resistant to high error rates (and to error bursts if more than one high/one low value is discarded) and mitigates the uncaught errors by averaging them with good ones while decreasing measurement noise if data is from a sensor).
  • count errors and go to a failsafe mode when too many happen (you need to decrease error count based on time, maybe 1 error every 10 seconds)
  • if the I2C is used for communication between Arduinos , add CRC at the end of each transmission
  • if the I2C is used to store values in an external EEPROM, add CRC to the stored values to check data integrity. This can also detect EEPROM wear.
  • In high reliability applications, use redundant sensors / write data twice (or 3 times with majority vote)
  • To prevent long delays due to timeout, set the timeout value to the shortest possible before communicating with each chip. Some chips always respond fast, so any delay should trigger the timeout ASAP
  • Use repeated timeouts to warn the user about a disconnected / poorly connected sensor

As a side note, the problem with hd44780 LCD with a PCF8574 backpack is tricky and very specific. It occurs because this display is 'write only' and there is no way to read from it and know if the nibbles are in the correct order or inverted causing display corruption. This is a very rare case as all other devices I can think of have some ability to be read that would allow such a problem to be detected. From a reliability point of view, the hd44780 LCD + PCF8574 combination is not a good solution. If it needs to be used, a periodic display reset sequence (the only way to reorder the nibbles) should be implemented, the delay being a compromise between how long a garbled display can be tolerated and the short visible glitch that's visible each time the display is reset.

@matthijskooijman
Copy link
Collaborator Author

@ermtl, most of what you write seems like good advice, though I'm not entirely sure what you intend with this? I'm not sure if this much detail is appropriate for this reference documentation page, or did you have something else in mind?

Concerning timeouts, two additional remarks:

  • In normal communication, timeouts are not really needed, since a chip will respond directly and if it does not (e.g. becaue it is disconnected), you'll know directly because no ACK is received. The only exception is when the slave does clock stretching, in which case it could actively (and potentially indefinitely) keep the master waiting, so the implemented timeout is useful in that case.
  • The other main case that these new timeouts guard against, is somewhat AVR-specific, where the AVR Wire hardware locks up because it thinks there is a secondary master on the bus (typically due to noise that looks like a start condition). It's very likely that other I²C hardware implements this more elegantly, with more feedback.

@ermtl
Copy link

ermtl commented Nov 21, 2020

@matthijskooijman what I mean is that a major cause for I2C errors is unavoidable random electrical noise. Given enough time, these will occur regardless of how well a device is designed. The easiest way to get a taste of it is to listen to an AM radio background noise between stations. Most of it is mild, but you'll occasionally hear some very strong ones related to all sorts of electrical noise sources occurring in the neighbourhood. the weak high state on SCL and SDA makes them a lot more susceptible to those interference than regular digital lines that are driven in both states. Short connections, use of shielding, etc... will reduce the problem, but never completely eliminate it.

My goal here was to counter the idea that timeouts would automatically mean a design flaw that needs to be corrected, timeout merely being a band aid. In any transmission protocol such as I2C, Rs485, Ethernet or even IP networks, occasional errors and timeouts are an annoying but normal occurrence that's not almost always an indication of an underlying problem. The whole uncertainty about it does not sit well with the deterministic nature of digital electronics, but that's the world we live in ...

Here is how I would change the paragraph:
"Note that while such a timeout can be a rare but normal occurrence caused by random outside electrical noise beyond our control that does not indicate a design flaw, frequent, repeated timeouts are almost always an indication of an underlying problem, such as misbehaving devices, bus noise or other electrical problems. These timeouts will prevent your sketch from locking up, but not solve these problems. In addition to locking the bus, there might also be data corruption. To ensure reliable operation, transmissions that ended in a timeout should be retried, the timeout error count should be monitored both during application debugging and as an alert for the end user and data values from I2C devices should be checked for plausibility or, whenever possible, CRC checks should be added. Whenever several timeouts occur in a short period of time (application dependant), make sure to find and fix any underlying problem, rather than assuming that these timeouts will fix those problems."

@matthijskooijman
Copy link
Collaborator Author

My goal here was to counter the idea that timeouts would automatically mean a design flaw that needs to be corrected, timeout merely being a band aid.

This was already discussed above and some notice was already added, but I just noticed that there was a half-finished sentence in there, and reading your proposal, I like it, so I added it verbatim. Your suggestion adds a bit more detail in a nice and concise way, thanks!

@bperrybap
Copy link

Most of this discussion was using the words "h/w issue".
When I used those words, it did not necessarily mean that there is a design issue, I simply meant that there is a issue in the h/w that the Wire library cannot work around.
That issue could be an actual h/w design issue, poor wiring, or it could be some sort of induced electrical noise.
Often it was poor wiring on the part of the user that was a big contributor.

@ermtl
I' not in favor of this update - as it is written.
IMO, it is lots of additional information that is both misleading and incomplete.
I think in its attempt to provide additional information it has the potential to create some confusion and offers incomplete or even incorrect information about how to recover from these types of issues.
I think in this case less can be more.

For example:
Note that while such a timeout can be a rare but normal occurrence caused by random outside electrical noise beyond our control that does not indicate a design flaw

I definitely don't like this statement, I believe it is very misleading in that it seems to indicate that these types of h/w errors are rare but are "normal".
i.e. it leads the sentence with language that mentions a single type of issue (electrical noise) being rare but then goes on to talk about other issues so the reader can easily confuse the overall issue of electrical issues as being rare but normal.
In some situations they can happen often (like when users use crappy wiring or super long wires) or even on every single transfer in the case of misbehaving slaves and are not caused by electrical noise.

Also, how to fully recover from a timeout issue, can vary substantially and is not as simple as doing retries on operations that received a timeout error.
I think it is a can of worms trying to add information about how to recover from the issue since in many cases just doing a retry can not resolve the h/w to the desired state, particularly since it is possible to have corrupted a slave that is different than the one that was being addressed when the timeout occurred.

So, IMO, this type of language is not very helpful:

To ensure reliable operation, transmissions that ended in a timeout should be retried, the timeout error count should be monitored both during application debugging and as an alert for the end user and data values from I2C devices should be checked for plausibility or, whenever possible, CRC checks should be added. Whenever several timeouts occur in a short period of time (application dependent), make sure to find and fix any underlying problem, rather than assuming that these timeouts will fix those problems.

Doing a retry after a timeout will not ensure reliable operation.

The reason being that when timeouts are occurring, particularly during a write to a slave, there is no way to know what really transferred on the bus to the slave and to which slave it really went to.
Like I have said several times, when I have seen real world signal corruption which would create these timeouts, it is almost always preceded by lots of undetected data corruption from prior writes that were successfully written, not from the write that got the timeout.
Think about it, bits during the data stream are being flipped or at least misinterpreted by the slave(s).
Depending on where the bit corruption occurs, it could corrupt an address, data, or end bits.
The timeout error is an indication of a bit error during the start / end / stop phase of the transfer.
When there is corruption happening, it is much more likely to occur somewhere else and not be detected.
So because of the silent errors (corrupted data being successfully written to slaves) occurring prior to the timeout, you have no way of knowing what the state of your slave(s) is.
At a minimum some amount of incorrect data has been written to slave.
In the case of multiple slaves with adjacent addresses, data or corrupted data can be written to the wrong slave.

So, IMO, the most important thing is to explain what the timeouts mean, along with explaining to users that these timeouts are a sign of a h/w issue and that just doing retries is very unlikely to be enough to keep things going or make the transmission reliable.
i.e. users need to understand that if they are seeing these timeouts, that there can be data corruption occurring that is not being detected, and so the slaves can be pushed into unknown / undesired states and all the slaves, not just the one that received the timeout, will likely require re-initialization to ensure reliable operation.

@ermtl
Copy link

ermtl commented Nov 21, 2020

@bperrybap In most cases where a timeout occurs, it's because the communication glitch prevented it from completing. In such cases, the peripheral won't get the stop condition where it expects it and nothing is written to the peripheral, either the one that was intended or another. While a scenario where the wrong peripheral gets the data, misses the problem, stores the data while the master gets locked is possible, it would be very rare. what could happen more often is a case where SDA is affected,the data is corrupted, but the transaction completes without timeout and there is no way to know. That's an I2C limitation, it's beyond the Wire library's control.

When glitches are the result of electrical noise, retrying the failed read from a sensor will in practice give a valid result however, your suggestion that, in case of a timeout, a safe practice would be to reinitialize all the settings for all the connected I2C peripherals makes a lot of sense. Also, as a high number of timeouts could indicate a very noisy environment that could have stability implications for the device beyond it's I2C peripherals, in some applications, too many timeouts could be treated like a watchdog error and trigger a software reset.

Since there is no CRC check in I2C, a safer (paranoid ?) option would be to read back any data written to an I2C peripheral, the odds of an undetected error at the same place on both the write and the subsequent read being incredibly low, and even lower if the read is made twice.

Probably the "To ensure reliable operation, " sentence should be replaced with "To ensure a more reliable operation, " as no workaround will insure 100% reliability.

Also, I mentioned that the timeout error count should be monitored (at least during development) and that's a very important part as it's how you separate electrical glitches from design errors. If a developer sees a glitch every few days, after carefully reviewing possible causes in the design, chances are it's random electrical disturbance, however, if it happens several times a minute without a clear reason (huge unshielded contactor, brushed motor, fluo tube ballast, etc ...) there is clearly something wrong. it's impossible to say everything in a single paragraph. The old wording of the paragraph made it seem like the timeout was a band aid for sloppy design, all I wanted to add is that such timeout can also occur in a well designed device so the developer won't get in a deadlock, searching for an mistake that is nowhere to be found when mitigation techniques offer a way to solve the problem and give a hint of what such mitigation techniques could be like.

@bperrybap
Copy link

@ermtl
In most cases where a timeout occurs, it's because the communication glitch prevented it from completing.
I would say it is beyond most, at least on writes, it is pretty much all because the signal error/corruption occurred during stop/start phase of the transaction.
However, like I keep saying over and over gain, and will keep saying, bit errors on the signals can occur anywhere, not just at the tail end of the message. I have observed this in real world testing.
Since there is no CRC check in I2C, a safer (paranoid ?) option would be to read back any data written to an I2C peripheral, the odds of an undetected error at the same place on both the write and the subsequent read being incredibly low, and even lower if the read is made twice.

It isn't that simple. Some devices are not readable, some are state driven, some use a combination of writes and reads to control which register you are accessing.
The old wording of the paragraph made it seem like the timeout was a band aid for sloppy design, all I wanted to add is that such timeout can also occur in a well designed device so the developer won't get in a deadlock, searching for an mistake that is nowhere to be found when mitigation techniques offer a way to solve the problem
For the most part it was/is a band aide. While I agree that the Wire code shouldn't lockup and the timeouts prevent this, from my observations in the forums over the years what mainly drove the fixing of the timout was for the most part bad h/w. In some cases it was actual bad h/w design in the slave or on the Arduino board, but in many if not most cases, it was from poor wiring such as users using wires that are WAY too long, i.e. feet long.

My biggest concern is that I don't want to give users the impression that they can work around these timeout issues by simply doing something like retries. It isn't that simple.

When the signal corruption happens during the address transfer, the address is corrupted and will address the incorrect slave. In most cases a user may have very few slaves or even a single slave so the corrupted address will be addressing a non existent slave and Wire will return a error since no slave will respond. However, if you have multiple slaves, the corrupted address can match a different slave. I have also seen this happen in real world testing with many i2c lcds hooked up. I get error code 2 or 4, but usually what happens is that a nibble transfer is lost to one slave or an extra one sent to another slave and now both LCDs are hopelessly out of nibble sync and will need re-initialization.

When the signal corruption happens during the actual data transfer during writes, it can be undetectable and corrupted data is written to the slave. I have seen this happen in real world test and when the signals are marginal and there is corruption, it happens WAY more often than these new timeout errors or some of the other Wire error codes.

The point is if you are getting these new timeout errors and there are not multiple masters, then signal corruption is happening, and more than likely there is other data corruption happening that may or may not be detected and cannot ever be detected.

In the case of a hd44780 LCD using an i/o expander like the PCF8574, if you get data errors or a timeout when writing to the expander, it is more often than not non recoverable since the host and the LCD will lose nibble sync.
You can't just send the failed transfer again, since prior to this error the host likely has sent a few corrupted bytes to the expander that went undected.
For that device, you pretty much have to fully re-initialize the LCD which means you lose everything that is currently on the display.

This is why I say, at least on writes, when one of these timeout errors on writes, the user must be aware that more than likely there has been some sort of silent data corruption on previous "successful" writes so there is absolutely no way of knowing the state of all the slaves, so the only thing the user can do to ensure 100% proper operation is to re-initialize everything on all the slaves.

@ermtl
Copy link

ermtl commented Nov 28, 2020

Bad design is a preventable cause of I2C transmission errors and timeout, but it's only one of the possible causes. the other major cause is electromagnetic interferences beyond the control of the designer and the problem is so pervasive there are mitigation norms (european EMC directives) that regulates what device can emit and how resistant they must be to electromagnetic interference depending on the intended use and environment.

There is a good overview document made by Texas Instruments called Understanding and Eliminating EMI in Microcontroller Applications and another called EMC design guide for STM8, STM32 and Legacy MCUs by ST that gives hints about building more resistant circuits. It won't eliminate the problem, but make the devices resistant enough to pass EMI tests and regulations.

However, the goal of this page is simply to document the wire library, not give a complete guide about good design and electromagnetic interference. as such, it needs to give the designers leads about the possible causes and remedies for timeouts. The goal for the designer should be to reduce those timeouts as much as possible, but knowing they can never be completely eliminated (even more so if multiple devices are deployed in uncontrolled environments). Considering the possibility that data to/from an I2C (or any other bus connected device) might occasionally be wrong, with or without timeouts and hardening the software to detect and recover from it (design dependant, from a simple retry read to a complete reset) will make it overall more robust and also help solve many issues such as missing, disconnected or damaged devices.

Digital electronics is a largely successful attempt to insulate electronics from the analog chaos so that we can reason and process data with some level of certainty, however, given a chance (such as when a digital level is weakly enforced by a resistor instead of being low impedance driven as is the case on both signals with I2C), the analog world creeps in and somewhat 'interesting' stuff happens reminding us we still live in an analog world and we need to deal with it...

@bperrybap
Copy link

@ermtl I'm aware of the real world noise issues in analog environments. I started a company that designed and shipped over 90% of the worlds DSL modems in the late 90s/early 2000s. Literally 100+ million devices to pretty much every telco around the world. In fact we designed the tests for the FCC for their certifications since, at the time, this type of device was not covered by any existing tests. With DSL, signal levels, noise, and cross talk is much more complex and a much bigger issue in that environment since you have can hundreds of subscriber wires all terminating together with very low signal levels. The signaling of DSL is so complex it uses DSPs to process the signals.
Given our deployment footprint was literally the entire planet, we saw all kinds of crazy stuff.
In many cases we got blamed for functional or performance issues that were due to wiring issues either along the way or at the Central office or in some cases due to external influences including radio, tv stations and even noise from nearby cable tv wiring or power lines.

Where we disagree on this i2c stuff (it seems pretty heavily) is the effects of i2c signal issues and how to recover from them.
Like I keep saying over and over gain, and will keep saying, bit errors on the signals can occur anywhere, not just at the tail end of the message.
It is only in a very tiny section of the i2c message protocol that any sort of signal corruption can be detected.
Most often it is detected by the signal being corrupted when the I2C address is being transferred which changes it to and address that maps to a non existent slave.
It is an even smaller timing window that will generate one of these new timeout errors.
Most i2c signal corruption will silently corrupt the data.

There are just so many cases where there is no recovery other to start over and re-initialize all the slaves not just the one that received an error/exception.
Sure you can try to do simplistic things like try to do a retry or even attempt to read back what was written. But for many environments that won't work for a variety of reasons.

I disagree with the text you proposed as I believe it is misleading and in some cases factually incorrect.
I'll repeat what I said before:

IMO, the most important thing is to explain what the timeouts mean, along with explaining to users that these timeouts are a sign of a h/w issue and that there is very likely data corruption occurring including sending data to the wrong slave so just doing retries when getting an error is very unlikely to be enough to keep things going or make the transmission reliable.
i.e. users need to understand that if they are seeing these timeouts, that there can be data corruption occurring that is not being detected, and so the slaves can be pushed into unknown / undesired states and all the slaves, not just the one that received the timeout, will likely require re-initialization to ensure reliable operation.

I would be very light on what the user could / should do to mitigate this as the most important thing is convey the information that when seeing these types of timeouts, all sorts of data corruption is likely to be happening.

@matthijskooijman
Copy link
Collaborator Author

Re-reading everything again, I think I agree that the proposal from @ermtl might be presenting noise and similar problems a bit too much as "fact of life" (sure, noise is not always under control of the designer, but one can always argue that if noise affects data, then the design has too little noise immunity for the environment it is in, so that's always a design problem).

Anyway, I've rewritten the paragraph again, I now wrote:

Note that these timeouts are almost always an indication of an underlying problem, such as misbehaving devices, noise, insufficient shielding, or other electrical problems. These timeouts will prevent your sketch from locking up, but not solve these problems. In such situations there will often be data corruption which only rarely results in a timeout or other error and remain undetected in other cases, so when a timeout happens, it is likely that some data previously read or written is also corrupted. Additional measures might be needed to more reliably detect such issues (e.g. checksums or reading back written values) and recover from them (e.g. full system reset). This timeout and such additional measures should be seen as a last line of defence, when possible the underlying cause should be fixed instead.

I think this now makes more clear that:

  • timeouts are not normally acceptable and a sign of underlying issues
  • when timeouts happen, there is probably a lot more corruption of data too, so "handling" a timeout needs more work
  • such data corruption often happens without a timeout, so there might be additional measures needed to detect such corruption
  • "handling" a timeout is a stop-gap, you should really be fixing the underlying issue

@ermtl, @bperrybap What do you think of this version?

@bperrybap
Copy link

I like it.
I would put a period instead of a comma between "in other cases, so when"
and change the "is" to "was" here: "likely that some data previously read or written is also corrupted."

It might be useful/helpful to mention that it is possible that data and/or state corruption can happen to any slave on the bus not just the one that was being addressed or that has just gotten the error.
i.e. a write to slave A can end up causing corruption on slave B.

@matthijskooijman
Copy link
Collaborator Author

matthijskooijman commented Dec 11, 2020

Thanks, made the changes you suggested.

It might be useful/helpful to mention that it is possible that data and/or state corruption can happen to any slave on the bus not just the one that was being addressed or that has just gotten the error.
i.e. a write to slave A can end up causing corruption on slave B.

Good point. Even more, if timeouts occur, there's likely noise on the bus that affects all slaves, so even without corruption on the address bits, all slaves can have seen corrupt data.

I added "(also to other slaves)" to:

So when a timeout happens, it is likely that some data previously read or written (also to other slaves) was also corrupted.

How's that?

@ermtl
Copy link

ermtl commented Dec 11, 2020

I obviously don't like this new version as it once again puts the blame on bad design and considers electrical glitches as a non issue.

pberrybap wrote:" bit errors on the signals can occur anywhere, not just at the tail end of the message." but i never pretended otherwise and that is a misrepresentation of what I wrote. Any glitch occuring on SCL will result in either missed or additional clock pulses as seen by one or more slaves relative to what the master sent. The master and slave will thus get out of sync and this loss of synchronicity will show when the stop transaction fails resulting in a timeout no matter where the glitch was in the transmission, not just at the precise time of the stop condition.
While cases where SDA gets a glitch and not SCL are possible, and they will result in data corruption without timeout, in most cases where a severe glitch occurs (and that's the ones even careful design can't avoid) SCL will be affected and the transmission will timeout. The idea that timeouts only capture a small portion of problems is not exact.

Another way to see the problem is from the user(programmer) perspective, since the goal here is user documentation.
In the "if there is a timeout, your design is crappy" approach, the user is not compelled to implement any mitigation. he will enhance his design until he does not see timeouts in the lab and declare it good, only to see it fail in the field. he might also test extensively in the field and no matter how careful or even paranoid, given enough time and enough devices, there will still be timeouts with no practical remedy.

In the "If there are many timeouts, your design is crappy, but a if it's only a rare occurrence it could be unavoidable glitches and you need to monitor it and add remediation in software" approach, the user is still urged to improve the design, but when rare glitches occur in the field, the device will be able to handle them. I think it's way more constructive and will result in more resilient designs.

@bperrybap
Copy link

@ermtl, I'm not seeing what you are saying is in the new/updated text from @matthijskooijman

once again puts the blame on bad design

I just don't see that in the text. It says that there could be various types of issues.
It never states that they are due to bad design.
Even the words "insufficient shielding" don't indicate a bad design, just insufficient for the environment.

Perhaps it should mention that these timeouts or other issues are caused by something outside of any of the s/w in the Wire library or the sketch. To make it clear that the problem is not caused by and cannot be eliminated in s/w.

Perhaps it could mention voltage/power supply issues as I've also seen that show up quite a bit in various forums where a heater element goes on and the voltage sags or creates massive undershoot/overshoot which can cause all sorts of signal integrity issues.


Also, I missed this when I read it at first but I think this part could use a tweek/correction:

In such situations there will often be data corruption which only rarely results in a timeout or other error and remain undetected in other cases, so when a timeout happens...."

What I see in real world heavy write testing in certain environments is that 100% of the time I get a Wire timeout error, there was undetected data corruption occurring prior to the timeout. In some cases going to the wrong slave.
So I would change the above to something perhaps like this:

In such situations there will often be data corruption which doesn't result in a timeout or other error and remains undetected. So when a timeout happens...."

Again, IMO, I think the most important thing to try to get across to people is to understand that these errors / timeout issues mean that there is i2c signal integrity issues.
And that if the sketch or library is getting timeout or other i2c issues, that there is some sort of issue creating i2c signal integrity issues caused by something outside of the sketch or library s/w.

Once there are i2c signal integrity issues, lots of odd/bad things can happen like
it is very likely that silent (undetected) data corruption is also happening prior to the error reported by the Wire library (at least on writes). Further, in the case of an SDA signal issue, the master can end up addressing and communicating with a different slave than the slave that was intended. i.e. An attempt to write to i2c slave A can end up writing to slave B.

The reason to point all this out to people is to try to get them to understand that after getting these types of i2c errors, you can not just do something simple like retry the operation to always get things working again.
Depending on how much data corruption has occurred, it may involved having reinitialize all the slaves on the bus, including those that have not received any errors.

I can tell you from reading many threads on the Arduino forum, and a long standing github issue about adding timeouts to the Wire library, that a large percentage of the people believe that all i2c errors/issues can be detected and that doing a retry of the operation after getting an i2c timeout can resolve the issue.

But If they really understood what could be happening to i2c bus and to their slaves when these timeouts are happening, then they should understand that recovery is not as simple as doing a retry on a failed operation.

@matthijskooijman
Copy link
Collaborator Author

@ermtl, you write:

While cases where SDA gets a glitch and not SCL are possible, and they will result in data corruption without timeout, in most cases where a severe glitch occurs (and that's the ones even careful design can't avoid) SCL will be affected and the transmission will timeout.

I don't think that a glitch on SCL necessarily results in a timeout. In most cases, the data will be shifted by one bit, resulting in invalid data, or a NACK instead of an ACK, but all of that is handled with a timeout. The only cause I've seen that causes a timeout is when noise results in either an arbitration error (noise on SDA during transmission of 1 bits) or an errant start condition (which needs noise on both channels during idle, I think). In both cases, the hardware assumes that another master is present and has an active transaction, so the hardware stalls indefinitely until the "other master" releases the bus (i.e. never).

The idea that timeouts only capture a small portion of problems is not exact.

So in my experience, this is accurate.

Another way to see the problem is from the user(programmer) perspective, since the goal here is user documentation.
In the "if there is a timeout, your design is crappy" approach, the user is not compelled to implement any mitigation. he will enhance his design until he does not see timeouts (...)

But enhancing the design is mitigation? I have the impression that you might have a different or more narrow definition of "design" (and "bad design") in mind than @bperrybap and myself. For me, "design" involves the entire system: Circuit, PCB layout, cabling, shielding, chassis, power supply. And IMHO any noise that results in bitflips means that your design is not sufficient to deal with the particular environment that the system is operating in at that moment.

@bperrybap As for your last suggestion, I included it. I already didn't like that sentence all that much, I think your suggestion is a bit nicer indeed.

@Aleev2007
Copy link

You are all wrong. The instructions should only contain what YOU know about the event. Not your guesses.

Data was not received / sent within the allotted time.

@doppelhub
Copy link

Ran into this issue... super fun to troubleshoot.
Just wanted to comment that if you want to prevent this hang, you can add "setWireTimeout" immediately following Wire.begin():
Wire.begin(); Wire.setWireTimeout(1000, true); //timeout after 1000 microseconds

@kengdahl
Copy link
Member

@per1234 I have made an internal issue about this for the content team to take a look at.

@per1234
Copy link
Collaborator

per1234 commented Jul 17, 2022

The "Wire" library reference content is now hosted in the arduino/reference-en repository so I have transferred this issue to the relevant tracker.

I see that a claim was made that the suggestions made here were incorporated during the migration of the content to this repository: #870. I haven't had the chance to review that though so I will leave this open until the resolution is verified.

The published content is here:

https://www.arduino.cc/reference/en/language/functions/communication/wire/

If anyone spots any omissions or errors, please let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants