-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: An unknown internal error occurred #17
Comments
Hi Tysonite, What sort of load is the device under? How much are you transmitting, approximately how much are you receiving, and are you using both channels on the device? If you have a bus capture, we could take a look for you as well to get a better idea of what's going on, you can send it to [email protected] and mention me. How long is the device running before you see these messages? Some debugging steps:
--Paul |
Thanks for detailed response. Based on it, we have updated firmware on impacted device, and it looks to be working well now. We are seeing one more
error message on another device, but it does not have up-to-date firmware. We will try to update firmware, just in advance, if it will not help, what can be the reason if this error? |
Glad that updating the firmware helped. Both of the error messages you list are related and likely stem from the same underlying issue, so I suspect that updating the firmware on that one would help as well. Otherwise, I'd apply the same debugging steps as the previous. --Paul |
We still observe Is it possible to print in logs the frame/packet that has checksum error as is? |
Or more question: how to get a bus capture? |
@hollinsky-intrepid, can't reach out to you either here, or via [email protected]. Any chance to get your support? |
I have created the Ensure also that the libicsneo gets updated when you switch to this branch of icsscand, Once running the revisions should be
Perhaps running this revision we can get to the bottom of the checksum errors. |
Thank you. Hope we will get new logs in |
When we switched to The version of
The
The output is not as mentioned by you for
which looks to be correct. One issue came up on |
The issue related to inability to send/receive any CAN message happened again. Some observations below. The error that shows this issue:
The recovery method is to restart And some outputs:
The stack trace of core dump gathered for core file dumped for running
|
Hi tysonite, I can see where the issue is occurring. One thing that will help narrow down the problem: are you unable to receive any CAN during this time? i.e. not counting the CAN you are transmitting, can you see other CAN coming in? --Paul |
According to Actually, we tried to do |
Okay. I will have to look further into the issue. Your cansend is failing due to the backpressure system when the write queues fill up. It would be expected that your writes would eventually fail if you were consistently sending more than the bus could handle, though I'd expect this to resolve itself after your application stopped transmitting at that rate. Is your application sending a lot of traffic on this bus? Is it possible that you are sending in a while loop without a sleep? I think something may be wrong there and the application is getting stuck in the "queue full" state. However, there are other things I should look in to if your application normally does not transmit much on this channel. --Paul |
Thanks. It is not resolving itself after our application stopped working. Only OrangePI reboot or There is about 115 000 CAN messages in total according to the log transferred within 2 hours in both directions. Not sure whether this much or not. As well, when we are sending CAN messages out to the system under test from our application, after each message there is 100ms delay. However, this loop usually includes not more than 20-50 messages in a row. Than, there will be some post/pre steps that don't include excersing of the system under test with CAN messages. Only general periodic traffic is present. Just for the sake of history. This started happening after we upgraded to the branch that includes additional logging for the I am wondering if additional changes made on logging branch causes that, it looks like that there are unrelated to |
Some new observations, when issue being discussed above happens, we restart
And
Before interfaces come up:
|
Hi @hollinsky-intrepid, did you get a chance to look further? |
We are actively looking into the issue and running some tests to see if we can reproduce your issue here. Due to the fact that the errant behavior only appears after several days straight of running, it may take some time to reproduce and fix. --Paul |
Thank you. If you need any other info, please feel free to ask. If it will be possible, I will share it. |
Few more logs, on one of our OrangePIs that sometimes experience issue, the following log records appeared. However,
Also,
|
@tysonite I have still not been able to reproduce here. In many cases when I simulated an error though, these "discarding byte" messages were preceded by a line "A checksum error has occurred!" along with more information about the error. Did you see anything like that in your logs? |
@hollinsky-intrepid, "checksum error" message went away after we switched to branch with additional logs. Only "discarding bytes" present sometimes. I already asked 2 times, but didn't receive a reply. It looks for me that
Can you please double check if all this really necessary, and we are really running same versions on both sides? I feel we are testing some changes that are still in development. That's how we switched to
Anything we did wrong? |
@hollinsky-intrepid Any feedback on previous message? |
Hi @tysonite, I see the line You can remove/change the rate limiting on journald: # in /etc/systemd/journald.conf
RateLimitInterval=0
RateLimitBurst=0
$ systemctl restart systemd-journald (see more here) For system performance reasons, once you capture the logs we need, I'd recommend setting this back to the default. After this, all messages should be collected and we can look further. About the new version: |
Thank you for clarifications and nice catch of systemd logs rate limit. We've also found some inconstencies how our system used by consumers, and trying to prevent this in future as well as testing defense mechanism. In short, periodic messages being sent by our system never terminate, new periodic messages queue in addition, and this probably lead to malfunction of icsscand after some period of time. We plan to fix this misbehavior at first, enable systemd logs for icsscand and wait for reproduction if it happens. One more question, if you don't mind, there is so called "off-bus" of the CAN hardware. And recovery for this state can be enabled by |
Hi @tysonite, That sounds all sounds good, let us know if you encounter any other problems once your end has the fix applied. Our hardware is programmed to avoid bus-off. Therefore, recovery will never be necessary. The TEC/REC are exposed to the user currently as a CAN message, though I'd like to make a better API for this in the near future. --Paul |
After some more experiments I come to the conclusion that there is something weird happening with a ValueCAN3 device when it is used for a long amount of time. This method does not exactly reproduce complete freeze of icsscand daemon, but at least gives a similar error while sending CAN messages continiously. Please see my finding below. Hereafter a Python 2.7 script that after some time produces #!/usr/bin/env python
from __future__ import print_function
import time
import threading
import can
try:
bus = can.ThreadSafeBus(interface='socketcan', channel='ics0can1')
except:
bus = can.ThreadSafeBus(interface='socketcan', channel='can1')
def send_periodic():
msg = can.Message(arbitration_id=0x122, data=[0x04, 0x02, 0, 0, 0, 0, 0, 0])
bus.send_periodic(msg, float(0.1))
print("periodic thread")
for i in range(100):
msg = can.Message(arbitration_id=i,
data=[0, 25, 0, 1, 3, 1, 4, 1])
bus.send_periodic(msg, float(0.1))
print("periodic thread done")
def par_thread():
print("send thread")
def func():
msg = can.Message(arbitration_id=0x999, data=[0x04, 0x02, 0, 0, 0, 0, 0, 0])
i = 0
while True:
try:
i = i + 1
if i % 100 == 0:
print("i = {}".format(i))
bus.send(msg, 0.1)
time.sleep(0.5)
except Exception as e:
print("{} - {} - {}".format(threading.active_count(), i, e))
thread = threading.Thread(target = func)
thread.start()
print("send thread started")
if __name__ == '__main__':
par_thread()
send_periodic()
print("sending messages")
time.sleep(10000) In order to run it, you need to have
This script does creates a huge load that is not a real case in our production environment, but at least it moves us a bit towards issue understanding (hopefully). It sends periodic messages (100) using SocketCAN Linux native interface ( After about 5-10 minutes, it starts printing While script runs, I observe TX/RX counters using
At the very first run, they are increasing fast enough, but over time, and after several restarts of the script, the speed that are increasing highly degrades. So finally they increase by 10-20 per second while at the begging it was around 300-400. Even if OrangePI is rebooted to exclude any potential issues in the CAN drivers/daemon and to free up any stale resources, the counters are still increasing very slowly and this makes me think that ValueCAN3 is busy with some unexpected tasks. After some time (around 40-60 minutes), ValueCAN3 gets back to normal. The same happens on any CAN driver/daemon (v1 or v2). No any errors reported by I would like to get your thoughts/suggestions on this as I feel that probably the same kind of stuff happening in our production environment, but as it runs for a very long time (days / weeks), the circumstances look differently. |
I am able to reproduce an issue with The topology of network is following: OrangePI connected to ValueCAN3/4, and another device is connected to the same ValueCAN3/4 that produces about 5-8 periodic messages with 0.2 seconds period. There are 2 interfaces (can0, can1), can1 has traffic, can0 has no at all. The only recovery is to restart
|
Very peculiar. I left mine running over the weekend on my workstation as well and have 242704818 packets sent thus far without error. There are no logs in the Python console other than from i = 0 to i = 479700. I'm running on the I also have yet to observe the CPU usage you mentioned before, so I'm starting to suspect whether the OrangePi specific kernel (or possibly just older kernel) is doing something different. It is also curious since you are only loading the bus to ~30%, assuming a 500k bus. |
ValueCAN is configured with 125k can1 bus. However, the same happens on a very small amount of traffic (7%-10% bus load) The kernel version is I've following statistics now:
When it reproduces, script prints continuously:
As well, |
I think we will try to get the same hardware setup that you have in order to track down the issue further. You're using the OrangePi One with 512MB of RAM? And what type/speed of SD card are you using for it? |
Are you sure? We have several other devices connected via different interfaces, it will be hard to mimic our setup. We are actually open to have call for a live debugging session. Will it work? |
Yes, OrangePI One with 512Mb of RAM. I will share SD card picture soon as well as pre-built image to flash on it as my team saying that this is possible to share it without signed NDA. |
The SD card is this one: https://www.amazon.com/SanDisk-Extreme-MicroSDXC-UHS-I-Adapter/dp/B07G3GMRYF/ref=sr_1_3?keywords=sandisk+64gb+extreme+pro&qid=1576681133&sr=8-3 And few more updates:
So I suspect that this is combination of 2 reasons:
|
Hi @hollinsky-intrepid, did you get a chance to reproduce this on OPI? Anything we could do to help? |
Hi @tysonite, I didn't expect it to take quite this long to get hardware. In the meantime, could you get us a core dump where it's stuck? You can do Perhaps you could send this, along with the binaries you've built, to the support team so we can take a closer look while we're waiting for ours to arrive and exhibit the behavior you're talking about. --Paul |
I've sent data through David. The icsscand daemon was in a state when it rejects CAN messages and does not recover automatically after a while assuming high bus load prior to |
|
Here is one more backtrace with debug info included:
I feel that the issue is in thread 1 and thread 6 that probably deadlocks each other. The thread 1 tries to write a message in |
Hi @tysonite I think you may well be correct in your analysis. The valgrind results you posted are quite odd to me, since if you look at I can definitely see what your second post is talking about with the condition variables. I suppose I had not expected that I want to push a fix for this and see where we end up. My fix will be two-fold. As to your question of why we decided on concurrentqueue, we've had a great experience with it so far and have found it to be very performant. It doesn't make much of a difference with a ValueCAN, but some devices can accept megabits (or even gigabits) of data with networks like Ethernet and FlexRay so the performance makes a big difference. We may also decide to use thread-pools for decoding if necessary, making multi-producer/consumer a nice feature to have waiting in the wings. If we do find it to be a source of problems, it would not be too difficult to swap it out for another solution or something homegrown that performs a similar task. |
I did it a bit simpler for the sake of prove during our testing. The blocking logic you have with a conditional variable might be replaced with See this patch:
It will be running for a few days on our side to get confidence that issue goes away unless you have better patch. BTW,
I see, understood. The queue implementation has also to be upgraded as it contains known issue I noted earlier. It can happen according to original author in case of long running queue/dequeue operations. |
Okay that sounds good. I have a patch here I planned to give you which still uses the condition_variable to prevent excessive spinning, but we can wait on that and check that your solution works. Then we can apply it to v0.2.0 and get you guys back onto the mainline. We'll also get the concurrentqueue implementation updated to the latest. |
Runs good so far. However, I forgot to note that I've added additional logging of USB errors via following patch as noted on libftdi help page:
During last 12 hours such error appeared in the log:
But I am not sure from where it comes from as there are several places where unknown error may be raised up. Would it be also possible to somehow enhance logging like adding a separate category/type for USB errors as well as return code of ftdi read/write call? Also, would it make sense to retry ftdi calls in case of some errors as I am a bit worried about missed messages? |
I don't think it is false positive. EventManager destructor does not protect |
I would still propose to get rid of size_approx(), enqueue() and conditional variable to wake up a thread in order to make logic less complex and error-prone. Regarding excessive spinning, would it help if The
Even the daemon responsiveness might be higher in this case. |
The Also, tried to google, and people try to invoke ftdi_read_data continuously to get rid of some errors, e.g. like here. But not sure if this is a case now. Sorry for spamming, but reliability of icsscand is really critical for us. |
Unfortunately this would drastically affect latency. For some applications, round trip latency is extremely important, and the current system allows for latency in the hundreds of microseconds in some cases.
These mean that somehow we missed something (maybe a single byte, maybe more) in the bytestream coming from the device. You're seeing these messages because we somehow missed the "start of packet" identifier. This doesn't necessarily mean you're missing data, as it's likely one of the "heartbeat" status messages from the device that happen very often while connected. Nevertheless, it should not be happening and we should get to the bottom of it. I'm still very curious why we're not seeing this on any of our testing machines. It may be down to the device or kernel, so I have finally acquired an OrangePi and will be doing further testing on that.
The read call will be immediately retried. The write will not, but I think it's best to understand what errors are occurring so the device doesn't receive duplicates.
You're correct that this error occurs in a few places. However, there are only two places it occurs during continued operation of icsscand. The first is after For debugging,
You're correct, we only flag up from a list of predefined errors in libicsneo at the moment. Having arbitrary data attached to them is definitely something we can consider going forward. Like I said,
It's understandable. In the coming weeks and months we will be getting ready to release Vehicle Spy X, and libicsneo is the hardware driver layer of that. You'll see this repo getting a lot more attention and stringent testing as we do more in that space. |
I understand. The original concern was about spinning, howerver, as far as I see
Thanks. Added that into each place where uknown error might be reported during processing CAN data.
Is it possbile to prove that somehow internally in icsscand? |
I got 1st error message:
As per libusb error codes, it looks to be LIBUSB_ERROR_TIMEOUT. However, there is nothing suspicious in the |
It looks to me as though libusb will return partial reads in case of a timeout, however libftdi does not handle this properly and throws out that data. Per the libusb docs:
However, the latest libftdi handles this as: ret = libusb_bulk_transfer (ftdi->usb_dev, ftdi->out_ep, ftdi->readbuffer, ftdi->readbuffer_chunksize, &actual_length, ftdi->usb_read_timeout);
if (ret < 0)
ftdi_error_return(ret, "usb bulk read failed"); I will have to dig into this properly and make modifications to libftdi. For the sake of your testing, we can change |
It runs without issues for about a week now. The final patch used for testing is this one:
As well, The last thing I am a bit concerned about is intrepidcs/icsscand#3. |
Hi @hollinsky-intrepid, is there any way to upgrade ValueCAN firmware on OrangePI/Linux? We are thinking about a way to automate this process somehow. |
Hi @hollinsky-intrepid, I see you have released a new version of the library, but looks like without a fix for this issue. Would it possible to plan those changes/fixes for inclusion into main stream? |
Yes, I'll make a note to get them merged in. I think the right way to fix this will be adding the patch to libftdi to handle partial reads. |
Use a spin lock to recheck the queue size until it has room to push.
Hey, just wanted to ask if anything pending here... |
Nice! Thanks for support! |
There are messages like below that started to come to syslog continuously. There are already around ~25 000 messages like that, frequency of those messages increase. We are using recent versions of all CAN repositories.
The outcome is that CAN messages can't be received from the device.
CAN repos versions:
Hardware:
OrangePI One
Linux kernel:
4.13.15-sunxi #1 SMP Tue Nov 21 23:35:46 MSK 2017 armv7l armv7l armv7l GNU/Linux
Can you please suggest how to debug it further?
The text was updated successfully, but these errors were encountered: