Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a parallel implementation that wraps libusb directly, moves code into C. #7

Open
rpcope1 opened this issue Apr 22, 2015 · 7 comments
Assignees
Milestone

Comments

@rpcope1
Copy link
Owner

rpcope1 commented Apr 22, 2015

Currently it is possible that we're getting close to the performance limits to reasonably be expected from using pure CPython polling data from the device. One possible upgrade, if all the data can be streamed, would be to implement the trigger completely and correctly in software; to do this will probably require threading, and will thus likely be hampered by the GIL. In addition to true parallelism, having a high performance circular buffer would be excellent.

@rpcope1 rpcope1 added this to the v0.1 milestone Apr 22, 2015
@rpcope1 rpcope1 self-assigned this Apr 22, 2015
@vpelletier
Copy link

Just my 2 cents as python-libusb1 author:

With async transfer usage, I can easily saturate the USB 2 bus with bulk transfers (~45MB/s IIRC, with insignificant CPU usage and with very coarse-grained tuning). I think besides using async API, it may be important to adopt a pipeline design (one process to capture, (an)other(s) to process) in order to scale better than what a single python process can do.

This said, I like the idea I read somewhere of using isochronous (a bit of loss shouldn't matter much for analog signals, I believe), but I didn't use them myself. You may have surprises, and I would be very happy to get feedback & bug reports.

Also, in my experience pypy provided ~6x faster execution on code written with CPyhton in mind - you may want to try this to get an edge and get stuff running on, say, rpi.

...And to shamelessly advertise another of my pet projects: pprofile wants to help you find that bottleneck ! Try the statistic profiling feature, which allows to reduce the performance hit from profiling as much as desired (of course, this is a trade-off with measure duration and precision). And it's still pure python, so it works with pypy.

@rpcope1
Copy link
Owner Author

rpcope1 commented May 1, 2015

Vincent,

All excellent ideas here. It's looking like I am going to try the multiprocess route, hopefully I can get a really clean solution that doesn't get bogged down anywhere. I am also going to try PyPy, I think that's a good idea since some of this will be CPU bound.

Your profiler looks really cool, and I am probably at the stage where I need better tools than timeit and friends. I'll certainly give it a go.

As a mostly related note, it appears the state of libusb on Windows is really rather ugly, and I think I recall python-libusb1 doesn't support Windows. Do you have any experience with libusb on Windows?

@vpelletier
Copy link

I think libusb works on windows... But there are at least 3 libraries:

  • libusb-win32, which implements libusb0.1 API - hence incompatible with python-libusb1.
  • libusb1, although I cannot get my hand on a windows release at the moment.
  • The artist library which was formerly known as libusbx. AKA the aggressive libusb1 fork/partial takeover. It now goes by the libusb1 name on another website. Politics asides, they worked a lot on windows support. Although I so to speak never use libusb1 (whichever flavour) on windows, I believe the fork is technically superior there.

There is another alternative for installing libusb on windows: cygwin. Very convenient from a *nixian perspective, but I guess I wouldn't specifically target this platform for releases (ie, it is just another GNU/x distribution as far as your package is concerned, just GNU/Windows instead of GNU/Linux).

@jhoenicke
Copy link
Collaborator

I just ran the async branch, letting the scope analyse itself by connecting it to the tx+ line. Of course, I cannot follow the data, since high speed USB is 480 MHz. Also due to the 20 MHz bandwidth of the scope, the signal is dampened but still clear to see.

A bulk transfer of 512 bytes with handshake takes about 12 µs. Also a bulk transfer may never go over a microframe boundary. So there are at most 10 transfers per microframe, corresponding to 40 MB/s theoretical maximum (45 MB/s would be possible if the handshake would be a bit faster; I think the builtin rate switching hub of my laptop is the bottleneck). With isochronous transfer we can have at most 3kb per microframe (3 frames á 1kb), i.e. 24 MB/s.

The device can buffer 4 blocks, but while a block is in transit it cannot be filled. According to my computation the gap between two blocks must be smaller than 61µs for 24 MHz and smaller than 44 µs for 30 MHz to sample continuously (assuming that the queue was full). At 24 MHz, I observe gaps of 60-70 µs between individual bulk reads (probably because the kernel needs some time to register the next bulk read with the usb driver). This is the reason why we loose some samples.

At 24 MHz we should be able to continuously transfer with isochronous transfer, provided the pc side requests the transfer for every micro frame. We even have some slack since USB can transfer 25165824 bytes per second. The big advantage is, that we can use the full 4kB buffer on the device. Also the new 12 MHz mode should be handy for sampling two channels continuously.

I'll try to patch the firmware to always use isochronous transfer. It would be quite a hassle to switch between both modes. It's probably easier to have two different firmwares for both modes.

@jhoenicke
Copy link
Collaborator

I started an isochronous branch, but unfortunately that cannot handle 24 MHz without sample loss.

https://github.com/jhoenicke/Hantek6022API/commits/isochron

When at the beginning of a micro frame the third buffer is not yet full but almost full, only the first two buffers are sent. Now if the fourth buffer starts shortly after the start of the micro frame, it takes 128 µs to fill the remaining buffer. The next time a buffer will be freed is when the third block was sent. But the third block starts sending with the next micro frame 125 µs later and needs 17 µs to be transferred. Thus we get a buffer overrun, even though we do quad buffering.

@jhoenicke
Copy link
Collaborator

There's a flag for everything :) In this case I shouldn't have set the AADJ flag in the EP2ISOINPKTS register. The current version of the above branch can sample at 24 MHz without any lost sample.

I haven't checked for 2 channels at 12 MHz, but that should just behave the same way.

@rpcope1
Copy link
Owner Author

rpcope1 commented May 1, 2015

Wow!! Very cool dude! Awesome work! I'll make sure it check it out when I get home.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants