-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XMDA: Low perfomance numbers on Gen3x16 #293
Comments
Dmitriy, I have similar problems with XDMA PCIe3x8 data transfers into BRAM memory, but not as severe as yours. The maxima I found are 3,381.66 Mbytes/s read speed and 3,399.44 Mbytes/s write speed for a maximum packet size of 256 Kbytes (limited by the number of BRAM blocks in the FPGA). That's a lot better than 2.2 Gbytes/s, but still less than half of the theoretical maximum according to Wikipedia. Your graph shows similar values for 256 Kbyte packages. Do you think it makes sense to make the packages larger? It looks like your curve is picking up speed for small packages, my curve is concave from the start. I have not touched any IRQ related driver settings yet, unrelated to the package size there might be room for improvement in the graph, but for now it's under-performing like your setup. Mischa. |
Mischa, You have created very interesting graphs; however, I have not worked with BRAM and am an absolute beginner in FPGA, and this project is my first. I am only reading data from FIFO, but its size is limited to one megabyte, so for testing, I am just transmitting meaningless data to the c2h channel ('1). Did I understand correctly that you are copying data from the computer and transferring it to BRAM on the FPGA? Diman. |
Bonjour, I am a specialist in device driver development on Windows and Linux. I developed DrvDMA (https://www.kms-quebec.com/Cards/0041_en.pdf), a driver supporting XDMA engine. I would be happy to find a way to test my driver with your specific hardware to know if the performance is better. For now, I tested it on a x4-gen3 and get ~ 20 Gb/s (~ 2.5 GB/s) from the card to the PC memory, but the speed was limited by the FPGA data processing. Do you use an FPGA dev kit? If yes which one? Would you want to share your bits stream? Regards, |
Hi Diman, I've been doing mostly C/C++ and Assembly all these years. The same here, I am also inexperienced with FPGAs. I am indeed copying data from the computer into BRAM and back into main memory. In fact, the graph shows the speed tests for all the different input combinations of the 'PG195: AXI4 Memory Mapped Default Example Design'. That's how far I've come. Not very far yet. Mischa. |
Hi Martin, Thank you for your interest, but I don't think the bitstream file will give you much to work with, because you'll need our board anyway. I would recommend creating an Example-Design on AxiStream and testing it there Diman. |
Hi Mischa, Then I think this unusual graph is indeed related to your computer's memory limitations. Could you tell me what program you're using to measure bandwidth? Diman. |
Hi Diman, Thanks for taking time to answer. If we want to try DrvDMA, with your board, we can discuss it. Regards, |
Hi Diman, No, it does not have anything to do with the computer's main memory. These are DDR4/3200 sims and they are fast enough. I wouldn't have measured 17 Gb/s read speed with the Kingston Fury Renegade PCIe 4.0 x4 SSD otherwise. I think the FPGA transfer speed is limited by the BRAM, but I can only be sure when I've tried the RAM plus the memory controller instead of the BRAM. Another factor that could influence the transfer speed is the use of IRQs, which are disabled in the current driver, if I'm correct. To answer your question, I've written my own software to measure the transfer speed (by placing timers around the read and write system calls). Mischa. |
Hi! Friends, I am measuring the performance of XDMA on the Z19-P board, PCIe Gen4x16 (the IP core only supports Gen3x16) - and I can't reach the theoretical speed of at least 12 GB/s, but i get only 5-6 GB/s
My system is Fedora 39, Linux Kernel 6.5.2
As a result of extensive work, I have tried absolutely all debugging options available to me, but the speed remained the same. Here are my observations and assumptions:
The processor is not fully utilized. The maximum is 60%. I find this strange because, for example, the command
dd of=/dev/null if=/dev/zero bs=1MB count=10000
fully loads the processor at 100%, whiledd of=/dev/null if=/dev/xdma0_c2h_0 bs=1MB count=10000
, which also transfers bytes via XDMA (I checked), also shows 60% and the same bandwidth. This detail is one of my arguments for why the programs I am using (like dma_from_device.c, etc.) are not related to speed limitations; rather, it is the driver that limits it. Even the oldest Linux system commanddd
cannot handle the transfer properly. Perhaps the current state of the driver is incompatible with some component of the system, for example, with the new version of Fedora 39, and XDMA simply does not deliver full performance due to some bug.The Hardware Numbers program (the second graph) shows excellent results. This Xilinx program was created so that we could learn the potential performance figures of our PCIe interface without software and drivers. Thus, the problem is definitely not in the hardware but somewhere in the OS or XDMA.
When reconfiguring the XDMA IP core from Gen 3x16 to Gen 3x8, I expected to see the same 5-6 GB/s as in Gen 3x16, but I saw 2.2 GB/s. Both values for both configurations are ~30% of the maximum. Something during operation cuts the speed by ~70%.
I noticed that speed increases when I build and do insmod xdma on different versions of the Linux kernel. On 6.5.2, it works 10-20% faster than on 6.9.9.
Different versions of the dma_ip_drivers repository do not yield significant results.
Previously, I had a problem with Poll Mode. It was performing worse. But with the help of
#define XDMA_DEBUG 1
(AR71435), I fixed the issue; however, it did not help the overall performance. I didn't find anything else strange in this debug log. Except, perhaps, that for some reason a strange number of descriptors is allocated. For example, if the transfer is 1MB, 255 descriptors are allocated, but only 16 are actually used (also writes "nents 16/256"). These messages indmesg
are output by the following function from libxdma.c, line 3040:The first graph shows the final results of my measurements with honest figures.
I will be very grateful for any hint from you
The text was updated successfully, but these errors were encountered: