-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controlling hardware simulation: Python test benches? #226
Comments
I agree with you on the problem completely, but I am not sure I quite follow the part of the the test benches. Are you thinking that we pass an "input generator" as a function along with simulation? I think this makes sense from a fast simulation standpoint, but I am not sure how it maps to the FPGA space (unless the generator was something we could then synthesize). For the hardware side I think the idea is that often times processors operate in "bursts" of activities. For example, we might send over set of commands which invoke some function or set some memory bits and then "run" the processor which might take many thousands or even millions of cycles (which no new inputs being needed -- obviously the input wires will be set to some value but they don't necessarily need to change). For example if we were to have a "terminal" on this hardware machine we might need to send something on each key stroke but then nothing is sent in between. So what I was thinking was that this could be captured by some sort of "run until" operation with an explicit watchdog counter. However, perhaps there is a middle way -- if functions or classes are how we generate inputs we could have a particular type of function/class that was then hardware accelerated? Open to more discussion. |
That's basically my idea with the test bench, but in addition to being an input generator it could also be an output pruner of sorts, discarding or compressing uninteresting outputs to reduce memory requirements. You're right that this wouldn't work on a typical FPGA, but the PYNQ has a couple ARM Cortex-A9 cores (capable of running Python) that are directly connected to the FPGA fabric, so a Python test bench running on the ARM wouldn't necessarily have the same performance problems as one on the host computer. For the simple case you describe of setup followed by constant inputs until some end condition, a "run until" would suffice. For anything more complicated, though, it would be easiest to express the testing conditions in code. Hardware is a pain to write compared to standard Python, so if we can get reasonable performance without requiring a synthesizable test bench, that would be ideal. |
I agree that having synthesizable testbenches is less than ideal. However, I thought that the Cortex-A9 cores were on-board but still off chip -- having the processor provide inputs every cycle might be pretty slow. I think each transaction will require both transfer across the off-chip bus and interrupt handling. From a bandwidth standpoint my guess is that it is pretty decent, but from a latency perspective (invoke on each cycle) I am less sure. It might make sense to estimate the performance of the schemes (or run some tests)? |
You probably want to store some a bunch of pre-generated inputs in host memory, DMA the input to the programmable logic and then DMA the output back. Interrupt the processor when completed. There are some IP blocks that should help you with that. |
From a hardware performance standpoint I agree that DMA is the way to go. I think the question is how to encapsulate that interaction. Abe is absolutely right that a generator is the right way to capture that interaction without requiring the "test bench" to be explicitly manage the reentrant nature. However, if you want performance out of the hardware some structure is required of that generator (like that there are "blocks" where inputs and outputs how no dependency). That is where I was going with the "types" of generators... only some of which get you hardware performance (but all should still be functional as long as the hardware supports single step). Another option would be that there is a .softsim and .hardsim method that is either hand written or automatically generated from some other specification... however it quickly becomes YASL (yet another specification language) pretty quickly down that path. |
I think the right approach in general is to start with functionality and convenience, and only worry about performance where necessary. This would suggest Python test benches in the general case, with an option to add a synthesizable test bench when you need pure-FPGA speed. In this view, PyRTL would provide a framework for connecting the Python test benches to the simulated logic (as well as some basic test benches), so that users can write their test benches normally, then port to logic only those pieces of the test bench that need higher speed. We could perhaps provide some pieces of logic (using this framework) that implement the optimizations you mentioned, such as running several steps without changing the inputs. As far as I can tell, this approach meets all the requirements:
This would work particularly well on the PYNQ, since the CPUs and FPGA are in the same package, but even if the only connection between Python and the FPGA is over USB, we should be able to present the same API. |
I agree that a synthesizable test bench is not the right way to go to start as well. In fact what I originally proposed did not have test benches at all -- but I am convinced that the generator approach both would work and is a good idea in some cases. However, if performance is not part of the equation then guess what I don't understand the proposed approach takes us any closer to running things in hardware (which was the topic of my original discussion)? I.e. why is it "better for hardware" to have a generator-based test bench than just calling sim.step? Both can be easily done on the ARM? BTW if you are convinced it is a good idea then by all means do it -- sometimes it is much easier to explain when there is running code in place :) |
I can imagine encapsulating that interaction would have applications beyond simulation. Being able to stream data between the programmable logic and the processor system(which is the best part of having the processor on chip) seems fairly useful. |
Got it. Thanks for the very clear write up! I would say a next step would be to connect up George and work out a game plan -- I am not sure if you two have had a chance to talk yet or not (but I think you have very overlapping interests!). I am happy to be involved too if you can fit a meeting into some lab times or free time. |
I talked with George, and here is my attempt to summarize the points he made:
Based on all of this, it seems that my plan for a testing system would not be very effective. Does anyone disagree with George on the first point and think that they would benefit from doing functional simulation on an FPGA? If so, I'm happy to continue with my plan. If not, there are other things George suggested I can work on. |
From the chat:
As the simulation becomes faster, communication between it and PyRTL has the potential to become a bottleneck, so giving it enough information to run independently seems like a very good idea. However, if the simulation can run for an indefinite number of steps, specifying the input at each step becomes more complicated than simply giving a list of inputs.
For other HDLs, the standard approach is to write a test bench—a separate piece of logic that can control the inputs and monitor the outputs. If we only intend to run hardware simulations on PYNQ, though, we can use its processor, allowing test benches to be arbitrary Python code. Under this approach, the current setup of fixed inputs would simply be the default test bench, but users could create more sophisticated test benches, such as ones that discard uninteresting outputs or stop when a specific output changes.
These Python test benches would also be used with current simulation techniques, so the interface would stay consistent: if no test bench is specified, the default one would be used, meaning no change to existing code. I'm envisioning a generator-based API for the test bench:
Does this seem like a useful addition? Does PYNQ work in a way that allows this efficiently? Any other thoughts?
The text was updated successfully, but these errors were encountered: