- Patch 1 (small vectors unit stride loads/stores) has been approved and pushed to riscv-to-apply.next.
- Patch 2 (large vectors unit stride loads/stores) has been updated to address Richard Henderson's review.
- latest review from Richard Handerson: https://lists.gnu.org/archive/html/qemu-devel/2024-12/msg05246.html
- new version of the patch: https://lore.kernel.org/all/[email protected]/.
- A new version of the patch for the whole register loads/stores is being tested for performance and will be soon submitted upstream.
- this new version of the patch addresses the latest review from Max Chou: https://lore.kernel.org/all/[email protected]/
The previous implementation consisted in calling qemu functions to perform 128 bits loads and stores in order to quickly emulate loads and store of whole vector registers. These functions though only generate at best pair of 64-bit loads and stores in the current QEMU. That meant that if a fault happened on the second of these memory operations we wouldn't be in the position of updating vstart from our routine in target/riscv/insn_trans.
The new version of the patch:
https://github.com/PaoloS02/rise-rvv-tcg-qemu/commit/db95037b428e28b084ce550872406da9ba4217bf
calls directly for the generation of 64 bit loads and stores in order to emulate. That gives us the chance to update vstart for every load/store performed. The new version also address the possibility that the machine we run on has 32 bit registers and calls for 32 bit loads and stores in that case. If the vector element are 64-bit long we fall back to using the helper function so that we avoid loading and storing half vector elements, exposing thus ourselves to the risk of being surprised by a fault mid-element.
Generate TCG Ops for vector whole word load/store.
- IN PROGRESS.
- We have updated the patch to address the latest review: https://lists.gnu.org/archive/html/qemu-devel/2024-12/msg05246.html.
- once the performance is tested against regressions we'll submit the patch upstream.
Improve first-fault handling for vector load/store helper functions.
- IN PROGRESS.
- No new work to report this week.
Improve strided load/store helper functions.
- IN PROGRESS.
- No new work to report this week.
There is no new general benchmark run this week.
See the results for the new patch here
We ran the new extensive version of memcpy and saw that the new patch is overall slightly less efficient in particular for VLEN=1024 where for very large data sizes there's a performance regression. This could be due to the overhead of the new more complex tcg generation loop that builds up with many iterations. The iterations with larger vector registers and groupings seem to be affected more. Due to the new shape of the benchmark for large data sizes the iterations with NF=8 dominate most of the benchmark execution. This explains how the performance regression for the largest cases show such a performance decrease on the overall result. We are running benchmarks individually on different combination of VLEN an NF in order to better identify the bottlenecks and possibly optimize the implementation of the loop.
Next meeting 15 January 2025.