Update memcpy-advsimd.S to preload the src into L1 at beginning of th… #61

AmaxGuan · 2023-12-14T08:16:16Z

…e function

Run bionic benchmark on Android on a a78c CPU. With the preload instruction, memcpy performed 7.5% better on 16 bytes and 32 bytes benchmark tests without affecting performance of any other benchmark results.

…e function In bionic benchmark on Android, with the preload instruction, memcpy performed 7.5% better on 16 bytes and 32 bytes benchmark without affecting performance of any other benchmark results.

Wilco1 · 2023-12-14T16:26:06Z

I tried your patch on Neoverse V1, and I get a 2% gain on the random benchmark. However I get the same performance if I use nop instead of prfm, so the effect is due to code alignment. Can you try this on your CPU? Also what does the benchmark actually do? Does it repeat the same copy multiple times or vary randomly like in the AOR random benchmark?

AmaxGuan · 2023-12-14T19:58:02Z

Also what does the benchmark actually do?

The benchmark when given a size to memcpy, it just repeat memcpy on the same src and dst address as much as it can.

How do I run AOR random benchmark?

AmaxGuan · 2023-12-15T00:54:35Z

And I tried doing using a nop instruction instead, and it gives the same benefit. So it could be just code alignment. Could you explain more about this topic though?

why the prfm doesn't work as intended?
how does that alignment allowed it to be faster?

AmaxGuan · 2023-12-15T01:12:51Z

Also, could you help recommend a better way to write this diff, since we know nop also works?

AmaxGuan · 2023-12-18T18:21:29Z

How do we move forward on this patch? I'm totally okay with dropping it, if we have better alternatives.

The reason I started looking into optimizing memcpy implementation for small copies is: We found out on our VR devices, the memcpy pattern is small size heavy (millions of times per second). So the performance for handling small size requests are more important when compares to large size requests (which the __memcpy_aarch64_simd version is doing around 30% better than the previous bionic implementation that doesn't use ASIMD.) However, for small size requests, specifically size 32-64bytes, a previous implementation in bionic seems to be performing better.

https://android.googlesource.com/platform/bionic/+/d5ac40cc9fda63ad01bb2925833ddbbc176cd0af/libc/arch-arm64/generic/bionic/memcpy.S

# bpftrace -e 'uprobe:/system/lib64/libc.so:__memcpy_aarch64_simd { @result = hist(arg2); } interval:s:10 { exit(); }'                                                                                                                                            
Attaching 2 probes...

@result: 
[0]                27702 |                                                    |
[1]              2191144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)            314015 |@@@@@@@                                             |
[4, 8)            864856 |@@@@@@@@@@@@@@@@@@@@                                |
[8, 16)          1855867 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
[16, 32)          703316 |@@@@@@@@@@@@@@@@                                    |
[32, 64)          296883 |@@@@@@@                                             |
[64, 128)        1615851 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@              |
[128, 256)        542010 |@@@@@@@@@@@@                                        |
[256, 512)        401805 |@@@@@@@@@                                           |
[512, 1K)         298042 |@@@@@@@                                             |
[1K, 2K)           73294 |@                                                   |
[2K, 4K)           36976 |                                                    |
[4K, 8K)           15529 |                                                    |
[8K, 16K)           2267 |                                                    |
[16K, 32K)          6873 |                                                    |
[32K, 64K)          1607 |                                                    |
[64K, 128K)         1161 |                                                    |
[128K, 256K)         217 |                                                    |
[256K, 512K)         154 |                                                    |

Wilco1 · 2023-12-19T17:43:52Z

Software prefetching is not generally beneficial because hardware prefetching works well on modern cores. And prefetching just a few instructions early doesn't help. Code alignment matters because of instruction fetch boundaries and branch prediction, which is why functions and loops are aligned.

Unfortunately the nop ends up slower on Neoverse N1. I'm reordering the initial instructions a bit more and moving the nop down further. I think it is feasible to still get the gain on Neoverse V1 (and similar cores) while keeping performance the same on older cores.

Your histogram shows a lot of really tiny copies - particularly size 1 looks odd. Typically larger sizes like 8, 16, 32 are the most common.

Wilco1 · 2023-12-22T15:09:25Z

Could you try the latest version and check whether that works for you? This showed good gains on Neoverse V1, while it didn't slow down Neoverse N1 or Cortex-A72.

AmaxGuan · 2023-12-23T06:05:10Z

Could you try the latest version and check whether that works for you? This showed good gains on Neoverse V1, while it didn't slow down Neoverse N1 or Cortex-A72.

Thanks! I’ll give it a try after the holidays! Happy holidays!

AmaxGuan · 2023-12-27T05:49:29Z

I've tested the latest version on Cortex-A78C, it seems to result in a regression when the size to move is 8,16 or 32 bytes. While on larger size memcpys, it seems to show improvements

Bionic memcpy benchmark (modified to improve stability) with latest version:

--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
BM_string_memcpy/8/0/0             292 ns          291 ns    113072109 bytes_per_second=26.2312M/s
BM_string_memcpy/16/0/0            202 ns          201 ns    136270897 bytes_per_second=75.8732M/s
BM_string_memcpy/32/0/0            203 ns          202 ns    137812740 bytes_per_second=150.914M/s
BM_string_memcpy/64/0/0            395 ns          394 ns     71358922 bytes_per_second=154.934M/s
BM_string_memcpy/512/0/0          1288 ns         1284 ns     21768833 bytes_per_second=380.291M/s
BM_string_memcpy/1024/0/0         2634 ns         2625 ns     10472358 bytes_per_second=371.973M/s
BM_string_memcpy/8192/0/0        18780 ns        18720 ns      1494600 bytes_per_second=417.328M/s
BM_string_memcpy/16384/0/0       34711 ns        34600 ns       810460 bytes_per_second=451.594M/s
BM_string_memcpy/32768/0/0       76747 ns        76498 ns       365951 bytes_per_second=408.51M/s
BM_string_memcpy/65536/0/0      234575 ns       233792 ns       120465 bytes_per_second=267.331M/s
BM_string_memcpy/131072/0/0     466876 ns       465276 ns        60670 bytes_per_second=268.658M/s

Bionic memcpy benchmark with previous version which only nop instruction were added:

Running /data/benchmarktest64/bionic-benchmarks-static/bionic-benchmarks-static
Run on (6 X 2054.4 MHz CPU s)
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
BM_string_memcpy/8/0/0             246 ns          245 ns    106464651 bytes_per_second=31.1043M/s
BM_string_memcpy/16/0/0            187 ns          186 ns    149909316 bytes_per_second=82.0805M/s
BM_string_memcpy/32/0/0            190 ns          189 ns    149436759 bytes_per_second=161.061M/s
BM_string_memcpy/64/0/0            395 ns          394 ns     71046449 bytes_per_second=154.89M/s
BM_string_memcpy/512/0/0          1296 ns         1292 ns     21699527 bytes_per_second=377.981M/s
BM_string_memcpy/1024/0/0         2653 ns         2644 ns     10582461 bytes_per_second=369.373M/s
BM_string_memcpy/8192/0/0        19079 ns        19016 ns      1469810 bytes_per_second=410.83M/s
BM_string_memcpy/16384/0/0       34767 ns        34652 ns       805790 bytes_per_second=450.912M/s
BM_string_memcpy/32768/0/0       76504 ns        76252 ns       367412 bytes_per_second=409.826M/s
BM_string_memcpy/65536/0/0      234766 ns       233970 ns       120177 bytes_per_second=267.128M/s
BM_string_memcpy/131072/0/0     465222 ns       463616 ns        60288 bytes_per_second=269.62M/s

AmaxGuan · 2023-12-27T06:04:56Z

In case you were wondering why it takes so long to run memcpy with size 8, the reason is I repeated memcpy 100 times within the same loop. Google benchmark did quite a lot of logic in each loop, so it's questionable whether we are benchmarking the cost of the loop condition or the function under test when benchmarking a function call that takes only few nanoseconds to run.

AmaxGuan · 2023-12-27T06:56:27Z

How does the AOR random benchmark work? Does it favor larger size (> 128 bytes) over smaller size?

Wilco1 · 2023-12-28T15:18:58Z

The random benchmark uses the distribution of copy sizes and alignment from SPEC2017. It follows the standard distribution of copy sizes, so small sizes are much more frequent than larger sizes. Average size is around 24 bytes IIRC. It runs a large set of randomized copies on blocks of different sizes so that it exercises caches as well as branch predictors.

Once you've setup your config.mk with a compiler, all you need to do a make bench-string. What I typically do is compare different variants by adding eg. F(__memcpy_aarch64_old/new to string/bench/memcpy.c, and then repeat the functions of interest a few times.

AmaxGuan · 2023-12-29T07:39:06Z

I've enabled the AOR benchmark build on Android and run the memcpy benchmark on a cortex-a78c. However, the environment I have has some background workloads running such as timer interrupts, which introduces noises that makes it hard to detect a 1-2% improvement.

Update memcpy-advsimd.S to preload the src into L1 at beginning of th…

51f8c4c

…e function In bionic benchmark on Android, with the preload instruction, memcpy performed 7.5% better on 16 bytes and 32 bytes benchmark without affecting performance of any other benchmark results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update memcpy-advsimd.S to preload the src into L1 at beginning of th… #61

Update memcpy-advsimd.S to preload the src into L1 at beginning of th… #61

AmaxGuan commented Dec 14, 2023 •

edited

Loading

Wilco1 commented Dec 14, 2023

AmaxGuan commented Dec 14, 2023 •

edited

Loading

AmaxGuan commented Dec 15, 2023

AmaxGuan commented Dec 15, 2023

AmaxGuan commented Dec 18, 2023 •

edited

Loading

Wilco1 commented Dec 19, 2023

Wilco1 commented Dec 22, 2023

AmaxGuan commented Dec 23, 2023

AmaxGuan commented Dec 27, 2023

AmaxGuan commented Dec 27, 2023

AmaxGuan commented Dec 27, 2023

Wilco1 commented Dec 28, 2023

AmaxGuan commented Dec 29, 2023

Update memcpy-advsimd.S to preload the src into L1 at beginning of th… #61

Are you sure you want to change the base?

Update memcpy-advsimd.S to preload the src into L1 at beginning of th… #61

Conversation

AmaxGuan commented Dec 14, 2023 • edited Loading

Wilco1 commented Dec 14, 2023

AmaxGuan commented Dec 14, 2023 • edited Loading

AmaxGuan commented Dec 15, 2023

AmaxGuan commented Dec 15, 2023

AmaxGuan commented Dec 18, 2023 • edited Loading

Wilco1 commented Dec 19, 2023

Wilco1 commented Dec 22, 2023

AmaxGuan commented Dec 23, 2023

AmaxGuan commented Dec 27, 2023

AmaxGuan commented Dec 27, 2023

AmaxGuan commented Dec 27, 2023

Wilco1 commented Dec 28, 2023

AmaxGuan commented Dec 29, 2023

AmaxGuan commented Dec 14, 2023 •

edited

Loading

AmaxGuan commented Dec 14, 2023 •

edited

Loading

AmaxGuan commented Dec 18, 2023 •

edited

Loading