Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update memcpy-advsimd.S to preload the src into L1 at beginning of th… #61

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AmaxGuan
Copy link

@AmaxGuan AmaxGuan commented Dec 14, 2023

…e function

Run bionic benchmark on Android on a a78c CPU. With the preload instruction, memcpy performed 7.5% better on 16 bytes and 32 bytes benchmark tests without affecting performance of any other benchmark results.

…e function

In bionic benchmark on Android, with the preload instruction, memcpy performed 7.5% better on 16 bytes and 32 bytes benchmark without affecting performance of any other benchmark results.
@Wilco1
Copy link
Contributor

Wilco1 commented Dec 14, 2023

I tried your patch on Neoverse V1, and I get a 2% gain on the random benchmark. However I get the same performance if I use nop instead of prfm, so the effect is due to code alignment. Can you try this on your CPU? Also what does the benchmark actually do? Does it repeat the same copy multiple times or vary randomly like in the AOR random benchmark?

@AmaxGuan
Copy link
Author

AmaxGuan commented Dec 14, 2023

Also what does the benchmark actually do?

The benchmark when given a size to memcpy, it just repeat memcpy on the same src and dst address as much as it can.

How do I run AOR random benchmark?

@AmaxGuan
Copy link
Author

And I tried doing using a nop instruction instead, and it gives the same benefit. So it could be just code alignment. Could you explain more about this topic though?

  1. why the prfm doesn't work as intended?
  2. how does that alignment allowed it to be faster?

@AmaxGuan
Copy link
Author

Also, could you help recommend a better way to write this diff, since we know nop also works?

@AmaxGuan
Copy link
Author

AmaxGuan commented Dec 18, 2023

How do we move forward on this patch? I'm totally okay with dropping it, if we have better alternatives.

The reason I started looking into optimizing memcpy implementation for small copies is: We found out on our VR devices, the memcpy pattern is small size heavy (millions of times per second). So the performance for handling small size requests are more important when compares to large size requests (which the __memcpy_aarch64_simd version is doing around 30% better than the previous bionic implementation that doesn't use ASIMD.) However, for small size requests, specifically size 32-64bytes, a previous implementation in bionic seems to be performing better.

https://android.googlesource.com/platform/bionic/+/d5ac40cc9fda63ad01bb2925833ddbbc176cd0af/libc/arch-arm64/generic/bionic/memcpy.S

# bpftrace -e 'uprobe:/system/lib64/libc.so:__memcpy_aarch64_simd { @result = hist(arg2); } interval:s:10 { exit(); }'                                                                                                                                            
Attaching 2 probes...

@result: 
[0]                27702 |                                                    |
[1]              2191144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)            314015 |@@@@@@@                                             |
[4, 8)            864856 |@@@@@@@@@@@@@@@@@@@@                                |
[8, 16)          1855867 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
[16, 32)          703316 |@@@@@@@@@@@@@@@@                                    |
[32, 64)          296883 |@@@@@@@                                             |
[64, 128)        1615851 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@              |
[128, 256)        542010 |@@@@@@@@@@@@                                        |
[256, 512)        401805 |@@@@@@@@@                                           |
[512, 1K)         298042 |@@@@@@@                                             |
[1K, 2K)           73294 |@                                                   |
[2K, 4K)           36976 |                                                    |
[4K, 8K)           15529 |                                                    |
[8K, 16K)           2267 |                                                    |
[16K, 32K)          6873 |                                                    |
[32K, 64K)          1607 |                                                    |
[64K, 128K)         1161 |                                                    |
[128K, 256K)         217 |                                                    |
[256K, 512K)         154 |                                                    |

@Wilco1
Copy link
Contributor

Wilco1 commented Dec 19, 2023

Software prefetching is not generally beneficial because hardware prefetching works well on modern cores. And prefetching just a few instructions early doesn't help. Code alignment matters because of instruction fetch boundaries and branch prediction, which is why functions and loops are aligned.

Unfortunately the nop ends up slower on Neoverse N1. I'm reordering the initial instructions a bit more and moving the nop down further. I think it is feasible to still get the gain on Neoverse V1 (and similar cores) while keeping performance the same on older cores.

Your histogram shows a lot of really tiny copies - particularly size 1 looks odd. Typically larger sizes like 8, 16, 32 are the most common.

@Wilco1
Copy link
Contributor

Wilco1 commented Dec 22, 2023

Could you try the latest version and check whether that works for you? This showed good gains on Neoverse V1, while it didn't slow down Neoverse N1 or Cortex-A72.

@AmaxGuan
Copy link
Author

Could you try the latest version and check whether that works for you? This showed good gains on Neoverse V1, while it didn't slow down Neoverse N1 or Cortex-A72.

Thanks! I’ll give it a try after the holidays! Happy holidays!

@AmaxGuan
Copy link
Author

I've tested the latest version on Cortex-A78C, it seems to result in a regression when the size to move is 8,16 or 32 bytes. While on larger size memcpys, it seems to show improvements

Bionic memcpy benchmark (modified to improve stability) with latest version:

--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
BM_string_memcpy/8/0/0             292 ns          291 ns    113072109 bytes_per_second=26.2312M/s
BM_string_memcpy/16/0/0            202 ns          201 ns    136270897 bytes_per_second=75.8732M/s
BM_string_memcpy/32/0/0            203 ns          202 ns    137812740 bytes_per_second=150.914M/s
BM_string_memcpy/64/0/0            395 ns          394 ns     71358922 bytes_per_second=154.934M/s
BM_string_memcpy/512/0/0          1288 ns         1284 ns     21768833 bytes_per_second=380.291M/s
BM_string_memcpy/1024/0/0         2634 ns         2625 ns     10472358 bytes_per_second=371.973M/s
BM_string_memcpy/8192/0/0        18780 ns        18720 ns      1494600 bytes_per_second=417.328M/s
BM_string_memcpy/16384/0/0       34711 ns        34600 ns       810460 bytes_per_second=451.594M/s
BM_string_memcpy/32768/0/0       76747 ns        76498 ns       365951 bytes_per_second=408.51M/s
BM_string_memcpy/65536/0/0      234575 ns       233792 ns       120465 bytes_per_second=267.331M/s
BM_string_memcpy/131072/0/0     466876 ns       465276 ns        60670 bytes_per_second=268.658M/s

Bionic memcpy benchmark with previous version which only nop instruction were added:

Running /data/benchmarktest64/bionic-benchmarks-static/bionic-benchmarks-static
Run on (6 X 2054.4 MHz CPU s)
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
BM_string_memcpy/8/0/0             246 ns          245 ns    106464651 bytes_per_second=31.1043M/s
BM_string_memcpy/16/0/0            187 ns          186 ns    149909316 bytes_per_second=82.0805M/s
BM_string_memcpy/32/0/0            190 ns          189 ns    149436759 bytes_per_second=161.061M/s
BM_string_memcpy/64/0/0            395 ns          394 ns     71046449 bytes_per_second=154.89M/s
BM_string_memcpy/512/0/0          1296 ns         1292 ns     21699527 bytes_per_second=377.981M/s
BM_string_memcpy/1024/0/0         2653 ns         2644 ns     10582461 bytes_per_second=369.373M/s
BM_string_memcpy/8192/0/0        19079 ns        19016 ns      1469810 bytes_per_second=410.83M/s
BM_string_memcpy/16384/0/0       34767 ns        34652 ns       805790 bytes_per_second=450.912M/s
BM_string_memcpy/32768/0/0       76504 ns        76252 ns       367412 bytes_per_second=409.826M/s
BM_string_memcpy/65536/0/0      234766 ns       233970 ns       120177 bytes_per_second=267.128M/s
BM_string_memcpy/131072/0/0     465222 ns       463616 ns        60288 bytes_per_second=269.62M/s

@AmaxGuan
Copy link
Author

In case you were wondering why it takes so long to run memcpy with size 8, the reason is I repeated memcpy 100 times within the same loop. Google benchmark did quite a lot of logic in each loop, so it's questionable whether we are benchmarking the cost of the loop condition or the function under test when benchmarking a function call that takes only few nanoseconds to run.

@AmaxGuan
Copy link
Author

How does the AOR random benchmark work? Does it favor larger size (> 128 bytes) over smaller size?

@Wilco1
Copy link
Contributor

Wilco1 commented Dec 28, 2023

The random benchmark uses the distribution of copy sizes and alignment from SPEC2017. It follows the standard distribution of copy sizes, so small sizes are much more frequent than larger sizes. Average size is around 24 bytes IIRC. It runs a large set of randomized copies on blocks of different sizes so that it exercises caches as well as branch predictors.

Once you've setup your config.mk with a compiler, all you need to do a make bench-string. What I typically do is compare different variants by adding eg. F(__memcpy_aarch64_old/new to string/bench/memcpy.c, and then repeat the functions of interest a few times.

@AmaxGuan
Copy link
Author

I've enabled the AOR benchmark build on Android and run the memcpy benchmark on a cortex-a78c. However, the environment I have has some background workloads running such as timer interrupts, which introduces noises that makes it hard to detect a 1-2% improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants