-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crypto/tls: linux/arm64 Go 1.8 performance is slow, max 12.5 MB/sec #19840
Comments
It only uses 1 core per connection, FWIW, so the 96 core part is not related. |
Thanks @bradfitz for the 1 core issue. Still, these are slower single cores than I'd expect at 2 Ghz, and the resulting performance hit causes tests in |
There's no arm64 assembly implementations of any of the crypto stuff. amd64 has:
There are some for 32-bit ARM, and some for s390x and ppc64, but nothing for arm64. |
A related issue is minio/sha256-simd#7 and Based on that I'll alert @williamweixiao @fwessels @Gaillard @ncw and @harshavardhana to this report. |
The poor performance won't surprise me since crypto has not been accelerated by hardware for arm64 and we have planned to optimize AES and others this year. |
Following up on this for @williamweixiao - is there a project schedule or open trackable issue for crypto improvements on golang for AES and specifically for this library? |
https://golang.org/wiki/Go-Release-Cycle documents the project schedule. This would need to happen in the first three (ideally two) months of the development window. |
We plan to optimize AES for arm64 in Go1.10. But there is some uncertainty for AES and other optimizations if our patch (CL41654) of go syntax extension for SIMD can't be merged as soon as the tree is open. |
For reference CL41654 can be read here: |
It looks like CL41654 has been merged. @williamweixiao - should this be good for AES support as mentioned in CL64490 ? |
yes, CL41654 is the base for other crypto optimisations such as CL64490, CL61550 and CL61570. As for TLS, we are also optimising AES-GCM which will be submitted soon. But upstream seems to be busy with other things and we maybe need to raise these CLs priority for reviewing and merging. |
Upstream is always busy! I would really like to see these merged so that we can get benchmarks for Go 1.10 that are better than Go 1.9 for TLS, since there's so much code that depends on fast TLS for performance. |
In https://blog.cloudflare.com/arm-takes-wing/ there are a number of benchmarks with poor results for Go on Arm compared to Go on Intel. From an issue tracking point of view, I think it makes sense to open up individual issue reports on each of them, rather than overloading this one. |
@vielmetti @cherrymui @ianlancetaylor @vkrasnov Sorry for late response and I'm taking sick leave recently. Engineers from cloudflare and arm will cooperate on fixing these issues. |
Change https://golang.org/cl/78935 mentions this issue: |
A report of work done to address this: |
Change https://golang.org/cl/107298 mentions this issue: |
Using go1.11beta1:
Faster results all throughout. Device under test is a Packet c1.large.arm "Type 2A" Cavium ThunderX. |
Some perf results for nats noted at nats-io/nats-server#695 showing positive results on their benchmark suite with Go 1.11beta1. |
Use the dedicated AES* and PMULL* instructions to accelerate AES-GCM name old time/op new time/op delta AESGCMSeal1K-46 12.1µs ± 0% 0.9µs ± 0% -92.66% (p=0.000 n=9+10) AESGCMOpen1K-46 12.1µs ± 0% 0.9µs ± 0% -92.43% (p=0.000 n=10+10) AESGCMSign8K-46 58.6µs ± 0% 2.1µs ± 0% -96.41% (p=0.000 n=9+8) AESGCMSeal8K-46 92.8µs ± 0% 5.7µs ± 0% -93.86% (p=0.000 n=9+9) AESGCMOpen8K-46 92.9µs ± 0% 5.7µs ± 0% -93.84% (p=0.000 n=8+9) name old speed new speed delta AESGCMSeal1K-46 84.7MB/s ± 0% 1153.4MB/s ± 0% +1262.21% (p=0.000 n=9+10) AESGCMOpen1K-46 84.4MB/s ± 0% 1115.2MB/s ± 0% +1220.53% (p=0.000 n=10+10) AESGCMSign8K-46 140MB/s ± 0% 3894MB/s ± 0% +2687.50% (p=0.000 n=9+10) AESGCMSeal8K-46 88.2MB/s ± 0% 1437.5MB/s ± 0% +1529.30% (p=0.000 n=9+9) AESGCMOpen8K-46 88.2MB/s ± 0% 1430.5MB/s ± 0% +1522.01% (p=0.000 n=8+9) This change mirrors the current amd64 implementation, and provides optimal performance on a range of arm64 processors including Centriq 2400 and Apple A12. By and large it is implicitly tested by the robustness of the already existing amd64 implementation. The implementation interleaves GHASH with CTR mode to achieve the highest possible throughput, it also aggregates GHASH with a factor of 8, to decrease the cost of the reduction step. Even thought there is a significant amount of assembly, the code reuses the go code for the amd64 implementation, so there is little additional go code. Since AES-GCM is critical for performance of all web servers, this change is required to level the playfield for arm64 CPUs, where amd64 currently enjoys an unfair advantage. Ideally both amd64 and arm64 codepaths could be replaced by hypothetical AES and CLMUL intrinsics, with a few additional vector instructions. Fixes golang#18498 Fixes golang#19840 Change-Id: Icc57b868cd1f67ac695c1ac163a8e215f74c7910 Reviewed-on: https://go-review.googlesource.com/107298 Run-TryBot: Vlad Krasnov <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>
Use the dedicated AES* and PMULL* instructions to accelerate AES-GCM name old time/op new time/op delta AESGCMSeal1K-46 12.1µs ± 0% 0.9µs ± 0% -92.66% (p=0.000 n=9+10) AESGCMOpen1K-46 12.1µs ± 0% 0.9µs ± 0% -92.43% (p=0.000 n=10+10) AESGCMSign8K-46 58.6µs ± 0% 2.1µs ± 0% -96.41% (p=0.000 n=9+8) AESGCMSeal8K-46 92.8µs ± 0% 5.7µs ± 0% -93.86% (p=0.000 n=9+9) AESGCMOpen8K-46 92.9µs ± 0% 5.7µs ± 0% -93.84% (p=0.000 n=8+9) name old speed new speed delta AESGCMSeal1K-46 84.7MB/s ± 0% 1153.4MB/s ± 0% +1262.21% (p=0.000 n=9+10) AESGCMOpen1K-46 84.4MB/s ± 0% 1115.2MB/s ± 0% +1220.53% (p=0.000 n=10+10) AESGCMSign8K-46 140MB/s ± 0% 3894MB/s ± 0% +2687.50% (p=0.000 n=9+10) AESGCMSeal8K-46 88.2MB/s ± 0% 1437.5MB/s ± 0% +1529.30% (p=0.000 n=9+9) AESGCMOpen8K-46 88.2MB/s ± 0% 1430.5MB/s ± 0% +1522.01% (p=0.000 n=8+9) This change mirrors the current amd64 implementation, and provides optimal performance on a range of arm64 processors including Centriq 2400 and Apple A12. By and large it is implicitly tested by the robustness of the already existing amd64 implementation. The implementation interleaves GHASH with CTR mode to achieve the highest possible throughput, it also aggregates GHASH with a factor of 8, to decrease the cost of the reduction step. Even thought there is a significant amount of assembly, the code reuses the go code for the amd64 implementation, so there is little additional go code. Since AES-GCM is critical for performance of all web servers, this change is required to level the playfield for arm64 CPUs, where amd64 currently enjoys an unfair advantage. Ideally both amd64 and arm64 codepaths could be replaced by hypothetical AES and CLMUL intrinsics, with a few additional vector instructions. Fixes golang#18498 Fixes golang#19840 Change-Id: Icc57b868cd1f67ac695c1ac163a8e215f74c7910 Reviewed-on: https://go-review.googlesource.com/107298 Run-TryBot: Vlad Krasnov <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go version go1.8 linux/arm64
What operating system and processor architecture are you using (
go env
)?96-core Cavium ThunderX, Packet type "2A" server
What did you do?
What did you expect to see?
TLS performance on my 96-core ARMv8 server faster than my laptop.
What did you see instead?
The text was updated successfully, but these errors were encountered: