Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/tls: linux/arm64 Go 1.8 performance is slow, max 12.5 MB/sec #19840

Closed
vielmetti opened this issue Apr 4, 2017 · 19 comments
Closed

crypto/tls: linux/arm64 Go 1.8 performance is slow, max 12.5 MB/sec #19840

vielmetti opened this issue Apr 4, 2017 · 19 comments

Comments

@vielmetti
Copy link

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.8 linux/arm64

What operating system and processor architecture are you using (go env)?

96-core Cavium ThunderX, Packet type "2A" server

root@docker-build-test:~# go env
GOARCH="arm64"
GOBIN=""
GOEXE=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/root"
GORACE=""
GOROOT="/usr/lib/go-1.8"
GOTOOLDIR="/usr/lib/go-1.8/pkg/tool/linux_arm64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build192910166=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

root@docker-build-test:~# go test crypto/tls -bench BenchmarkThroughput

What did you expect to see?

TLS performance on my 96-core ARMv8 server faster than my laptop.

What did you see instead?

root@docker-build-test:~# go test crypto/tls -bench BenchmarkThroughput

BenchmarkThroughput/MaxPacket/1MB-96                  10         128291858 ns/op    8.17 MB/s
BenchmarkThroughput/MaxPacket/2MB-96                   5         211866625 ns/op    9.90 MB/s
BenchmarkThroughput/MaxPacket/4MB-96                   3         378852259 ns/op   11.07 MB/s
BenchmarkThroughput/MaxPacket/8MB-96                   2         715603298 ns/op   11.72 MB/s
BenchmarkThroughput/MaxPacket/16MB-96                  1        1387017225 ns/op   12.10 MB/s
BenchmarkThroughput/MaxPacket/32MB-96                  1        2713806130 ns/op   12.36 MB/s
BenchmarkThroughput/MaxPacket/64MB-96                  1        5402023727 ns/op   12.42 MB/s
BenchmarkThroughput/DynamicPacket/1MB-96              10         128462369 ns/op    8.16 MB/s
BenchmarkThroughput/DynamicPacket/2MB-96               5         211779553 ns/op    9.90 MB/s
BenchmarkThroughput/DynamicPacket/4MB-96               3         378591737 ns/op   11.08 MB/s
BenchmarkThroughput/DynamicPacket/8MB-96               2         711548140 ns/op   11.79 MB/s
BenchmarkThroughput/DynamicPacket/16MB-96              1        1385720232 ns/op   12.11 MB/s
BenchmarkThroughput/DynamicPacket/32MB-96              1        2711156682 ns/op   12.38 MB/s
BenchmarkThroughput/DynamicPacket/64MB-96              1        5378659024 ns/op   12.48 MB/s
PASS
ok      crypto/tls      36.894s
@bradfitz
Copy link
Contributor

bradfitz commented Apr 4, 2017

TLS performance on my 96-core

It only uses 1 core per connection, FWIW, so the 96 core part is not related.

@vielmetti
Copy link
Author

Thanks @bradfitz for the 1 core issue. Still, these are slower single cores than I'd expect at 2 Ghz, and the resulting performance hit causes tests in gnatsd to fail at nats-io/nats-server#466 .

@bradfitz
Copy link
Contributor

bradfitz commented Apr 4, 2017

There's no arm64 assembly implementations of any of the crypto stuff.

amd64 has:

./vendor/golang_org/x/crypto/chacha20poly1305/chacha20poly1305_amd64.s
./vendor/golang_org/x/crypto/curve25519/mul_amd64.s
./vendor/golang_org/x/crypto/curve25519/square_amd64.s
./vendor/golang_org/x/crypto/curve25519/cswap_amd64.s
./vendor/golang_org/x/crypto/curve25519/const_amd64.s
./vendor/golang_org/x/crypto/curve25519/ladderstep_amd64.s
./vendor/golang_org/x/crypto/curve25519/freeze_amd64.s
./vendor/golang_org/x/crypto/poly1305/sum_amd64.s
./crypto/aes/gcm_amd64.s
./crypto/aes/asm_amd64.s
./crypto/elliptic/p256_asm_amd64.s

There are some for 32-bit ARM, and some for s390x and ppc64, but nothing for arm64.

@vielmetti
Copy link
Author

A related issue is minio/sha256-simd#7 and

Based on that I'll alert @williamweixiao @fwessels @Gaillard @ncw and @harshavardhana to this report.

@williamweixiao
Copy link
Member

The poor performance won't surprise me since crypto has not been accelerated by hardware for arm64 and we have planned to optimize AES and others this year.

@vielmetti
Copy link
Author

Following up on this for @williamweixiao - is there a project schedule or open trackable issue for crypto improvements on golang for AES and specifically for this library?

@bradfitz
Copy link
Contributor

bradfitz commented Jun 8, 2017

https://golang.org/wiki/Go-Release-Cycle documents the project schedule. This would need to happen in the first three (ideally two) months of the development window.

@williamweixiao
Copy link
Member

We plan to optimize AES for arm64 in Go1.10. But there is some uncertainty for AES and other optimizations if our patch (CL41654) of go syntax extension for SIMD can't be merged as soon as the tree is open.

@vielmetti
Copy link
Author

For reference CL41654 can be read here:

https://go-review.googlesource.com/c/41654/

@vielmetti
Copy link
Author

It looks like CL41654 has been merged. @williamweixiao - should this be good for AES support as mentioned in CL64490 ?

@williamweixiao
Copy link
Member

yes, CL41654 is the base for other crypto optimisations such as CL64490, CL61550 and CL61570. As for TLS, we are also optimising AES-GCM which will be submitted soon. But upstream seems to be busy with other things and we maybe need to raise these CLs priority for reviewing and merging.

@vielmetti
Copy link
Author

Upstream is always busy! I would really like to see these merged so that we can get benchmarks for Go 1.10 that are better than Go 1.9 for TLS, since there's so much code that depends on fast TLS for performance.

@vielmetti
Copy link
Author

In https://blog.cloudflare.com/arm-takes-wing/ there are a number of benchmarks with poor results for Go on Arm compared to Go on Intel.

From an issue tracking point of view, I think it makes sense to open up individual issue reports on each of them, rather than overloading this one.

@williamweixiao
Copy link
Member

@vielmetti @cherrymui @ianlancetaylor @vkrasnov

Sorry for late response and I'm taking sick leave recently.
Since some performance issues mentioned by cloudflare have been fixed or are being fixed, I just create following 4 issues to track the most important ones confirmed by Vlad from cloudflare.
#22806
#22807
#22808
#22809

Engineers from cloudflare and arm will cooperate on fixing these issues.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/78935 mentions this issue: crypto/tls: enable AES-GCM mode

@vielmetti
Copy link
Author

A report of work done to address this:

https://twitter.com/jgrahamc/status/988812004499607553

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/107298 mentions this issue: crypto/aes: implement AES-GCM AEAD for arm64

@vielmetti
Copy link
Author

Using go1.11beta1:

ed@ed-2a-bcc-llvm:~$ ~/go/bin/go1.11beta1 test crypto/tls -bench BenchmarkThroughput
goos: linux
goarch: arm64
pkg: crypto/tls
BenchmarkThroughput/MaxPacket/1MB-96                  20          69582510 ns/op          15.07 MB/s
BenchmarkThroughput/MaxPacket/2MB-96                  10         131829325 ns/op          15.91 MB/s
BenchmarkThroughput/MaxPacket/4MB-96                   5         251071157 ns/op          16.71 MB/s
BenchmarkThroughput/MaxPacket/8MB-96                   3         484649847 ns/op          17.31 MB/s
BenchmarkThroughput/MaxPacket/16MB-96                  2         942398337 ns/op          17.80 MB/s
BenchmarkThroughput/MaxPacket/32MB-96                  1        1876532891 ns/op          17.88 MB/s
BenchmarkThroughput/MaxPacket/64MB-96                  1        3749993113 ns/op          17.90 MB/s
BenchmarkThroughput/DynamicPacket/1MB-96              20          69777398 ns/op          15.03 MB/s
BenchmarkThroughput/DynamicPacket/2MB-96              10         125700576 ns/op          16.68 MB/s
BenchmarkThroughput/DynamicPacket/4MB-96               5         240230680 ns/op          17.46 MB/s
BenchmarkThroughput/DynamicPacket/8MB-96               3         472440932 ns/op          17.76 MB/s
BenchmarkThroughput/DynamicPacket/16MB-96              2         933129898 ns/op          17.98 MB/s
BenchmarkThroughput/DynamicPacket/32MB-96              1        1843988300 ns/op          18.20 MB/s
BenchmarkThroughput/DynamicPacket/64MB-96              1        3614544855 ns/op          18.57 MB/s
PASS
ok      crypto/tls      32.844s

Faster results all throughout. Device under test is a Packet c1.large.arm "Type 2A" Cavium ThunderX.

@vielmetti
Copy link
Author

Some perf results for nats noted at nats-io/nats-server#695 showing positive results on their benchmark suite with Go 1.11beta1.

FiloSottile pushed a commit to FiloSottile/go that referenced this issue Oct 12, 2018
Use the dedicated AES* and PMULL* instructions to accelerate AES-GCM

name              old time/op    new time/op      delta
AESGCMSeal1K-46     12.1µs ± 0%       0.9µs ± 0%    -92.66%  (p=0.000 n=9+10)
AESGCMOpen1K-46     12.1µs ± 0%       0.9µs ± 0%    -92.43%  (p=0.000 n=10+10)
AESGCMSign8K-46     58.6µs ± 0%       2.1µs ± 0%    -96.41%  (p=0.000 n=9+8)
AESGCMSeal8K-46     92.8µs ± 0%       5.7µs ± 0%    -93.86%  (p=0.000 n=9+9)
AESGCMOpen8K-46     92.9µs ± 0%       5.7µs ± 0%    -93.84%  (p=0.000 n=8+9)

name              old speed      new speed        delta
AESGCMSeal1K-46   84.7MB/s ± 0%  1153.4MB/s ± 0%  +1262.21%  (p=0.000 n=9+10)
AESGCMOpen1K-46   84.4MB/s ± 0%  1115.2MB/s ± 0%  +1220.53%  (p=0.000 n=10+10)
AESGCMSign8K-46    140MB/s ± 0%    3894MB/s ± 0%  +2687.50%  (p=0.000 n=9+10)
AESGCMSeal8K-46   88.2MB/s ± 0%  1437.5MB/s ± 0%  +1529.30%  (p=0.000 n=9+9)
AESGCMOpen8K-46   88.2MB/s ± 0%  1430.5MB/s ± 0%  +1522.01%  (p=0.000 n=8+9)

This change mirrors the current amd64 implementation, and provides optimal performance
on a range of arm64 processors including Centriq 2400 and Apple A12. By and large it is
implicitly tested by the robustness of the already existing amd64 implementation.

The implementation interleaves GHASH with CTR mode to achieve the highest possible
throughput, it also aggregates GHASH with a factor of 8, to decrease the cost of the
reduction step.

Even thought there is a significant amount of assembly, the code reuses the go
code for the amd64 implementation, so there is little additional go code.

Since AES-GCM is critical for performance of all web servers, this change is
required to level the playfield for arm64 CPUs, where amd64 currently enjoys an
unfair advantage.

Ideally both amd64 and arm64 codepaths could be replaced by hypothetical AES and
CLMUL intrinsics, with a few additional vector instructions.

Fixes golang#18498
Fixes golang#19840

Change-Id: Icc57b868cd1f67ac695c1ac163a8e215f74c7910
Reviewed-on: https://go-review.googlesource.com/107298
Run-TryBot: Vlad Krasnov <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Brad Fitzpatrick <[email protected]>
FiloSottile pushed a commit to FiloSottile/go that referenced this issue Oct 12, 2018
Use the dedicated AES* and PMULL* instructions to accelerate AES-GCM

name              old time/op    new time/op      delta
AESGCMSeal1K-46     12.1µs ± 0%       0.9µs ± 0%    -92.66%  (p=0.000 n=9+10)
AESGCMOpen1K-46     12.1µs ± 0%       0.9µs ± 0%    -92.43%  (p=0.000 n=10+10)
AESGCMSign8K-46     58.6µs ± 0%       2.1µs ± 0%    -96.41%  (p=0.000 n=9+8)
AESGCMSeal8K-46     92.8µs ± 0%       5.7µs ± 0%    -93.86%  (p=0.000 n=9+9)
AESGCMOpen8K-46     92.9µs ± 0%       5.7µs ± 0%    -93.84%  (p=0.000 n=8+9)

name              old speed      new speed        delta
AESGCMSeal1K-46   84.7MB/s ± 0%  1153.4MB/s ± 0%  +1262.21%  (p=0.000 n=9+10)
AESGCMOpen1K-46   84.4MB/s ± 0%  1115.2MB/s ± 0%  +1220.53%  (p=0.000 n=10+10)
AESGCMSign8K-46    140MB/s ± 0%    3894MB/s ± 0%  +2687.50%  (p=0.000 n=9+10)
AESGCMSeal8K-46   88.2MB/s ± 0%  1437.5MB/s ± 0%  +1529.30%  (p=0.000 n=9+9)
AESGCMOpen8K-46   88.2MB/s ± 0%  1430.5MB/s ± 0%  +1522.01%  (p=0.000 n=8+9)

This change mirrors the current amd64 implementation, and provides optimal performance
on a range of arm64 processors including Centriq 2400 and Apple A12. By and large it is
implicitly tested by the robustness of the already existing amd64 implementation.

The implementation interleaves GHASH with CTR mode to achieve the highest possible
throughput, it also aggregates GHASH with a factor of 8, to decrease the cost of the
reduction step.

Even thought there is a significant amount of assembly, the code reuses the go
code for the amd64 implementation, so there is little additional go code.

Since AES-GCM is critical for performance of all web servers, this change is
required to level the playfield for arm64 CPUs, where amd64 currently enjoys an
unfair advantage.

Ideally both amd64 and arm64 codepaths could be replaced by hypothetical AES and
CLMUL intrinsics, with a few additional vector instructions.

Fixes golang#18498
Fixes golang#19840

Change-Id: Icc57b868cd1f67ac695c1ac163a8e215f74c7910
Reviewed-on: https://go-review.googlesource.com/107298
Run-TryBot: Vlad Krasnov <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Brad Fitzpatrick <[email protected]>
@golang golang locked and limited conversation to collaborators Jul 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants