crypto/tls: linux/arm64 Go 1.8 performance is slow, max 12.5 MB/sec #19840

vielmetti · 2017-04-04T18:30:04Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (`go version`)?

go version go1.8 linux/arm64

What operating system and processor architecture are you using (`go env`)?

96-core Cavium ThunderX, Packet type "2A" server

root@docker-build-test:~# go env
GOARCH="arm64"
GOBIN=""
GOEXE=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/root"
GORACE=""
GOROOT="/usr/lib/go-1.8"
GOTOOLDIR="/usr/lib/go-1.8/pkg/tool/linux_arm64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build192910166=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

root@docker-build-test:~# go test crypto/tls -bench BenchmarkThroughput

What did you expect to see?

TLS performance on my 96-core ARMv8 server faster than my laptop.

What did you see instead?

root@docker-build-test:~# go test crypto/tls -bench BenchmarkThroughput

BenchmarkThroughput/MaxPacket/1MB-96                  10         128291858 ns/op    8.17 MB/s
BenchmarkThroughput/MaxPacket/2MB-96                   5         211866625 ns/op    9.90 MB/s
BenchmarkThroughput/MaxPacket/4MB-96                   3         378852259 ns/op   11.07 MB/s
BenchmarkThroughput/MaxPacket/8MB-96                   2         715603298 ns/op   11.72 MB/s
BenchmarkThroughput/MaxPacket/16MB-96                  1        1387017225 ns/op   12.10 MB/s
BenchmarkThroughput/MaxPacket/32MB-96                  1        2713806130 ns/op   12.36 MB/s
BenchmarkThroughput/MaxPacket/64MB-96                  1        5402023727 ns/op   12.42 MB/s
BenchmarkThroughput/DynamicPacket/1MB-96              10         128462369 ns/op    8.16 MB/s
BenchmarkThroughput/DynamicPacket/2MB-96               5         211779553 ns/op    9.90 MB/s
BenchmarkThroughput/DynamicPacket/4MB-96               3         378591737 ns/op   11.08 MB/s
BenchmarkThroughput/DynamicPacket/8MB-96               2         711548140 ns/op   11.79 MB/s
BenchmarkThroughput/DynamicPacket/16MB-96              1        1385720232 ns/op   12.11 MB/s
BenchmarkThroughput/DynamicPacket/32MB-96              1        2711156682 ns/op   12.38 MB/s
BenchmarkThroughput/DynamicPacket/64MB-96              1        5378659024 ns/op   12.48 MB/s
PASS
ok      crypto/tls      36.894s

The text was updated successfully, but these errors were encountered:

bradfitz · 2017-04-04T18:31:46Z

TLS performance on my 96-core

It only uses 1 core per connection, FWIW, so the 96 core part is not related.

vielmetti · 2017-04-04T18:34:04Z

Thanks @bradfitz for the 1 core issue. Still, these are slower single cores than I'd expect at 2 Ghz, and the resulting performance hit causes tests in gnatsd to fail at nats-io/nats-server#466 .

bradfitz · 2017-04-04T18:36:08Z

There's no arm64 assembly implementations of any of the crypto stuff.

amd64 has:

./vendor/golang_org/x/crypto/chacha20poly1305/chacha20poly1305_amd64.s
./vendor/golang_org/x/crypto/curve25519/mul_amd64.s
./vendor/golang_org/x/crypto/curve25519/square_amd64.s
./vendor/golang_org/x/crypto/curve25519/cswap_amd64.s
./vendor/golang_org/x/crypto/curve25519/const_amd64.s
./vendor/golang_org/x/crypto/curve25519/ladderstep_amd64.s
./vendor/golang_org/x/crypto/curve25519/freeze_amd64.s
./vendor/golang_org/x/crypto/poly1305/sum_amd64.s
./crypto/aes/gcm_amd64.s
./crypto/aes/asm_amd64.s
./crypto/elliptic/p256_asm_amd64.s

There are some for 32-bit ARM, and some for s390x and ppc64, but nothing for arm64.

vielmetti · 2017-04-04T18:45:51Z

A related issue is minio/sha256-simd#7 and

Based on that I'll alert @williamweixiao @fwessels @Gaillard @ncw and @harshavardhana to this report.

williamweixiao · 2017-04-06T07:21:26Z

The poor performance won't surprise me since crypto has not been accelerated by hardware for arm64 and we have planned to optimize AES and others this year.

vielmetti · 2017-06-08T14:40:07Z

Following up on this for @williamweixiao - is there a project schedule or open trackable issue for crypto improvements on golang for AES and specifically for this library?

bradfitz · 2017-06-08T15:01:28Z

https://golang.org/wiki/Go-Release-Cycle documents the project schedule. This would need to happen in the first three (ideally two) months of the development window.

williamweixiao · 2017-06-09T04:01:24Z

We plan to optimize AES for arm64 in Go1.10. But there is some uncertainty for AES and other optimizations if our patch (CL41654) of go syntax extension for SIMD can't be merged as soon as the tree is open.

vielmetti · 2017-06-12T04:07:14Z

For reference CL41654 can be read here:

https://go-review.googlesource.com/c/41654/

vielmetti · 2017-10-31T15:39:20Z

It looks like CL41654 has been merged. @williamweixiao - should this be good for AES support as mentioned in CL64490 ?

williamweixiao · 2017-11-01T02:11:56Z

yes, CL41654 is the base for other crypto optimisations such as CL64490, CL61550 and CL61570. As for TLS, we are also optimising AES-GCM which will be submitted soon. But upstream seems to be busy with other things and we maybe need to raise these CLs priority for reviewing and merging.

vielmetti · 2017-11-05T22:01:49Z

Upstream is always busy! I would really like to see these merged so that we can get benchmarks for Go 1.10 that are better than Go 1.9 for TLS, since there's so much code that depends on fast TLS for performance.

vielmetti · 2017-11-08T21:35:19Z

In https://blog.cloudflare.com/arm-takes-wing/ there are a number of benchmarks with poor results for Go on Arm compared to Go on Intel.

From an issue tracking point of view, I think it makes sense to open up individual issue reports on each of them, rather than overloading this one.

williamweixiao · 2017-11-19T12:34:43Z

@vielmetti @cherrymui @ianlancetaylor @vkrasnov

Sorry for late response and I'm taking sick leave recently.
Since some performance issues mentioned by cloudflare have been fixed or are being fixed, I just create following 4 issues to track the most important ones confirmed by Vlad from cloudflare.
#22806
#22807
#22808
#22809

Engineers from cloudflare and arm will cooperate on fixing these issues.

gopherbot · 2017-11-21T04:26:24Z

Change https://golang.org/cl/78935 mentions this issue: crypto/tls: enable AES-GCM mode

vielmetti · 2018-04-24T16:39:09Z

A report of work done to address this:

https://twitter.com/jgrahamc/status/988812004499607553

gopherbot · 2018-06-06T09:25:59Z

Change https://golang.org/cl/107298 mentions this issue: crypto/aes: implement AES-GCM AEAD for arm64

vielmetti · 2018-06-26T22:07:41Z

Using go1.11beta1:

ed@ed-2a-bcc-llvm:~$ ~/go/bin/go1.11beta1 test crypto/tls -bench BenchmarkThroughput
goos: linux
goarch: arm64
pkg: crypto/tls
BenchmarkThroughput/MaxPacket/1MB-96                  20          69582510 ns/op          15.07 MB/s
BenchmarkThroughput/MaxPacket/2MB-96                  10         131829325 ns/op          15.91 MB/s
BenchmarkThroughput/MaxPacket/4MB-96                   5         251071157 ns/op          16.71 MB/s
BenchmarkThroughput/MaxPacket/8MB-96                   3         484649847 ns/op          17.31 MB/s
BenchmarkThroughput/MaxPacket/16MB-96                  2         942398337 ns/op          17.80 MB/s
BenchmarkThroughput/MaxPacket/32MB-96                  1        1876532891 ns/op          17.88 MB/s
BenchmarkThroughput/MaxPacket/64MB-96                  1        3749993113 ns/op          17.90 MB/s
BenchmarkThroughput/DynamicPacket/1MB-96              20          69777398 ns/op          15.03 MB/s
BenchmarkThroughput/DynamicPacket/2MB-96              10         125700576 ns/op          16.68 MB/s
BenchmarkThroughput/DynamicPacket/4MB-96               5         240230680 ns/op          17.46 MB/s
BenchmarkThroughput/DynamicPacket/8MB-96               3         472440932 ns/op          17.76 MB/s
BenchmarkThroughput/DynamicPacket/16MB-96              2         933129898 ns/op          17.98 MB/s
BenchmarkThroughput/DynamicPacket/32MB-96              1        1843988300 ns/op          18.20 MB/s
BenchmarkThroughput/DynamicPacket/64MB-96              1        3614544855 ns/op          18.57 MB/s
PASS
ok      crypto/tls      32.844s

Faster results all throughout. Device under test is a Packet c1.large.arm "Type 2A" Cavium ThunderX.

vielmetti · 2018-06-29T03:54:12Z

Some perf results for nats noted at nats-io/nats-server#695 showing positive results on their benchmark suite with Go 1.11beta1.

Use the dedicated AES* and PMULL* instructions to accelerate AES-GCM name old time/op new time/op delta AESGCMSeal1K-46 12.1µs ± 0% 0.9µs ± 0% -92.66% (p=0.000 n=9+10) AESGCMOpen1K-46 12.1µs ± 0% 0.9µs ± 0% -92.43% (p=0.000 n=10+10) AESGCMSign8K-46 58.6µs ± 0% 2.1µs ± 0% -96.41% (p=0.000 n=9+8) AESGCMSeal8K-46 92.8µs ± 0% 5.7µs ± 0% -93.86% (p=0.000 n=9+9) AESGCMOpen8K-46 92.9µs ± 0% 5.7µs ± 0% -93.84% (p=0.000 n=8+9) name old speed new speed delta AESGCMSeal1K-46 84.7MB/s ± 0% 1153.4MB/s ± 0% +1262.21% (p=0.000 n=9+10) AESGCMOpen1K-46 84.4MB/s ± 0% 1115.2MB/s ± 0% +1220.53% (p=0.000 n=10+10) AESGCMSign8K-46 140MB/s ± 0% 3894MB/s ± 0% +2687.50% (p=0.000 n=9+10) AESGCMSeal8K-46 88.2MB/s ± 0% 1437.5MB/s ± 0% +1529.30% (p=0.000 n=9+9) AESGCMOpen8K-46 88.2MB/s ± 0% 1430.5MB/s ± 0% +1522.01% (p=0.000 n=8+9) This change mirrors the current amd64 implementation, and provides optimal performance on a range of arm64 processors including Centriq 2400 and Apple A12. By and large it is implicitly tested by the robustness of the already existing amd64 implementation. The implementation interleaves GHASH with CTR mode to achieve the highest possible throughput, it also aggregates GHASH with a factor of 8, to decrease the cost of the reduction step. Even thought there is a significant amount of assembly, the code reuses the go code for the amd64 implementation, so there is little additional go code. Since AES-GCM is critical for performance of all web servers, this change is required to level the playfield for arm64 CPUs, where amd64 currently enjoys an unfair advantage. Ideally both amd64 and arm64 codepaths could be replaced by hypothetical AES and CLMUL intrinsics, with a few additional vector instructions. Fixes golang#18498 Fixes golang#19840 Change-Id: Icc57b868cd1f67ac695c1ac163a8e215f74c7910 Reviewed-on: https://go-review.googlesource.com/107298 Run-TryBot: Vlad Krasnov <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>

bradfitz added the Performance label Apr 4, 2017

bradfitz added this to the Unplanned milestone Apr 4, 2017

vielmetti mentioned this issue Apr 4, 2017

Support for linux/arm64 (ARMv8, aarch64) nats-io/nats-server#466

Closed

1 task

kozlovic mentioned this issue Sep 22, 2017

Very high CPU usage during initial TLS setup nats-io/nats-server#589

Closed

bradfitz mentioned this issue Feb 7, 2018

net/http: Slow HTTPS #23727

Closed

vielmetti mentioned this issue Jun 29, 2018

Test Go 1.11beta1 for performance vs Go 1.10.x nats-io/nats-server#695

Closed

1 task

gopherbot closed this as completed in 4f1f503 Jul 20, 2018

elgatito mentioned this issue Dec 17, 2018

Need help with proxy on different platforms elazarl/goproxy#319

Closed

vielmetti mentioned this issue Mar 29, 2019

plugin/forward: maxes out CPU (TLS connection negotiating) coredns/coredns#2624

Closed

golang locked and limited conversation to collaborators Jul 20, 2019

gopherbot added the FrozenDueToAge label Jul 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crypto/tls: linux/arm64 Go 1.8 performance is slow, max 12.5 MB/sec #19840

crypto/tls: linux/arm64 Go 1.8 performance is slow, max 12.5 MB/sec #19840

vielmetti commented Apr 4, 2017

bradfitz commented Apr 4, 2017

vielmetti commented Apr 4, 2017

bradfitz commented Apr 4, 2017

vielmetti commented Apr 4, 2017

williamweixiao commented Apr 6, 2017

vielmetti commented Jun 8, 2017

bradfitz commented Jun 8, 2017

williamweixiao commented Jun 9, 2017

vielmetti commented Jun 12, 2017

vielmetti commented Oct 31, 2017

williamweixiao commented Nov 1, 2017

vielmetti commented Nov 5, 2017

vielmetti commented Nov 8, 2017

williamweixiao commented Nov 19, 2017

gopherbot commented Nov 21, 2017

vielmetti commented Apr 24, 2018

gopherbot commented Jun 6, 2018

vielmetti commented Jun 26, 2018

vielmetti commented Jun 29, 2018

crypto/tls: linux/arm64 Go 1.8 performance is slow, max 12.5 MB/sec #19840

crypto/tls: linux/arm64 Go 1.8 performance is slow, max 12.5 MB/sec #19840

Comments

vielmetti commented Apr 4, 2017

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

bradfitz commented Apr 4, 2017

vielmetti commented Apr 4, 2017

bradfitz commented Apr 4, 2017

vielmetti commented Apr 4, 2017

williamweixiao commented Apr 6, 2017

vielmetti commented Jun 8, 2017

bradfitz commented Jun 8, 2017

williamweixiao commented Jun 9, 2017

vielmetti commented Jun 12, 2017

vielmetti commented Oct 31, 2017

williamweixiao commented Nov 1, 2017

vielmetti commented Nov 5, 2017

vielmetti commented Nov 8, 2017

williamweixiao commented Nov 19, 2017

gopherbot commented Nov 21, 2017

vielmetti commented Apr 24, 2018

gopherbot commented Jun 6, 2018

vielmetti commented Jun 26, 2018

vielmetti commented Jun 29, 2018

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?