netty io_uring was slower than epoll #152

xbzhang99 · 2022-03-23T17:48:21Z

I collected some stats on x86, 5.13 kernel, using netty-incubator-iouring 13 with Cassandra 4.0. The network throughput is about 8% lower than netty epoll. The context switches is 69% higher than than epoll, and the interrupts are 2% higher than epoll. Also, CPI (circles per instruction) is 8% higher.

	iouring	epoll	diff
load	1000	1000		Actual	Calculated
throughput	142868	154465	1.081	8.1	11.4
util	52.0	50.3	0.967	-3.3
freq	3.09	3.13	1.013	1.3
CPI	1.33	1.22	0.915	8.5
Pathlength (Kinstruction/throughput)	842.7	801.6	0.951	4.9

netty io_uring flamegraph

netty epoll flamegraph

franz1981 · 2022-03-23T18:56:07Z

Please can you attach the flamegraphs or sending them via email?
I cannot navigate them properly in order to understand what's going on in the current form ie as images.
Looking at the flames from a 100ft perspective the test doesn't seem I/O bound and in this case the Netty threads (thanks to io_uring) can go to sleep faster (and more frequently), awaking them will cost more, causing more context switches etc etc
and getting worse performance overall.

So please share a bit more about the test and data results

xbzhang99 · 2022-03-23T22:01:01Z

Tests were done using READ.sh script in Cassandra. IOuring: ./READ.sh 120 50 Results: Op rate : 188,259 op/s [simple1: 188,259 op/s] Partition rate : 188,259 pk/s [simple1: 188,259 pk/s] Row rate : 1,882,971 row/s [simple1: 1,882,971 row/s] Latency mean : 0.3 ms [simple1: 0.3 ms] Latency median : 0.2 ms [simple1: 0.2 ms] Latency 95th percentile : 0.3 ms [simple1: 0.3 ms] Latency 99th percentile : 0.5 ms [simple1: 0.5 ms] Latency 99.9th percentile : 1.0 ms [simple1: 1.0 ms] Latency max : 55.6 ms [simple1: 55.6 ms] Total partitions : 22,778,788 [simple1: 22,778,788] Total errors : 0 [simple1: 0] Total GC count : 0 Total GC memory : 0.000 KiB Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:02:00 Epoll: ./READ.sh 120 50 Op rate : 204,277 op/s [simple1: 204,277 op/s] Partition rate : 204,277 pk/s [simple1: 204,277 pk/s] Row rate : 2,042,924 row/s [simple1: 2,042,924 row/s] Latency mean : 0.2 ms [simple1: 0.2 ms] Latency median : 0.2 ms [simple1: 0.2 ms] Latency 95th percentile : 0.3 ms [simple1: 0.3 ms] Latency 99th percentile : 0.3 ms [simple1: 0.3 ms] Latency 99.9th percentile : 0.9 ms [simple1: 0.9 ms] Latency max : 55.5 ms [simple1: 55.5 ms] Total partitions : 24,601,614 [simple1: 24,601,614] Total errors : 0 [simple1: 0] Total GC count : 0 Total GC memory : 0.000 KiB Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:02:00 Best regards, Xubo From: Francesco Nigro ***@***.***> Sent: Wednesday, March 23, 2022 11:56 AM To: netty/netty-incubator-transport-io_uring ***@***.***> Cc: Zhang, Xubo ***@***.***>; Author ***@***.***> Subject: Re: [netty/netty-incubator-transport-io_uring] netty io_uring was slower than epoll (Issue #152) Please can you attache the flamegraphs or sending me via email? I cannot navigate the flamegraphs in order to understand what's going on. Looking at the flames in this form It doesn't seem and I/O bounded benchmark: if the system is not I/O bounded and Netty threads (thanks to io_uring) got to sleep faster, awaking them will cost more, cause more context switches etc etc So please share a bit more about the test and data results — Reply to this email directly, view it on GitHub<#152 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AN2R2MKQ7R4KXMTOGOVVUGLVBNSNFANCNFSM5ROVXYRA>. You are receiving this because you authored the thread.Message ID: ***@***.***>

xbzhang99 · 2022-03-23T22:03:11Z

Tests were done using READ.sh script in Cassandra (cassandra-stress).

IOuring:
./READ.sh 120 50
Results:
Op rate : 188,259 op/s [simple1: 188,259 op/s]
Partition rate : 188,259 pk/s [simple1: 188,259 pk/s]
Row rate : 1,882,971 row/s [simple1: 1,882,971 row/s]
Latency mean : 0.3 ms [simple1: 0.3 ms]
Latency median : 0.2 ms [simple1: 0.2 ms]
Latency 95th percentile : 0.3 ms [simple1: 0.3 ms]
Latency 99th percentile : 0.5 ms [simple1: 0.5 ms]
Latency 99.9th percentile : 1.0 ms [simple1: 1.0 ms]
Latency max : 55.6 ms [simple1: 55.6 ms]
Total partitions : 22,778,788 [simple1: 22,778,788]
Total errors : 0 [simple1: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:02:00

Epoll:
./READ.sh 120 50
Op rate : 204,277 op/s [simple1: 204,277 op/s]
Partition rate : 204,277 pk/s [simple1: 204,277 pk/s]
Row rate : 2,042,924 row/s [simple1: 2,042,924 row/s]
Latency mean : 0.2 ms [simple1: 0.2 ms]
Latency median : 0.2 ms [simple1: 0.2 ms]
Latency 95th percentile : 0.3 ms [simple1: 0.3 ms]
Latency 99th percentile : 0.3 ms [simple1: 0.3 ms]
Latency 99.9th percentile : 0.9 ms [simple1: 0.9 ms]
Latency max : 55.5 ms [simple1: 55.5 ms]
Total partitions : 24,601,614 [simple1: 24,601,614]
Total errors : 0 [simple1: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:02:00

franz1981 · 2022-03-23T22:11:42Z

Thanks @xbzhang99 but please can you still attach somewhere the flamegraphs you've sent? sharing details of machines (not just OS), JVM etc etc

xbzhang99 · 2022-03-23T22:39:36Z

it was jdk14. Sorry I can't seem to attach the flamegraphs (I already dragged and dropped them above)

njhill · 2022-03-23T23:22:55Z

@franz1981 try downloading the .svg files first and then open them in a browser from your local disk. If you just click on them directly the javascript is disabled.

franz1981 · 2022-03-24T00:53:50Z

thanks @njhill :O I was thinking was just an image, not a svg :O
I apologize @xbzhang99

franz1981 · 2022-03-24T08:28:19Z

@xbzhang99 I need few others info to better understand it

profiling duration
number of cores (logical too) of the machine
a profiler run using -t argument
the machine is physical? or virtual?
how many Netty event loops are created?

And please consider applying #106 (comment) too, io_uring doesn't need as many event loop threads of epoll, in general.

Another suggestion is related

netty-incubator-transport-io_uring/transport-classes-io_uring/src/main/java/io/netty/incubator/channel/uring/Native.java

Line 40 in 60f7b89

    
           Math.max(0, SystemPropertyUtil.getInt("io.netty.iouring.iosqeAsyncThreshold", 25));

:

the flamegraphs shows async sqes submissions, hence bumping -Dio.netty.iouring.iosqeAsyncThreshold=<very high integer value> could help (with event loop threads to be the same of the available cores) -> see

netty-incubator-transport-io_uring/transport-classes-io_uring/src/main/java/io/netty/incubator/channel/uring/IOUringSubmissionQueue.java

Line 110 in 98f7b6d

return numHandledFds < iosqeAsyncThreshold ? 0 : Native.IOSQE_ASYNC;

(submitted requests will be handled on foreground if the amount of handled connections are less then 25)

xbzhang99 · 2022-03-25T04:58:04Z

tried several io.netty.iouring.iosqeAsyncThreshold values, but the client got timedout when it was >2
The machine was physical, has 64 cores.
How to check #of Netty event loops?

franz1981 · 2022-03-25T05:24:30Z

There are still missing answers, useful for the investigation:

duration of profiling session ie the -d argument of async-profiler

tried several io.netty.iouring.iosqeAsyncThreshold values, but the client got timedout when it was >2

Try with a number much higher then 25: how many connections are created against the Cassandra node?
You can bump it to be 2147483647

But I suggest first to consider the suggestion from @normanmaurer too first, on #106 (comment)

How to check #of Netty event loops?

It depends by Cassandra I think, can you share the GitHub link of the exact test? I am not familiar with Cassandra codebase

xbzhang99 · 2022-03-28T15:29:16Z

we have done -d 60 in profiling, but have not done -t

I increased the value to 2000000000, and the client (cassandra-stress) gave:
com.datastax.driver.core.exceptions.OperationTimedOutException: [/xxx.xxx.xxx.xxx:9042] Timed out waiting for server response
java.util.NoSuchElementException

there seemed to have 3 eventloops, in the debug.log there are
DEBUG [main] 2022-03-28 01:41:31,893 SocketFactory.java:179 - using netty IOURING event loop for pool prefix Messaging-AcceptLoop, threadCount= 1
DEBUG [main] 2022-03-28 01:41:31,906 SocketFactory.java:179 - using netty IOURING event loop for pool prefix Messaging-EventLoop, threadCount= 64
DEBUG [main] 2022-03-28 01:41:31,930 SocketFactory.java:179 - using netty IOURING event loop for pool prefix Streaming-EventLoop, threadCount= 64

xbzhang99 · 2022-04-01T16:20:03Z

I reduced the threads as per #106 (comment).
The throughput is on par with epoll (reduced to the same number of threads). And the context switches were much lower.
So can we have a smaller default for the threads? Can we also make the number of threads a runtime parameter?

A side question, why not the liburing but the low level API were used?

franz1981 · 2022-04-01T16:42:04Z

@xbzhang99 because JNI to access the ring buffers has still a cost (many, TBH) but here thanks to unsafe (and VarHandle in the future) there's not need to pay that cost.

xbzhang99 · 2022-04-04T23:57:17Z

I reduced the the number of threads for NioEventLoopGroup, and I got about the same throughput as Epoll and IOUring.
So basically I am able to see the benefit of IOUring over Epoll or even Nio

normanmaurer · 2022-04-05T07:45:44Z

@xbzhang99 also one important aspect is that io_uring only will give you improvements when you are IO heavy in terms of network sys calls atm. Also it will only show an big improvement when a lot of connections are doing IO that sit on the same IOUringEventLoop.

xbzhang99 · 2022-04-11T16:31:28Z

"because JNI to access the ring buffers has still a cost (many, TBH) but here thanks to unsafe (and VarHandle in the future) there's not need to pay that cost."

@franz1981
Could you explain more? I thought with both libring and io-uring API, you have to use jni to access the ring buffers?
what about unsafe?

franz1981 · 2022-04-11T17:31:08Z

Nope, in the current version there's no need of JNI to access the iuring ring buffers

franz1981 · 2022-05-13T06:01:44Z

Closing this as explained 😃

plaintext benchmark can cause Netty io_uring to use ASYNC SQEs that are not nicely handled at OS level (creating TONS of kernel threads) ie 16384 connections / 28 event loops = 582 connections per event loop which can use SYNC SEQs, which exceed the 25 default ones. This PR fix this behaviour and force SYNC SQEs: see netty/netty-incubator-transport-io_uring#152 (comment) for more info

franz1981 closed this as completed May 13, 2022

This was referenced May 23, 2023

Document io_uring quarkusio/quarkus#33544

Merged

Reduce the number of event loop threads for io_uring TechEmpower/FrameworkBenchmarks#8231

Merged

franz1981 mentioned this issue Jun 10, 2023

Raise iosqeAsyncThreshold TechEmpower/FrameworkBenchmarks#8252

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netty io_uring was slower than epoll #152

netty io_uring was slower than epoll #152

xbzhang99 commented Mar 23, 2022

franz1981 commented Mar 23, 2022 •

edited

Loading

xbzhang99 commented Mar 23, 2022 via email

xbzhang99 commented Mar 23, 2022

franz1981 commented Mar 23, 2022 •

edited

Loading

xbzhang99 commented Mar 23, 2022

njhill commented Mar 23, 2022

franz1981 commented Mar 24, 2022

franz1981 commented Mar 24, 2022 •

edited

Loading

xbzhang99 commented Mar 25, 2022

franz1981 commented Mar 25, 2022

xbzhang99 commented Mar 28, 2022 •

edited

Loading

xbzhang99 commented Apr 1, 2022 •

edited

Loading

franz1981 commented Apr 1, 2022

xbzhang99 commented Apr 4, 2022 •

edited

Loading

normanmaurer commented Apr 5, 2022

xbzhang99 commented Apr 11, 2022 •

edited

Loading

franz1981 commented Apr 11, 2022

franz1981 commented May 13, 2022

netty io_uring was slower than epoll #152

netty io_uring was slower than epoll #152

Comments

xbzhang99 commented Mar 23, 2022

franz1981 commented Mar 23, 2022 • edited Loading

xbzhang99 commented Mar 23, 2022 via email

xbzhang99 commented Mar 23, 2022

franz1981 commented Mar 23, 2022 • edited Loading

xbzhang99 commented Mar 23, 2022

njhill commented Mar 23, 2022

franz1981 commented Mar 24, 2022

franz1981 commented Mar 24, 2022 • edited Loading

xbzhang99 commented Mar 25, 2022

franz1981 commented Mar 25, 2022

xbzhang99 commented Mar 28, 2022 • edited Loading

xbzhang99 commented Apr 1, 2022 • edited Loading

franz1981 commented Apr 1, 2022

xbzhang99 commented Apr 4, 2022 • edited Loading

normanmaurer commented Apr 5, 2022

xbzhang99 commented Apr 11, 2022 • edited Loading

franz1981 commented Apr 11, 2022

franz1981 commented May 13, 2022

franz1981 commented Mar 23, 2022 •

edited

Loading

franz1981 commented Mar 23, 2022 •

edited

Loading

franz1981 commented Mar 24, 2022 •

edited

Loading

xbzhang99 commented Mar 28, 2022 •

edited

Loading

xbzhang99 commented Apr 1, 2022 •

edited

Loading

xbzhang99 commented Apr 4, 2022 •

edited

Loading

xbzhang99 commented Apr 11, 2022 •

edited

Loading