-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
netty io_uring was slower than epoll #152
Comments
Please can you attach the flamegraphs or sending them via email? So please share a bit more about the test and data results |
Tests were done using READ.sh script in Cassandra.
IOuring:
./READ.sh 120 50
Results:
Op rate : 188,259 op/s [simple1: 188,259 op/s]
Partition rate : 188,259 pk/s [simple1: 188,259 pk/s]
Row rate : 1,882,971 row/s [simple1: 1,882,971 row/s]
Latency mean : 0.3 ms [simple1: 0.3 ms]
Latency median : 0.2 ms [simple1: 0.2 ms]
Latency 95th percentile : 0.3 ms [simple1: 0.3 ms]
Latency 99th percentile : 0.5 ms [simple1: 0.5 ms]
Latency 99.9th percentile : 1.0 ms [simple1: 1.0 ms]
Latency max : 55.6 ms [simple1: 55.6 ms]
Total partitions : 22,778,788 [simple1: 22,778,788]
Total errors : 0 [simple1: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:02:00
Epoll:
./READ.sh 120 50
Op rate : 204,277 op/s [simple1: 204,277 op/s]
Partition rate : 204,277 pk/s [simple1: 204,277 pk/s]
Row rate : 2,042,924 row/s [simple1: 2,042,924 row/s]
Latency mean : 0.2 ms [simple1: 0.2 ms]
Latency median : 0.2 ms [simple1: 0.2 ms]
Latency 95th percentile : 0.3 ms [simple1: 0.3 ms]
Latency 99th percentile : 0.3 ms [simple1: 0.3 ms]
Latency 99.9th percentile : 0.9 ms [simple1: 0.9 ms]
Latency max : 55.5 ms [simple1: 55.5 ms]
Total partitions : 24,601,614 [simple1: 24,601,614]
Total errors : 0 [simple1: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:02:00
Best regards,
Xubo
From: Francesco Nigro ***@***.***>
Sent: Wednesday, March 23, 2022 11:56 AM
To: netty/netty-incubator-transport-io_uring ***@***.***>
Cc: Zhang, Xubo ***@***.***>; Author ***@***.***>
Subject: Re: [netty/netty-incubator-transport-io_uring] netty io_uring was slower than epoll (Issue #152)
Please can you attache the flamegraphs or sending me via email?
I cannot navigate the flamegraphs in order to understand what's going on.
Looking at the flames in this form It doesn't seem and I/O bounded benchmark: if the system is not I/O bounded and Netty threads (thanks to io_uring) got to sleep faster, awaking them will cost more, cause more context switches etc etc
So please share a bit more about the test and data results
—
Reply to this email directly, view it on GitHub<#152 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AN2R2MKQ7R4KXMTOGOVVUGLVBNSNFANCNFSM5ROVXYRA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Tests were done using READ.sh script in Cassandra (cassandra-stress). IOuring: Epoll: |
Thanks @xbzhang99 but please can you still attach somewhere the flamegraphs you've sent? sharing details of machines (not just OS), JVM etc etc |
it was jdk14. Sorry I can't seem to attach the flamegraphs (I already dragged and dropped them above) |
@franz1981 try downloading the .svg files first and then open them in a browser from your local disk. If you just click on them directly the javascript is disabled. |
thanks @njhill :O I was thinking was just an image, not a svg :O |
@xbzhang99 I need few others info to better understand it
And please consider applying #106 (comment) too, io_uring doesn't need as many event loop threads of epoll, in general. Another suggestion is related Line 40 in 60f7b89
|
tried several io.netty.iouring.iosqeAsyncThreshold values, but the client got timedout when it was >2 |
There are still missing answers, useful for the investigation:
Try with a number much higher then 25: how many connections are created against the Cassandra node? But I suggest first to consider the suggestion from @normanmaurer too first, on #106 (comment)
It depends by Cassandra I think, can you share the GitHub link of the exact test? I am not familiar with Cassandra codebase |
we have done -d 60 in profiling, but have not done -t I increased the value to 2000000000, and the client (cassandra-stress) gave: there seemed to have 3 eventloops, in the debug.log there are |
I reduced the threads as per #106 (comment). A side question, why not the liburing but the low level API were used? |
@xbzhang99 because JNI to access the ring buffers has still a cost (many, TBH) but here thanks to unsafe (and VarHandle in the future) there's not need to pay that cost. |
I reduced the the number of threads for NioEventLoopGroup, and I got about the same throughput as Epoll and IOUring. |
@xbzhang99 also one important aspect is that io_uring only will give you improvements when you are IO heavy in terms of network sys calls atm. Also it will only show an big improvement when a lot of connections are doing IO that sit on the same |
"because JNI to access the ring buffers has still a cost (many, TBH) but here thanks to unsafe (and VarHandle in the future) there's not need to pay that cost." @franz1981 |
Nope, in the current version there's no need of JNI to access the iuring ring buffers |
Closing this as explained 😃 |
plaintext benchmark can cause Netty io_uring to use ASYNC SQEs that are not nicely handled at OS level (creating TONS of kernel threads) ie 16384 connections / 28 event loops = 582 connections per event loop which can use SYNC SEQs, which exceed the 25 default ones. This PR fix this behaviour and force SYNC SQEs: see netty/netty-incubator-transport-io_uring#152 (comment) for more info
plaintext benchmark can cause Netty io_uring to use ASYNC SQEs that are not nicely handled at OS level (creating TONS of kernel threads) ie 16384 connections / 28 event loops = 582 connections per event loop which can use SYNC SEQs, which exceed the 25 default ones. This PR fix this behaviour and force SYNC SQEs: see netty/netty-incubator-transport-io_uring#152 (comment) for more info
plaintext benchmark can cause Netty io_uring to use ASYNC SQEs that are not nicely handled at OS level (creating TONS of kernel threads) ie 16384 connections / 28 event loops = 582 connections per event loop which can use SYNC SEQs, which exceed the 25 default ones. This PR fix this behaviour and force SYNC SQEs: see netty/netty-incubator-transport-io_uring#152 (comment) for more info
I collected some stats on x86, 5.13 kernel, using netty-incubator-iouring 13 with Cassandra 4.0. The network throughput is about 8% lower than netty epoll. The context switches is 69% higher than than epoll, and the interrupts are 2% higher than epoll. Also, CPI (circles per instruction) is 8% higher.
netty io_uring flamegraph
netty epoll flamegraph
The text was updated successfully, but these errors were encountered: