Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netty io_uring was slower than epoll #152

Closed
xbzhang99 opened this issue Mar 23, 2022 · 18 comments
Closed

netty io_uring was slower than epoll #152

xbzhang99 opened this issue Mar 23, 2022 · 18 comments

Comments

@xbzhang99
Copy link

I collected some stats on x86, 5.13 kernel, using netty-incubator-iouring 13 with Cassandra 4.0. The network throughput is about 8% lower than netty epoll. The context switches is 69% higher than than epoll, and the interrupts are 2% higher than epoll. Also, CPI (circles per instruction) is 8% higher.

  iouring epoll diff    
load 1000 1000   Actual Calculated
throughput 142868 154465 1.081 8.1 11.4
util 52.0 50.3 0.967 -3.3  
freq 3.09 3.13 1.013 1.3  
CPI 1.33 1.22 0.915 8.5  
Pathlength (Kinstruction/throughput) 842.7 801.6 0.951 4.9  

netty io_uring flamegraph
iouring-read-cache-run002_profiler

netty epoll flamegraph
epoll-read-cache-run002_profiler

@franz1981
Copy link
Contributor

franz1981 commented Mar 23, 2022

Please can you attach the flamegraphs or sending them via email?
I cannot navigate them properly in order to understand what's going on in the current form ie as images.
Looking at the flames from a 100ft perspective the test doesn't seem I/O bound and in this case the Netty threads (thanks to io_uring) can go to sleep faster (and more frequently), awaking them will cost more, causing more context switches etc etc
and getting worse performance overall.

So please share a bit more about the test and data results

@xbzhang99
Copy link
Author

xbzhang99 commented Mar 23, 2022 via email

@xbzhang99
Copy link
Author

Tests were done using READ.sh script in Cassandra (cassandra-stress).

IOuring:
./READ.sh 120 50
Results:
Op rate : 188,259 op/s [simple1: 188,259 op/s]
Partition rate : 188,259 pk/s [simple1: 188,259 pk/s]
Row rate : 1,882,971 row/s [simple1: 1,882,971 row/s]
Latency mean : 0.3 ms [simple1: 0.3 ms]
Latency median : 0.2 ms [simple1: 0.2 ms]
Latency 95th percentile : 0.3 ms [simple1: 0.3 ms]
Latency 99th percentile : 0.5 ms [simple1: 0.5 ms]
Latency 99.9th percentile : 1.0 ms [simple1: 1.0 ms]
Latency max : 55.6 ms [simple1: 55.6 ms]
Total partitions : 22,778,788 [simple1: 22,778,788]
Total errors : 0 [simple1: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:02:00

Epoll:
./READ.sh 120 50
Op rate : 204,277 op/s [simple1: 204,277 op/s]
Partition rate : 204,277 pk/s [simple1: 204,277 pk/s]
Row rate : 2,042,924 row/s [simple1: 2,042,924 row/s]
Latency mean : 0.2 ms [simple1: 0.2 ms]
Latency median : 0.2 ms [simple1: 0.2 ms]
Latency 95th percentile : 0.3 ms [simple1: 0.3 ms]
Latency 99th percentile : 0.3 ms [simple1: 0.3 ms]
Latency 99.9th percentile : 0.9 ms [simple1: 0.9 ms]
Latency max : 55.5 ms [simple1: 55.5 ms]
Total partitions : 24,601,614 [simple1: 24,601,614]
Total errors : 0 [simple1: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:02:00

@franz1981
Copy link
Contributor

franz1981 commented Mar 23, 2022

Thanks @xbzhang99 but please can you still attach somewhere the flamegraphs you've sent? sharing details of machines (not just OS), JVM etc etc

@xbzhang99
Copy link
Author

it was jdk14. Sorry I can't seem to attach the flamegraphs (I already dragged and dropped them above)

@njhill
Copy link
Member

njhill commented Mar 23, 2022

@franz1981 try downloading the .svg files first and then open them in a browser from your local disk. If you just click on them directly the javascript is disabled.

@franz1981
Copy link
Contributor

thanks @njhill :O I was thinking was just an image, not a svg :O
I apologize @xbzhang99

@franz1981
Copy link
Contributor

franz1981 commented Mar 24, 2022

@xbzhang99 I need few others info to better understand it

  1. profiling duration
  2. number of cores (logical too) of the machine
  3. a profiler run using -t argument
  4. the machine is physical? or virtual?
  5. how many Netty event loops are created?

And please consider applying #106 (comment) too, io_uring doesn't need as many event loop threads of epoll, in general.

Another suggestion is related

Math.max(0, SystemPropertyUtil.getInt("io.netty.iouring.iosqeAsyncThreshold", 25));
:

@xbzhang99
Copy link
Author

tried several io.netty.iouring.iosqeAsyncThreshold values, but the client got timedout when it was >2
The machine was physical, has 64 cores.
How to check #of Netty event loops?

@franz1981
Copy link
Contributor

There are still missing answers, useful for the investigation:

  • duration of profiling session ie the -d argument of async-profiler

tried several io.netty.iouring.iosqeAsyncThreshold values, but the client got timedout when it was >2

Try with a number much higher then 25: how many connections are created against the Cassandra node?
You can bump it to be 2147483647

But I suggest first to consider the suggestion from @normanmaurer too first, on #106 (comment)

How to check #of Netty event loops?

It depends by Cassandra I think, can you share the GitHub link of the exact test? I am not familiar with Cassandra codebase

@xbzhang99
Copy link
Author

xbzhang99 commented Mar 28, 2022

we have done -d 60 in profiling, but have not done -t

I increased the value to 2000000000, and the client (cassandra-stress) gave:
com.datastax.driver.core.exceptions.OperationTimedOutException: [/xxx.xxx.xxx.xxx:9042] Timed out waiting for server response
java.util.NoSuchElementException

there seemed to have 3 eventloops, in the debug.log there are
DEBUG [main] 2022-03-28 01:41:31,893 SocketFactory.java:179 - using netty IOURING event loop for pool prefix Messaging-AcceptLoop, threadCount= 1
DEBUG [main] 2022-03-28 01:41:31,906 SocketFactory.java:179 - using netty IOURING event loop for pool prefix Messaging-EventLoop, threadCount= 64
DEBUG [main] 2022-03-28 01:41:31,930 SocketFactory.java:179 - using netty IOURING event loop for pool prefix Streaming-EventLoop, threadCount= 64

@xbzhang99
Copy link
Author

xbzhang99 commented Apr 1, 2022

I reduced the threads as per #106 (comment).
The throughput is on par with epoll (reduced to the same number of threads). And the context switches were much lower.
So can we have a smaller default for the threads? Can we also make the number of threads a runtime parameter?

A side question, why not the liburing but the low level API were used?

@franz1981
Copy link
Contributor

@xbzhang99 because JNI to access the ring buffers has still a cost (many, TBH) but here thanks to unsafe (and VarHandle in the future) there's not need to pay that cost.

@xbzhang99
Copy link
Author

xbzhang99 commented Apr 4, 2022

I reduced the the number of threads for NioEventLoopGroup, and I got about the same throughput as Epoll and IOUring.
So basically I am able to see the benefit of IOUring over Epoll or even Nio

@normanmaurer
Copy link
Member

@xbzhang99 also one important aspect is that io_uring only will give you improvements when you are IO heavy in terms of network sys calls atm. Also it will only show an big improvement when a lot of connections are doing IO that sit on the same IOUringEventLoop.

@xbzhang99
Copy link
Author

xbzhang99 commented Apr 11, 2022

"because JNI to access the ring buffers has still a cost (many, TBH) but here thanks to unsafe (and VarHandle in the future) there's not need to pay that cost."

@franz1981
Could you explain more? I thought with both libring and io-uring API, you have to use jni to access the ring buffers?
what about unsafe?

@franz1981
Copy link
Contributor

Nope, in the current version there's no need of JNI to access the iuring ring buffers

@franz1981
Copy link
Contributor

Closing this as explained 😃

franz1981 added a commit to franz1981/FrameworkBenchmarks that referenced this issue Jun 10, 2023
plaintext benchmark can cause Netty io_uring to use ASYNC SQEs that are not nicely handled at OS level (creating TONS of kernel threads) ie 16384 connections / 28 event loops = 582 connections per event loop which can use SYNC SEQs, which exceed the 25 default ones.
  
This PR fix this behaviour and force SYNC SQEs: see netty/netty-incubator-transport-io_uring#152 (comment) for more info
NateBrady23 pushed a commit to TechEmpower/FrameworkBenchmarks that referenced this issue Jun 12, 2023
plaintext benchmark can cause Netty io_uring to use ASYNC SQEs that are not nicely handled at OS level (creating TONS of kernel threads) ie 16384 connections / 28 event loops = 582 connections per event loop which can use SYNC SEQs, which exceed the 25 default ones.
  
This PR fix this behaviour and force SYNC SQEs: see netty/netty-incubator-transport-io_uring#152 (comment) for more info
franz1981 added a commit to franz1981/FrameworkBenchmarks that referenced this issue Jun 23, 2023
plaintext benchmark can cause Netty io_uring to use ASYNC SQEs that are not nicely handled at OS level (creating TONS of kernel threads) ie 16384 connections / 28 event loops = 582 connections per event loop which can use SYNC SEQs, which exceed the 25 default ones.
  
This PR fix this behaviour and force SYNC SQEs: see netty/netty-incubator-transport-io_uring#152 (comment) for more info
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants