Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use io_uring to batch handle clients pending writes to reduce SYSCALL count. #112

Open
wants to merge 6 commits into
base: unstable
Choose a base branch
from

Conversation

lipzhu
Copy link
Contributor

@lipzhu lipzhu commented Apr 1, 2024

Description

This patch try to benefit the io_uring batching feature to reduce the SYSCALL count for valkey when handleClientsWithPendingWrites.
With this patch, we can observe more than 6% perf gain for SET/GET.
This patch was implemented based on below discussion during the review:

  1. Introduce a io_uring.h to handle the io_uing related API to split it from server logic.
  2. Make io_uring.h independent of server.h .
  3. Only use io_uring to gain performance when write client static buffer.

Benchmark Result

Test Env

  • OPERATING SYSTEM: Ubuntu 22.04.4 LTS
  • Kernel: 5.15.0-116-generic
  • PROCESSOR: Intel Xeon Platinum 8380
  • Base: 5b9fc46
  • Server and Client in same socket.

Test Steps

  1. Start valkey-server with below config.
taskset -c 0-3 ~/src/valkey-server /tmp/valkey_1.conf

port 9001
bind * -::*
daemonize yes
protected-mode no
save ""
  1. Start valkey-benchmark to ensure valkey-server CPU utilized is 1(fully utilized).
taskset -c 16-19 ~/src/valkey-benchmark -p 9001 -t set -d 100 -r 1000000 -n 5000000 -c 50 --threads 4

Test Result

QPS of SET and GET can increase 6.5%, 6.6% correspondingly.

Perf Stat

The perf stat info shows that only 1 CPU resource was used during the test and the IPC also increase 6%, not from more CPU resources.

perf stat -p `pidof valkey-server` sleep 10

# w/o io_uring
 Performance counter stats for process id '2267781':

          9,993.95 msec task-clock                #    0.999 CPUs utilized
               625      context-switches          #   62.538 /sec
                 0      cpu-migrations            #    0.000 /sec
            94,933      page-faults               #    9.499 K/sec
    33,894,880,825      cycles                    #    3.392 GHz
    39,284,579,699      instructions              #    1.16  insn per cycle
     7,750,350,988      branches                  #  775.504 M/sec
        73,791,242      branch-misses             #    0.95% of all branches
   169,474,584,465      slots                     #   16.958 G/sec
    39,212,071,735      topdown-retiring          #     23.1% retiring
    11,962,902,869      topdown-bad-spec          #      7.1% bad speculation
    43,199,367,984      topdown-fe-bound          #     25.5% frontend bound
    75,159,711,305      topdown-be-bound          #     44.3% backend bound

      10.001262795 seconds time elapsed

# w/ io_uring
 Performance counter stats for process id '2273716':

          9,970.38 msec task-clock                #    0.997 CPUs utilized
             1,077      context-switches          #  108.020 /sec
                 1      cpu-migrations            #    0.100 /sec
           124,080      page-faults               #   12.445 K/sec
    33,813,062,268      cycles                    #    3.391 GHz
    41,455,816,158      instructions              #    1.23  insn per cycle
     8,063,017,730      branches                  #  808.697 M/sec
        68,008,453      branch-misses             #    0.84% of all branches
   169,066,451,360      slots                     #   16.957 G/sec
    38,077,547,648      topdown-retiring          #     22.0% retiring
    28,509,121,765      topdown-bad-spec          #     16.5% bad speculation
    41,083,738,441      topdown-fe-bound          #     23.8% frontend bound
    65,062,545,805      topdown-be-bound          #     37.7% backend bound

      10.001785198 seconds time elapsed

NOTE

  • Since io_uring was adopted from kernel 5.1, if kernel doesn't support io_uring yet, it will use the origin implementation.
  • This patch introduce the liburing dependency, it is installed in my local env, to keep it simple, I didn't include liburing dependency in this patch, the CI build may failed.

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Apr 2, 2024

If you merge latest unstable, the spellcheck is fixed.

Can you add a check for <liburing.h> in Makefile, something like this:

HAS_LIBURING := $(shell sh -c 'echo "$(NUMBER_SIGN_CHAR)include <liburing.h>" > foo.c; \
	$(CC) -E foo.c > /dev/null 2>&1 && echo yes; \
	rm foo.c')
ifeq ($(HAS_LIBURING),yes)
	...
else
	...
endif

@PingXie
Copy link
Member

PingXie commented Apr 15, 2024

I have not taken a closer look at the PR but at the minimum I think we need a config to opt in io_uring.

I am also suspecting that the gain is likely coming from tapping into the additional cores that couldn't be utilized efficiently by the current io-threads. If so this would lead to two questions

  1. On lower spec machines with less cores, say 2 or 4, will we see the similar improvements?

  2. Where do we see io_uring's place in light of the planned multi threading improvements? This improvement would potentially allow Valkey to use cores beyond 4 or 8 a lot more efficiently hence leaving not much room for io-uring?

Lastly,I haven't checked recently but in the past, io-uring has seen quite amount of vulnerabilities. So in addition to the design and PR review, we should also take a hard look at the security implications.

@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 15, 2024

  1. On lower spec machines with less cores, say 2 or 4, will we see the similar improvements?

The core numbers will not affect the perf gain, you can refer the server config I post in top comment, io-threads is disabled, the maximum CPU utilized is 1, the benchmark clients will make sure server CPU is fully utilized. Just as I described in top comment, the gain benefit from the reduce of SYSCALL .

  1. Where do we see io_uring's place in light of the planned multi threading improvements? This improvement would potentially allow Valkey to use cores beyond 4 or 8 a lot more efficiently hence leaving not much room for io-uring?

More context about the multi threading improvements from community?

Lastly,I haven't checked recently but in the past, io-uring has seen quite amount of vulnerabilities. So in addition to the design and PR review, we should also take a hard look at the security implications.

I am not security experts, can you give more details about your concern, will this a blocker for community to adopt io_uring?

@PingXie
Copy link
Member

PingXie commented Apr 15, 2024

The core numbers will not affect the perf gain, you can refer the server config I post in top comment, io-threads is disabled, the maximum CPU utilized is 1, the benchmark clients will make sure server CPU is fully utilized. Just as I described in top comment, the gain benefit from the reduce of SYSCALL .

Start valkey-becnmark taskset -c 16-19 ~/src/valkey-benchmark -p 9001 -t set -d 100 -r 1000000 -n 5000000 -c 50 --threads 4 to ensure valkey-server CPU utilized is 1(fully utilized).

io-uring comes with busy polling outside of the Valkey (io/main) threads. Does this CPU usage include that or just the CPU cycles accumulated by the Valkey threads? Going back to your original post, it seems to indicate this is just the Valkey CPU usage? I think a more deterministic setup would be using a 2/4 core machine.

More context about the multi threading improvements from community?

#22

I am not security experts, can you give more details about your concern

https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html

will this a blocker for community to adopt io_uring?

I would say this would be a serious concern for me. There are two possible outcomes

  1. the vulns are mostly in the kernel and there isn't much application developers (us) can do.
  2. the vulns can be mitigated by changes in the application, which is Valkey here
  1. would reduce the reach of this feature
  2. would add more work on the Valkey team

@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 16, 2024

io-uring comes with busy polling outside of the Valkey (io/main) threads. Does this CPU usage include that or just the CPU cycles accumulated by the Valkey threads?

Actually, I didn't use the busy polling model of io_uring in this patch, no background threads started of io_uring, all the cycles generated by io_uring are classified into Valkey server. Just as I post in top comment, the perf gain benefit from the feature of https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023#batching

Going back to your original post, it seems to indicate this is just the Valkey CPU usage? I think a more deterministic setup would be using a 2/4 core machine.

I can get a similar result with 2-4 CPUs allocated.

I would say this would be a serious concern for me. There are two possible outcomes

  1. the vulns are mostly in the kernel and there isn't much application developers (us) can do.

  2. the vulns can be mitigated by changes in the application, which is Valkey here

  3. would reduce the reach of this feature

  4. would add more work on the Valkey team

Just glanced the vulns list: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring, seems most of the fixes happened in the kernel side, there isn't application developers can do. I think we can keep supporting the io_uring, user can disable the io_uring if they want.

@zuiderkwast
Copy link
Contributor

I think this is really good stuff. Performance is one of the areas we should prioritize IMO.

@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 17, 2024

I think this is really good stuff. Performance is one of the areas we should prioritize IMO.

Thanks @zuiderkwast, the question is how can we push this patch forward?

@zuiderkwast
Copy link
Contributor

@lipzhu We are a new team, new project, first release and we are still busy with rebranding from redis to valkey and getting a website up. I think you just need some patience to let other team members have some time to look and think about it.

I'll add it to the backlog for Valkey 8. It will not forgotten.

@PingXie
Copy link
Member

PingXie commented Apr 18, 2024

Yeah this kind of changes requires the reviewers to block off some decent amount of time and to really think through it holistically. This week is really busy for the team as many of us are having in person engagement with the OSS community. We really appreciate your patience.

@Wenwen-Chen
Copy link
Contributor

@lipzhu

Description

This patch try to benefit the io_uring batching feature to reduce the SYSCALL count for Valkey when handleClientsWithPendingWrites. With this patch, we can observe more than 4% perf gain for SET/GET, and didn't see an obvious performance regression.

As far as I know, IO_Uring is a high efficient IO engine.
Do you have any plan to optimize Valkey's other modules by using io_uring technology?
For example, ae framework, snapshot operations.

@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 25, 2024

As far as I know, IO_Uring is a high efficient IO engine. Do you have any plan to optimize Valkey's other modules by using io_uring technology? For example, ae framework, snapshot operations.

Sure, but at the beginning when we decide to introduce io_uring, we want to search the scenarios that io_uring really helps on perf gain, this patch is straightforward.
And another scenario I come out is disk related operation, I open #255 to understand the details.
For the ae framework part, it needs a lot of work to replace the epoll and sync workflow per my understanding. I had a POC before, I can observe the performance gain, but the cost is more CPU resources allocated.
So I want to integrate io_uring incrementally, and I also need the help from community, as you know, currently they are busy rebranding :)

src/io_uring.c Outdated Show resolved Hide resolved
src/io_uring.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
@PingXie
Copy link
Member

PingXie commented Apr 26, 2024

Thanks @lipzhu!

I am generally aligned with the high level idea (and good to know that you don't use polling).

I do have some high level feedback around the code structure and I will list them here too

  1. we should go with an opt-in approach and keep io-uring off by default
  2. I think we should avoid mixing sync read()/write() calls with io_uring. Let's explore a way to have a cleaner separation
  3. the io_uring support seems incomplete - we are missing the support for scatter/gather IOs; also not sure about the rationale behind excluding the replication stream

BTW, I don't have all the details on #22 at the moment so there is a chance that we might have to revisit/rethink this PR, depending on the relative pace of the two. That said, let's continue collaborating on this PR, assuming we would like to incorporate io_uring in Valkey.

@PingXie
Copy link
Member

PingXie commented Apr 27, 2024

@lipzhu, looking at your results above, the amount of the read calls jumps out too. It will be great if you could apply io-uring to the query path as well.

image

src/io_uring.c Outdated Show resolved Hide resolved
@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 28, 2024

@PingXie Thanks for your comments :)

@lipzhu, looking at your results above, the amount of the read calls jumps out too.

image

The counter data is based on the time duration(10s), each query is pair to readQueryFromClient, so I think the SYSCALL count of read increased is sensible because the QPS increased too.

It will be great if you could apply io-uring to the query path as well.

@PingXie I have done this before, but some issues I found are:

  1. I didn't find a batch read scenario from read query path, if use the io_uring_prep_read and following io_uring_submit_and_wait simply to simulate the read, the SYSCALL count didn't reduce and io_uring_enter is more expensive than read.
  2. Prefer a small PR, each pr focus only one thing.

@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 28, 2024

Thanks @lipzhu!

I am generally aligned with the high level idea (and good to know that you don't use polling).

I do have some high level feedback around the code structure and I will list them here too

  1. we should go with an opt-in approach and keep io-uring off by default

Ok, I will introduce a new config like io-uring (default off) in valkey.conf.

  1. I think we should avoid mixing sync read()/write() calls with io_uring. Let's explore a way to have a cleaner separation

I will refactor the code to explore a cleaner separation.

  1. the io_uring support seems incomplete - we are missing the support for scatter/gather IOs; also not sure about the rationale behind excluding the replication stream

The reason I didn't do for scatter/gather IOs because I remember I didn't observe the perf gain with io_uring, I will double confirm this later, for the replication stream, thanks to point it out, I will check it later.

BTW, I don't have all the details on #22 at the moment so there is a chance that we might have to revisit/rethink this PR, depending on the relative pace of the two. That said, let's continue collaborating on this PR, assuming we would like to incorporate io_uring in Valkey.

Sure, thanks for your guidance and patience.

Copy link

codecov bot commented May 21, 2024

Codecov Report

Attention: Patch coverage is 25.30120% with 62 lines in your changes missing coverage. Please review.

Project coverage is 70.53%. Comparing base (20d583f) to head (efc4fe4).

Files with missing lines Patch % Lines
src/networking.c 30.30% 46 Missing ⚠️
src/io_uring.c 0.00% 12 Missing ⚠️
src/server.c 20.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #112      +/-   ##
============================================
- Coverage     70.54%   70.53%   -0.02%     
============================================
  Files           114      115       +1     
  Lines         61644    61718      +74     
============================================
+ Hits          43488    43530      +42     
- Misses        18156    18188      +32     
Files with missing lines Coverage Δ
src/config.c 78.69% <ø> (ø)
src/server.h 100.00% <ø> (ø)
src/server.c 88.49% <20.00%> (-0.11%) ⬇️
src/io_uring.c 0.00% <0.00%> (ø)
src/networking.c 86.78% <30.30%> (-1.70%) ⬇️

... and 10 files with indirect coverage changes

@lipzhu
Copy link
Contributor Author

lipzhu commented May 30, 2024

Update for this patch:

  1. Introduce a new config io_uring (yes|no) to let user determine if enable io_uring or not, and then validate if the running system support io_uring, when both conditions are met go to the io_uring code path otherwise fall back.
  2. Split the sync write with io_uring, most of the logic moved to io_uring.c.
  3. Have a measurement for the scatter/gather IOs and slave clients, didn't observe the perf gain for above 2 scenarios by using io_uring, but some regressions. Did a simple analysis, this is because of the count of SYSCALL write is not high in above 2 scenarios which make the cycles ratio of SYSCALL is not high, and per my measurement, a single SYSCALL io_uring_enter is more expensive than SYSCALL write. I think it is not a good idea to use io_uring batch handle for above 2 scenarios.

I suggest to only use io_uring for the clients static buffer write in first phase, most of the static response buffer is small which make total count of syscall is high and cycles ratio of SYCALL write is high correspondingly and the perf gain is significant indeed.

@PingXie @zuiderkwast What do you think?

@lipzhu lipzhu force-pushed the io_uring branch 2 times, most recently from 958ff60 to 0e9afa9 Compare May 30, 2024 10:01
@lipzhu lipzhu requested a review from PingXie June 3, 2024 01:40
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks mostly good now. It has some refactorings that will conflict with the Async IO threading feature, so I think we should merge the async IO theading first.

The error messages and log messages can probably be improved, but I will review those later.

src/io_uring.h Outdated Show resolved Hide resolved
---------

Signed-off-by: Lipeng Zhu <[email protected]>
Co-authored-by: Wangyang Guo <[email protected]>
@secwall
Copy link
Contributor

secwall commented Aug 3, 2024

It seems that this change (even without uring enabled) breaks operation with multiple threads.
Just running valkey-benchmark against an instance with io-threads set to 4 makes valkey fail:

6966:M 03 Aug 2024 18:34:36.931 # === ASSERTION FAILED ===
6966:M 03 Aug 2024 18:34:36.931 # ==> io_threads.c:384 'c->clients_pending_write_node.prev == NULL && c->clients_pending_write_node.next == NULL' is not true

The reason is simple: trySendWriteToIOThreads expects ln to be already unlinked. A simple patch like this makes benchmark passing:

--- a/src/networking.c
+++ b/src/networking.c
@@ -2540,14 +2540,18 @@ int handleClientsWithPendingWrites(void) {
         }

         /* If we can send the client to the I/O thread, let it handle the write. */
-        if (trySendWriteToIOThreads(c) == C_OK) {
+        if (server.io_threads_num > 1) {
             listUnlinkNode(server.clients_pending_write, ln);
-            continue;
+            if (trySendWriteToIOThreads(c) == C_OK) {
+                continue;
+            }
         }

         /* We can't write to the client while IO operation is in progress. */
         if (c->io_write_state != CLIENT_IDLE || c->io_read_state != CLIENT_IDLE) {
-            listUnlinkNode(server.clients_pending_write, ln);
+            if (server.io_threads_num == 1) {
+                listUnlinkNode(server.clients_pending_write, ln);                                                                                                                                                    +            }
             continue;                                                                                                                                                                                                         }

@@ -2559,7 +2563,9 @@ int handleClientsWithPendingWrites(void) {
                 continue;
             }
         } else {
-            listUnlinkNode(server.clients_pending_write, ln);
+            if (server.io_threads_num == 1) {
+                listUnlinkNode(server.clients_pending_write, ln);                                                                                                                                                    +            }                                                                                                                                                                                                                     /* Try to write buffers to the client socket. */
             if (writeToClient(c) == C_ERR) continue;

The second issue: enabling both io_uring and TLS makes even simple info with cli fail:

./src/valkey-cli --tls --cacert tests/tls/ca.crt
127.0.0.1:6379> info
Error: Success

It seems that we should not try to use io_uring for tls-enabled clients like this?

--- a/src/networking.c
+++ b/src/networking.c
@@ -2429,7 +2429,7 @@ int processIOThreadsWriteDone(void) {
 static inline int _canWriteUsingIOUring(client *c) {
     if (server.io_uring_enabled && server.io_threads_num == 1) {
         /* Currently, we only use io_uring to handle the static buffer write requests. */
-        return getClientType(c) != CLIENT_TYPE_REPLICA && listLength(c->reply) == 0 && c->bufpos > 0;
+        return connIsTLS(c->conn) == 0 && getClientType(c) != CLIENT_TYPE_REPLICA && listLength(c->reply) == 0 && c->bufpos > 0;
     }
     return 0;
 }

@lipzhu lipzhu force-pushed the io_uring branch 2 times, most recently from b3169ab to f6c6dd7 Compare August 5, 2024 09:47
@PingXie
Copy link
Member

PingXie commented Aug 12, 2024

I saw the Async IO is already merged. I rebase the io_uring batch optimization based on unstable branch, let's resume this pull request?

@lipzhu - are the performance numbers in the PR description updated? If not, can you help re-benchmark the improvements? Let's make sure there is still meaningful improvement with async IO changes merged before diving into the code review?

@lipzhu
Copy link
Contributor Author

lipzhu commented Aug 12, 2024

I saw the Async IO is already merged. I rebase the io_uring batch optimization based on unstable branch, let's resume this pull request?

@lipzhu - are the performance numbers in the PR description updated? If not, can you help re-benchmark the improvements? Let's make sure there is still meaningful improvement with async IO changes merged before diving into the code review?

@PingXie Thanks, I just updated the performance boost info in top comments, we can still have 6% performance boost based on the SET/GET benchmark.

@PingXie
Copy link
Member

PingXie commented Aug 12, 2024

Thanks, @lipzhu!

Sorry I didn't make it clear earlier.

I don't think the current test setup (controlling CPU allocation via taskset) represents the real world workload. And the reason is that, with this test setup, the server can "steal" compute from the CPUs not explicitly allocated by taskset through io-uring. While in the async IO case, the server sticks to the CPUs allocated; therefore, the results are not apple-to-apple. In my opinion, for this test to be valid, we would need to separate out the client and the server on two different machines and allow the server to use all the CPUs for io-threading. Then we toggle io-uring on and off and compare the two sets of performance numbers.

@lipzhu
Copy link
Contributor Author

lipzhu commented Aug 13, 2024

@PingXie I setup an environment which separate server and client and double confirm the perf boost. Below are the brief summary of my local test env.
Both server and client have 8 CPUs (Intel(R) Xeon(R) Platinum 8380 CPU) enabled, and they are connected through NIC Ethernet Controller XXV710 for 25GbE SFP28. Run the same commands w/o taskset, we can observe ~5% perf boost.

Start server.

~/valkey/src/valkey-server /tmp/valkey.conf

port 9001
bind * -::*
daemonize yes
protected-mode no
save ""

Start client.

~/valkey/src/valkey-benchmark -h 192.168.2.1 -p 9001 -t set,get -d 100 -r 1000000 -n 5000000 -c 50 --threads 4

Signed-off-by: Lipeng Zhu <[email protected]>
@PingXie
Copy link
Member

PingXie commented Aug 13, 2024

Both server and client have 8 CPUs (Intel(R) Xeon(R) Platinum 8380 CPU) enabled

Just a quick confirmation - there were 8 CPUs in total and 8 io-threads in these tests?

@lipzhu
Copy link
Contributor Author

lipzhu commented Aug 13, 2024

Both server and client have 8 CPUs (Intel(R) Xeon(R) Platinum 8380 CPU) enabled

Just a quick confirmation - there were 8 CPUs in total and 8 io-threads in these tests?

Both server and client have 8 CPUs. Doesn't enable io-threads for this test, not quite understand why io-threads should be enabled for this? Because this optimization only works for the main thread.

Signed-off-by: Lipeng Zhu <[email protected]>
@PingXie
Copy link
Member

PingXie commented Aug 13, 2024

Doesn't enable io-threads for this test, not quite understand why io-threads should be enabled for this? Because this optimization only works for the main thread.

Io-threading is important because both io-threading and io-uring are targeted at the same problem, which is how to better utilize the CPUs on the system. It is not a fair comparison when one test can use only one CPU (when io-uring is off) while the other can use other CPUs via io-uring.

In a broader sense, io-uring is essentially a more generic form of io-threading done in the kernel.

Do you mind trying out the tests one more time but with 8 io-threads?

@lipzhu
Copy link
Contributor Author

lipzhu commented Aug 13, 2024

Io-threading is important because both io-threading and io-uring are targeted at the same problem, which is how to better utilize the CPUs on the system. It is not a fair comparison when one test can use only one CPU (when io-uring is off) while the other can use other CPUs via io-uring.

Actually, io_uring will not steal CPU resource in this scenario. Maybe you are talking about the feature of async_io_threads which can be set by IOSQE_ASYNC, but we didn't use this feature.
Just as titled, the perf boost mainly come from the reduced write SYSCALL.

The perf also shows that only 1 CPU resource was used during the test and the IPC also increase 6%, not from more CPU resources.
Another experiment we can prove that is that we started the valkey in server which only has 1 CPU resource and test, not sure if this can dispel your concern.

perf stat -p `pidof valkey-server` sleep 10

# w/o io_uring
 Performance counter stats for process id '2267781':

          9,993.95 msec task-clock                #    0.999 CPUs utilized
               625      context-switches          #   62.538 /sec
                 0      cpu-migrations            #    0.000 /sec
            94,933      page-faults               #    9.499 K/sec
    33,894,880,825      cycles                    #    3.392 GHz
    39,284,579,699      instructions              #    1.16  insn per cycle
     7,750,350,988      branches                  #  775.504 M/sec
        73,791,242      branch-misses             #    0.95% of all branches
   169,474,584,465      slots                     #   16.958 G/sec
    39,212,071,735      topdown-retiring          #     23.1% retiring
    11,962,902,869      topdown-bad-spec          #      7.1% bad speculation
    43,199,367,984      topdown-fe-bound          #     25.5% frontend bound
    75,159,711,305      topdown-be-bound          #     44.3% backend bound

      10.001262795 seconds time elapsed

# w/ io_uring
 Performance counter stats for process id '2273716':

          9,970.38 msec task-clock                #    0.997 CPUs utilized
             1,077      context-switches          #  108.020 /sec
                 1      cpu-migrations            #    0.100 /sec
           124,080      page-faults               #   12.445 K/sec
    33,813,062,268      cycles                    #    3.391 GHz
    41,455,816,158      instructions              #    1.23  insn per cycle
     8,063,017,730      branches                  #  808.697 M/sec
        68,008,453      branch-misses             #    0.84% of all branches
   169,066,451,360      slots                     #   16.957 G/sec
    38,077,547,648      topdown-retiring          #     22.0% retiring
    28,509,121,765      topdown-bad-spec          #     16.5% bad speculation
    41,083,738,441      topdown-fe-bound          #     23.8% frontend bound
    65,062,545,805      topdown-be-bound          #     37.7% backend bound

      10.001785198 seconds time elapsed

@PingXie
Copy link
Member

PingXie commented Aug 14, 2024

Maybe you are talking about the feature of async_io_threads which can be set by IOSQE_ASYNC, but we didn't use this feature.

I, again, forgot this point. I am convinced by your test results. Will find time next to resume the code review :). Thanks a lot for your patience, @lipzhu!

@PingXie
Copy link
Member

PingXie commented Aug 14, 2024

The perf also shows that only 1 CPU resource was used during the test and the IPC also increase 6%, not from more CPU resources.

Whenever you get a chance, can you incorporate these performance numbers along with your test setup to the PR description so they are more discoverable?

@lipzhu
Copy link
Contributor Author

lipzhu commented Aug 14, 2024

Thanks @PingXie.

Maybe you are talking about the feature of async_io_threads which can be set by IOSQE_ASYNC, but we didn't use this feature.

I, again, forgot this point. I am convinced by your test results. Will find time next to resume the code review :). Thanks a lot for your patience, @lipzhu!

Really appreciate your effort on this :).

Whenever you get a chance, can you incorporate these performance numbers along with your test setup to the PR description so they are more discoverable?

Done.

@lipzhu
Copy link
Contributor Author

lipzhu commented Sep 9, 2024

Kindly ping @PingXie @zuiderkwast .

Signed-off-by: Lipeng Zhu <[email protected]>
@zuiderkwast
Copy link
Contributor

Just reading through this discussion again. It seems there are no blockers for this one. @PingXie Can we continue to get it merged?

Recap: It doesn't use more CPUs, just less syscalls.

I remember we did a lot of reviewing and discussion on #599 but in the end we didn't merge it because it affects the read-committed guarantees. But whatever we decided about io-uring configs and abstraction should apply to this PR as well. @lipzhu maybe you want to refresh the memory.

@PingXie
Copy link
Member

PingXie commented Dec 8, 2024

Recap: It doesn't use more CPUs, just less syscalls.

Yeah that is the tl;dr.

@PingXie Can we continue to get it merged?

I am aligned directionally. That said, I think a more seamless integration could be achieved at a lower level, specifically within the connection layer, in connSocketWrite and connSocketWritev. This approach would eliminate the need for explicit policy decisions in the high level flow, such as "no TLS" and the restriction of io-thread-num == 1, which feels unnecessarily limiting.

@lipzhu what do you think?

BTW, we've decided to phase out the use of the _ prefix for internal function names.

@lipzhu
Copy link
Contributor Author

lipzhu commented Dec 9, 2024

@zuiderkwast @PingXie Thanks for revisiting this thread.

I am aligned directionally. That said, I think a more seamless integration could be achieved at a lower level, specifically within the connection layer, in connSocketWrite and connSocketWritev. This approach would eliminate the need for explicit policy decisions in the high level flow, such as "no TLS" and the restriction of io-thread-num == 1, which feels unnecessarily limiting.

@lipzhu what do you think?

+1, I will recover the context :) and try to refactor it based on the suggestion.

Booked by the other things recently, sorry for the late response.

@lipzhu
Copy link
Contributor Author

lipzhu commented Dec 27, 2024

That said, I think a more seamless integration could be achieved at a lower level, specifically within the connection layer, in connSocketWrite and connSocketWritev. This approach would eliminate the need for explicit policy decisions in the high level flow, such as "no TLS" and the restriction of io-thread-num == 1, which feels unnecessarily limiting

@PingXie after reconsidering the suggestion, I realized that integrating io_uring into the connection layer would be somewhat complex. Or maybe I'm overcomplicating it? :) The existing connection implementations (socket, unix, tls, rdma) all use synchronous methods, and the high-level call stack (e.g. writeToReplica and _writeToClient) is based on this synchronous mode (input data and return the written bytes).

This patch attempts to submit write requests to io_uring, which will handle them asynchronously, and later retrieve all the clients’ written bytes in a batch mode. It is somewhat similar to what io-threads did. How about adding a new method like trySendWriteToIOUring in handleClientsWithPendingWrites and processIOUringWriteDone in beforeSleep? The key difference is that the performance boost comes from the reduced syscalls, not from using more CPUs, as we previously discussed.
Just want to get aligned before the refactor, cc @zuiderkwast.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach in general. I'd be fine with merging this soon. It's a first introduction of io_uring.

I don't know if there is a better integration point for this, such as in the connection abstraction or in the event loop logic, but the abstractions can be improved in the future, as well as using uring for more things, like reads and fsync.

@pizhenwei you have created much of the connection abstraction. Do you have any ideas about how we can integrate io_uring?

}

void freeIOUring(void) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These dummy stubs are never called, right? They're defined just to make it compile for when we don't have liburing?

Should we mark them as dead code in some way as? Assert that they're never called?

if (server.io_uring_enabled && server.io_threads_num == 1) {
/* Currently, we only use io_uring to handle the static buffer write requests.
* If io-threads or tls is enabled, skip the io_uring. */
return connIsTLS(c->conn) == 0 && getClientType(c) != CLIENT_TYPE_REPLICA && listLength(c->reply) == 0 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These conditions don't cover RDMA. Does it work or should we exclude that too? What about other fake clients, like the fake client used from Lua?

Rather then defining a negated condition for skipping it, like "not TLS", it's usually better to a have a positive condition for when it's known to work. In the future, we may add more connection types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
In fact, RDMA supports async API only, it uses send queue, receive queue and completion queue, but they are different from IO uring queue, so IO uring can't support RDMA.

@zuiderkwast
Copy link
Contributor

I realized that integrating io_uring into the connection layer would be somewhat complex. Or maybe I'm overcomplicating it? :) The existing connection implementations (socket, unix, tls, rdma) all use synchronous methods, and the high-level call stack (e.g. writeToReplica and _writeToClient) is based on this synchronous mode (input data and return the written bytes).

@lipzhu If we want to use IO uring from within the connection abstractions, maybe we can try to change the connection abstraction to an async API? I'm just speculating.

This patch attempts to submit write requests to io_uring, which will handle them asynchronously, and later retrieve all the clients’ written bytes in a batch mode. It is somewhat similar to what io-threads did. How about adding a new method like trySendWriteToIOUring in handleClientsWithPendingWrites and processIOUringWriteDone in beforeSleep?

Yes, this sounds good too.

There seem to be some similarities to IO threads. Is there any chance we can use IO uring for TLS too in the future? That would be a good thing. I like an incremental approach. We don't need a perfect solution immediately IMO.

@pizhenwei
Copy link
Contributor

I realized that integrating io_uring into the connection layer would be somewhat complex. Or maybe I'm overcomplicating it? :) The existing connection implementations (socket, unix, tls, rdma) all use synchronous methods, and the high-level call stack (e.g. writeToReplica and _writeToClient) is based on this synchronous mode (input data and return the written bytes).

@lipzhu If we want to use IO uring from within the connection abstractions, maybe we can try to change the connection abstraction to an async API? I'm just speculating.

The current connection framework(of sync APIs) and event handling framework(of both POLLIN and POLLOUT events driven) are designed for classic TCP programming API. As far as I can see, a new async IO framework would change the networking a lot. For example:

  • write handler may be removed
  • each IO request has its completion callback function

This patch attempts to submit write requests to io_uring, which will handle them asynchronously, and later retrieve all the clients’ written bytes in a batch mode. It is somewhat similar to what io-threads did. How about adding a new method like trySendWriteToIOUring in handleClientsWithPendingWrites and processIOUringWriteDone in beforeSleep?

Yes, this sounds good too.

There seem to be some similarities to IO threads. Is there any chance we can use IO uring for TLS too in the future? That would be a good thing. I like an incremental approach. We don't need a perfect solution immediately IMO.

What about trying to reuse .has_pending_data and .process_pending_data instead of adding new functions? IO uring is Linux only, the common framework should not care it.

* point to the shared memory containing the io_uring queues.
* On failure -errno is returned. */
if (io_uring_queue_init_params(IO_URING_DEPTH, _io_uring, &params) < 0) return IO_URING_ERR;
return IO_URING_OK;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deal IO_URING_OK in io_uring.c only, outside should not handle IO uring related code any more. Please convert IO_URING_OK to C_OK, so does IO_URING_ERR.

#define UNUSED(V) ((void)V)
#endif

int initIOUring(void) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called if server.io_uring_enabled is true, this means a user specifies io-uring-enabled yes. So I think error log should be printed here and return error instead of a silent error.

@pizhenwei
Copy link
Contributor

There seem to be some similarities to IO threads. Is there any chance we can use IO uring for TLS too in the future? That would be a good thing. I like an incremental approach. We don't need a perfect solution immediately IMO.

@zuiderkwast
Agree! Async connection framework would be long term topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

6 participants