Blog about io_thread performance contribution #98

touitou-dan · 2024-07-07T13:54:23Z

Description

Adding a blog describing at high level the performance contribution for Valkey8

Issues Resolved

Check List

Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License.

Signed-off-by: Dan Touitou <[email protected]>

stockholmux

There are a lot of interesting things in this blog post, but it needs a little bit before it's ready.

A few general comments:

The preferred format is one sentence per line (it makes reviews easier).
There are a few places where you talk about the AWS customer, but this post could use a little zoom out to talk about how this benefits the project

content/authors/dantouitou.md

stockholmux · 2024-07-08T17:53:34Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+# Each author corresponds to a biography file (more info later in this document)
+authors= [ "dantouitou", "uriyagelnik"]
+++
+## AWS to Contribute Efficiency Improvements for Valkey 8


No need for this line. It makes a redundant heading

stockholmux · 2024-07-08T18:01:06Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+
+From simple in-memory caching implementations to complex job queues, real-time collaboration, and leaderboards applications, we at AWS are continually amazed by how innovatively users employ Valkey. 
+
+Clearly, this is just the tip of the iceberg. As more use cases and industries want to  benefit from the speed, low latency, and cost reduction advantages of in-memory processing as introduced by Valkey, we are fully committed to this vision.


I'm not sure I understand what "this vision" means.

Maybe just need to modify the wording to better indicate that the vision is performance.

stockholmux · 2024-07-08T18:03:36Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

@@ -0,0 +1,67 @@
+++
+# `title` is how your post will be listed and what will appear at the top of the post
+title=  "AWS to Contribute Efficiency Improvements to Valkey 8 "


'AWS to Contribute' indicates that it hasn't yet contributed to this functionality. If this is the case, better to release this blog when there is some contribution that it can link to.

We have the PR now here: valkey-io/valkey#758.

I also don't think this is a great title. "AWS contributes significant performance improvement to Valkey 8", seems a lot more exciting.

stockholmux · 2024-07-08T18:04:37Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+
+### Our Commitment to Efficiency
+
+One of our primary goals at AWS is to ensure our customers receive the most efficient services. Efficiency not only leads to lower costs, better latency and a greener environment, but also enhances resilience.


"greener environment" is used here and in the summary. Can you back this up?

stockholmux · 2024-07-08T18:44:22Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+
+The main thread orchestrates all the jobs spawned to the io_threads, ensuring that no race conditions occur. Io_threads can be easily added and removed by the main thread based on the current load to ensure efficient utilization of the underlying hardware. Despite the dynamic nature of io_threads, the main thread attempts to maintain thread affinity, ensuring that the same io_thread will handle IO for the same client to improve memory access locality. 
+
+Before executing commands, the main thread performs a new procedure, prefetch-commands-keys, which aims to reduce the number of external memory accesses needed when executing the commands on the main dictionary. A detailed explanation of the technique used in that procedure will be described in our next blog. 


'A detailed explanation...' I would just that sentence out. Someone interested in the subject will find it later.

I would consider dropping this whole paragraph, since it's missing from the current set of PRs. We can produce blogs as necessary talking about our performance improvements.

stockholmux · 2024-07-08T18:47:01Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+
+Before executing commands, the main thread performs a new procedure, prefetch-commands-keys, which aims to reduce the number of external memory accesses needed when executing the commands on the main dictionary. A detailed explanation of the technique used in that procedure will be described in our next blog. 
+
+Socket polling system calls, such as epoll_wait, are expensive procedures. When executed solely by the main thread, epoll_wait consumes more than 20 percent of the time. Therefore, we decided to offload epoll_wait execution to the io_threads in the following way: to avoid race conditions, at any time, at most one thread, either an io_thread or the main thread, is executing epoll_wait. Io_threads never sleep on epoll, and whenever there are pending IO operations or commands to be executed, epoll_wait calls are scheduled to the io_threads by the main thread. In all other cases, the main thread executes the epoll_wait with the waiting time as in the original Valkey implementation 


'epoll_wait' should be in backticks for all mentions (maybe also epoll)

stockholmux · 2024-07-08T18:48:04Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+
+Socket polling system calls, such as epoll_wait, are expensive procedures. When executed solely by the main thread, epoll_wait consumes more than 20 percent of the time. Therefore, we decided to offload epoll_wait execution to the io_threads in the following way: to avoid race conditions, at any time, at most one thread, either an io_thread or the main thread, is executing epoll_wait. Io_threads never sleep on epoll, and whenever there are pending IO operations or commands to be executed, epoll_wait calls are scheduled to the io_threads by the main thread. In all other cases, the main thread executes the epoll_wait with the waiting time as in the original Valkey implementation 
+
+### Future Enhancements


I think this section can be removed. What should the reader of this blog post do next? Can they read the PR? Can they download the development branch and try it out?

stockholmux · 2024-07-08T18:48:58Z

content/authors/uriyagelnik.md

+    github: uriyage
+---
+
+Uri is a software engineer at AWS 


Uri needs a more complete bio if possible.

stockholmux · 2024-07-08T18:50:00Z

content/authors/uriyagelnik.md

+---
+title: Uri Yagelnik
+extra:
+    photo: '/assets/media/authors/uriyagelnik.png'


A real photo would be preferred, but not required.

madolson

Agree with a lot of what Kyle said, I added some more concrete suggestions in a bunch of sections.

madolson · 2024-07-08T20:13:39Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+![io_threads high level design](/assets/media/pictures/io_threads.png)
+
+### High Level Design 
+The above diagram depicts the io_threads implementation from a high-level perspective. Io_threads are stateless worker threads that receive jobs to execute from the main thread. In Valkey 8, a job can involve reading and parsing a command from a client, writing back responses to the client, polling for IO events on TCP connections, or de-allocating memory. This leaves the main thread with more time to execute  commands. 


Suggested change

The above diagram depicts the io_threads implementation from a high-level perspective. Io_threads are stateless worker threads that receive jobs to execute from the main thread. In Valkey 8, a job can involve reading and parsing a command from a client, writing back responses to the client, polling for IO events on TCP connections, or de-allocating memory. This leaves the main thread with more time to execute commands.

The above diagram depicts the high-level design of how IO threading processes work in Valkey 8.

IO threads are worker threads that receive jobs to execute from the main thread.

A job can involve reading and parsing a command from a client, writing back responses to the client, polling for IO events on TCP connections, or de-allocating memory.

While IO threads are busy handling IO, the main thread is able to spend more time executing commands.

madolson · 2024-07-08T20:14:58Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+### High Level Design 
+The above diagram depicts the io_threads implementation from a high-level perspective. Io_threads are stateless worker threads that receive jobs to execute from the main thread. In Valkey 8, a job can involve reading and parsing a command from a client, writing back responses to the client, polling for IO events on TCP connections, or de-allocating memory. This leaves the main thread with more time to execute  commands. 
+
+The main thread orchestrates all the jobs spawned to the io_threads, ensuring that no race conditions occur. Io_threads can be easily added and removed by the main thread based on the current load to ensure efficient utilization of the underlying hardware. Despite the dynamic nature of io_threads, the main thread attempts to maintain thread affinity, ensuring that the same io_thread will handle IO for the same client to improve memory access locality. 


Suggested change

The main thread orchestrates all the jobs spawned to the io_threads, ensuring that no race conditions occur. Io_threads can be easily added and removed by the main thread based on the current load to ensure efficient utilization of the underlying hardware. Despite the dynamic nature of io_threads, the main thread attempts to maintain thread affinity, ensuring that the same io_thread will handle IO for the same client to improve memory access locality.

The main thread orchestrates all the jobs spawned to the I/O threads, ensuring that no race conditions occur.

The number of active I/O threads can be changed by the main thread based on the current load to ensure efficient utilization of the underlying hardware.

Despite the dynamic nature of I/O threads, the main thread attempts to maintain thread affinity, ensuring that the same I/O thread will handle I/O for the same client to improve memory access locality.

"IO threads" or "I/O threads"?

Wikipedia says

input/output (I/O, i/o, or informally io or IO)

Sure, I'll update all my suggestions to I/O threads.

madolson · 2024-07-08T20:15:46Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+
+The main thread orchestrates all the jobs spawned to the io_threads, ensuring that no race conditions occur. Io_threads can be easily added and removed by the main thread based on the current load to ensure efficient utilization of the underlying hardware. Despite the dynamic nature of io_threads, the main thread attempts to maintain thread affinity, ensuring that the same io_thread will handle IO for the same client to improve memory access locality. 
+
+Before executing commands, the main thread performs a new procedure, prefetch-commands-keys, which aims to reduce the number of external memory accesses needed when executing the commands on the main dictionary. A detailed explanation of the technique used in that procedure will be described in our next blog. 


I would consider dropping this whole paragraph, since it's missing from the current set of PRs. We can produce blogs as necessary talking about our performance improvements.

madolson · 2024-07-08T20:22:03Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+
+Before executing commands, the main thread performs a new procedure, prefetch-commands-keys, which aims to reduce the number of external memory accesses needed when executing the commands on the main dictionary. A detailed explanation of the technique used in that procedure will be described in our next blog. 
+
+Socket polling system calls, such as epoll_wait, are expensive procedures. When executed solely by the main thread, epoll_wait consumes more than 20 percent of the time. Therefore, we decided to offload epoll_wait execution to the io_threads in the following way: to avoid race conditions, at any time, at most one thread, either an io_thread or the main thread, is executing epoll_wait. Io_threads never sleep on epoll, and whenever there are pending IO operations or commands to be executed, epoll_wait calls are scheduled to the io_threads by the main thread. In all other cases, the main thread executes the epoll_wait with the waiting time as in the original Valkey implementation 


Suggested change

Socket polling system calls, such as epoll_wait, are expensive procedures. When executed solely by the main thread, epoll_wait consumes more than 20 percent of the time. Therefore, we decided to offload epoll_wait execution to the io_threads in the following way: to avoid race conditions, at any time, at most one thread, either an io_thread or the main thread, is executing epoll_wait. Io_threads never sleep on epoll, and whenever there are pending IO operations or commands to be executed, epoll_wait calls are scheduled to the io_threads by the main thread. In all other cases, the main thread executes the epoll_wait with the waiting time as in the original Valkey implementation

Socket polling system calls, such as epoll_wait, are expensive procedures.

When executed solely by the main thread, epoll_wait consumes more than 20 percent of the CPU time of the process.

Therefore, we offload `epoll_wait` execution to the I/O thread when necessary by scheduling an epoll job from the main thread to an I/O thread.

To avoid race conditions, the main thread will no longer execute `epoll_wait` until the poll job has completed, ensuring that only one thread is executing the `epoll_wait` at a given time.

madolson · 2024-07-08T20:23:59Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+
+### Performance Without Compromising Simplicity
+
+Implementing multi-threading  can be a complex task. Over the years, contributors have been careful to maintain Redis’ and now Valkey’s simplicity . This ensures an API  that can continuously evolve without the need to use complex synchronization and avoid race conditions. 


Suggested change

Implementing multi-threading can be a complex task. Over the years, contributors have been careful to maintain Redis’ and now Valkey’s simplicity . This ensures an API that can continuously evolve without the need to use complex synchronization and avoid race conditions.

Implementing multi-threading can be a complex task.

Valkey strives to stay simple by executing as much code in a single thread as possible.

This ensures an API that can continuously evolve without the need to use complex synchronization and avoid race conditions.

madolson · 2024-07-08T20:27:21Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+++
+## AWS to Contribute Efficiency Improvements for Valkey 8
+
+From simple in-memory caching implementations to complex job queues, real-time collaboration, and leaderboards applications, we at AWS are continually amazed by how innovatively users employ Valkey. 


Suggested change

From simple in-memory caching implementations to complex job queues, real-time collaboration, and leaderboards applications, we at AWS are continually amazed by how innovatively users employ Valkey.

From simple in-memory caching implementations to complex job queues, real-time collaboration, and leaderboard applications, we at AWS are continually amazed by how innovatively users employ Valkey.

madolson · 2024-07-08T20:28:28Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+
+From simple in-memory caching implementations to complex job queues, real-time collaboration, and leaderboards applications, we at AWS are continually amazed by how innovatively users employ Valkey. 
+
+Clearly, this is just the tip of the iceberg. As more use cases and industries want to  benefit from the speed, low latency, and cost reduction advantages of in-memory processing as introduced by Valkey, we are fully committed to this vision.


Maybe just need to modify the wording to better indicate that the vision is performance.

madolson · 2024-07-08T20:29:05Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+
+Clearly, this is just the tip of the iceberg. As more use cases and industries want to  benefit from the speed, low latency, and cost reduction advantages of in-memory processing as introduced by Valkey, we are fully committed to this vision.
+
+### Our Commitment to Efficiency


Suggested change

### Our Commitment to Efficiency

### Our Commitment to Performance Efficiency

We have two efficiency threads, memory density and performance, I think we should be clear about which one we're optimizing here.

madolson · 2024-07-08T20:30:05Z

content/blog/2024-07-07-aws-to-contribute-efficiency-improvements-for-ValKey-8.md

+
+We are excited about the new Linux Foundation sponsorship for Valkey and are taking a bigger step, contributing our major performance improvements and expertise. Starting with version 8, Valkey users will benefit from a breakthrough in performance, thanks to a new multi-threading implementation that can can considerably boost performance on a wide range of hardware types. 
+
+For caching workloads, Valkey users will be able to increase maximum requests per second to over 1 million on multi-core machines such as AWS EC2  r7g.4xl


We need to couch this statement as well, because in OSS we talk a lot about pipeline performance, and Valkey can already do 1M+ RPS per process with batching.

uriyage · 2024-07-10T15:22:35Z

I couldn't update the current PR due to permission issues, so I addressed the PR comments in a new PR: #102.

uriyage · 2024-07-10T15:25:03Z

new PR: #102.

### Description Addressed PR comments for I/O threads blog post. Previous PR: #98 ### Issues Resolved - ### Check List - [ V] Commits are signed per the DCO using `--signoff` By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License. --------- Signed-off-by: Dan Touitou <[email protected]> Signed-off-by: Uri Yagelnik <[email protected]> Co-authored-by: Dan Touitou <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>

touitou-dan requested review from madolson and stockholmux as code owners July 7, 2024 13:54

Blog about io_thread performance contribution

a4554a8

Signed-off-by: Dan Touitou <[email protected]>

touitou-dan force-pushed the main branch from 4e06cf8 to a4554a8 Compare July 7, 2024 20:17

stockholmux requested changes Jul 8, 2024

View reviewed changes

madolson reviewed Jul 8, 2024

View reviewed changes

uriyage mentioned this pull request Jul 10, 2024

Io threads blog #102

Merged

uriyage closed this Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog about io_thread performance contribution #98

Blog about io_thread performance contribution #98

touitou-dan commented Jul 7, 2024

stockholmux left a comment

stockholmux Jul 8, 2024

stockholmux Jul 8, 2024

madolson Jul 8, 2024

stockholmux Jul 8, 2024

madolson Jul 8, 2024

madolson Jul 8, 2024

stockholmux Jul 8, 2024

stockholmux Jul 8, 2024

madolson Jul 8, 2024

stockholmux Jul 8, 2024

stockholmux Jul 8, 2024

stockholmux Jul 8, 2024

stockholmux Jul 8, 2024

madolson left a comment

madolson Jul 8, 2024

madolson Jul 8, 2024 •

edited

Loading

zuiderkwast Jul 8, 2024

madolson Jul 9, 2024

madolson Jul 8, 2024

madolson Jul 8, 2024 •

edited

Loading

madolson Jul 8, 2024

madolson Jul 8, 2024

madolson Jul 8, 2024

madolson Jul 8, 2024

madolson Jul 8, 2024

uriyage commented Jul 10, 2024

uriyage commented Jul 10, 2024


		From simple in-memory caching implementations to complex job queues, real-time collaboration, and leaderboards applications, we at AWS are continually amazed by how innovatively users employ Valkey.

		Clearly, this is just the tip of the iceberg. As more use cases and industries want to benefit from the speed, low latency, and cost reduction advantages of in-memory processing as introduced by Valkey, we are fully committed to this vision.


		### Our Commitment to Efficiency

		One of our primary goals at AWS is to ensure our customers receive the most efficient services. Efficiency not only leads to lower costs, better latency and a greener environment, but also enhances resilience.


		The main thread orchestrates all the jobs spawned to the io_threads, ensuring that no race conditions occur. Io_threads can be easily added and removed by the main thread based on the current load to ensure efficient utilization of the underlying hardware. Despite the dynamic nature of io_threads, the main thread attempts to maintain thread affinity, ensuring that the same io_thread will handle IO for the same client to improve memory access locality.

		Before executing commands, the main thread performs a new procedure, prefetch-commands-keys, which aims to reduce the number of external memory accesses needed when executing the commands on the main dictionary. A detailed explanation of the technique used in that procedure will be described in our next blog.


		Before executing commands, the main thread performs a new procedure, prefetch-commands-keys, which aims to reduce the number of external memory accesses needed when executing the commands on the main dictionary. A detailed explanation of the technique used in that procedure will be described in our next blog.

		Socket polling system calls, such as epoll_wait, are expensive procedures. When executed solely by the main thread, epoll_wait consumes more than 20 percent of the time. Therefore, we decided to offload epoll_wait execution to the io_threads in the following way: to avoid race conditions, at any time, at most one thread, either an io_thread or the main thread, is executing epoll_wait. Io_threads never sleep on epoll, and whenever there are pending IO operations or commands to be executed, epoll_wait calls are scheduled to the io_threads by the main thread. In all other cases, the main thread executes the epoll_wait with the waiting time as in the original Valkey implementation


		Socket polling system calls, such as epoll_wait, are expensive procedures. When executed solely by the main thread, epoll_wait consumes more than 20 percent of the time. Therefore, we decided to offload epoll_wait execution to the io_threads in the following way: to avoid race conditions, at any time, at most one thread, either an io_thread or the main thread, is executing epoll_wait. Io_threads never sleep on epoll, and whenever there are pending IO operations or commands to be executed, epoll_wait calls are scheduled to the io_threads by the main thread. In all other cases, the main thread executes the epoll_wait with the waiting time as in the original Valkey implementation

		### Future Enhancements

-The above diagram depicts the io_threads implementation from a high-level perspective. Io_threads are stateless worker threads that receive jobs to execute from the main thread. In Valkey 8, a job can involve reading and parsing a command from a client, writing back responses to the client, polling for IO events on TCP connections, or de-allocating memory. This leaves the main thread with more time to execute  commands.
+The above diagram depicts the high-level design of how IO threading processes work in Valkey 8.
+IO threads are worker threads that receive jobs to execute from the main thread.
+A job can involve reading and parsing a command from a client, writing back responses to the client, polling for IO events on TCP connections, or de-allocating memory.
+While IO threads are busy handling IO, the main thread is able to spend more time executing commands.

-The main thread orchestrates all the jobs spawned to the io_threads, ensuring that no race conditions occur. Io_threads can be easily added and removed by the main thread based on the current load to ensure efficient utilization of the underlying hardware. Despite the dynamic nature of io_threads, the main thread attempts to maintain thread affinity, ensuring that the same io_thread will handle IO for the same client to improve memory access locality.
+The main thread orchestrates all the jobs spawned to the I/O threads, ensuring that no race conditions occur.
+The number of active I/O threads can be changed by the main thread based on the current load to ensure efficient utilization of the underlying hardware.
+Despite the dynamic nature of I/O threads, the main thread attempts to maintain thread affinity, ensuring that the same I/O thread will handle I/O for the same client to improve memory access locality.

-Socket polling system calls, such as epoll_wait, are expensive procedures. When executed solely by the main thread, epoll_wait consumes more than 20 percent of the time. Therefore, we decided to offload epoll_wait execution to the io_threads in the following way: to avoid race conditions, at any time, at most one thread, either an io_thread or the main thread, is executing epoll_wait. Io_threads never sleep on epoll, and whenever there are pending IO operations or commands to be executed, epoll_wait calls are scheduled to the io_threads by the main thread. In all other cases, the main thread executes the epoll_wait with the waiting time as in the original Valkey implementation
+Socket polling system calls, such as epoll_wait, are expensive procedures.
+When executed solely by the main thread, epoll_wait consumes more than 20 percent of the CPU time of the process.
+Therefore, we offload `epoll_wait` execution to the I/O thread when necessary by scheduling an epoll job from the main thread to an I/O thread.
+To avoid race conditions, the main thread will no longer execute `epoll_wait` until the poll job has completed, ensuring that only one thread is executing the `epoll_wait` at a given time.


		### Performance Without Compromising Simplicity

		Implementing multi-threading can be a complex task. Over the years, contributors have been careful to maintain Redis’ and now Valkey’s simplicity . This ensures an API that can continuously evolve without the need to use complex synchronization and avoid race conditions.

	### Our Commitment to Efficiency
	### Our Commitment to Performance Efficiency


		We are excited about the new Linux Foundation sponsorship for Valkey and are taking a bigger step, contributing our major performance improvements and expertise. Starting with version 8, Valkey users will benefit from a breakthrough in performance, thanks to a new multi-threading implementation that can can considerably boost performance on a wide range of hardware types.

		For caching workloads, Valkey users will be able to increase maximum requests per second to over 1 million on multi-core machines such as AWS EC2 r7g.4xl

Blog about io_thread performance contribution #98

Blog about io_thread performance contribution #98

Conversation

touitou-dan commented Jul 7, 2024

Description

Issues Resolved

Check List

stockholmux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madolson Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madolson Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uriyage commented Jul 10, 2024

uriyage commented Jul 10, 2024

madolson Jul 8, 2024 •

edited

Loading

madolson Jul 8, 2024 •

edited

Loading