Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog about io_thread performance contribution #98

Closed
wants to merge 1 commit into from

Conversation

touitou-dan
Copy link
Contributor

Description

Adding a blog describing at high level the performance contribution for Valkey8

Issues Resolved

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License.

Copy link
Member

@stockholmux stockholmux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of interesting things in this blog post, but it needs a little bit before it's ready.

A few general comments:

  • The preferred format is one sentence per line (it makes reviews easier).
  • There are a few places where you talk about the AWS customer, but this post could use a little zoom out to talk about how this benefits the project

content/authors/dantouitou.md Show resolved Hide resolved
# Each author corresponds to a biography file (more info later in this document)
authors= [ "dantouitou", "uriyagelnik"]
+++
## AWS to Contribute Efficiency Improvements for Valkey 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this line. It makes a redundant heading


From simple in-memory caching implementations to complex job queues, real-time collaboration, and leaderboards applications, we at AWS are continually amazed by how innovatively users employ Valkey.

Clearly, this is just the tip of the iceberg. As more use cases and industries want to benefit from the speed, low latency, and cost reduction advantages of in-memory processing as introduced by Valkey, we are fully committed to this vision.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what "this vision" means.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just need to modify the wording to better indicate that the vision is performance.

@@ -0,0 +1,67 @@
+++
# `title` is how your post will be listed and what will appear at the top of the post
title= "AWS to Contribute Efficiency Improvements to Valkey 8 "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'AWS to Contribute' indicates that it hasn't yet contributed to this functionality. If this is the case, better to release this blog when there is some contribution that it can link to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the PR now here: valkey-io/valkey#758.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think this is a great title. "AWS contributes significant performance improvement to Valkey 8", seems a lot more exciting.


### Our Commitment to Efficiency

One of our primary goals at AWS is to ensure our customers receive the most efficient services. Efficiency not only leads to lower costs, better latency and a greener environment, but also enhances resilience.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"greener environment" is used here and in the summary. Can you back this up?


The main thread orchestrates all the jobs spawned to the io_threads, ensuring that no race conditions occur. Io_threads can be easily added and removed by the main thread based on the current load to ensure efficient utilization of the underlying hardware. Despite the dynamic nature of io_threads, the main thread attempts to maintain thread affinity, ensuring that the same io_thread will handle IO for the same client to improve memory access locality.

Before executing commands, the main thread performs a new procedure, prefetch-commands-keys, which aims to reduce the number of external memory accesses needed when executing the commands on the main dictionary. A detailed explanation of the technique used in that procedure will be described in our next blog.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'A detailed explanation...' I would just that sentence out. Someone interested in the subject will find it later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider dropping this whole paragraph, since it's missing from the current set of PRs. We can produce blogs as necessary talking about our performance improvements.


Before executing commands, the main thread performs a new procedure, prefetch-commands-keys, which aims to reduce the number of external memory accesses needed when executing the commands on the main dictionary. A detailed explanation of the technique used in that procedure will be described in our next blog.

Socket polling system calls, such as epoll_wait, are expensive procedures. When executed solely by the main thread, epoll_wait consumes more than 20 percent of the time. Therefore, we decided to offload epoll_wait execution to the io_threads in the following way: to avoid race conditions, at any time, at most one thread, either an io_thread or the main thread, is executing epoll_wait. Io_threads never sleep on epoll, and whenever there are pending IO operations or commands to be executed, epoll_wait calls are scheduled to the io_threads by the main thread. In all other cases, the main thread executes the epoll_wait with the waiting time as in the original Valkey implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'epoll_wait' should be in backticks for all mentions (maybe also epoll)


Socket polling system calls, such as epoll_wait, are expensive procedures. When executed solely by the main thread, epoll_wait consumes more than 20 percent of the time. Therefore, we decided to offload epoll_wait execution to the io_threads in the following way: to avoid race conditions, at any time, at most one thread, either an io_thread or the main thread, is executing epoll_wait. Io_threads never sleep on epoll, and whenever there are pending IO operations or commands to be executed, epoll_wait calls are scheduled to the io_threads by the main thread. In all other cases, the main thread executes the epoll_wait with the waiting time as in the original Valkey implementation

### Future Enhancements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section can be removed. What should the reader of this blog post do next? Can they read the PR? Can they download the development branch and try it out?

github: uriyage
---

Uri is a software engineer at AWS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uri needs a more complete bio if possible.

---
title: Uri Yagelnik
extra:
photo: '/assets/media/authors/uriyagelnik.png'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A real photo would be preferred, but not required.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with a lot of what Kyle said, I added some more concrete suggestions in a bunch of sections.

![io_threads high level design](/assets/media/pictures/io_threads.png)

### High Level Design
The above diagram depicts the io_threads implementation from a high-level perspective. Io_threads are stateless worker threads that receive jobs to execute from the main thread. In Valkey 8, a job can involve reading and parsing a command from a client, writing back responses to the client, polling for IO events on TCP connections, or de-allocating memory. This leaves the main thread with more time to execute commands.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The above diagram depicts the io_threads implementation from a high-level perspective. Io_threads are stateless worker threads that receive jobs to execute from the main thread. In Valkey 8, a job can involve reading and parsing a command from a client, writing back responses to the client, polling for IO events on TCP connections, or de-allocating memory. This leaves the main thread with more time to execute commands.
The above diagram depicts the high-level design of how IO threading processes work in Valkey 8.
IO threads are worker threads that receive jobs to execute from the main thread.
A job can involve reading and parsing a command from a client, writing back responses to the client, polling for IO events on TCP connections, or de-allocating memory.
While IO threads are busy handling IO, the main thread is able to spend more time executing commands.

### High Level Design
The above diagram depicts the io_threads implementation from a high-level perspective. Io_threads are stateless worker threads that receive jobs to execute from the main thread. In Valkey 8, a job can involve reading and parsing a command from a client, writing back responses to the client, polling for IO events on TCP connections, or de-allocating memory. This leaves the main thread with more time to execute commands.

The main thread orchestrates all the jobs spawned to the io_threads, ensuring that no race conditions occur. Io_threads can be easily added and removed by the main thread based on the current load to ensure efficient utilization of the underlying hardware. Despite the dynamic nature of io_threads, the main thread attempts to maintain thread affinity, ensuring that the same io_thread will handle IO for the same client to improve memory access locality.
Copy link
Member

@madolson madolson Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The main thread orchestrates all the jobs spawned to the io_threads, ensuring that no race conditions occur. Io_threads can be easily added and removed by the main thread based on the current load to ensure efficient utilization of the underlying hardware. Despite the dynamic nature of io_threads, the main thread attempts to maintain thread affinity, ensuring that the same io_thread will handle IO for the same client to improve memory access locality.
The main thread orchestrates all the jobs spawned to the I/O threads, ensuring that no race conditions occur.
The number of active I/O threads can be changed by the main thread based on the current load to ensure efficient utilization of the underlying hardware.
Despite the dynamic nature of I/O threads, the main thread attempts to maintain thread affinity, ensuring that the same I/O thread will handle I/O for the same client to improve memory access locality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"IO threads" or "I/O threads"?

Wikipedia says

input/output (I/O, i/o, or informally io or IO)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll update all my suggestions to I/O threads.


The main thread orchestrates all the jobs spawned to the io_threads, ensuring that no race conditions occur. Io_threads can be easily added and removed by the main thread based on the current load to ensure efficient utilization of the underlying hardware. Despite the dynamic nature of io_threads, the main thread attempts to maintain thread affinity, ensuring that the same io_thread will handle IO for the same client to improve memory access locality.

Before executing commands, the main thread performs a new procedure, prefetch-commands-keys, which aims to reduce the number of external memory accesses needed when executing the commands on the main dictionary. A detailed explanation of the technique used in that procedure will be described in our next blog.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider dropping this whole paragraph, since it's missing from the current set of PRs. We can produce blogs as necessary talking about our performance improvements.


Before executing commands, the main thread performs a new procedure, prefetch-commands-keys, which aims to reduce the number of external memory accesses needed when executing the commands on the main dictionary. A detailed explanation of the technique used in that procedure will be described in our next blog.

Socket polling system calls, such as epoll_wait, are expensive procedures. When executed solely by the main thread, epoll_wait consumes more than 20 percent of the time. Therefore, we decided to offload epoll_wait execution to the io_threads in the following way: to avoid race conditions, at any time, at most one thread, either an io_thread or the main thread, is executing epoll_wait. Io_threads never sleep on epoll, and whenever there are pending IO operations or commands to be executed, epoll_wait calls are scheduled to the io_threads by the main thread. In all other cases, the main thread executes the epoll_wait with the waiting time as in the original Valkey implementation
Copy link
Member

@madolson madolson Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Socket polling system calls, such as epoll_wait, are expensive procedures. When executed solely by the main thread, epoll_wait consumes more than 20 percent of the time. Therefore, we decided to offload epoll_wait execution to the io_threads in the following way: to avoid race conditions, at any time, at most one thread, either an io_thread or the main thread, is executing epoll_wait. Io_threads never sleep on epoll, and whenever there are pending IO operations or commands to be executed, epoll_wait calls are scheduled to the io_threads by the main thread. In all other cases, the main thread executes the epoll_wait with the waiting time as in the original Valkey implementation
Socket polling system calls, such as epoll_wait, are expensive procedures.
When executed solely by the main thread, epoll_wait consumes more than 20 percent of the CPU time of the process.
Therefore, we offload `epoll_wait` execution to the I/O thread when necessary by scheduling an epoll job from the main thread to an I/O thread.
To avoid race conditions, the main thread will no longer execute `epoll_wait` until the poll job has completed, ensuring that only one thread is executing the `epoll_wait` at a given time.


### Performance Without Compromising Simplicity

Implementing multi-threading can be a complex task. Over the years, contributors have been careful to maintain Redis’ and now Valkey’s simplicity . This ensures an API that can continuously evolve without the need to use complex synchronization and avoid race conditions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Implementing multi-threading can be a complex task. Over the years, contributors have been careful to maintain Redis’ and now Valkey’s simplicity . This ensures an API that can continuously evolve without the need to use complex synchronization and avoid race conditions.
Implementing multi-threading can be a complex task.
Valkey strives to stay simple by executing as much code in a single thread as possible.
This ensures an API that can continuously evolve without the need to use complex synchronization and avoid race conditions.

+++
## AWS to Contribute Efficiency Improvements for Valkey 8

From simple in-memory caching implementations to complex job queues, real-time collaboration, and leaderboards applications, we at AWS are continually amazed by how innovatively users employ Valkey.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
From simple in-memory caching implementations to complex job queues, real-time collaboration, and leaderboards applications, we at AWS are continually amazed by how innovatively users employ Valkey.
From simple in-memory caching implementations to complex job queues, real-time collaboration, and leaderboard applications, we at AWS are continually amazed by how innovatively users employ Valkey.


From simple in-memory caching implementations to complex job queues, real-time collaboration, and leaderboards applications, we at AWS are continually amazed by how innovatively users employ Valkey.

Clearly, this is just the tip of the iceberg. As more use cases and industries want to benefit from the speed, low latency, and cost reduction advantages of in-memory processing as introduced by Valkey, we are fully committed to this vision.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just need to modify the wording to better indicate that the vision is performance.


Clearly, this is just the tip of the iceberg. As more use cases and industries want to benefit from the speed, low latency, and cost reduction advantages of in-memory processing as introduced by Valkey, we are fully committed to this vision.

### Our Commitment to Efficiency
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Our Commitment to Efficiency
### Our Commitment to Performance Efficiency

We have two efficiency threads, memory density and performance, I think we should be clear about which one we're optimizing here.


We are excited about the new Linux Foundation sponsorship for Valkey and are taking a bigger step, contributing our major performance improvements and expertise. Starting with version 8, Valkey users will benefit from a breakthrough in performance, thanks to a new multi-threading implementation that can can considerably boost performance on a wide range of hardware types.

For caching workloads, Valkey users will be able to increase maximum requests per second to over 1 million on multi-core machines such as AWS EC2 r7g.4xl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to couch this statement as well, because in OSS we talk a lot about pipeline performance, and Valkey can already do 1M+ RPS per process with batching.

@uriyage uriyage mentioned this pull request Jul 10, 2024
@uriyage
Copy link
Collaborator

uriyage commented Jul 10, 2024

I couldn't update the current PR due to permission issues, so I addressed the PR comments in a new PR: #102.

@uriyage
Copy link
Collaborator

uriyage commented Jul 10, 2024

new PR: #102.

@uriyage uriyage closed this Jul 10, 2024
madolson added a commit that referenced this pull request Jul 31, 2024
### Description

Addressed PR comments for I/O threads blog post.
Previous PR: #98

### Issues Resolved
-

### Check List
- [ V] Commits are signed per the DCO using `--signoff`

By submitting this pull request, I confirm that my contribution is made
under the terms of the BSD-3-Clause License.

---------

Signed-off-by: Dan Touitou <[email protected]>
Signed-off-by: Uri Yagelnik <[email protected]>
Co-authored-by: Dan Touitou <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
madolson added a commit that referenced this pull request Aug 5, 2024
### Description

Addressed PR comments for I/O threads blog post.
Previous PR: #98

### Issues Resolved
-

### Check List
- [ V] Commits are signed per the DCO using `--signoff`

By submitting this pull request, I confirm that my contribution is made
under the terms of the BSD-3-Clause License.

---------

Signed-off-by: Dan Touitou <[email protected]>
Signed-off-by: Uri Yagelnik <[email protected]>
Co-authored-by: Dan Touitou <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants