-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce thread-local storage variable to update thread's own used_memory and sum when reading to reduce atomic contention. #674
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #674 +/- ##
============================================
- Coverage 70.21% 70.11% -0.11%
============================================
Files 110 111 +1
Lines 60100 60233 +133
============================================
+ Hits 42202 42231 +29
- Misses 17898 18002 +104
|
…hen update used_memory metrics. Signed-off-by: Lipeng Zhu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be the better alternative, I believe. It achives the same performance(?) and avoids the errors.
src/zmalloc.c
Outdated
assert(total_active_threads < MAX_THREADS_NUM); | ||
size_t um = 0; | ||
for (int i = 0; i < total_active_threads; i++) { | ||
um += used_memory_thread[i]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I get this part. without atomic operations on both readers and writers, I think we will get stale data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PingXie Yes, the worst case is when read thread is reading while the write thread are writing to thread-local storage variable, but the data will not lose, just stale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. so we are essentially saying/assuming the staleness is (tightly) bounded, which I think is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Will sign off after you address my comments
Signed-off-by: Lipeng Zhu <[email protected]> Co-authored-by: Wangyang Guo <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think reducing atomic updates is great. I just think that we have a very big PR coming which will address similar issues and make lots of related changes. so suggest to wait for it instead? (@uriyage, @madolson @zuiderkwast FYI)
@ranshid For zmalloc? I'm not aware of any such PR. The Async IO threading PRs don't affect zmalloc AFAIK. |
I think it might be easier to consider the atomic changes of threads after the Async IO PR (Maybe?) all the part of dynamically adding and removing threads MIGHT have impact. @zuiderkwast no concrete concern at this point |
The only overlap between these two PRs are that the CACHE_LINE_SIZE definition is moved to config.h, so it will be a small merge conflict but not a serious one. I don't think we should hold off. It slows down the project. Adjusting the loop to only include the active threads is a minor optimization we can do later. |
As I said. I did not observe any specific issue just had a general feeling of dependency. I just do not want the extra work of adjusting the Async IO PR after merging on this one, but maybe I am just paranoid. I also have some questions/suggestions about this PR:
|
We have a macro
:) The story is a bit long, the initial version of #308 is same with this patch and perf boost list in top comments is based on the initial version. But after some discussions in #467, we think could be more aggressive, then #308 evolved into what it is now. But @soloestoy have concerned about the delta proposal, as discussed with @zuiderkwast , we fall back to this thread-local array proposal in this PR to let others vote. I will update the top comments once we have decision.
Ditto. |
generally LGTM :) |
I am sorry @lipzhu . probably missed it while I was scanning the code. this works fine together with the assert.
By "thread-local array" you mean used_memory_threa? I did not see how this is a thread local, which is what I meant before by having multiple threads writing to the same memory area. Anyway I guess if you benchmarked it and observed a performance gain, no reason to hold you back. Thank you for this contribution! |
Signed-off-by: Lipeng Zhu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Lipeng Zhu <[email protected]>
@PingXie I just retested this patch in my local environment and updated the top comments with performance boost info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new code LGTM, thanks. The comments and discussions were helpful.
Thanks @lipzhu! Curious. Was memtier run on the same machine or over the network? can you try a random access non-pipelined workload with a 8:2 or 7:3 get/set ratio? Please set the value size to 1024 and pre-load the keys so that there are no cache misses. I don't know what client count was used in your earlier tests but hopefully significant enough to keep the server CPUs busy. At some point, we will need to get real with #368 and standardize on our performance benchmarks ... |
@PingXie memtier and server are in the same machine over the loopback device.
I use below memtier commands according to your requirements. Didn't observe an oblivious performance boost(~1%), consider the variance, the performance should be same. And from the baseline's profiling result, the performance result is sensible, because the zmalloc/zfree cycles ratio is low, there is not too much room for this test scenario.
We always make server CPU utilization 100% when setup benchmark clients. |
Thanks for the additional context.
This is a very reasonable setup to eliminate the network variance but I think it does mean that the performance boost is amplified compared to a typical setup because network latency plays a significant role too.
I am not sure how to interpret this. I was under the impression that the zmalloc/zfree count would be proportional to the number of requests, which is both 1M in this test and your earlier tests. I now wonder if pipelining (with a queue depth of 10) further squeezes out other overhead (such as the |
I don't have high speed network on my hands now, but I guess the behavior should be similar if network is not a bottleneck. Maybe we can standardize the performance testing environment later.
@PingXie Yes, you are right. This kind of optimization strategy can be interpreted easily in the single thread module(sequential). Pipeline reduce the SYSCALL of read/write socket, and we have more CPU resources to handle the commands and then increase the call times of zmalloc/zfree related instructions and total cycles ratio increased correspondingly. Then our optimization of using |
static _Atomic size_t used_memory = 0; | ||
/* Register the thread index in start_routine. */ | ||
static inline void zmalloc_register_thread_index(void) { | ||
thread_index = atomic_fetch_add_explicit(&total_active_threads, 1, memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you assuming threads are lunch and never terminate and relaunched? cause otherwise this can easily reach max over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It's not very good to assume that.
I wonder if we can let this index overflow to zero. Worst case, if two threads use the same index, the implementation is still correct, right?
If we can do that, then zmalloc doesn't need to know the server's exact max number of threads. I prefer if we can avoid that coupling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It's not very good to assume that.
+1.
I don't think the engine dynamically creates/destroys threads today. that said, agreed we could start doing that in the future so there is indeed a risk. moreover, I wonder if this assumption could be broken already by a module allocating/freeing memory from dynamically created and short-lived threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about HelloBlock_ValkeyCommand
[didn't look at the logic of it TBH]? In general what about modules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, my own preference is to wrap the thread create function and plug any initialization we want to it. I can imagine other usages for that, I guess you discussed this and I'm late to the party
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I understand now why two threads cant add
to the same variable: add
is not atomic. It is read-modify-write. A simple read or write to one word is atomic, so that's why we can avoid _Atomic
when we have only one writer. (@lipzhu explained to me in a chat.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea, a simple protection: If thread index > MAX_THREADS_NUM, just use one special shared atomic variable for those threads, something like _Atomic size_t used_memory_for_additional_threads
. These last threads will not benefit from the optimization, but it's a protection. It's used only if some modules create a lot of threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in case a module or a future dynamic io-threads implementation which spawns and kills threads all the time will converge to use atomic? We will probably need to address this limitation in our future work then.
I would like to ask why is that so bad to keep a "free-ids-list" and use a mutex to guard it when we allocate thread id? It is a once in a thread lifecycle operation so basically I am not sure will cause that much performance impact.
For the deregistration when a thread is destroyed. I would suggest think of using pthread_key_t so we can assign a destructor call to return the id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the module threads will use atomic in that case, but the main thread and IO threads use their own variable if they're spawned first and stay alive. We can improve this when/if we see a problem with modules. (Lipeng's idea is to keep a bitmap to flag each entry in the array as used or free.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. server threads are created earlier and long-lived. I don't think they will ever use the overflow atomic counter.
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy now, with this protection against too many threads.
I just found some typos.
src/config.h
Outdated
#define IO_THREADS_MAX_NUM 128 | ||
#define MAX_THREADS_NUM (IO_THREADS_MAX_NUM + 3 + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Minor) CACHE_LINE_SIZE is good in this file, but IO_THREADS_MAX_NUM is a little misplaced here, IMHO.
Now that zmalloc has a protection against too many threads, can we just define a limit within zmalloc.c that is independent of IO threads? That's my preference, but I will not insist if others disagree.
Signed-off-by: Lipeng Zhu <[email protected]>
@lipzhu, sound like testing with threads is critical to validate the change, can you share results when tested with io-threads? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zvi-code I don't think its critical for threads feature, even in a single thread, the atomic contention still exist and instruction like
|
@zuiderkwast Thanks, update the title and description, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thanks @lipzhu!
Description
This patch try to introduce a thread-local storage variable for each thread to update its own
used_memory
, and then sum them together when reading inzmalloc_used_memory
. Then we can reduce unnecessarylock add
contention from atomic variable. We also add a protection if too many threads created and the total threads number greater than 132, then fall back to atomic operation for the threads index >= 132.Problem Statement
zmalloc
andzfree
related functions will update theused_memory
atomicity for each operation, and they are called very frequency. From the benchmark of memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml , the cycles ratio ofzmalloc
andzfree
are high, they are wrappers for the lower allocator library, it should not take too much cycles. And most of the cycles are contributed bylock add
andlock sub
, they are expensive instructions. From the profiling, the metrics' update mainly come from the main thread, use a TLS will reduce a lot of contention.Performance Boost
Note: This optimization should benefit common benchmark widely. I choose below 2 scenarios to validate the performance boost in my local environment.
Test Env
Start Server