Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce thread-local storage variable to update thread's own used_memory and sum when reading to reduce atomic contention. #674

Merged
merged 9 commits into from
Jul 2, 2024
Merged
11 changes: 11 additions & 0 deletions src/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -329,4 +329,15 @@ void setcpuaffinity(const char *cpulist);
#define HAVE_FADVISE
#endif

#define IO_THREADS_MAX_NUM 128
#define MAX_THREADS_NUM (IO_THREADS_MAX_NUM + 3 + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Minor) CACHE_LINE_SIZE is good in this file, but IO_THREADS_MAX_NUM is a little misplaced here, IMHO.

Now that zmalloc has a protection against too many threads, can we just define a limit within zmalloc.c that is independent of IO threads? That's my preference, but I will not insist if others disagree.


#ifndef CACHE_LINE_SIZE
#if defined(__aarch64__) && defined(__APPLE__)
#define CACHE_LINE_SIZE 128
#else
#define CACHE_LINE_SIZE 64
#endif
#endif

#endif
9 changes: 0 additions & 9 deletions src/networking.c
Original file line number Diff line number Diff line change
Expand Up @@ -4204,15 +4204,6 @@ void processEventsWhileBlocked(void) {
* Threaded I/O
* ========================================================================== */

#define IO_THREADS_MAX_NUM 128
#ifndef CACHE_LINE_SIZE
#if defined(__aarch64__) && defined(__APPLE__)
#define CACHE_LINE_SIZE 128
#else
#define CACHE_LINE_SIZE 64
#endif
#endif

typedef struct __attribute__((aligned(CACHE_LINE_SIZE))) threads_pending {
_Atomic unsigned long value;
} threads_pending;
Expand Down
46 changes: 42 additions & 4 deletions src/zmalloc.c
Original file line number Diff line number Diff line change
Expand Up @@ -88,10 +88,44 @@ void zlibc_free(void *ptr) {
#define dallocx(ptr, flags) je_dallocx(ptr, flags)
#endif

#define update_zmalloc_stat_alloc(__n) atomic_fetch_add_explicit(&used_memory, (__n), memory_order_relaxed)
#define update_zmalloc_stat_free(__n) atomic_fetch_sub_explicit(&used_memory, (__n), memory_order_relaxed)
#if __STDC_NO_THREADS__
#define thread_local __thread
#else
#include <threads.h>
#endif

/* A thread-local storage which keep the current thread's index in the used_memory_thread array. */
static thread_local int thread_index = -1;
/* Element in used_memory_thread array should only be written by a single thread which
* distinguished by the thread-local storage thread_index. But when an element in
* used_memory_thread array was written, it could be read by another thread simultaneously,
* the reader will see the inconsistency memory on non x86 architecture potentially.
* For the ARM and PowerPC platform, we can solve this issue by make the memory aligned.
* For the other architecture, lets fall back to the atomic operation to keep safe. */
#if defined(__i386__) || defined(__x86_64__) || defined(__amd64__) || defined(__POWERPC__) || defined(__arm__) || \
defined(__arm64__)
static __attribute__((aligned(sizeof(size_t)))) size_t used_memory_thread[MAX_THREADS_NUM];
#else
static _Atomic size_t used_memory_thread[MAX_THREADS_NUM];
zuiderkwast marked this conversation as resolved.
Show resolved Hide resolved
#endif
static atomic_int total_active_threads;

static _Atomic size_t used_memory = 0;
/* Register the thread index in start_routine. */
static inline void zmalloc_register_thread_index(void) {
thread_index = atomic_fetch_add_explicit(&total_active_threads, 1, memory_order_relaxed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you assuming threads are lunch and never terminate and relaunched? cause otherwise this can easily reach max over time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It's not very good to assume that.

I wonder if we can let this index overflow to zero. Worst case, if two threads use the same index, the implementation is still correct, right?

If we can do that, then zmalloc doesn't need to know the server's exact max number of threads. I prefer if we can avoid that coupling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It's not very good to assume that.

+1.

I don't think the engine dynamically creates/destroys threads today. that said, agreed we could start doing that in the future so there is indeed a risk. moreover, I wonder if this assumption could be broken already by a module allocating/freeing memory from dynamically created and short-lived threads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about HelloBlock_ValkeyCommand [didn't look at the logic of it TBH]? In general what about modules?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, my own preference is to wrap the thread create function and plug any initialization we want to it. I can imagine other usages for that, I guess you discussed this and I'm late to the party

Copy link
Contributor

@zuiderkwast zuiderkwast Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I understand now why two threads cant add to the same variable: add is not atomic. It is read-modify-write. A simple read or write to one word is atomic, so that's why we can avoid _Atomic when we have only one writer. (@lipzhu explained to me in a chat.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea, a simple protection: If thread index > MAX_THREADS_NUM, just use one special shared atomic variable for those threads, something like _Atomic size_t used_memory_for_additional_threads. These last threads will not benefit from the optimization, but it's a protection. It's used only if some modules create a lot of threads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in case a module or a future dynamic io-threads implementation which spawns and kills threads all the time will converge to use atomic? We will probably need to address this limitation in our future work then.
I would like to ask why is that so bad to keep a "free-ids-list" and use a mutex to guard it when we allocate thread id? It is a once in a thread lifecycle operation so basically I am not sure will cause that much performance impact.
For the deregistration when a thread is destroyed. I would suggest think of using pthread_key_t so we can assign a destructor call to return the id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the module threads will use atomic in that case, but the main thread and IO threads use their own variable if they're spawned first and stay alive. We can improve this when/if we see a problem with modules. (Lipeng's idea is to keep a bitmap to flag each entry in the array as used or free.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. server threads are created earlier and long-lived. I don't think they will ever use the overflow atomic counter.

/* TODO: Handle the case when exceed the MAX_THREADS_NUM (may rarely happen). */
assert(total_active_threads < MAX_THREADS_NUM);
}

static inline void update_zmalloc_stat_alloc(size_t size) {
if (unlikely(thread_index == -1)) zmalloc_register_thread_index();
used_memory_thread[thread_index] += size;
}

static inline void update_zmalloc_stat_free(size_t size) {
if (unlikely(thread_index == -1)) zmalloc_register_thread_index();
used_memory_thread[thread_index] -= size;
}

static void zmalloc_default_oom(size_t size) {
fprintf(stderr, "zmalloc: Out of memory trying to allocate %zu bytes\n", size);
Expand Down Expand Up @@ -415,7 +449,11 @@ char *zstrdup(const char *s) {
}

size_t zmalloc_used_memory(void) {
size_t um = atomic_load_explicit(&used_memory, memory_order_relaxed);
assert(total_active_threads < MAX_THREADS_NUM);
size_t um = 0;
for (int i = 0; i < total_active_threads; i++) {
um += used_memory_thread[i];
zuiderkwast marked this conversation as resolved.
Show resolved Hide resolved
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I get this part. without atomic operations on both readers and writers, I think we will get stale data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PingXie Yes, the worst case is when read thread is reading while the write thread are writing to thread-local storage variable, but the data will not lose, just stale.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. so we are essentially saying/assuming the staleness is (tightly) bounded, which I think is reasonable.

return um;
}

Expand Down
Loading