DAOS-16877 client: implement utilities for shared memory #15613

wiliamhuang · 2024-12-13T17:39:22Z

Features: shm

use tlsf as memory allocator
shared memory create/destroy
robust mutex in shared memory
hash table in shared memory

Required-githooks: true
Skipped-githooks: codespell

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

Features: shm 1. use tlsf as memory allocator 2. shared memory create/destroy 3. robust mutex in shared memory 4. hash table in shared memory Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

github-actions · 2024-12-13T17:39:38Z

Ticket title is 'To implement node-wise caching with shared memory '
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-16877

daosbuild1 · 2024-12-13T22:33:35Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15613/1/execution/node/1210/log

Features: shm Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

…hm_mutex Features: shm Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

Features: shm Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

wiliamhuang · 2024-12-23T15:17:42Z

Some component (e.g., hash table record reference count) will be revised later when we add more use cases.

wiliamhuang · 2024-12-23T15:18:37Z

Currently we use tlsf memory allocator. It will be replaced by our own allocator in future.

phender · 2025-01-02T13:50:34Z

src/tests/ftest/daos_test/shm.py

+        job.assign_hosts(cmocka_utils.hosts)
+        job.assign_environment(daos_test_env)
+
+        cmocka_utils.run_cmocka_test(self, job)


Normally we run the cmocka tests, like daos_test, via DaosCoreBase.run_subtest(), which sets up additional environment variables and configures the dmg command. Do we need any of that here? Note: In its current form the run_subtest() method uses Orterun to run the daos_test command remotely.

As a requirement for adding this test we should also run it with the faults-enabled: false commit pragma to ensure that it will run when we attempt a release build.

@phender Thank you very much! I wrote shm.py and shm.yaml with dfuse.py and dfuse.yaml as templates. "shm_test" does not need to configure dmg command. Additional environment variables might be in added in futures tests with "daos_test_env" here.
Thank you for your tip of using "faults-enabled: false". I will use it in future. I used "Features: shm" previously to run the new test.

phender · 2025-01-02T14:21:21Z

src/tests/ftest/daos_test/shm.yaml

+pool:
+  scm_size: 1G
+container:
+  type: POSIX
+  control_method: daos


This isn't used by the test. A typical cmocka test would use the pool entry information, but only when the test is run via DaosCoreBase.run_subtest():

https://github.com/daos-stack/daos/blob/master/src/tests/ftest/util/daos_core_base.py#L68-L69

https://github.com/daos-stack/daos/blob/master/src/tests/ftest/util/daos_core_base.py#L89-L90

The new test "shm_test" does not need scm_size and nvme_size information.

I think what @phender means is since this test does not use DaosCoreBase.run_subtest(), the entire pool and container keys here are not used and can be removed

... or if we use DaosCoreBase.run_subtest() we can keep it

@daltonbohning @phender Thank you very much! I will try removing pool and container keys locally to make sure it works. I thought they are required.

@daltonbohning @phender You are right. We can remove pool and container keys in yaml file as you suggested. I will update it in next commit. Thank you!

knard38

I have not yet finish the review process, but I still have several concerns and questions regarding this PR.

knard38 · 2025-01-08T09:04:11Z

src/tests/ftest/daos_test/shm.py

@@ -0,0 +1,44 @@
+"""


From my understanding this test is more a unit test and thus should probably be run with the utils/run_utest.py python script instead of by the functional test framework.
@phender and @daltonbohning what is your opinion on this point ?

In general, yes. If the same test can be ran as a unit test (low cost, quick) instead of a functional test (higher cost, slower) then we should run it as a unit test

@knard38 @daltonbohning Thank you very much! I will look into running this test as a unit test.

@wiliamhuang , If it can help, I recently added some unit tests in the following PR:
https://github.com/daos-stack/daos/pull/14713/files#diff-294ea4ccb7880cabe2a9a4ffadd3c709916da304a39a819c82f23bfe06197a61

Yes. Current tests are simple. They could fit as a unit test. More complex tests will be added in future. We can add shared memory related ftest later when we need.

@wiliamhuang , If it can help, I recently added some unit tests in the following PR: https://github.com/daos-stack/daos/pull/14713/files#diff-294ea4ccb7880cabe2a9a4ffadd3c709916da304a39a819c82f23bfe06197a61

@knard38 Thank you very much! It's very helpful.

knard38 · 2025-01-08T10:08:38Z

src/gurt/shm_alloc.c

+	/* failed to open */
+	if (shm_ht_fd == -1) {
+		if (errno == ENOENT) {
+			goto create_shm;


NIT, could improve readability to put this code in a dedicated function instead of goto ?

You are right. I will create a function for shm creation. Thank you!

knard38 · 2025-01-08T10:32:15Z

src/include/gurt/shm_alloc.h

+	uint64_t         shm_pool_size;
+	/* reserved for future usage */
+	char             reserved[256];
+};


Should be needed if we want to have the mmaped address space to be well aligned

Suggested change

};

} __attribute__((aligned(PAGE_SIZE)));

Thank you! The address returned by mmap() always is page aligned.

knard38 · 2025-01-08T10:37:49Z

src/gurt/shm_alloc.c

+		shm_pool_size = shm_size / N_SHM_POOL;
+		if (shm_pool_size % 4096)
+			/* make shm_pool_size 4K aligned */
+			shm_pool_size += (4096 - (shm_pool_size % 4096));


To support different architectures and page size configuration, the value 4096 should probably be defined with a macro such as PAGE_SIZE

Suggested change

shm_pool_size += (4096 - (shm_pool_size % 4096));

shm_pool_size += (PAGE_SIZE - (shm_pool_size % PAGE_SIZE));

Thank you! I will update it as you suggested to make it more portable.

knard38 · 2025-01-08T13:50:15Z

src/gurt/shm_alloc.c

+
+	/* map existing shared memory */
+	shm_addr = mmap(FIXED_SHM_ADDR, shm_size, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_FIXED, shm_ht_fd, 0);


Using fixed memory location seems to be strongly discouraged by the man page.
Not sure to understand how we will be sure why it is needed and how we will be sure that it will not overlap existing mmaps.

I agree. Fixed memory location is a strong limitation. It comes from the memory allocator we use. We could eliminate this limitation later once we have our own memory allocator supporting shared memory management.

i also don't get why using MAP_FIXED and the fixed address. using MAP_FIXED can cause undefined behavior if the address is actually in use by something else, no?
maybe i don't get the requirement why you need this.

You are right.
The requirement of using fix same address across processes is due to memory allocator we use. This is a quick and dirty way to allow us to use existing memory allocator for now. In future we need to implement our own memory allocator to natively support shared memory management, then the limitation could be removed.

src/gurt/shm_alloc.c

knard38 · 2025-01-08T14:26:49Z

src/gurt/shm_alloc.c

+	char daos_shm_file_name[128];
+
+	sprintf(daos_shm_file_name, "/dev/shm/%s_%d", daos_shm_name, getuid());
+	unlink(daos_shm_file_name);


NIT,
From my understanding shm_unlink() and shm_link() are equivalent, but using shm_unlink() with the fd opened with shm_open() seems to be more understandable to me: explicitly indicating that we are closing a file descriptor get with shm_open()

You are right. I will replace unlink() with shm_unlink(). Thank you!

knard38 · 2025-01-08T14:38:29Z

src/gurt/shm_alloc.c

+	uint64_t oldref;
+
+	if (idx_small < 0) {
+		tid  = syscall(SYS_gettid);


instead of calling syscall(SYS_gettid) (which needs to trap into the kernel), a call to a user land function such as pthread_self() could be more suited ?

I never used "pthread_self" before.
https://man7.org/linux/man-pages/man3/pthread_self.3.html

The thread ID returned by pthread_self() is not the same thing as
the kernel thread ID returned by a call to gettid.

This syscall will be called only once. I would not worry about the overhead.

Not fully agree with you on this, this function should be called for each memory allocation and thus its performance have some impact. However, this point is not a blocker for me.

I tested pthread_self() and syscall(SYS_gettid).

Hello from thread 1! tid_pthread_self = 140335788001024 SYS_gettid = 3023160
Hello from thread 2! tid_pthread_self = 140335779608320 SYS_gettid = 3023161
Hello from thread 3! tid_pthread_self = 140335771215616 SYS_gettid = 3023162
Hello from thread 4! tid_pthread_self = 140335762822912 SYS_gettid = 3023163

It looks like all tid_pthread_self are even number here. They are not evenly distributed to me. syscall(SYS_gettid) is only called once per thread instead of every memory allocation.

knard38 · 2025-01-08T15:06:08Z

src/gurt/shm_alloc.c

+
+	atomic_fetch_add_relaxed(&(d_shm_head->ref_count), -1);
+	if (pid != pid_shm_creator)
+		munmap(d_shm_head, d_shm_head->size);


From my understanding, all the process should unmap and then close the file.
The shared mmaped file content should be keept until the unlink() and shm_unlink() will be called.
Thus, not unmapping and closing with the process creating the shared memory file seems to be useless.
I am also concerned that it could be seen as memoy leak by valgrind or other memory checker tools.

Considering the cache in kernel space, I thought we may want to keep our cache persistent too. Otherwise, shared memory needs to be initialized again and again. Ideally, the space for caching would be freed after the content expires. We need our own shared memory allocator to dynamically expand/shrink shared memory region. It would be a long way to get there.
Yes. I did have a little concern about whether valgrind can detect the memory leak in shared memory usage here. I could play with valgrind to find out with simple test.

Not sure that it will change something for the kernel cache to keep a dangling pointer when the process will die.
From my understanding what will make the difference is to unlink the file.
@mchaarawi do you have an opinion on this ?

One extreme case, the caching does not have any benefit at all if a user runs job serially in case we destruct caching once application ends.

Maybe, I am missing something, but if you remove the shared file then keeping a dangling pointer will not help if you have unlink the file. On the other hand, if you do not unlink the file, the cache will still be available even if you do not have a dangling pointer at the end your applications. From my understanding the live of the cache is managed by the kernel and it will be removed when the file will be unlinked.
However, it is perfectly that I am missing something obvious.

shm_destroy() is not called by regular applications. It was called only in the shm_test. The file associated with shared memory will not be unlinked.
I talked to Mohamad recently. He suggested daos_agent will initialize and destruct shared memory region. We will update it later.

knard38 · 2025-01-08T17:56:19Z

src/gurt/shm_alloc.c

+		if (shm_pool_size % 4096)
+			/* make shm_pool_size 4K aligned */
+			shm_pool_size += (4096 - (shm_pool_size % 4096));
+		shm_size = shm_pool_size * N_SHM_POOL + sizeof(struct d_shm_hdr);


In fact my remarks on aligning the struct d_shm_addr was to have a sizeof struct which will be a multiple of the PAGE_SIZE otherwise shm_size could be not a multiple of PAGE_SIZE from my understanding. Then the shared pools (following the header), could be not aligned on PAGE_SIZE (from my understanding).
However, I have not check that the align attribute will properly change the sizeof. I will check this asap.

"shm_size" does not have be a multiple of PAGE_SIZE. mmap() does not require size to be a multiple of PAGE_SIZE. The memory allocator will use some space in the pool too. I am not sure making shm_size a multiple of PAGE_SIZE will bring noticeable benefit. Maybe some performance tests could help to clarify later.

From what I know it is indeed always better to have aligned memory for performance.
Moreover, from what I understand, the size really allocated by mmap() will be the same.
The only difference is that the padding will be added at the end of allocated memory instead of being between the struct d_shm_hdr and the first shm_pool.
In any case, this is not a blocker from my side.

mchaarawi

just some quick comments.
still need to review more closely. will do that soon

mchaarawi · 2025-01-08T17:07:48Z

src/gurt/shm_alloc.c

+		return 0;
+	}
+
+	rc = d_getenv_uint64_t("DAOS_SHM_SIZE", &shm_size);


we probably should have something different than an env variable to determine the size. i previosly implemented a utility that grabs that from the agent. i can integrate that into the branch later and replace this.

mchaarawi · 2025-01-08T17:14:25Z

src/gurt/shm_alloc.c

+
+	/* map existing shared memory */
+	shm_addr = mmap(FIXED_SHM_ADDR, shm_size, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_FIXED, shm_ht_fd, 0);


i also don't get why using MAP_FIXED and the fixed address. using MAP_FIXED can cause undefined behavior if the address is actually in use by something else, no?
maybe i don't get the requirement why you need this.

mchaarawi · 2025-01-08T17:19:17Z

src/gurt/shm_alloc.c

+
+	/* the shared memory only accessible for individual user for now */
+	sprintf(daos_shm_name_buf, "%s_%d", daos_shm_name, getuid());
+open_rw:


would it work if we just use O_CREAT without O_EXCL? so you don't need the try_open then try_create semantics here?

I used "O_CREAT | O_EXCL" to make sure only one process will initialize shared memory. Without "O_EXCL" would allow more than one concurrent processes to initialize shared memory.

mchaarawi · 2025-01-08T18:00:32Z

src/tests/suite/SConscript

+    shm_test_env = base_env.Clone()
+    shm_test_env.compiler_setup()
+    shm_test_env.AppendUnique(LIBPATH=[Dir('../../gurt')])
+    shm_test_env.AppendUnique(LIBPATH=[Dir('../../common')])
+    shm_test_env.AppendUnique(LIBPATH=[Dir('../../cart')])
+    shmtest = shm_test_env.d_program(File("shm_test.c"), LIBS=['gurt', 'daos_common', 'cart',
+                                     'cmocka', 'rt', 'pthread'])
+    denv.Install('$PREFIX/bin/', shmtest)
+


i don't think you need a DAOS server to run those tests, right?
so probably adding those as unit tests in gurt will be more appropriate

Right. Current tests are quite simple. DAOS server is not needed. I will change the tests as unit tests. We will add ftest later when necessary. Thank you!

mchaarawi · 2025-01-08T18:01:40Z

src/include/gurt/shm_alloc.h

+#ifndef __DAOS_SHM_ALLOC_H__
+#define __DAOS_SHM_ALLOC_H__
+
+#include <stdint.h>


it's probably better to create 1 header for this rather than multiple public headers that users of this module need to always include.

Thank you very much! Just fixed it. Now only necessary APIs are exposed in this header file.

Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

daosbuild1 · 2025-01-09T07:14:55Z

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15613/6/testReport/

Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

Required-githooks: true Signed-off-by: Lei Huang <[email protected]>

daosbuild1 · 2025-01-14T17:32:25Z

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15613/14/testReport/

Required-githooks: true Signed-off-by: Lei Huang <[email protected]>

knard38 · 2025-01-15T15:37:43Z

src/include/gurt/shm_internal.h

From what I understand of the best DAOS practice, the internal header should not be located in the src/include directory to not be visible by the end user. But I could be wrong on this point.

ok. I will move it to src/gurt. Thank you!

knard38 · 2025-01-15T15:41:10Z

src/include/gurt/shm_internal.h

+	_Atomic int      ref_count;
+	/* global counter used for round robin picking memory allocator for large memory request */
+	_Atomic uint64_t large_mem_count;
+	/* array of pointors to memory allocators */


NIT

Suggested change

/* array of pointors to memory allocators */

/* array of pointers to memory allocators */

A good catch. Thank you! Will fix it.

knard38 · 2025-01-15T15:43:42Z

src/gurt/shm_dict.c

Instead of creating a new hash map, could it not be possible to update the current DAOS htable implementation with your shared memory new features ?

I agree. I inclined to take current DAOS htable implementation and modify it to fit shared memory at the beginning. I decided to implement hash table in shared memory from scratch after I realized many parts are not compatible.

Fair enough, I will thus have a more in depth look to this file.

wiliamhuang mentioned this pull request Dec 13, 2024

DAOS-16877 client: implement utilities for shared memory #15612

Closed

18 tasks

wiliamhuang added 3 commits December 15, 2024 16:05

use atomic operation to avoid locking and import python function

88210eb

Features: shm Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

revert the support of flock with rwlock and wrap pthread_mutex with s…

fe7d780

…hm_mutex Features: shm Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

fix format issues and add env DAOS_SHM_SIZE

c1acb9d

Features: shm Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

wiliamhuang marked this pull request as ready for review December 23, 2024 15:15

wiliamhuang requested review from a team as code owners December 23, 2024 15:15

wiliamhuang requested review from mchaarawi, daltonbohning and knard38 December 23, 2024 15:15

phender requested changes Jan 2, 2025

View reviewed changes

knard38 reviewed Jan 8, 2025

View reviewed changes

mchaarawi reviewed Jan 8, 2025

View reviewed changes

run as utest instead of ftest

989be37

Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

wiliamhuang added 5 commits January 9, 2025 14:57

relax time cutoff considering the overhead from valgrind

cbb6611

Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

free exe_path to avoid memory leak

036f7b8

Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

only expose necessary APIs in shm_alloc.h

2dd3943

Required-githooks: true Skipped-githooks: codespell Signed-off-by: Lei Huang <[email protected]>

update copyright info

a67889a

Required-githooks: true Signed-off-by: Lei Huang <[email protected]>

increase utest memcheck timeout in Jenkins configuration

bfdc419

Required-githooks: true Signed-off-by: Lei Huang <[email protected]>

relax threshold value to avoid issues on slow CI vm nodes

495a4f5

Required-githooks: true Signed-off-by: Lei Huang <[email protected]>

knard38 reviewed Jan 15, 2025

View reviewed changes

wiliamhuang requested a review from mchaarawi January 15, 2025 16:19

	shm_pool_size += (4096 - (shm_pool_size % 4096));
	shm_pool_size += (PAGE_SIZE - (shm_pool_size % PAGE_SIZE));

	/* array of pointors to memory allocators */
	/* array of pointers to memory allocators */

DAOS-16877 client: implement utilities for shared memory #15613

Are you sure you want to change the base?

DAOS-16877 client: implement utilities for shared memory #15613

Conversation

wiliamhuang commented Dec 13, 2024

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Dec 13, 2024

daosbuild1 commented Dec 13, 2024

wiliamhuang commented Dec 23, 2024

wiliamhuang commented Dec 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knard38 left a comment

Choose a reason for hiding this comment

knard38 Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knard38 Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knard38 Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knard38 Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knard38 Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mchaarawi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daosbuild1 commented Jan 9, 2025

daosbuild1 commented Jan 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knard38 Jan 8, 2025 •

edited

Loading

knard38 Jan 8, 2025 •

edited

Loading

knard38 Jan 8, 2025 •

edited

Loading

knard38 Jan 8, 2025 •

edited

Loading

knard38 Jan 17, 2025 •

edited

Loading