Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abnormal memory access latency when using multiple servers #8

Open
charles-typ opened this issue May 20, 2021 · 2 comments
Open

Abnormal memory access latency when using multiple servers #8

charles-typ opened this issue May 20, 2021 · 2 comments

Comments

@charles-typ
Copy link

charles-typ commented May 20, 2021

Hi @ooibc88 @cac2003 @guowentian

I ran some performance benchmarks on GAM that yield unexpected latency numbers when I increase the number of servers, and I was hoping to get some insights from you regarding them. Below are details on the experimental setup, methodology and results.

Experiment setup:

  1. Two servers VM1 and VM2 with 512MB of local memory, and all memory used as cache.
  2. One server VM3 with all available DRAM used as local memory (~10GB), and no cache.

Therefore VM1 and VM2 fetch data from VM3, and keep it in their local cache.

Method:

I replayed several memory traces captured from different applications against GAM, under two scenarios (listed below), and recorded the execution time for both of them. The memory footprint of the application (~1GB) is larger than local cache size (512MB), so there are evictions along with invalidations. All memory accesses are 1 byte.

Scenario 1: Replay the memory traces for 10 threads on VM1, keep VM2 idle.
Scenario 2: Replay the memory traces for 10 threads on VM1 and 10 threads on VM2; this means that there are invalidations between the VMs due to shared memory accesses.

Results:

I expected Scenario 2 to be slower due to more invalidations between VM1 and VM2, but found Scenario 2 was actually faster than Scenario 1.

To understand the results better, I profiled the memory access latency in GAM, separating the latency for local and remote memory accesses (as shown in the table below; only measured for read operations, since write operations are always asynchronous under the PSO model).

Local access latency(us) Remote access latency(us)
Scenario 1 2.2 299
Scenario 2 1.4 84

Even though there are invalidations in Scenario 2, the remote access latency is smaller for Scenario 2 compared to Scenario 1. Also there is a slight speed up in local memory accesses in Scenario 2.

Despite extensive profiling, I was unable to explain this strange behavior; is this expected? If so, why? Thank you for taking the time to read this issue --- I would really appreciate any help!

@Second222None
Copy link

Second222None commented Jan 30, 2024

Hi @charles-typ , @guowentian , @ooibc88 , @cac2003
I am doing a similar thing and adapting GAM to run on RoCE. However, when I tried to run ./scripts/benchmark-all.sh with 3 VMs, I had a segmentation fault.

Can you give some advice?
Thanks in advance!

...
cannot find the key for hash table widCliMap (key not found in table)
...
(gdb) bt
#0  0x00000000004503a2 in std::atomic_flag::test_and_set (__m=std::memory_order_acquire, this=0x1b399ff0)
    at /usr/include/c++/10.3.1/bits/atomic_base.h:202
#1  cuckoohash_map<unsigned int, Client*, CityHasher<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, Client*> >, 4ul>::spinlock::lock (this=0x1b399ff0)
    at ../include/../lib/libcuckoo/src/cuckoohash_map.hh:164
#2  cuckoohash_map<unsigned int, Client*, CityHasher<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, Client*> >, 4ul>::lock_two (i2=6579, i1=34142, hp=<optimized out>, this=0x7ffff5ba3018)
    at ../include/../lib/libcuckoo/src/cuckoohash_map.hh:784
#3  cuckoohash_map<unsigned int, Client*, CityHasher<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, Client*> >, 4ul>::snapshot_and_lock_two (hv=<optimized out>, this=<optimized out>)
    at ../include/../lib/libcuckoo/src/cuckoohash_map.hh:833
#4  cuckoohash_map<unsigned int, Client*, CityHasher<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, Client*> >, 4ul>::find (val=<synthetic pointer>: <optimized out>, key=@0x7fffc670dd3c: 20,
    this=0x7ffff5ba3018) at ../include/../lib/libcuckoo/src/cuckoohash_map.hh:473
#5  cuckoohash_map<unsigned int, Client*, CityHasher<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, Client*> >, 4ul>::find (key=@0x7fffc670dd3c: 20, this=0x7ffff5ba3018)
    at ../include/../lib/libcuckoo/src/cuckoohash_map.hh:483
#6  HashTable<unsigned int, Client*>::at (key=@0x7fffc670dd3c: 20, this=0x7ffff5ba3018) at ../include/hashtable.h:61
#7  Server::FindClient (this=0x7ffff5ba3010, qpn=<optimized out>) at server.cc:168
#8  0x000000000045051c in Server::ProcessRdmaRequest (this=0x7ffff5ba3010, wc=...) at server.cc:38
#9  0x0000000000422e47 in Worker::StartService (w=0x7ffff5ba3010) at worker.cc:186
#10 0x00007ffff7e64270 in ?? () from /usr/lib64/libstdc++.so.6
#11 0x00007ffff7b234ca in ?? () from /usr/lib64/libc.so.6
#12 0x00007ffff7ba5ec0 in ?? () from /usr/lib64/libc.so.6

@charles-typ
Copy link
Author

@Second222None

I suggest you open a separate issue for this. I'm unsure about your problem, but maybe you can refer to my forked repo. I made a significant amount of fixes to make it work, so I hope you can get some help from the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants