-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault when leases are enabled #18
Comments
First thing to do is to attach gdb to both processes and get a backtrace
plus variable dump of the segfault on both sides. That will give you a lot
more indication as to what's going on. You can use it to falsify your
hypothesis that the segfault occurs in acquire_lease.
…On Wed, Nov 17, 2021 at 3:11 PM Carson Molder ***@***.***> wrote:
We're trying to run Assise with leases enabled. We set the -DUSE_LEASE
flags in the KernFS and LibFS Makefiles respectively, and set up our file
system using emulated NVM as shown in the readme.
Unfortunately, it seems that there is a segmentation fault occuring in
iotest when LibFS acquires a lease. This only occurs when leases are
active: when they're disabled, the program runs normally. Here is the
output of the KernFS and LibFS on the iotest from the readme:
KernFS:
sudo ./run.sh kernfs
initialize file system
dev-dax engine is initialized: dev_path /dev/dax0.0 size 4096 MB
Reading root inode with inum: 1fetching node's IP address..
Process pid is 25584
ip address on interface 'lo' is 127.0.0.1
cluster settings:
--- node 0 - ip:127.0.0.1
MLFS cluster initialized
[Local-Server] Listening on port 12345 for connections. interrupt (^C) to exit.
Adding connection with sockfd: 0
Adding connection with sockfd: 1
Adding connection with sockfd: 2
RECV <-- MSG_INIT [pid 0]
RECV <-- MSG_INIT [pid 1]
RECV <-- MSG_INIT [pid 2]
[add_peer_socket():80] Peer connected (ip: 127.0.0.1, pid: 25597)
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:1 of type:1 and peer:0x7f8baa40f000
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:0 of type:0 and peer:0x7f8baa40f000
SEND --> MSG_SHM [paths: /shm_recv_0|/shm_send_0]
start shmem_poll_loop for sockfd 0
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:2 of type:2 and peer:0x7f8baa40f000
SEND --> MSG_SHM [paths: /shm_recv_2|/shm_send_2]
start shmem_poll_loop for sockfd 2
SEND --> MSG_SHM [paths: /shm_recv_1|/shm_send_1]
start shmem_poll_loop for sockfd 1
[discard_leases():933] discarding all leases for peer ID = 1
Connection terminated [sockfd:0]
Connection terminated [sockfd:1]
Exit server_thread
Exit server_thread
./run.sh: line 15: 25584 Segmentation fault LD_LIBRARY_PATH=../build:../../libfs/lib/nvml/src/nondebug/ LD_PRELOAD=../../libfs/lib/jemalloc-4.5.0/lib/libjemalloc.so.2 MLFS_PROFILE=1 numactl -N0 -m0 $@
LibFS
sudo ./run.sh iotest sw 2G 4K 1
[tid:25597][device_init():50] dev id 1
dev-dax engine is initialized: dev_path /dev/dax0.0 size 4096 MB
[tid:25597][device_init():50] dev id 2
[tid:25597][device_init():50] dev id 3
[tid:25597][cache_init():293] allocating 262144 blocks for DRAM read cache
[tid:25597][read_superblock():504] [dev 1] superblock: size 883712 nblocks 856598 ninodes 300000 inodestart 2 bmap start 27087 datablock_start 27114
fetching node's IP address..
Process pid is 25597
ip address on interface 'lo' is 127.0.0.1
cluster settings:
[tid:25597][register_peer_log():245] assigning peer (ip: 127.0.0.1 pid: 0) to log id 0
--- node 0 - ip:127.0.0.1
Connecting to KernFS instance 0 [ip: 127.0.0.1]
[Local-Client] Creating connection (pid:25597, app_type:0, status:pending) to 127.0.0.1:12345 on sockfd 0
In thread
[Local-Client] Creating connection (pid:25597, app_type:1, status:pending) to 127.0.0.1:12345 on sockfd 1
In thread
[Local-Client] Creating connection (pid:25597, app_type:2, status:pending) to 127.0.0.1:12345 on sockfd 2
[tid:25597][init_rpc():148] awaiting remote KernFS connections
In thread
SEND --> MSG_INIT [pid 0|25597]
SEND --> MSG_INIT [pid 2|25597]
SEND --> MSG_INIT [pid 1|25597]
RECV <-- MSG_SHM [paths: /shm_recv_1|/shm_send_1]
RECV <-- MSG_SHM [paths: /shm_recv_0|/shm_send_0]
[tid:25599][add_peer_socket():63] found socket 1
[tid:25599][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 0)
[tid:25599][_find_peer():206] peer[0]: ip 127.0.0.1 pid 0
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:1 of type:1 and peer:0x565200a47d30
start shmem_poll_loop for sockfd 1
RECV <-- MSG_SHM [paths: /shm_recv_2|/shm_send_2]
[tid:25598][add_peer_socket():63] found socket 0
[tid:25600][add_peer_socket():63] found socket 2
[tid:25598][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 1)
[tid:25598][_find_peer():191] sockfd[0]: ip 127.0.0.1 pid 0
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:0 of type:0 and peer:0x565200a47d30
start shmem_poll_loop for sockfd 0
[tid:25600][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 2)
[tid:25600][_find_peer():191] sockfd[0]: ip 127.0.0.1 pid 0
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:2 of type:2 and peer:0x565200a47d30
start shmem_poll_loop for sockfd 2
[tid:25597][rpc_bootstrap():909] peer send: |bootstrap |25597
[tid:25598][signal_callback():1357] received rpc with body: |bootstrap |1 on sockfd 0
[signal_callback():1370] Assigned LibFS ID=1
MLFS cluster initialized
[tid:25597][init_log():148] end of the log e7c00
init log dev 1 start_blk 916481 end 949248
[tid:25597][ialloc():647] get inode - inum 1
[tid:25597][init_fs():459] LibFS is initialized on dev 1
Total file size: 2147483648B
io size: 4096B
# of thread: 1
[tid:25597][mlfs_posix_mkdir():419] [POSIX] mkdir(/mlfs/)
[tid:25597][namex():274] namex: path /mlfs/, parent 1, name mlfs
[acquire_lease():430] LIBFS ID= 1 trying to acquire lease of type 2 for inum 1
./run.sh: line 5: 25597 Segmentation fault LD_PRELOAD=../build/libmlfs.so MLFS_PROFILE=1 ${@}
My best guess is that LibFS is having a segfault when it tries to access
the lease cache in acquire_lease()
<https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L432>.
The function calls lcache_find()
<https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L114>
and looks inside a hash table to find the lease. I have a feeling there is
something incorrect about our setup ***@***.*** on the UT
cluster), but it might not be since the code works fine with leases off. Do
you know where we could start to debug this? Thank you!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#18>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHQBMT7XMTQKQMHZCKUTI3UMQK6ZANCNFSM5IH767QQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
We're having trouble getting meaningful results from gdb - we've recompiled both sides with the -g flag, and are attaching gdb by pid after starting the process. However, our backtraces look something like
Is there some other way that we need to import symbols or run gdb? I have not been able to set breakpoints by name either. In the meantime, we tried some "printf debugging" to narrow down where segfault is coming from, and it appears to be coming from: https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L120 Specifically that the hash_handle is not defined. |
Waleed (cc'ed) might be able to help here.
I imagine lcache_find will return NULL and instead of segfaulting, libfs
should at least assert that the return value is non-NULL. The next step
would be to figure out why the return value is NULL and what should be done
in that case.
…-- Simon
On Thu, Nov 18, 2021 at 3:24 PM Jacqueline Deans ***@***.***> wrote:
We're having trouble getting meaningful results from gdb - we've
recompiled both sides with the -g flag, and are attaching gdb by pid after
starting the process. However, our backtraces look something like
Thread 1 "iotest" received signal SIGSEGV, Segmentation fault.
0x00007f7844fd662c in ?? ()
(gdb) backtrace
#0 0x00007f7844fd662c in ?? ()
#1 0x00007fff75f01904 in ?? ()
#2 0x00007fff75f01a40 in ?? ()
#3 0x5366922b64c30614 in ?? ()
#4 0x0000000000000000 in ?? ()
Is there some other way that we need to import symbols or run gdb? I have
not been able to set breakpoints by name either.
In the meantime, we tried some "printf debugging" to narrow down where
segfault is coming from, and it appears to be coming from:
https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L120
Specifically that the hash_handle is not defined.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#18 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHQBMQZSE63MMIVU765YBTUMVVGFANCNFSM5IH767QQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I believe I found the issue. There was a bug that resulted in the lease table not being properly initialized. I just pushed a fix for it. Please pull and let me know if you run into any other problems. |
Your fix works great. Thank you! |
We're trying to run Assise with leases enabled. We set the
-DUSE_LEASE
flags in the KernFS and LibFS Makefiles respectively, and set up our file system using emulated NVM as shown in the readme.Unfortunately, it seems that there is a segmentation fault occuring in iotest when LibFS acquires a lease. This only occurs when leases are active: when they're disabled, the program runs normally. Here is the output of the KernFS and LibFS on the iotest from the readme:
KernFS:
LibFS
My best guess is that LibFS is having a segfault when it tries to access the lease cache in acquire_lease(). The function calls lcache_find() and looks inside a hash table to find the lease. I have a feeling there is something incorrect about our setup (strata_access@kitten3-lom on the UT cluster), but it might not be since the code works fine with leases off. Do you know where we could start to debug this? Thank you!
The text was updated successfully, but these errors were encountered: