Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when leases are enabled #18

Open
cmolder opened this issue Nov 17, 2021 · 5 comments
Open

Segmentation fault when leases are enabled #18

cmolder opened this issue Nov 17, 2021 · 5 comments

Comments

@cmolder
Copy link
Contributor

cmolder commented Nov 17, 2021

We're trying to run Assise with leases enabled. We set the -DUSE_LEASE flags in the KernFS and LibFS Makefiles respectively, and set up our file system using emulated NVM as shown in the readme.

Unfortunately, it seems that there is a segmentation fault occuring in iotest when LibFS acquires a lease. This only occurs when leases are active: when they're disabled, the program runs normally. Here is the output of the KernFS and LibFS on the iotest from the readme:

KernFS:

sudo ./run.sh kernfs
initialize file system
dev-dax engine is initialized: dev_path /dev/dax0.0 size 4096 MB
Reading root inode with inum: 1fetching node's IP address..
Process pid is 25584
ip address on interface 'lo' is 127.0.0.1
cluster settings:
--- node 0 - ip:127.0.0.1
MLFS cluster initialized
[Local-Server] Listening on port 12345 for connections. interrupt (^C) to exit.
Adding connection with sockfd: 0
Adding connection with sockfd: 1
Adding connection with sockfd: 2
RECV <-- MSG_INIT [pid 0]
RECV <-- MSG_INIT [pid 1]
RECV <-- MSG_INIT [pid 2]
[add_peer_socket():80] Peer connected (ip: 127.0.0.1, pid: 25597)
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:1 of type:1 and peer:0x7f8baa40f000
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:0 of type:0 and peer:0x7f8baa40f000
SEND --> MSG_SHM [paths: /shm_recv_0|/shm_send_0]
start shmem_poll_loop for sockfd 0
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:2 of type:2 and peer:0x7f8baa40f000
SEND --> MSG_SHM [paths: /shm_recv_2|/shm_send_2]
start shmem_poll_loop for sockfd 2
SEND --> MSG_SHM [paths: /shm_recv_1|/shm_send_1]
start shmem_poll_loop for sockfd 1
[discard_leases():933] discarding all leases for peer ID = 1
Connection terminated [sockfd:0]
Connection terminated [sockfd:1]
Exit server_thread 
Exit server_thread 
./run.sh: line 15: 25584 Segmentation fault      LD_LIBRARY_PATH=../build:../../libfs/lib/nvml/src/nondebug/ LD_PRELOAD=../../libfs/lib/jemalloc-4.5.0/lib/libjemalloc.so.2 MLFS_PROFILE=1 numactl -N0 -m0 $@

LibFS

sudo ./run.sh iotest sw 2G 4K 1
[tid:25597][device_init():50] dev id 1
dev-dax engine is initialized: dev_path /dev/dax0.0 size 4096 MB
[tid:25597][device_init():50] dev id 2
[tid:25597][device_init():50] dev id 3
[tid:25597][cache_init():293] allocating 262144 blocks for DRAM read cache
[tid:25597][read_superblock():504] [dev 1] superblock: size 883712 nblocks 856598 ninodes 300000 inodestart 2 bmap start 27087 datablock_start 27114
fetching node's IP address..
Process pid is 25597
ip address on interface 'lo' is 127.0.0.1
cluster settings:
[tid:25597][register_peer_log():245] assigning peer (ip: 127.0.0.1 pid: 0) to log id 0
--- node 0 - ip:127.0.0.1
Connecting to KernFS instance 0 [ip: 127.0.0.1]
[Local-Client] Creating connection (pid:25597, app_type:0, status:pending) to 127.0.0.1:12345 on sockfd 0
In thread
[Local-Client] Creating connection (pid:25597, app_type:1, status:pending) to 127.0.0.1:12345 on sockfd 1
In thread
[Local-Client] Creating connection (pid:25597, app_type:2, status:pending) to 127.0.0.1:12345 on sockfd 2
[tid:25597][init_rpc():148] awaiting remote KernFS connections
In thread
SEND --> MSG_INIT [pid 0|25597]
SEND --> MSG_INIT [pid 2|25597]
SEND --> MSG_INIT [pid 1|25597]
RECV <-- MSG_SHM [paths: /shm_recv_1|/shm_send_1]
RECV <-- MSG_SHM [paths: /shm_recv_0|/shm_send_0]
[tid:25599][add_peer_socket():63] found socket 1
[tid:25599][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 0)
[tid:25599][_find_peer():206] peer[0]: ip 127.0.0.1 pid 0
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:1 of type:1 and peer:0x565200a47d30
start shmem_poll_loop for sockfd 1
RECV <-- MSG_SHM [paths: /shm_recv_2|/shm_send_2]
[tid:25598][add_peer_socket():63] found socket 0
[tid:25600][add_peer_socket():63] found socket 2
[tid:25598][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 1)
[tid:25598][_find_peer():191] sockfd[0]: ip 127.0.0.1 pid 0
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:0 of type:0 and peer:0x565200a47d30
start shmem_poll_loop for sockfd 0
[tid:25600][_find_peer():176] trying to find peer with ip 127.0.0.1 and pid 0 (peer count: 1 | sock count: 2)
[tid:25600][_find_peer():191] sockfd[0]: ip 127.0.0.1 pid 0
[add_peer_socket():97] Established connection with 127.0.0.1 on sock:2 of type:2 and peer:0x565200a47d30
start shmem_poll_loop for sockfd 2
[tid:25597][rpc_bootstrap():909] peer send: |bootstrap |25597
[tid:25598][signal_callback():1357] received rpc with body: |bootstrap |1 on sockfd 0
[signal_callback():1370] Assigned LibFS ID=1
MLFS cluster initialized
[tid:25597][init_log():148] end of the log e7c00
init log dev 1 start_blk 916481 end 949248
[tid:25597][ialloc():647] get inode - inum 1
[tid:25597][init_fs():459] LibFS is initialized on dev 1
Total file size: 2147483648B
io size: 4096B
# of thread: 1
[tid:25597][mlfs_posix_mkdir():419] [POSIX] mkdir(/mlfs/)
[tid:25597][namex():274] namex: path /mlfs/, parent 1, name mlfs
[acquire_lease():430] LIBFS ID= 1 trying to acquire lease of type 2 for inum 1
./run.sh: line 5: 25597 Segmentation fault      LD_PRELOAD=../build/libmlfs.so MLFS_PROFILE=1 ${@}

My best guess is that LibFS is having a segfault when it tries to access the lease cache in acquire_lease(). The function calls lcache_find() and looks inside a hash table to find the lease. I have a feeling there is something incorrect about our setup (strata_access@kitten3-lom on the UT cluster), but it might not be since the code works fine with leases off. Do you know where we could start to debug this? Thank you!

@simpeter
Copy link

simpeter commented Nov 18, 2021 via email

@jdeans289
Copy link

We're having trouble getting meaningful results from gdb - we've recompiled both sides with the -g flag, and are attaching gdb by pid after starting the process. However, our backtraces look something like

Thread 1 "iotest" received signal SIGSEGV, Segmentation fault.
0x00007f7844fd662c in ?? ()
(gdb) backtrace
#0  0x00007f7844fd662c in ?? ()
#1  0x00007fff75f01904 in ?? ()
#2  0x00007fff75f01a40 in ?? ()
#3  0x5366922b64c30614 in ?? ()
#4  0x0000000000000000 in ?? ()

Is there some other way that we need to import symbols or run gdb? I have not been able to set breakpoints by name either.

In the meantime, we tried some "printf debugging" to narrow down where segfault is coming from, and it appears to be coming from:

https://github.com/ut-osa/assise/blob/master/libfs/src/experimental/leases.c#L120

Specifically that the hash_handle is not defined.

@simpeter
Copy link

simpeter commented Nov 18, 2021 via email

@wreda
Copy link
Contributor

wreda commented Nov 19, 2021

I believe I found the issue. There was a bug that resulted in the lease table not being properly initialized. I just pushed a fix for it. Please pull and let me know if you run into any other problems.

@cmolder
Copy link
Contributor Author

cmolder commented Nov 21, 2021

Your fix works great. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants