-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GW server logs notice on keep alive timeout and disconnects host from subsystem #161
Comments
From the logs that we have I guess the issue can be something like "socket hung" in the Gateway. We observe the following errors in the log:
Eventually the spdk process is killed due to not responding to rpc keep-alive . It also may indicate the same issue - socket hung "ERROR:control.server:spdk_get_version failed with: Does spdk create a core-dump upon termination? It would be great to see the back-trace of spdk processes |
@leonidc Yes it is reproducible. able to see this thrice. Disconnection is issued against all 3 hosts that have been connected to 3 individual subsystems on this GW.
Tagging @sdpeters @trociny @PepperJo @idryomov for more insights |
Depends how the gateway terminates it. If the problem can be detected (failing host connections or IOs) before the SPDK process is killed, you could run gcore on the SPDK process to get a core dump (or do it with gdb). I don't have any theories about what the problem is, but with a dump you could at least look for threads waiting for locks, etc. Is nvmf_tcp logging already enabled (I haven't looked at these logs yet so don't know)? You can do that at some convenient time (before tings start failing) with the log_set_flag JSON RPC command. |
I see the discovery controller we're talking about here is the one in the SPDK gateway process. Though the SPDK discovery KATO issue inked above seems to have ben fixed, it's worth asking if this problem occurs when the host doesn't connect to that discovery controller (the subsystem ports are configured manually in the host instead)? I say this because the POR is for hosts to use the separate cluster-wide discovery controller we'll include with the gateway rather than the discovery controllers in each gateway process. This is because the SPDK discovery controllers are only aware of the subsystems in that same SPDK process. Hosts need to discover all the subsystems and ports in all the Ceph NVMe-oF gateway nodes (see #190). I see these log messages from the SPDK process related to the discovery controller:
So if we needed to use this discovery controller it seems it should be added to be added to the TCP listener. Given its limitations, we should probably disable it instead, |
Modified server.py not to kill the spdk when it becomes not responsive, Thread 1 (Thread 0x7efe8dcf6700 (LWP 181668) "reactor_0"): Usually thread waits on epoll_wait (most of the time). But we need more dumps for analyze, |
There's something funny about that stack trace line. There's no symbol named sock_group_impl_poll_coumax_eventsnt, but sock_group_impl_poll_count and max_events exist. Did gdb display it that way, or was this a copy and paste issue between your terminal and github? |
I suspect that thread was not blocked in _sock_flush(), but you just caught that poller at that point when you stopped the process. Everything in _sock_flush() that might block (SSL_writev() or sendmsg()) would leave another stack frame. If you stop this with gdb again, it might be worth stepping through that thread for a while to see if it's looping endlessly over a (corrupted) TAILQ (unlikely), or just busily polling and not making any progress. If it ever returns, it's the latter. Actually it probably makes more sense to first try this without using the SPDK discovery controller. If the problem still occurs then it's likely a customer will experience it even when using the gateway's cluster discovery service. If not, then we can work around the issue for this test. |
@sdpeters , sock_group_impl_poll_coumax_eventsnt - checked now, that is how it looks in our ibm internal box, was pasted there by @rahullepakshi. He used the gdb command "thread apply all bt" .It was the last shown thread's stack before gdb was disconnected from the process. |
issue seems to be fixed after bumping the version to 23.01.1 LTS. Closing. |
Re-opening this issue as issue is seen in SPDK v23.01.1 . On a VM, test is to create 256 namespaces in single subsystem only in Gateway node and run IO in parallel for each bdev, namespaces created. Failure is seen after 183 namespaces are created.
|
@caroav Can you provide me logs/ any information with which you saw that this issue might be fixed at SPDK v23.01.1 as at #161 (comment) ? I wanted to cross check with what I have here when recreated at this version |
@leonidc is this failure looking the same as the original failure of this issue? |
@rahullepakshi , errors looking the same as the original failure, can you please check in var/log/messages for messages like "heartbeat_check" from osd and "oom_kill_process" ? |
@rahullepakshi I assume this is due to 1000s of threads being created because currently we don't share the Ceph client instance in SPDK. This will make SPDK really slow and might miss timeouts etc. |
In addition to @PepperJo comments, we also asked @rahullepakshi to run with tcp memory configuration as we set in the container (transport_tcp_options = {"in_capsule_data_size" : 8192, "max_io_qpairs_per_ctrlr" : 7}) |
@leonidc I do not see any messages on "heartbeat_check" from osd and "oom_kill_process" in this case Few observations,
|
@rahullepakshi it is my understanding now that with 256 namespaces, it is working fine if the GW has 16 GB RAM. Please confirm. It fails for 8 GB RAM configuration. That might be addressed if needed in another issue for Squid. |
Thats correct @caroav |
With upstream container build on VMs, (logs will open only within redhat network) |
@rahullepakshi can you retest it with 0.0.4? We want to know if a node with 16GB RAM can handle 400 namespaces with IOs. It should be better we believe in 0.0.4 because of this PR #230 |
ok sure @caroav . I will plan for this week if possible |
Just tested with 4GB RAM node with 0.0.4 and recent ceph build but with no IOs, as automation test runs fail due "nvme discover" command failure to discover subsystems from client node. It fails at namespace 191. I will also update for 16GB RAM node once issue is fixed.
|
This is fixed a long time ago. Discussed with @rahullepakshi and we agreed to close. |
Not sure if this is a config miss or related to any other thing but does anyone has an idea about below from GW server logs
This is seen around 60 minutes after discovery and connection from Linux initiator/host. In this span I added 115 namespaces to this subsystem, run some I/O from host and idle for say 40 minutes. Post which I could not list nvme volumes on initiator and probably because of above logs it got disconnected.
Eventually GW crashed but I am unsure if this is the reason for it.
Complete logs -
Few references -
spdk/spdk#1089
linux-nvme/nvme-cli#624
Observed it again -
Test suite- https://github.com/rahullepakshi/cephci/blob/openstack_nvmf_scale/suites/quincy/nvmeof/ceph_nvmeof_openstack_scale.yaml
( I had to copy paste to pad.ceph as I did not find other place to attach logs that can be viewed by all)
https://pad.ceph.com/p/test_failure_log -> https://github.com/rahullepakshi/cephci/blob/openstack_nvmf_scale/suites/quincy/nvmeof/ceph_nvmeof_openstack_scale.yaml#L148-L180
GW crash log - https://pad.ceph.com/p/KATO_ceph_NVMeTCP_issue
The text was updated successfully, but these errors were encountered: