-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ibv: propagate send/recv failure to user #24
Comments
Whoops, I misread the |
Currently working on a proper fix for this in my branch. It's mostly straightforward—except for the three-way handshake rendezvous send/recv protocol, where I'm not sure what to do. If the RTR or RDMA send fails (when called during the progress call), due to e.g. send queue overflow, this can't be propagated to the user—the user called progress, not send. The call could simply be retried, but my thought is that in some situations this could result in a deadlock—send queue overflows, progress() attempts RTR or RDMA send, send queue can't be drained because send completion queue full => deadlock (can't recursively progress). That said, this situation seems unlikely to me; we create CQs with 16,384 entries, which should be more than enough (that's currently 16x what I use for the send queue length; I increased it from 256 to 1024 while debugging this problem). I just don't know if retrying in the progress call is "good" behavior. Thoughts? Pinging @danghvu @snirmarc @JiakunYan |
You also run into a potential race when the RTR send fails at the client end (caller of My first thought for an RTR send failure at the "client" would be to reinsert the recv as a Server entry; when the user retries the recv it should be re-matched and (hopefully) the send completed. However, if between the match occurring and the send failing a different thread posts a new recv, reinserting the recv as a Server entry would cause an erroneous match with this different Client recv. This could be worked around with a new "Reinsert" entry type (which would function the same as Server, but not attempt to match with any Client recvs; however, this would break the model by allowing more than a single type of entry in the hash table. I'm not sure if doing so would cause issues—in theory, the user should (at some point) retry the failed recv which should match with the reinserted recv (or another recv with the same rank and tag should do so), but maybe there's some edge case I'm not seeing? |
It turns out my initial thought here is incorrect; it appears that polling the completion queue is necessary to remove entries from the send queue. We need to either a) have a dedicated QP for internal sends on the progress thread or b) count number of sends and ensure that at least I'll see if this second solution can work, but I'm not a huge fan... |
I'll note that An alternative solution could be to split polling recv and send CQs into separate functions; then if the RTS or RDMA send fails the send CQ can be polled and processed to clear entries in the send queue. Completing sends should never generate more communications. However, this does mean that |
We can create our own send queue to hold all the overflowed send requests and try them later. We can also have a flag to indicate whether the IBV send queue is overflowed, and prevent new |
Thinking further, this is really a fundamental problem that we only get a chance to exert back pressure on the user at the start of |
Somewhat related to this issue, I've encountered a situation where the RTR can fail due to memory registration—prior to sending the RTR, the RDMA Write destination must be registered. If the registration cache is full and nothing can be evicted (all are in use for other RDMAs), then we segfault since we don't check for errors. Even if we were to check for this condition, the correct behavior is again dubious. To be honest, I'm a bit confused why I'm hitting this condition. The cache, if I'm reading the code right, should have 8192 entries—but my send queue is length 1024, so it shouldn't be possible to have more than 1024 long sends from a single source to a single destination at once. This is with only 2 processes... Unless something is wrong with the eviction code? There are 10,000 "logically concurrent" sends, even though some are necessarily delayed due to send queue credits, but if entries aren't getting evicted when they're supposed to be, that could cause problems. Alternatively, RDMA Write completion at the origin does not imply completion at the target, so that more RMDA Writes are being done than the length of the send queue implies. I don't think that's the case though... |
The ibv backend should propagate send/recv failure i.e. retry errors back to the user for back-pressure. Current behavior is to not check for errors i.e. ignore them; if
LCI_DEBUG
is enabled then errors are checked, but cannot be handled by the user.In particular, I've encountered this when many sends are initiated in a short period of time, overwhelming the send queue (set statically at 256 entries in
server_ibv_helper.h
; seeqp_init_attr.cap.max_send_wr
). This results inIBV_WC_RETRY_EXC_ERR
(12
), "Transport Retry Counter Exceeded".Correctly handling this may require some care to ensure we don't leak memory/packets or other resources; we need to return the packet to the pool on a send failure.
The text was updated successfully, but these errors were encountered: