Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ibv: propagate send/recv failure to user #24

Open
omor1 opened this issue Jun 8, 2021 · 8 comments
Open

ibv: propagate send/recv failure to user #24

omor1 opened this issue Jun 8, 2021 · 8 comments

Comments

@omor1
Copy link
Member

omor1 commented Jun 8, 2021

The ibv backend should propagate send/recv failure i.e. retry errors back to the user for back-pressure. Current behavior is to not check for errors i.e. ignore them; if LCI_DEBUG is enabled then errors are checked, but cannot be handled by the user.

In particular, I've encountered this when many sends are initiated in a short period of time, overwhelming the send queue (set statically at 256 entries in server_ibv_helper.h; see qp_init_attr.cap.max_send_wr ). This results in IBV_WC_RETRY_EXC_ERR (12), "Transport Retry Counter Exceeded".

Correctly handling this may require some care to ensure we don't leak memory/packets or other resources; we need to return the packet to the pool on a send failure.

@omor1
Copy link
Member Author

omor1 commented Jun 9, 2021

Whoops, I misread the ibv_post_send; the error returned isn't an IBV_WC status, it's the value of errno i.e. a standard return value, in this case ENOMEM... which coincidentally has the integer value 12, the same as IBV_WC_RETRY_EXC_ERR.

@omor1
Copy link
Member Author

omor1 commented Jun 10, 2021

Currently working on a proper fix for this in my branch. It's mostly straightforward—except for the three-way handshake rendezvous send/recv protocol, where I'm not sure what to do. If the RTR or RDMA send fails (when called during the progress call), due to e.g. send queue overflow, this can't be propagated to the user—the user called progress, not send.

The call could simply be retried, but my thought is that in some situations this could result in a deadlock—send queue overflows, progress() attempts RTR or RDMA send, send queue can't be drained because send completion queue full => deadlock (can't recursively progress). That said, this situation seems unlikely to me; we create CQs with 16,384 entries, which should be more than enough (that's currently 16x what I use for the send queue length; I increased it from 256 to 1024 while debugging this problem).

I just don't know if retrying in the progress call is "good" behavior. Thoughts? Pinging @danghvu @snirmarc @JiakunYan

@omor1
Copy link
Member Author

omor1 commented Jun 10, 2021

You also run into a potential race when the RTR send fails at the client end (caller of recv versus during progress). An assumption of the hash table design is that all entries for a particular key ((rank << 32) | tag) are either "Client" entries or "Server" entries—all are either expected or unexpected. If a new data or RTS packet arrives, it is inserted as a "Server" entry; if it matches an existing "Client" entry, the insert fails and an RTR is sent. Conversely, when a new recv is posted, it is inserted as a new "Client" entry; if it matches an existing "Server" entry, the insert fails and an RTR is sent. In short, all entries for a key in the table must be of the same type, or they would have been matched and not inserted in the first place.

My first thought for an RTR send failure at the "client" would be to reinsert the recv as a Server entry; when the user retries the recv it should be re-matched and (hopefully) the send completed. However, if between the match occurring and the send failing a different thread posts a new recv, reinserting the recv as a Server entry would cause an erroneous match with this different Client recv. This could be worked around with a new "Reinsert" entry type (which would function the same as Server, but not attempt to match with any Client recvs; however, this would break the model by allowing more than a single type of entry in the hash table.

I'm not sure if doing so would cause issues—in theory, the user should (at some point) retry the failed recv which should match with the reinserted recv (or another recv with the same rank and tag should do so), but maybe there's some edge case I'm not seeing?

@omor1
Copy link
Member Author

omor1 commented Jun 10, 2021

The call could simply be retried, but my thought is that in some situations this could result in a deadlock—send queue overflows, progress() attempts RTR or RDMA send, send queue can't be drained because send completion queue full => deadlock (can't recursively progress). That said, this situation seems unlikely to me; we create CQs with 16,384 entries, which should be more than enough (that's currently 16x what I use for the send queue length; I increased it from 256 to 1024 while debugging this problem).

It turns out my initial thought here is incorrect; it appears that polling the completion queue is necessary to remove entries from the send queue. We need to either a) have a dedicated QP for internal sends on the progress thread or b) count number of sends and ensure that at least MAX_CQ = 16 entries are unused at all times i.e. prevent sends if outstanding_sends < MAX_CQ. The problem is that we need a separate virtual channel for these sends; this can either be a "real" channel i.e. QP or one we virtualize.

I'll see if this second solution can work, but I'm not a huge fan...

@omor1
Copy link
Member Author

omor1 commented Jun 10, 2021

I'll note that lc_server_progress for ibv polls the recv CQ before the send CQ; this can exacerbate the problem by preventing the progress function from even having the opportunity of completing sends.

An alternative solution could be to split polling recv and send CQs into separate functions; then if the RTS or RDMA send fails the send CQ can be polled and processed to clear entries in the send queue. Completing sends should never generate more communications. However, this does mean that lc_server_progress can take an arbitrary amount of time; must try send RTS/RDMA -> poll send CQ -> try send again -> poll send CQ, etc. until send succeeds. This breaks the LCI model so I disfavor it.

@JiakunYan
Copy link
Collaborator

We can create our own send queue to hold all the overflowed send requests and try them later. We can also have a flag to indicate whether the IBV send queue is overflowed, and prevent new sendd/recvd to be posted when the device is in the overflow status.

@omor1
Copy link
Member Author

omor1 commented Jun 11, 2021

Thinking further, this is really a fundamental problem that we only get a chance to exert back pressure on the user at the start of sendd (or in recvd when it finds a match in the table). The actual communication RDMA Write is started in the progress thread, as is the RTR if the receive is expected. If for whatever reason we are unable to start the communication at this point, we have no way of pushing back on the user; in fact, the user may unknowingly have already flooded us and doomed themself.

@omor1
Copy link
Member Author

omor1 commented Jun 19, 2021

Somewhat related to this issue, I've encountered a situation where the RTR can fail due to memory registration—prior to sending the RTR, the RDMA Write destination must be registered. If the registration cache is full and nothing can be evicted (all are in use for other RDMAs), then we segfault since we don't check for errors. Even if we were to check for this condition, the correct behavior is again dubious.

To be honest, I'm a bit confused why I'm hitting this condition. The cache, if I'm reading the code right, should have 8192 entries—but my send queue is length 1024, so it shouldn't be possible to have more than 1024 long sends from a single source to a single destination at once. This is with only 2 processes...

Unless something is wrong with the eviction code? There are 10,000 "logically concurrent" sends, even though some are necessarily delayed due to send queue credits, but if entries aren't getting evicted when they're supposed to be, that could cause problems.

Alternatively, RDMA Write completion at the origin does not imply completion at the target, so that more RMDA Writes are being done than the length of the send queue implies. I don't think that's the case though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants