-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/verbs: Non blocking EP creation (asynchronous route resolution using event channel) #9543
Conversation
@j-xiong, @chien-intel it seems that you are the most active developers for the verbs provider :) |
prov/verbs/src/verbs_init.c
Outdated
"dst addr", (*rai)->ai_dst_addr); | ||
ret = -errno; | ||
goto err2; | ||
if (node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that src_addr and/or dest_addr are supplied directly in hints instead of node/service. I think we can check (*rai)->ai_dst_addr
instead.
BTW, there is a typo in the title of this commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested your suggestion but it doesn't work as expected: the problem is that (*rai)->ai_dst_addr
is set even if no hints are provided. I'd assume it's the behavior of rdma_getaddrinfo()
to set ai_dst_addr
when the node
is given.
If dest_addr
is supplied, I would expect the user application to give NULL for the node
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this check work node || hints->dest_addr
? This matches with @j-xiong comment.
I believe we would have already exited the call if FI_SOURCE was set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shefty Thanks for the suggestion, it looks like it works. I'll update the PR with this change.
907274a
to
e0cb324
Compare
I added an extra patch to the series. I believe there are some core paths where connection establishment may fail and the provider doesn't report the failure to the EQ. |
@shijin-aws Sorry to ping you directly. Can you please take a look at the AWS CI results and tell me if the error relates to my changes? |
prov/verbs/src/verbs_ep.c
Outdated
if (wr->num_sge > vrb_gl_data.def_tx_iov_limit) | ||
return -FI_EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use def_rx_iov_limit instead of tx.
This check should be at the highest level and consistent across all post_recv calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, should be rx instead of tx.
Do you suggest to move this check to vrb_post_recv()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know what the future holds. Placing the check in vrb_post_recv would be most generic but with current code base the best place is vrb_msg_ep_recvmsg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vrb_post_recv()
is also called from vrb_dgram_ep_recvmsg()
.
I think it seems better to have the check directly in vrb_post_recv()
as it's more generic (as you said).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to run some tests with this PR before Thanksgiving break.
prov/verbs/src/verbs_ep.c
Outdated
VRB_WARN_ERRNO(FI_LOG_EP_CTRL, "rdma_create_qp"); | ||
return -errno; | ||
} | ||
if (ep->state == VRB_REQ_RCVD) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this change. In general I'm parnoid about state management. There is only one place where state is set to VRB_REQ_RCVD and as far as I can tell, the QP is already created at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vrb_eq_addr_resolved_event()
only creates the QP for the connection initiator. The target side needs to create the QP when the EP is enabled, after RDMA_CM_EVENT_CONNECT_REQUEST
is received (EP state is VRB_REQ_RCVD in this case).
To be honest, I don't like this check neither. Do you have a better suggestion? Is there a way to determine what side enables the EP, for example based on the content of ep->id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the check and the code now uses ep->id->verbs
to figure out what side (server
or client
) executes the function. I looked at rdma-core and it seems that ep->id->verbs
is set only when the addr or the route are resolved:
https://github.com/linux-rdma/rdma-core/blob/0cf342c301dfef1ba31c7ab763682ed0c59e7763/librdmacm/cma.c#L618
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id->verbs field is ibv_context and its creation/assignment is not related to addr/route resolution.
Please consider creating QP in RDMA_CM_EVENT_CONNECT_REQUEST handler see https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/man/rdma_cm.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot move QP creation in RDMA_CM_EVENT_CONNECT_REQUEST
handler because rdma_create_qp()
requires to set the QP init attributes (ibv_qp_init_attr
), and these attributes are known only when the EP is enabled. The EP doesn't even exist when the event RDMA_CM_EVENT_CONNECT_REQUEST
is received.
id->verbs field is ibv_context and its creation/assignment is not related to addr/route resolution.
I agree with your statement, but in practice and according to the code, id->verbs is set when the addr/route is resolved.
An alternative solution to find out if the route was resolved would be to check if rdma_cm_id::route::num_paths
is > 0 or rdma_cm_id::route::path_rec
is != NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still running tests, trying out large MPI jobs. It is taking a bit of time.
8218aec
to
2c4800a
Compare
prov/verbs/src/verbs_init.c
Outdated
"dst addr", (*rai)->ai_dst_addr); | ||
ret = -errno; | ||
goto err2; | ||
if (node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this check work node || hints->dest_addr
? This matches with @j-xiong comment.
I believe we would have already exited the call if FI_SOURCE was set.
@chien-intel @shefty @j-xiong While testing the changes, I found an issue when multiple verbs adapters are being used. The issue was found with RoCE but I believe Infiniband adapters are also impacted. When I think the bug is in
The problem is that The PR doesn't address this issue. I decided to work around the bug in the application: I skip all the info returned by |
I did notice some issues when I included nodes with multiple infiniband adapters on same logical subnet. I will create an issue for this when I have more info with easy reproducer. |
@chien-intel it's likely the same issue as the one I described above. |
@chien-intel Any update for this PR? |
prov/verbs/src/verbs_ep.c
Outdated
} | ||
/* Server-side QP creation, after RDMA_CM_EVENT_CONNECT_REQUEST | ||
* is recevied */ | ||
if (ep->id->verbs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still having a hard time with this one. Using internal variable from another library for this purpose is just a bad idea. Unfortunately, I do not have an alternative for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The verbs provider already has a couple of existing conditions based on id->verbs
. For example:
https://github.com/ofiwg/libfabric/blob/main/prov/verbs/src/verbs_info.c#L1316
https://github.com/ofiwg/libfabric/blob/main/prov/verbs/src/verbs_info.c#L1789
There is even this comment in the code:
/* Handle the case when rdma_cm doesn't fill src address even
* though it fills the destination address (presence of id->verbs
* corresponds to a valid dest addr) */
See https://github.com/ofiwg/libfabric/blob/main/prov/verbs/src/verbs_info.c#L1319C3-L1321C41
I'm not adding something new to the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have that kind of insight into libibverbs nor librdmacm and I certainly would not code that way. Just me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id->verbs is externally visible. It has to be that way to coordinate the use of librdmacm with libibverbs. struct rdma_cm_id has both visible and hidden fields. The hidden fields are not accessible to a user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct rdma_cm_id
is part of the public API of RDMA CM. Referencing its member is fine when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is one thing to use the value, it is another to infer behavior. Can we be certain when id->verbs is set, it is always after RDMA_CM_EVENT_CONNECT_REQUEST is received on the server side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we rephrase the question to: Is id->verbs always set for an RDMA_CM_EVENT_CONNECT_REQUEST? The answer is yes. Connect requests are received over a specific device.
Chien's question is subtly different, however, and I think the answer to his question is no, which might be exposing an issue.
An rdma_cm_id is associated with a verbs device in 3 cases. On the client side, this is the purpose of rdma_resolve_addr(). Once that operation completes, we have a device. On the server side, all connection requests are associated with a device. The 3rd case, however, is through rdma_bind_addr().
The restructured connect() path calls enable() first, then rdma_resolve_addr(). If we're taking the client code path, then id->verbs is usually not set. However, if we went through rdma_bind_addr() prior to calling connect(), it may be. In that case, I think we would end up trying to create 2 QPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the very, very late reply :)
I updated the code to check that ep->ibv_qp is NULL before creating the QP. The change is available here:
https://github.com/ofiwg/libfabric/pull/9543/files#diff-0c780e52c4b4880ab4a4012f359d1ae818bbdddcd650d9192f250f622e7bf927R1039
FYI, our product has been running with this PR for the last 2 months with no issue: plain Verbs (no RXM) and MSG endpoints. The only real issue we had was with the following patch that introduces EP states to the Verbs provider: #9175
The above PR was causing assertion failures during connection shutdown because the EP state was incorrect. We simply reverted this patch and everything works now.
Anyway, I'm still convinced this PR is important for upstream as some runtime models require all the calls to be non-blocking (I believe this is the case for DAOS too because of the SPDK model). Please review it when you can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- I didn't see any actual problems with the changes.
@@ -933,6 +934,12 @@ do { \ | |||
( wr->opcode == IBV_WR_SEND || wr->opcode == IBV_WR_SEND_WITH_IMM \ | |||
|| wr->opcode == IBV_WR_RDMA_WRITE_WITH_IMM ) | |||
|
|||
struct vrb_recv_wr { | |||
struct slist_entry entry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ibv_recv_wr has a next pointer which can be used to create a single-linked list. Using that requires duplicating part of the slist implementation, so I'm not sure yet in the review if it's worth pursuing this change, as this is all non-critical path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need some time to think about it. I'm not sure the change is worth it.
int ret; | ||
|
||
if (wr->num_sge > vrb_gl_data.def_rx_iov_limit) | ||
return -FI_EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say just assert here, but can we get the limit that was associated with the ep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not assert here: wr->num_sge
is set by the application when the receive buffer is posted: I think it's better to return -FI_EINVAL
and let the application deal with this error.
I slightly modified the code based on your suggestion to use the limit that is associated with the ep and I added an assertion failure to ensure that the EP limit is effectively lower than vrb_gl_data.def_rx_iov_limit
assert(ep->info_attr.rx_iov_limit <= vrb_gl_data.def_rx_iov_limit);
if (wr->num_sge > ep->info_attr.rx_iov_limit)
return -FI_EINVAL;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the assert is always true. The requested limit is subject to the device limits, not necessarily the default limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the pool of WRs belongs to the progress object (shared between EPs of the same domain) and the max number of SGEs is set to vrb_gl_data.def_rx_iov_limit
. Also, note that a similar assert() was added by the PR at line 68 (same file).
One possible fix would be to detect the max rx_iov_limit set by the HCA and use this value when the pool of WRs is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I believe ibv_post_recv() checks num_sge. So, I think we're okay moving this check into vrb_prepost_recv(), which is where we really need it. I agree if we can use the device limit when initializing the pool, that should cover (unlikely) edge cases. The check in vrb_prepost_recv() just need to verify that we don't overrun the local array. When the receives are actually posted to the device, vrb_post_recv_internal() -> ibv_post_recv() will check that the QP can handle it. A failure there is a user error, which at least will be handled (by tearing down the QP). Does this work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @shefty. The last revision addresses all your review comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the assert that's marked needs to be removed, but otherwise, this looks good. Thanks!
int ret; | ||
|
||
if (wr->num_sge > vrb_gl_data.def_rx_iov_limit) | ||
return -FI_EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the assert is always true. The requested limit is subject to the device limits, not necessarily the default limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shefty Thanks for the review! I did a few modifications related to the pool of WRs:
- The pool is now created with the max number of SGEs defined by
domain->info->rx_attr->iov_limit
instead of the default value set by the provider. - I removed the 2 asserts that checked that the number of SGEs is lower than
vrb_gl_data.def_rx_iov_limit
. The code assumes thatep->info_attr.rx_size
is always lower or equal todomain->info->rx_attr->iov_limit
.
int ret; | ||
|
||
if (wr->num_sge > vrb_gl_data.def_rx_iov_limit) | ||
return -FI_EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the pool of WRs belongs to the progress object (shared between EPs of the same domain) and the max number of SGEs is set to vrb_gl_data.def_rx_iov_limit
. Also, note that a similar assert() was added by the PR at line 68 (same file).
One possible fix would be to detect the max rx_iov_limit set by the HCA and use this value when the pool of WRs is created.
Can someone please look at the CI logs and tell me if the errors are related to the PR? |
The CI job was cancelled in the middle probably due to infrastructure failure. |
@j-xiong I see that you have re-triggered the job. Thanks! |
It hit the same node failure. I am taking that node offline and will restart your job again. Sorry its taking so long |
No worry. I know from experience how difficult it is to keep a CI pipeline running :) |
@sydidelot thanks for your understanding. We had some updates to our controller's server that look like they may have waterfalled some issues to our server that we are fixing. |
The provider should always use the destination address given to fi_connect(). Signed-off-by: Sylvain Didelot <[email protected]>
…ided The call to rdma_resolve_addr() in vrb_get_rai_id() would block the caller for 2 seconds if the remote address is unreachable. Because this function is executed in the context of fi_getinfo(), we cannot convert it to a non-blocking call but we can mitigate the issue: we can assume that the caller is interested in the destination address only if the node or the dest address (in the hints) is provided. For some application, the destination address can be exchanged out-of-bound and does not required to be resolved by fi_getinfo(). Signed-off-by: Sylvain Didelot <[email protected]>
The patch changes the second argument of vrb_init_progress() to be a struct fi_info. The current struct ibv_context is not used and we will need to access to the struct fi_info associated to the domain in a next patch in the series. Signed-off-by: Sylvain Didelot <[email protected]>
We are going to make the EP creation completely non-blocking by using event channels. With this design, we can no longer create the QP at EP enable time: we need to wait for the RDMA_CM_EVENT_ADDR_RESOLVED event. The problem is that the app is allowed to post the receive buffers right after the EP is enabled. To fix the issue, we need to introduce a software receive queue to store the receive buffers that are posted by the application before the EP is created. The patch implements the code to store the recv work requests in a pool. Signed-off-by: Sylvain Didelot <[email protected]>
The patch implements the code to store the preposted receive buffers in a software queue if the QP is not available, i.e., ep->ibv_qp is NULL. The list of preposted receive buffers is flushed when the QP is created (change to come in a next patch) and the EP is closed. Signed-off-by: Sylvain Didelot <[email protected]>
With the upcoming changes, we will no longer know the sockaddr address family when fi_connect() is executed. This piece of information will be known only after when the route is resolved. The patch updates the code to always allocate the CM header but skip it when the route is resolved (if needed). Signed-off-by: Sylvain Didelot <[email protected]>
The patch implements the code to handle the EQ events when the route is resolved. This new code will be enabled in a next patch. Signed-off-by: Sylvain Didelot <[email protected]>
The patch implements the last changes to make the EP creation non-blocking. The address is now resolved asynchronously when connection is established, which does no longer block the caller. The connection initiator creates the QP when the route is resolved. As for the connection target, it creates the QP when the EP is enabled (same as it was before). Note that the call to rdma_resolve_addr() in vrb_msg_ep_connect() becomes asynchronous because the EP CM id is associated with an event channel when the endpoint is bound to an event queue, which happend before the EP is enabled. Signed-off-by: Sylvain Didelot <[email protected]>
Fix two code paths where some connection establishment may fail and we don't report the error to the EQ. Signed-off-by: Sylvain Didelot <[email protected]>
@zachdworkin I rebased the PR but it still fails. Please let me know when the CI infrastructure is fixed. |
@sydidelot Will do. Keep an eye on #9898 which should contain a fix for the ze-shm-v3 stage that failed in your PR. |
The CI error is a known issue that has been fixed by #9898. I am going to ignore that and merge this PR. |
The PR reworks the verbs provider to avoid blocking the caller in
rdma_resolve_route()
.The solution I implemented is based on the following TODO from
vrb_create_ep()
:Fixes #9494