Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: Non blocking EP creation (asynchronous route resolution using event channel) #9543

Merged
merged 9 commits into from
Mar 18, 2024

Conversation

sydidelot
Copy link
Member

@sydidelot sydidelot commented Nov 13, 2023

The PR reworks the verbs provider to avoid blocking the caller in rdma_resolve_route().
The solution I implemented is based on the following TODO from vrb_create_ep():

       /* TODO convert this call to non-blocking (use event channel) as well:
        * This may likely be needed for better scaling when running large
        * MPI jobs.
        * Making this non-blocking would mean we can't create QP at EP enable
        * time. We need to wait for RDMA_CM_EVENT_ADDR_RESOLVED event before
        * creating the QP using rdma_create_qp. It would also require a SW
        * receive queue to store recvs posted by app after enabling the EP.
        */

Fixes #9494

@sydidelot
Copy link
Member Author

@j-xiong, @chien-intel it seems that you are the most active developers for the verbs provider :)
May I request your feedback on this patch? Thank you!

"dst addr", (*rai)->ai_dst_addr);
ret = -errno;
goto err2;
if (node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that src_addr and/or dest_addr are supplied directly in hints instead of node/service. I think we can check (*rai)->ai_dst_addr instead.

BTW, there is a typo in the title of this commit message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested your suggestion but it doesn't work as expected: the problem is that (*rai)->ai_dst_addr is set even if no hints are provided. I'd assume it's the behavior of rdma_getaddrinfo() to set ai_dst_addr when the node is given.
If dest_addr is supplied, I would expect the user application to give NULL for the node argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this check work node || hints->dest_addr? This matches with @j-xiong comment.

I believe we would have already exited the call if FI_SOURCE was set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shefty Thanks for the suggestion, it looks like it works. I'll update the PR with this change.

prov/verbs/src/verbs_cq.c Show resolved Hide resolved
prov/verbs/src/verbs_ep.c Show resolved Hide resolved
prov/verbs/src/verbs_cm.c Show resolved Hide resolved
@sydidelot sydidelot force-pushed the 9494 branch 2 times, most recently from 907274a to e0cb324 Compare November 15, 2023 13:18
@sydidelot
Copy link
Member Author

I added an extra patch to the series. I believe there are some core paths where connection establishment may fail and the provider doesn't report the failure to the EQ.

@sydidelot
Copy link
Member Author

@shijin-aws Sorry to ping you directly. Can you please take a look at the AWS CI results and tell me if the error relates to my changes?
It's a pity that all CI pipelines are now private an there is no output when a test fail... UCX is so much better in this aspect.

Comment on lines 121 to 122
if (wr->num_sge > vrb_gl_data.def_tx_iov_limit)
return -FI_EINVAL;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use def_rx_iov_limit instead of tx.
This check should be at the highest level and consistent across all post_recv calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, should be rx instead of tx.
Do you suggest to move this check to vrb_post_recv()?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know what the future holds. Placing the check in vrb_post_recv would be most generic but with current code base the best place is vrb_msg_ep_recvmsg.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vrb_post_recv() is also called from vrb_dgram_ep_recvmsg().
I think it seems better to have the check directly in vrb_post_recv() as it's more generic (as you said).

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to run some tests with this PR before Thanksgiving break.

prov/verbs/src/verbs_ep.c Show resolved Hide resolved
VRB_WARN_ERRNO(FI_LOG_EP_CTRL, "rdma_create_qp");
return -errno;
}
if (ep->state == VRB_REQ_RCVD) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change. In general I'm parnoid about state management. There is only one place where state is set to VRB_REQ_RCVD and as far as I can tell, the QP is already created at that point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vrb_eq_addr_resolved_event() only creates the QP for the connection initiator. The target side needs to create the QP when the EP is enabled, after RDMA_CM_EVENT_CONNECT_REQUEST is received (EP state is VRB_REQ_RCVD in this case).
To be honest, I don't like this check neither. Do you have a better suggestion? Is there a way to determine what side enables the EP, for example based on the content of ep->id?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the check and the code now uses ep->id->verbs to figure out what side (server or client) executes the function. I looked at rdma-core and it seems that ep->id->verbs is set only when the addr or the route are resolved:
https://github.com/linux-rdma/rdma-core/blob/0cf342c301dfef1ba31c7ab763682ed0c59e7763/librdmacm/cma.c#L618

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id->verbs field is ibv_context and its creation/assignment is not related to addr/route resolution.

Please consider creating QP in RDMA_CM_EVENT_CONNECT_REQUEST handler see https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/man/rdma_cm.7

Copy link
Member Author

@sydidelot sydidelot Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot move QP creation in RDMA_CM_EVENT_CONNECT_REQUEST handler because rdma_create_qp() requires to set the QP init attributes (ibv_qp_init_attr), and these attributes are known only when the EP is enabled. The EP doesn't even exist when the event RDMA_CM_EVENT_CONNECT_REQUEST is received.

id->verbs field is ibv_context and its creation/assignment is not related to addr/route resolution.

I agree with your statement, but in practice and according to the code, id->verbs is set when the addr/route is resolved.

An alternative solution to find out if the route was resolved would be to check if rdma_cm_id::route::num_paths is > 0 or rdma_cm_id::route::path_rec is != NULL.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still running tests, trying out large MPI jobs. It is taking a bit of time.

@sydidelot sydidelot force-pushed the 9494 branch 3 times, most recently from 8218aec to 2c4800a Compare November 22, 2023 14:09
prov/verbs/src/verbs_cm.c Show resolved Hide resolved
"dst addr", (*rai)->ai_dst_addr);
ret = -errno;
goto err2;
if (node) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this check work node || hints->dest_addr? This matches with @j-xiong comment.

I believe we would have already exited the call if FI_SOURCE was set.

prov/verbs/src/verbs_cm.c Show resolved Hide resolved
@sydidelot
Copy link
Member Author

@chien-intel @shefty @j-xiong While testing the changes, I found an issue when multiple verbs adapters are being used. The issue was found with RoCE but I believe Infiniband adapters are also impacted.
Note: I am aware of the issues when running multiple local interfaces in the same network subnet. This a different issue here.

When fi_getinfo() is executed with hints->src_addr but without node or hints->dest_addr, the function may return domains that do not match the source address given in hints->src_addr. The bug doesn't reproduce without my changes.

I think the bug is in vrb_set_info_addrs(), which gets executed with the following call stack:

vrb_getinfo()
 vrb_get_match_infos()
  vrb_handle_sock_addr() /* id->verbs is NULL because `rdma_resolve_addr()` was not called */
   vrb_fill_addr() /* The condition rai->ai_src_addr && !ofi_is_loopback_addr(rai->ai_src_addr is true */
    vrb_set_info_addrs() 

The problem is that vrb_set_info_addrs() updates all the info structures with the same rdma_addrinfo() info (see function vrb_rai_to_fi()).
However, there is no guarantee that all the info structures listed there have the same source address as rai->ai_src_addr.
I believe we need to check if iter->src_addr matches rai->ai_src_addr in vrb_set_info_addrs() and remove the info from the list if so.

The PR doesn't address this issue. I decided to work around the bug in the application: I skip all the info returned by fi_getinfo() that does not match the source address provided in the hints.

@ghost
Copy link

ghost commented Dec 1, 2023

@chien-intel @shefty @j-xiong While testing the changes, I found an issue when multiple verbs adapters are being used. The issue was found with RoCE but I believe Infiniband adapters are also impacted. Note: I am aware of the issues when running multiple local interfaces in the same network subnet. This a different issue here.

I did notice some issues when I included nodes with multiple infiniband adapters on same logical subnet. I will create an issue for this when I have more info with easy reproducer.

@sydidelot
Copy link
Member Author

@chien-intel it's likely the same issue as the one I described above.
Do you see warning messages similar to this one in your logs when you open an endpoint (active side)?
verbs domain warn vrb_open_ep:1220 Invalid info->domain_attr->name: rocep65s0f0 and rocep193s0f0

@sydidelot
Copy link
Member Author

@chien-intel Any update for this PR?

prov/verbs/src/verbs_eq.c Show resolved Hide resolved
}
/* Server-side QP creation, after RDMA_CM_EVENT_CONNECT_REQUEST
* is recevied */
if (ep->id->verbs) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still having a hard time with this one. Using internal variable from another library for this purpose is just a bad idea. Unfortunately, I do not have an alternative for you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verbs provider already has a couple of existing conditions based on id->verbs. For example:
https://github.com/ofiwg/libfabric/blob/main/prov/verbs/src/verbs_info.c#L1316
https://github.com/ofiwg/libfabric/blob/main/prov/verbs/src/verbs_info.c#L1789

There is even this comment in the code:

	/* Handle the case when rdma_cm doesn't fill src address even
	 * though it fills the destination address (presence of id->verbs
	 * corresponds to a valid dest addr) */

See https://github.com/ofiwg/libfabric/blob/main/prov/verbs/src/verbs_info.c#L1319C3-L1321C41

I'm not adding something new to the code.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have that kind of insight into libibverbs nor librdmacm and I certainly would not code that way. Just me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id->verbs is externally visible. It has to be that way to coordinate the use of librdmacm with libibverbs. struct rdma_cm_id has both visible and hidden fields. The hidden fields are not accessible to a user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct rdma_cm_id is part of the public API of RDMA CM. Referencing its member is fine when needed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is one thing to use the value, it is another to infer behavior. Can we be certain when id->verbs is set, it is always after RDMA_CM_EVENT_CONNECT_REQUEST is received on the server side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rephrase the question to: Is id->verbs always set for an RDMA_CM_EVENT_CONNECT_REQUEST? The answer is yes. Connect requests are received over a specific device.

Chien's question is subtly different, however, and I think the answer to his question is no, which might be exposing an issue.

An rdma_cm_id is associated with a verbs device in 3 cases. On the client side, this is the purpose of rdma_resolve_addr(). Once that operation completes, we have a device. On the server side, all connection requests are associated with a device. The 3rd case, however, is through rdma_bind_addr().

The restructured connect() path calls enable() first, then rdma_resolve_addr(). If we're taking the client code path, then id->verbs is usually not set. However, if we went through rdma_bind_addr() prior to calling connect(), it may be. In that case, I think we would end up trying to create 2 QPs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the very, very late reply :)

I updated the code to check that ep->ibv_qp is NULL before creating the QP. The change is available here:
https://github.com/ofiwg/libfabric/pull/9543/files#diff-0c780e52c4b4880ab4a4012f359d1ae818bbdddcd650d9192f250f622e7bf927R1039

FYI, our product has been running with this PR for the last 2 months with no issue: plain Verbs (no RXM) and MSG endpoints. The only real issue we had was with the following patch that introduces EP states to the Verbs provider: #9175
The above PR was causing assertion failures during connection shutdown because the EP state was incorrect. We simply reverted this patch and everything works now.

Anyway, I'm still convinced this PR is important for upstream as some runtime models require all the calls to be non-blocking (I believe this is the case for DAOS too because of the SPDK model). Please review it when you can.

Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- I didn't see any actual problems with the changes.

@@ -933,6 +934,12 @@ do { \
( wr->opcode == IBV_WR_SEND || wr->opcode == IBV_WR_SEND_WITH_IMM \
|| wr->opcode == IBV_WR_RDMA_WRITE_WITH_IMM )

struct vrb_recv_wr {
struct slist_entry entry;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ibv_recv_wr has a next pointer which can be used to create a single-linked list. Using that requires duplicating part of the slist implementation, so I'm not sure yet in the review if it's worth pursuing this change, as this is all non-critical path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need some time to think about it. I'm not sure the change is worth it.

int ret;

if (wr->num_sge > vrb_gl_data.def_rx_iov_limit)
return -FI_EINVAL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say just assert here, but can we get the limit that was associated with the ep?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not assert here: wr->num_sge is set by the application when the receive buffer is posted: I think it's better to return -FI_EINVAL and let the application deal with this error.

I slightly modified the code based on your suggestion to use the limit that is associated with the ep and I added an assertion failure to ensure that the EP limit is effectively lower than vrb_gl_data.def_rx_iov_limit

assert(ep->info_attr.rx_iov_limit <= vrb_gl_data.def_rx_iov_limit); 
if (wr->num_sge > ep->info_attr.rx_iov_limit)                       
    return -FI_EINVAL;                                              

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the assert is always true. The requested limit is subject to the device limits, not necessarily the default limit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the pool of WRs belongs to the progress object (shared between EPs of the same domain) and the max number of SGEs is set to vrb_gl_data.def_rx_iov_limit. Also, note that a similar assert() was added by the PR at line 68 (same file).
One possible fix would be to detect the max rx_iov_limit set by the HCA and use this value when the pool of WRs is created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I believe ibv_post_recv() checks num_sge. So, I think we're okay moving this check into vrb_prepost_recv(), which is where we really need it. I agree if we can use the device limit when initializing the pool, that should cover (unlikely) edge cases. The check in vrb_prepost_recv() just need to verify that we don't overrun the local array. When the receives are actually posted to the device, vrb_post_recv_internal() -> ibv_post_recv() will check that the QP can handle it. A failure there is a user error, which at least will be handled (by tearing down the QP). Does this work for you?

prov/verbs/src/verbs_eq.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_eq.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_eq.c Outdated Show resolved Hide resolved
Copy link
Member Author

@sydidelot sydidelot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @shefty. The last revision addresses all your review comments.

prov/verbs/src/verbs_eq.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_eq.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_eq.c Outdated Show resolved Hide resolved
Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assert that's marked needs to be removed, but otherwise, this looks good. Thanks!

int ret;

if (wr->num_sge > vrb_gl_data.def_rx_iov_limit)
return -FI_EINVAL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the assert is always true. The requested limit is subject to the device limits, not necessarily the default limit.

Copy link
Member Author

@sydidelot sydidelot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shefty Thanks for the review! I did a few modifications related to the pool of WRs:

  • The pool is now created with the max number of SGEs defined by domain->info->rx_attr->iov_limit instead of the default value set by the provider.
  • I removed the 2 asserts that checked that the number of SGEs is lower than vrb_gl_data.def_rx_iov_limit. The code assumes that ep->info_attr.rx_size is always lower or equal to domain->info->rx_attr->iov_limit.

int ret;

if (wr->num_sge > vrb_gl_data.def_rx_iov_limit)
return -FI_EINVAL;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the pool of WRs belongs to the progress object (shared between EPs of the same domain) and the max number of SGEs is set to vrb_gl_data.def_rx_iov_limit. Also, note that a similar assert() was added by the PR at line 68 (same file).
One possible fix would be to detect the max rx_iov_limit set by the HCA and use this value when the pool of WRs is created.

@sydidelot
Copy link
Member Author

Can someone please look at the CI logs and tell me if the errors are related to the PR?

@j-xiong
Copy link
Contributor

j-xiong commented Mar 15, 2024

The CI job was cancelled in the middle probably due to infrastructure failure.

@sydidelot
Copy link
Member Author

sydidelot commented Mar 15, 2024

The CI job was cancelled in the middle probably due to infrastructure failure.

@j-xiong I see that you have re-triggered the job. Thanks!

@zachdworkin
Copy link
Contributor

It hit the same node failure. I am taking that node offline and will restart your job again. Sorry its taking so long

@sydidelot
Copy link
Member Author

It hit the same node failure. I am taking that node offline and will restart your job again. Sorry its taking so long

No worry. I know from experience how difficult it is to keep a CI pipeline running :)

@zachdworkin
Copy link
Contributor

@sydidelot thanks for your understanding. We had some updates to our controller's server that look like they may have waterfalled some issues to our server that we are fixing.

The provider should always use the destination address given to
fi_connect().

Signed-off-by: Sylvain Didelot <[email protected]>
…ided

The call to rdma_resolve_addr() in vrb_get_rai_id() would block the
caller for 2 seconds if the remote address is unreachable. Because
this function is executed in the context of fi_getinfo(), we cannot
convert it to a non-blocking call but we can mitigate the issue: we
can assume that the caller is interested in the destination address
only if the node or the dest address (in the hints) is provided.
For some application, the destination address can be exchanged
out-of-bound and does not required to be resolved by fi_getinfo().

Signed-off-by: Sylvain Didelot <[email protected]>
The patch changes the second argument of vrb_init_progress() to be
a struct fi_info. The current struct ibv_context is not used and we
will need to access to the struct fi_info associated to the domain
in a next patch in the series.

Signed-off-by: Sylvain Didelot <[email protected]>
We are going to make the EP creation completely non-blocking by using
event channels. With this design, we can no longer create the QP at
EP  enable time: we need to wait for the RDMA_CM_EVENT_ADDR_RESOLVED event.

The problem is that the app is allowed to post the receive buffers right
after the EP is enabled. To fix the issue, we need to introduce a software
receive queue to store the receive buffers that are posted by the
application before the EP is created.

The patch implements the code to store the recv work requests in a pool.

Signed-off-by: Sylvain Didelot <[email protected]>
The patch implements the code to store the preposted receive buffers
in a software queue if the QP is not available, i.e., ep->ibv_qp is NULL.

The list of preposted receive buffers is flushed when the QP is created
(change to come in a next patch) and the EP is closed.

Signed-off-by: Sylvain Didelot <[email protected]>
With the upcoming changes, we will no longer know the sockaddr address
family when fi_connect() is executed. This piece of information will be
known only after when the route is resolved.

The patch updates the code to always allocate the CM header but skip
it when the route is resolved (if needed).

Signed-off-by: Sylvain Didelot <[email protected]>
The patch implements the code to handle the EQ events when the
route is resolved. This new code will be enabled in a next patch.

Signed-off-by: Sylvain Didelot <[email protected]>
The patch implements the last changes to make the EP creation
non-blocking.

The address is now resolved asynchronously when connection is established,
which does no longer block the caller. The connection initiator creates
the QP when the route is resolved. As for the connection target, it
creates the QP when the EP is enabled (same as it was before).

Note that the call to rdma_resolve_addr() in vrb_msg_ep_connect()
becomes asynchronous because the EP CM id is associated with an event
channel when the endpoint is bound to an event queue, which happend
before the EP is enabled.

Signed-off-by: Sylvain Didelot <[email protected]>
Fix two code paths where some connection establishment may fail
and we don't report the error to the EQ.

Signed-off-by: Sylvain Didelot <[email protected]>
@sydidelot
Copy link
Member Author

@zachdworkin I rebased the PR but it still fails. Please let me know when the CI infrastructure is fixed.

@zachdworkin
Copy link
Contributor

@sydidelot Will do. Keep an eye on #9898 which should contain a fix for the ze-shm-v3 stage that failed in your PR.

@j-xiong
Copy link
Contributor

j-xiong commented Mar 18, 2024

The CI error is a known issue that has been fixed by #9898. I am going to ignore that and merge this PR.

@j-xiong j-xiong merged commit 4860135 into ofiwg:main Mar 18, 2024
10 of 13 checks passed
@sydidelot sydidelot deleted the 9494 branch March 20, 2024 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prov/verbs: EP creation blocks during RDMA address resolution
4 participants