Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/CORE: Increment completion count before calling uct_ep_flush #7332

Merged
merged 1 commit into from
Sep 3, 2021

Conversation

dmitrygx
Copy link
Member

@dmitrygx dmitrygx commented Sep 2, 2021

What

Increment completion count before calling uct_ep_flush().

Why ?

Fixes the following assertion in UD EP flush which expects that the completion counter > 0.

[swx-ucx02:54317:0:54317]       ud_ep.c:1001 Assertion `comp->count > 0' failed
[1630582789.248645] [swx-ucx02:54303:0]          ucp_ep.c:1107 UCX  ERROR ep 0x7fba25d451c8: error 'Destination is unreachable' on NULL lane will not be handled since no error callback is installed
[1630582789.248588] [swx-ucx02:54317:0]          ucp_ep.c:1107 UCX  ERROR ep 0x7ff6b2f2c098: error 'Destination is unreachable' on NULL lane will not be handled since no error callback is installed

/labhome/dmitrygla/work_auto/ucx/src/uct/ib/ud/base/ud_ep.c: [ uct_ud_ep_flush_nolock() ]
      ...
      998      * released when the current sequence number is completed.
      999      */
     1000     if (comp != NULL) {
==>  1001         ucs_assert(comp->count > 0);
     1002
     1003         skb = ucs_mpool_get(&iface->tx.mp);
     1004         if (skb == NULL) {

==== backtrace (tid:  54317) ====
 0 0x0000000000166934 uct_ud_ep_flush_nolock()  /labhome/dmitrygla/work_auto/ucx/src/uct/ib/ud/base/ud_ep.c:1001
 1 0x0000000000166db7 uct_ud_ep_flush()  /labhome/dmitrygla/work_auto/ucx/src/uct/ib/ud/base/ud_ep.c:1061
 2 0x0000000000063769 uct_ep_flush()  /labhome/dmitrygla/work_auto/ucx/src/uct/api/uct.h:3050
 3 0x0000000000063899 ucp_worker_discard_uct_ep_progress()  /labhome/dmitrygla/work_auto/ucx/src/ucp/core/ucp_worker.c:2315
 4 0x000000000006811c ucp_worker_discard_tl_uct_ep()  /labhome/dmitrygla/work_auto/ucx/src/ucp/core/ucp_worker.c:3052
 5 0x00000000000683d2 ucp_worker_discard_uct_ep()  /labhome/dmitrygla/work_auto/ucx/src/ucp/core/ucp_worker.c:3110
 6 0x000000000015bce5 ucp_wireup_ep_discard_aux_ep()  /labhome/dmitrygla/work_auto/ucx/src/ucp/wireup/wireup_ep.c:348
 7 0x000000000006819b ucp_worker_discard_wireup_ep()  /labhome/dmitrygla/work_auto/ucx/src/ucp/core/ucp_worker.c:3064
 8 0x00000000000683aa ucp_worker_discard_uct_ep()  /labhome/dmitrygla/work_auto/ucx/src/ucp/core/ucp_worker.c:3101
 9 0x000000000003e474 ucp_ep_discard_lanes()  /labhome/dmitrygla/work_auto/ucx/src/ucp/core/ucp_ep.c:1317
10 0x000000000003cf30 ucp_ep_set_failed()  /labhome/dmitrygla/work_auto/ucx/src/ucp/core/ucp_ep.c:1084
11 0x000000000003963d ucp_ep_set_failed_progress()  /labhome/dmitrygla/work_auto/ucx/src/ucp/core/ucp_ep.c:272
12 0x00000000000564ed ucs_callbackq_slow_proxy()  /labhome/dmitrygla/work_auto/ucx/src/ucs/datastruct/callbackq.c:402
13 0x0000000000057f0d ucs_callbackq_dispatch()  /labhome/dmitrygla/work_auto/ucx/src/ucs/datastruct/callbackq.h:211
14 0x00000000000647e8 uct_worker_progress()  /labhome/dmitrygla/work_auto/ucx/src/uct/api/uct.h:2592
15 0x000000000003674c opal_progress()  /build-result/src/hpcx-gcc-redhat7.6/ompi-1454445/opal/runtime/opal_progress.c:231
16 0x000000000004d3c5 ompi_request_wait_completion()  /build-result/src/hpcx-gcc-redhat7.6/ompi-1454445/ompi/../ompi/request/request.h:437
17 0x000000000004d3c5 ompi_request_default_wait()  /build-result/src/hpcx-gcc-redhat7.6/ompi-1454445/ompi/request/req_wait.c:42
18 0x00000000000312e5 ompi_comm_set()  /build-result/src/hpcx-gcc-redhat7.6/ompi-1454445/ompi/communicator/comm.c:123
19 0x00000000000381f5 ompi_dpm_connect_accept()  /build-result/src/hpcx-gcc-redhat7.6/ompi-1454445/ompi/dpm/dpm.c:518
20 0x000000000003cf17 ompi_dpm_dyn_init()  /build-result/src/hpcx-gcc-redhat7.6/ompi-1454445/ompi/dpm/dpm.c:1063
21 0x00000000000b0eb2 ompi_mpi_init()  /build-result/src/hpcx-gcc-redhat7.6/ompi-1454445/ompi/runtime/ompi_mpi_init.c:972
22 0x000000000006bb3b PMPI_Init()  /build-result/src/hpcx-gcc-redhat7.6/ompi-1454445/ompi/mpi/c/profile/pinit.c:67
23 0x0000000000400a2e main()  /hpc/mtr_scrap/users/dmitrygla/mpi/mpi_comm_spawn_ep_reconfig_bug/worker.c:17
24 0x00000000000223d5 __libc_start_main()  ???:0
25 0x0000000000400939 _start()  ???:0
=================================
[swx-ucx02:54317:0:54317] Process frozen...

How ?

  1. Increment the completion counter prior calling uct_ep_flush().
  2. Decrement the completion counter if status == UCS_OK was returned from uct_ep_flush().
  3. Remove incrementing of the completion counter if status == UCS_INPROGRESS was returned from uct_ep_flush().

@dmitrygx dmitrygx force-pushed the topic/ucp/discard_flush branch from a21f072 to df16fe6 Compare September 2, 2021 13:45

out_comp_count_dec:
--req->send.state.uct_comp.count;
ucs_assert(req->send.state.uct_comp.count == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why assert == 0, are we use it started with 0? maybe assert >= 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it should be either 0 or 1

return UCS_OK;
} else if (status == UCS_ERR_NO_RESOURCE) {
return UCS_ERR_NO_RESOURCE;
goto out_comp_count_dec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot touch request after calling ucp_worker_discard_uct_ep_flush_comp
maybe

if (status == UCS_INPROGRESS) {
    return status; // or goto out
} 

--req->send.state.uct_comp.count;
if (status != UCS_ERR_NO_RESOURCE) {
   uct_completion_update_status(&req->send.state.uct_comp, status);
   ucp_worker_discard_uct_ep_flush_comp(&req->send.state.uct_comp);
   status = UCS_OK;
} 

return status;

return status;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!
done

@dmitrygx dmitrygx force-pushed the topic/ucp/discard_flush branch 2 times, most recently from c952e33 to 38dbfcd Compare September 2, 2021 20:46
--req->send.state.uct_comp.count;
ucs_assert(req->send.state.uct_comp.count == 0);

if (status != UCS_ERR_NO_RESOURCE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a slight better suggestion:
if (status == NO_RESOURCE) {
return NO_RESOURCE;
}

uct_completion_update_status(&req->send.state.uct_comp, status);
ucp_worker_discard_uct_ep_flush_comp(&req->send.state.uct_comp);
return UCS_OK

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done

@dmitrygx dmitrygx force-pushed the topic/ucp/discard_flush branch from 38dbfcd to a390687 Compare September 3, 2021 07:31
@changchengx
Copy link
Contributor

Since #6933 req->send.state.uct_comp.count = 0 is ported to yosefe#223
It also need to port #7332 to yosefe#223 at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants