Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/WIREUP: Don't discard CM lane #5882

Merged
merged 1 commit into from
Nov 9, 2020

Conversation

dmitrygx
Copy link
Member

@dmitrygx dmitrygx commented Nov 5, 2020

What

Don't discard CM lane.

Why ?

Don't schedule UCT CM lane to be discarded, since UCP EP will be destroyed due to peer failure and ucp_cm_disconnect_cb() could be invoked on async thread after UCP EP is destroyed and before UCT CM EP is destroyed from discarding functionality. So, UCP EP will passed as a corrupted argument to ucp_cm_disconnect_cb().
Fixes #5875

How ?

Add if check into the ucp_worker_iface_err_handle_progress() to understand whether the lane is CM or not:

  • if it is a CM lane, purge pending operations by calling uct_ep_pending_purge() and destroy UCT EP in-place by calling uct_ep_destroy()
  • if it is not a CM lane, schedule UCT EP to be discarded

@dmitrygx dmitrygx force-pushed the topic/ucp/do_not_discard_cm_lane branch from bfb2958 to 804c72e Compare November 5, 2020 15:48
@dmitrygx
Copy link
Member Author

dmitrygx commented Nov 5, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dmitrygx
Copy link
Member Author

dmitrygx commented Nov 5, 2020

The failure is unrelated
Similar to #5840

@dmitrygx
Copy link
Member Author

dmitrygx commented Nov 6, 2020

@brminich @evgeny-leksikov @alinask could you review pls?

@dmitrygx
Copy link
Member Author

dmitrygx commented Nov 6, 2020

@yosefe could you review pls?

src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
@dmitrygx dmitrygx force-pushed the topic/ucp/do_not_discard_cm_lane branch from b3cd7e8 to 45d4775 Compare November 7, 2020 14:10
@yosefe
Copy link
Contributor

yosefe commented Nov 7, 2020

can this be related?

2020-11-07T14:45:30.4314575Z [     INFO ] Testing :::49294 Interface: eno1
2020-11-07T14:45:40.4414356Z /scrap/azure/agent-01/AZP_WORKSPACE/2/s/contrib/../test/gtest/uct/ib/test_sockaddr.cc:448: Failure
2020-11-07T14:45:40.4416016Z Value of: m_state & TEST_STATE_CONNECT_REQUESTED
2020-11-07T14:45:40.4417302Z   Actual: false
2020-11-07T14:45:40.4417884Z Expected: true
2020-11-07T14:45:49.7228411Z [  FAILED  ] sockaddr/test_uct_cm_sockaddr_multiple_cms.server_switch_cm/9, where GetParam() = sockaddr/eno1 (19299 ms)
2020-11-07T14:45:49.7229324Z [ RUN      ] sockaddr/test_uct_cm_sockaddr_multiple_cms.server_switch_cm/5 <sockaddr/ens4f1>

@dmitrygx
Copy link
Member Author

dmitrygx commented Nov 7, 2020

can this be related?

@yosefe thank you for pointing me out
Will investigate the failure

@dmitrygx
Copy link
Member Author

dmitrygx commented Nov 8, 2020

can this be related?

@yosefe thank you for pointing me out
Will investigate the failure

@yosefe I tried to reproduce this problem for a day. It did 1522554 iterations, but no luck to reproduce

@yosefe
Copy link
Contributor

yosefe commented Nov 8, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

all/test_ucp_sockaddr.concurrent_disconnect/6 corruption
5 participants