Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discard unused uct_ep when the pre-created uct_ep more than needed #223

Merged
merged 1 commit into from
Apr 6, 2022

Conversation

changchengx
Copy link

@changchengx changchengx commented Nov 16, 2021

What

keep NUM_PATHS compatible between server (e.g. UCX_IB_NUM_PATHS=1) and client (e.g. UCX_IB_NUM_PATHS=2)
This PR is mainly ported from:

  1. UCP/CORE: Implement flush+destroy for UCT EPs on UCP Worker openucx/ucx#5608
  2. UCP/WIREUP: Implement support for SHM/multi-lane in case of CM openucx/ucx#5696

src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
@changchengx changchengx force-pushed the smpath_int3 branch 3 times, most recently from 8a10332 to 95dc3d9 Compare November 19, 2021 03:50
@changchengx
Copy link
Author

changchengx commented Nov 19, 2021

  1. UCP: fail if the server cannot connect to all eps addrs from the client #31 is reverted
  2. There's no uct_ep resource leak as being reported in https://redmine.mellanox.com/issues/2284793 when verifying it on local two server.
  3. verify path compatible function on local two serves & every server has one NIC with two ports(HCA MT4125, 40Gbps x 2)
    image

@changchengx changchengx changed the title [WIP] implement NUM_PATHS compatible discard unused uct_ep when the pre-created uct_ep more than needed Nov 19, 2021
@changchengx
Copy link
Author

@yosefe
Merge all the ucp_ep_config_lanes_intersect related functions in 422a29c

@yosefe
Copy link
Owner

yosefe commented Jan 12, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@changchengx
Copy link
Author

changchengx commented Jan 13, 2022

How to run the below io_demo Test tag match on CX4 test case locally to check the problem?
https://dev.azure.com/ucfconsort/ucx-int3/_build/results?buildId=35022&view=logs&j=f6213e06-5fba-58e7-a041-fa7d17b490d4&t=0687ad30-25a0-56ac-90ea-d6a36c068c41&l=30

2022-01-12T11:54:39.4062959Z + python /hpc/noarch/git_projects/hpc-mtt-conf/scripts/iodemo_analyzer.py -d drop_35022/tag_cx4_rc --duration 600 -t 3 --allow_list integration3
2022-01-12T11:54:39.4523334Z ==============================
2022-01-12T11:54:39.4524537Z Client Errors:
2022-01-12T11:54:39.4525117Z ==============================
2022-01-12T11:54:39.4526337Z Log: drop_35022/tag_cx4_rc/iodemo_swx-rain02_client_00.log
2022-01-12T11:54:39.4527104Z Error iodemo analyzer: [1641988008.295210] [DEMO] timeout waiting for 1 replies on 1 connections
2022-01-12T11:54:39.4527679Z 
2022-01-12T11:54:39.4528179Z ==============================
2022-01-12T11:54:39.4528814Z Server Errors:
2022-01-12T11:54:39.4529401Z ==============================
2022-01-12T11:54:39.4530206Z Log: drop_35022/tag_cx4_rc/iodemo_swx-rain01_server_00.log
2022-01-12T11:54:39.4531328Z Error iodemo analyzer: [1641987928.295231] [swx-rain01:438277:0]         wireup.c:318  UCX  ERROR ep 0x7f2042f71000: no remote ep address for lane[0]->remote_lane[0]
2022-01-12T11:54:39.4532016Z 
2022-01-12T11:54:39.4633260Z ##[error]Bash exited with code '1'.
2022-01-12T11:54:39.4648668Z ##[section]Finishing: Analyze for tag_cx4_rc

@yosefe
Copy link
Owner

yosefe commented Jan 13, 2022

@changchengx
Copy link
Author

io_demo Test tag match on CX4

1. reserve 2 swx-ucx servers: e.g. swx-ucx03 & swx-ucx04

2. Login in swx-ucx03

3. Build ucx/integration3 with PR in swx-ucx03

4. Start network corrupter in background
$ #Confirm whether it works swx-ucx03 & swx-ucx04
$ #Start network corrupter
[swx-ucx03]$ buildlib/az-network-corrupter.sh initial_delay=20 cycles=100 downtime=5 uptime=180 interface=bond0

5. Run below command on swx-ucx03 server
[swx-ucx03]$ export UCX_NET_DEVICES=mlx5_bond_0:1
[swx-ucx03]$ export UCX_TLS=rc_x
[swx-ucx03]$ export UCX_RNDV_THRESH=4k
[swx-ucx03]$ export LD_LIBRARY_PATH=drop_35176/install/lib:
[swx-ucx03]$ LD_LIBRARY_PATH=drop_35176/install/lib:
[swx-ucx03]$ test/apps/iodemo/run_io_demo.sh -H swx-ucx03,swx-ucx04 --tasks-per-node 1 --duration 600 -v --num-clients 1 --num-servers 1 --map-by slot --log-dir drop_35176/tag_cx4_rc -i bond0 drop_35176/install/bin/io_demo -d 512:524288 -P 2 -o read,write -i 0 -w 16 -t 60

6. Run below command on swx-ucx03 server
[swx-ucx03] python /hpc/noarch/git_projects/hpc-mtt-conf/scripts/iodemo_analyzer.py -d drop_35176/tag_cx4_rc --duration 600 -t 3 --allow_list integration3

7. Kill the corrupter

8. restore interface
[swx-ucx03]$ drop_35176/buildlib/az-network-corrupter.sh reset=yes interface=bond0

take care of the direcotry

drop_35176/install/bin/io_demo
--log-dir drop_35176/tag_cx4_rc
-d drop_35176/tag_cx4_rc

@yosefe
Copy link
Owner

yosefe commented Jan 23, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@changchengx
Copy link
Author

@yosefe Is it possible to review this PR before fixing the CI self problem?

@changchengx
Copy link
Author

CI has been passed.

@yosefe
Copy link
Owner

yosefe commented Feb 10, 2022

@dmitrygx can you pls review?

Copy link

@dmitrygx dmitrygx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@changchengx did you use the latest code from master?
I see that some bugs came from the initial implementation.
So, I suggest you align this PR with the master branch.

src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.h Outdated Show resolved Hide resolved
@changchengx
Copy link
Author

@dmitrygx I've followed your 1st round CR comments to refine the implmentation. Please check it.

Copy link

@dmitrygx dmitrygx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add ucp_worker_discard_uct_ep_cleanup to cleanup discarding hash in ucp_worker_destroy?

@changchengx
Copy link
Author

shall we add ucp_worker_discard_uct_ep_cleanup to cleanup discarding hash in ucp_worker_destroy?

As we've talked offline, I'll add warning if the worker->discard_uct_ep_hash isn't empty to minimize the change when doing ucp_worker_destroy

@changchengx
Copy link
Author

@yosefe Dmitry has approved this PR and he's in Hackathon these days.

@changchengx
Copy link
Author

@yosefe Do you have any comments about this PR?

src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.h Show resolved Hide resolved
src/ucp/wireup/wireup_cm.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup_cm.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup_cm.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
Copy link
Owner

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add CI to test this?
this branch has minimal ci infrastructure to run io_demo

src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
@changchengx
Copy link
Author

can we add CI to test this? this branch has minimal ci infrastructure to run io_demo

@yosefe Do you mean to add test into io-demo.yaml or MTT?

@yosefe
Copy link
Owner

yosefe commented Mar 29, 2022

2 short tests in io-demo.yaml , without interference, with different num_paths (2->1 and 1->2)

@changchengx
Copy link
Author

@yosefe Does azure CI stop working? There's no CI triggered here.
I've added two test cases to verify path compatible function.

@yosefe
Copy link
Owner

yosefe commented Mar 30, 2022

@changchengx
Copy link
Author

Two test cases have been added.
Pass all the CI test cases.

@yosefe
Copy link
Owner

yosefe commented Apr 6, 2022

@changchengx pls squash

buildlib/pr/io_demo/io-demo.yml Outdated Show resolved Hide resolved
buildlib/pr/io_demo/io-demo.yml Outdated Show resolved Hide resolved
@yosefe yosefe merged commit 1510f76 into yosefe:integration3 Apr 6, 2022
@changchengx changchengx deleted the smpath_int3 branch April 6, 2022 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants