-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: rank 1 not found in topology #2
Comments
Hi WUDAO, could you provide your experiment set up? Like parameters used in pretrain_gpt_distributed.sh, how many machines and how many GPUs per machine? |
Hi, the pretrain_gpt_distributed.sh is set up as #! /bin/bash Runs the "345M" parameter modelDATA_PATH='/data/wang/models/Sailing/examples/gpt2' export WORKER_0_HOST=localhost MASTER_PORT=6002 GPUS_PER_NODE=$GPU_PER_WORKER NNODES=$NUM_WORKER WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES)) base_dir=$(cd DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT" ds_config='{ python3 -m torch.distributed.launch $DISTRIBUTED_ARGS |
Hi WuDao, Sorry, I am not able to repo your issue with provided setting. I suggest do following two steps to exclude some possible issue:
|
Thank you very much for you response! |
Hi WuDao, Could you paste your full log here? Don't use screen shot, so that I can search text. |
When I run bash ``examples/gpt/pretrain_gpt_distributed.sh"
It reports the information
data:image/s3,"s3://crabby-images/3ac66/3ac661b82ed6a2a49874ff5bac8b0569fe0bea8b" alt="image"
and reports this error
data:image/s3,"s3://crabby-images/1135c/1135c737b89bb1bc18c27b9c2df7e7e14a66af6f" alt="image"
data:image/s3,"s3://crabby-images/7f6eb/7f6eb7bd6f7cbdf34be11412875b6ba1d08d2003" alt="image"
data:image/s3,"s3://crabby-images/29d4b/29d4ba8892e8ee7e8979c43b50d341a46efd3eac" alt="image"
I follows the errors, it seems that the problem is located in topology.py, Line 43,
for I print the variable self.mapping at topology.py, Line 131, it is empty.
The text was updated successfully, but these errors were encountered: