Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: rank 1 not found in topology #2

Open
BAAI-WuDao opened this issue Feb 25, 2022 · 5 comments
Open

ValueError: rank 1 not found in topology #2

BAAI-WuDao opened this issue Feb 25, 2022 · 5 comments

Comments

@BAAI-WuDao
Copy link

BAAI-WuDao commented Feb 25, 2022

When I run bash ``examples/gpt/pretrain_gpt_distributed.sh"

It reports the information
image

and reports this error
image
image
image

I follows the errors, it seems that the problem is located in topology.py, Line 43,

  for global_rank, coord in enumerate(cartesian_product(*ranges)):
      key = {axis: coord[self.axes.index(axis)] for axis in self.axes}
      key = self.ProcessCoord(**key)
      # for example, {ProcessCoord(row=0, col=1) : 1}
      self.mapping[key] = global_rank

for I print the variable self.mapping at topology.py, Line 131, it is empty.

@BAAI-WuDao BAAI-WuDao changed the title Bugs: ValueError: rank 1 not found in topology ValueError: rank 1 not found in topology Feb 25, 2022
@gongwei-130
Copy link
Collaborator

Hi WUDAO, could you provide your experiment set up? Like parameters used in pretrain_gpt_distributed.sh, how many machines and how many GPUs per machine?

@BAAI-WuDao
Copy link
Author

Hi, the pretrain_gpt_distributed.sh is set up as

#! /bin/bash

Runs the "345M" parameter model

DATA_PATH='/data/wang/models/Sailing/examples/gpt2'
CHECKPOINT_PATH='./'

export WORKER_0_HOST=localhost
export WORKER_0_PORT=6000
export NUM_WORKER=1
export WORKER_RANK=0
export GPU_PER_WORKER=4

MASTER_PORT=6002
MASTER_ADDR=$WORKER_0_HOST

GPUS_PER_NODE=$GPU_PER_WORKER

NNODES=$NUM_WORKER
NODE_RANK=$WORKER_RANK

WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

base_dir=$(cd dirname $0; pwd)
echo base_dir $base_dir

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

ds_config='{
"train_micro_batch_size_per_gpu":16,
"train_batch_size" : 16,
"gradient_accumulation_steps": 2,
"steps_per_print": 1,
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 0,
"allgather_partitions": true,
"allgather_bucket_size": 500000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 500000000,
"contiguous_gradients" : true,
"cpu_offload": false
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": true
}'

python3 -m torch.distributed.launch $DISTRIBUTED_ARGS
--no_python --use_env python3
${base_dir}/pretrain_gpt2.py
--model-parallel-size 2
--num-stages 2
--num-layers 24
--hidden-size 1024
--train-batch-size 64
--gradient_accumulation_steps 16
--num-attention-heads 16
--batch-size 4
--seq-length 1024
--max-position-embeddings 1024
--train-iters 500000
--lr-decay-iters 450000
--save $CHECKPOINT_PATH
--load $CHECKPOINT_PATH
--data-path $DATA_PATH/my-gpt2_text_document
--vocab-file $DATA_PATH/gpt2-vocab.json
--merge-file $DATA_PATH/gpt2-merges.txt
--data-impl mmap
--split 949,50,1
--distributed-backend nccl
--lr 0.00025
--lr-decay-style cosine
--min-lr 1.0e-5
--weight-decay 1e-2
--clip-grad 1.0
--warmup .02
--log-interval 1
--save-interval 100000
--vocab-size 145608
--DDP-impl torch
--eod-mask-loss
--deepspeed-pipeline
--deepspeed
--config_param "$ds_config"
--fp16
--partition_method "type:ParallelTransformerLayerPiped"
$@
set +x

@gongwei-130
Copy link
Collaborator

gongwei-130 commented Mar 1, 2022

Hi WuDao,

Sorry, I am not able to repo your issue with provided setting. I suggest do following two steps to exclude some possible issue:

  1. Add following into your pretrain_gpt_distributed.sh
       export WORKER_0_HOST=127.0.0.1
       export DMLC_NODE_HOST=127.0.0.1
       export BYTEPS_WITH_UCX=0 
       export DMLC_ENABLE_UCX=0
       export DMLC_ENABLE_RDMA=0
  1. Try setting up environment with following docker file, I will push this docker file into repo soon.
FROM nvcr.io/nvidia/pytorch:21.05-py3 
RUN pip3 install boto3 regex tensorboardX==1.8 wheel pybind11 ninja psutil pyprof
RUN apt-get -yq autoremove --purge ibverbs-providers
RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends --allow-downgrades \
     libibverbs-dev=28.0-1ubuntu1 libibverbs1=28.0-1ubuntu1

RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends --allow-downgrades \
        cmake \
        libopenmpi-dev \
        openmpi-bin \
        openssh-client \
        openssh-server \
        ibverbs-providers \
        libibverbs-dev=28.0-1ubuntu1 \
        librdmacm-dev \
        vim \
        iputils-ping \
        llvm-10-dev \
        iproute2 \
        unzip

RUN ln -s /usr/bin/aclocal-1.16 /usr/local/bin/aclocal-1.14
RUN ln -s /usr/bin/automake /usr/local/bin/automake-1.14

ENV LD_LIBRARY_PATH "/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}"
ENV BYTEPS_WITH_UCX 0

RUN pip3 install https://giant-model-package.tos-cn-beijing.volces.com/byteps-0.7.2-cp38-cp38-linux_x86_64.whl
WORKDIR /root

Hi, the pretrain_gpt_distributed.sh is set up as

#! /bin/bash

Runs the "345M" parameter model

DATA_PATH='/data/wang/models/Sailing/examples/gpt2' CHECKPOINT_PATH='./'

export WORKER_0_HOST=localhost export WORKER_0_PORT=6000 export NUM_WORKER=1 export WORKER_RANK=0 export GPU_PER_WORKER=4

MASTER_PORT=6002 MASTER_ADDR=$WORKER_0_HOST

GPUS_PER_NODE=$GPU_PER_WORKER

NNODES=$NUM_WORKER NODE_RANK=$WORKER_RANK

WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

base_dir=$(cd dirname $0; pwd) echo base_dir $base_dir

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

ds_config='{ "train_micro_batch_size_per_gpu":16, "train_batch_size" : 16, "gradient_accumulation_steps": 2, "steps_per_print": 1, "gradient_clipping": 1.0, "zero_optimization": { "stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients" : true, "cpu_offload": false }, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "wall_clock_breakdown": true }'

python3 -m torch.distributed.launch $DISTRIBUTED_ARGS --no_python --use_env python3 ${base_dir}/pretrain_gpt2.py --model-parallel-size 2 --num-stages 2 --num-layers 24 --hidden-size 1024 --train-batch-size 64 --gradient_accumulation_steps 16 --num-attention-heads 16 --batch-size 4 --seq-length 1024 --max-position-embeddings 1024 --train-iters 500000 --lr-decay-iters 450000 --save $CHECKPOINT_PATH --load $CHECKPOINT_PATH --data-path $DATA_PATH/my-gpt2_text_document --vocab-file $DATA_PATH/gpt2-vocab.json --merge-file $DATA_PATH/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 0.00025 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup .02 --log-interval 1 --save-interval 100000 --vocab-size 145608 --DDP-impl torch --eod-mask-loss --deepspeed-pipeline --deepspeed --config_param "$ds_config" --fp16 --partition_method "type:ParallelTransformerLayerPiped" $@ set +x

@BAAI-WuDao
Copy link
Author

Thank you very much for you response!
I follow the steps, but the bugs still exists.
Can you give a example, that I can set the value of ``mapping" at line 136 in topology.py

@gongwei-130
Copy link
Collaborator

Thank you very much for you response! I follow the steps, but the bugs still exists. Can you give a example, that I can set the value of ``mapping" at line 136 in topology.py

Hi WuDao,

Could you paste your full log here? Don't use screen shot, so that I can search text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants