ValueError: rank 1 not found in topology #2

BAAI-WuDao · 2022-02-25T03:14:56Z

When I run bash ``examples/gpt/pretrain_gpt_distributed.sh"

It reports the information

and reports this error

I follows the errors, it seems that the problem is located in topology.py, Line 43,

  for global_rank, coord in enumerate(cartesian_product(*ranges)):
      key = {axis: coord[self.axes.index(axis)] for axis in self.axes}
      key = self.ProcessCoord(**key)
      # for example, {ProcessCoord(row=0, col=1) : 1}
      self.mapping[key] = global_rank

for I print the variable self.mapping at topology.py, Line 131, it is empty.

The text was updated successfully, but these errors were encountered:

gongwei-130 · 2022-02-25T18:53:17Z

Hi WUDAO, could you provide your experiment set up? Like parameters used in pretrain_gpt_distributed.sh, how many machines and how many GPUs per machine?

BAAI-WuDao · 2022-02-26T09:50:54Z

Hi， the pretrain_gpt_distributed.sh is set up as

#! /bin/bash

Runs the "345M" parameter model

DATA_PATH='/data/wang/models/Sailing/examples/gpt2'
CHECKPOINT_PATH='./'

export WORKER_0_HOST=localhost
export WORKER_0_PORT=6000
export NUM_WORKER=1
export WORKER_RANK=0
export GPU_PER_WORKER=4

MASTER_PORT=6002
MASTER_ADDR=$WORKER_0_HOST

GPUS_PER_NODE=$GPU_PER_WORKER

NNODES=$NUM_WORKER
NODE_RANK=$WORKER_RANK

WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

base_dir=$(cd dirname $0; pwd)
echo base_dir $base_dir

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

ds_config='{
"train_micro_batch_size_per_gpu":16,
"train_batch_size" : 16,
"gradient_accumulation_steps": 2,
"steps_per_print": 1,
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 0,
"allgather_partitions": true,
"allgather_bucket_size": 500000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 500000000,
"contiguous_gradients" : true,
"cpu_offload": false
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": true
}'

python3 -m torch.distributed.launch $DISTRIBUTED_ARGS
--no_python --use_env python3
${base_dir}/pretrain_gpt2.py
--model-parallel-size 2
--num-stages 2
--num-layers 24
--hidden-size 1024
--train-batch-size 64
--gradient_accumulation_steps 16
--num-attention-heads 16
--batch-size 4
--seq-length 1024
--max-position-embeddings 1024
--train-iters 500000
--lr-decay-iters 450000
--save $CHECKPOINT_PATH
--load $CHECKPOINT_PATH
--data-path $DATA_PATH/my-gpt2_text_document
--vocab-file $DATA_PATH/gpt2-vocab.json
--merge-file $DATA_PATH/gpt2-merges.txt
--data-impl mmap
--split 949,50,1
--distributed-backend nccl
--lr 0.00025
--lr-decay-style cosine
--min-lr 1.0e-5
--weight-decay 1e-2
--clip-grad 1.0
--warmup .02
--log-interval 1
--save-interval 100000
--vocab-size 145608
--DDP-impl torch
--eod-mask-loss
--deepspeed-pipeline
--deepspeed
--config_param "$ds_config"
--fp16
--partition_method "type:ParallelTransformerLayerPiped"
$@
set +x

gongwei-130 · 2022-03-01T04:20:40Z

Hi WuDao,

Sorry, I am not able to repo your issue with provided setting. I suggest do following two steps to exclude some possible issue:

Add following into your pretrain_gpt_distributed.sh

       export WORKER_0_HOST=127.0.0.1
       export DMLC_NODE_HOST=127.0.0.1
       export BYTEPS_WITH_UCX=0 
       export DMLC_ENABLE_UCX=0
       export DMLC_ENABLE_RDMA=0

Try setting up environment with following docker file, I will push this docker file into repo soon.

FROM nvcr.io/nvidia/pytorch:21.05-py3 
RUN pip3 install boto3 regex tensorboardX==1.8 wheel pybind11 ninja psutil pyprof
RUN apt-get -yq autoremove --purge ibverbs-providers
RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends --allow-downgrades \
     libibverbs-dev=28.0-1ubuntu1 libibverbs1=28.0-1ubuntu1

RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends --allow-downgrades \
        cmake \
        libopenmpi-dev \
        openmpi-bin \
        openssh-client \
        openssh-server \
        ibverbs-providers \
        libibverbs-dev=28.0-1ubuntu1 \
        librdmacm-dev \
        vim \
        iputils-ping \
        llvm-10-dev \
        iproute2 \
        unzip

RUN ln -s /usr/bin/aclocal-1.16 /usr/local/bin/aclocal-1.14
RUN ln -s /usr/bin/automake /usr/local/bin/automake-1.14

ENV LD_LIBRARY_PATH "/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}"
ENV BYTEPS_WITH_UCX 0

RUN pip3 install https://giant-model-package.tos-cn-beijing.volces.com/byteps-0.7.2-cp38-cp38-linux_x86_64.whl
WORKDIR /root

Hi， the pretrain_gpt_distributed.sh is set up as

#! /bin/bash

Runs the "345M" parameter model

DATA_PATH='/data/wang/models/Sailing/examples/gpt2' CHECKPOINT_PATH='./'

export WORKER_0_HOST=localhost export WORKER_0_PORT=6000 export NUM_WORKER=1 export WORKER_RANK=0 export GPU_PER_WORKER=4

MASTER_PORT=6002 MASTER_ADDR=$WORKER_0_HOST

GPUS_PER_NODE=$GPU_PER_WORKER

NNODES=$NUM_WORKER NODE_RANK=$WORKER_RANK

WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

base_dir=$(cd dirname $0; pwd) echo base_dir $base_dir

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

ds_config='{ "train_micro_batch_size_per_gpu":16, "train_batch_size" : 16, "gradient_accumulation_steps": 2, "steps_per_print": 1, "gradient_clipping": 1.0, "zero_optimization": { "stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients" : true, "cpu_offload": false }, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "wall_clock_breakdown": true }'

python3 -m torch.distributed.launch $DISTRIBUTED_ARGS --no_python --use_env python3 ${base_dir}/pretrain_gpt2.py --model-parallel-size 2 --num-stages 2 --num-layers 24 --hidden-size 1024 --train-batch-size 64 --gradient_accumulation_steps 16 --num-attention-heads 16 --batch-size 4 --seq-length 1024 --max-position-embeddings 1024 --train-iters 500000 --lr-decay-iters 450000 --save $CHECKPOINT_PATH --load $CHECKPOINT_PATH --data-path $DATA_PATH/my-gpt2_text_document --vocab-file $DATA_PATH/gpt2-vocab.json --merge-file $DATA_PATH/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 0.00025 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup .02 --log-interval 1 --save-interval 100000 --vocab-size 145608 --DDP-impl torch --eod-mask-loss --deepspeed-pipeline --deepspeed --config_param "$ds_config" --fp16 --partition_method "type:ParallelTransformerLayerPiped" $@ set +x

BAAI-WuDao · 2022-03-02T09:53:48Z

Thank you very much for you response!
I follow the steps, but the bugs still exists.
Can you give a example, that I can set the value of ``mapping" at line 136 in topology.py

gongwei-130 · 2022-03-02T22:57:25Z

Thank you very much for you response! I follow the steps, but the bugs still exists. Can you give a example, that I can set the value of ``mapping" at line 136 in topology.py

Hi WuDao,

Could you paste your full log here? Don't use screen shot, so that I can search text.

BAAI-WuDao changed the title ~~Bugs: ValueError: rank 1 not found in topology~~ ValueError: rank 1 not found in topology Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: rank 1 not found in topology #2

ValueError: rank 1 not found in topology #2

BAAI-WuDao commented Feb 25, 2022 •

edited

Loading

gongwei-130 commented Feb 25, 2022

BAAI-WuDao commented Feb 26, 2022

gongwei-130 commented Mar 1, 2022 •

edited

Loading

Runs the "345M" parameter model

BAAI-WuDao commented Mar 2, 2022

gongwei-130 commented Mar 2, 2022

ValueError: rank 1 not found in topology #2

ValueError: rank 1 not found in topology #2

Comments

BAAI-WuDao commented Feb 25, 2022 • edited Loading

gongwei-130 commented Feb 25, 2022

BAAI-WuDao commented Feb 26, 2022

Runs the "345M" parameter model

gongwei-130 commented Mar 1, 2022 • edited Loading

Runs the "345M" parameter model

BAAI-WuDao commented Mar 2, 2022

gongwei-130 commented Mar 2, 2022

BAAI-WuDao commented Feb 25, 2022 •

edited

Loading

gongwei-130 commented Mar 1, 2022 •

edited

Loading