CUDA Error when run with multiple GPUs #454

YeDeming · 2020-09-30T10:39:02Z

Thanks for opening source the great code!

I try to load HugginFace checkpoint and run BingBertSquad example with deepspeed transformer kernal.

The script:

#~/bin/bash

#1: number of GPUs
#2: Model File Address
#3: BertSquad Data Directory Address
#4: Output Directory Address

NGPU_PER_NODE=$1
MODEL_FILE=$2
SQUAD_DIR=$3
OUTPUT_DIR=$4
LR=${5:-0.00003}
SEED=${6:-12345}
MASTER_PORT=${7:-29500}
DROPOUT=${8:-0.1}
echo "lr is ${LR}"
echo "seed is $SEED"
echo "master port is $MASTER_PORT"
echo "dropout is ${DROPOUT}"

# Force deepspeed to run with only local node
NUM_NODES=1
HOSTFILE=/dev/null

NGPU=$((NGPU_PER_NODE*NUM_NODES))
EFFECTIVE_BATCH_SIZE=24
MAX_GPU_BATCH_SIZE=12
PER_GPU_BATCH_SIZE=$((EFFECTIVE_BATCH_SIZE/NGPU))
if [[ $PER_GPU_BATCH_SIZE -lt $MAX_GPU_BATCH_SIZE ]]; then
       GRAD_ACCUM_STEPS=1
else
       GRAD_ACCUM_STEPS=$((PER_GPU_BATCH_SIZE/MAX_GPU_BATCH_SIZE))
fi
JOB_NAME="deepspeed_${NGPU}GPUs_${EFFECTIVE_BATCH_SIZE}batch_size"
config_json=deepspeed_bsz24_config.json
run_cmd="deepspeed --num_nodes ${NUM_NODES} --num_gpus ${NGPU_PER_NODE} \
       --master_port=${MASTER_PORT} \
       --hostfile ${HOSTFILE} \
       nvidia_run_squad_deepspeed.py \
       --bert_model ../../bert-base-uncased \
       --do_train \
       --do_lower_case \
       --predict_batch_size 12 \
       --do_predict \
       --train_file $SQUAD_DIR/train-v1.1.json \
       --predict_file $SQUAD_DIR/dev-v1.1.json \
       --train_batch_size $PER_GPU_BATCH_SIZE \
       --learning_rate ${LR} \
       --num_train_epochs 2.0 \
       --max_seq_length 384 \
       --doc_stride 128 \
       --output_dir $OUTPUT_DIR \
       --job_name ${JOB_NAME} \
       --gradient_accumulation_steps ${GRAD_ACCUM_STEPS} \
       --deepspeed \
       --deepspeed_config ${config_json} \
       --dropout ${DROPOUT} \
       --model_file $MODEL_FILE \
       --seed ${SEED} \
       --ckpt_type HF \
       --origin_bert_config_file ../../bert-base-uncased/config.json \
       --deepspeed_transformer_kernel \
       --fp16
       "

echo ${run_cmd}
eval ${run_cmd}

I run in two environment:

(1) 1080ti with the provided docker
1GPU with fp32 --> success
1GPU with fp16 --> NAN
2GPU with fp32 --> error
(2) TITAN RTX and manually use install.sh
1GPU with fp16 --> success
2GPU with fp16--> error

The error on the RTX server is shown in below (it is similiar to the error on the 1080ti sever):

!!!! kernel execution error. (m: 768, n: 4608, k: 3072, error: 13)                                    
!!!! kernel execution error. (m: 2304, n: 4608, k: 768, error: 13)                                   
!!!! kernel execution error. (m: 384, n: 384, k: 64, error: 13)
!!!! kernel execution error. (m: 64, n: 384, k: 384, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 3072, error: 13)
!!!! kernel execution error. (m: 2304, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 384, n: 384, k: 64, error: 13)
!!!! kernel execution error. (m: 64, n: 384, k: 384, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 3072, error: 13)
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1147, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 998, in main
    start_positions, end_positions)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yedeming/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 743, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/private/yedeming/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 1488, in forward
    output_all_encoded_layers=False)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/private/yedeming/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 937, in forward
    checkpoint_activations=checkpoint_activations)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/private/yedeming/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 572, in forward
    hidden_states = layer_module(hidden_states, attention_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yedeming/.local/lib/python3.6/site-packages/deepspeed/ops/transformer/transformer.py", line 560, in forward
    self.config)
  File "/home/yedeming/.local/lib/python3.6/site-packages/deepspeed/ops/transformer/transformer.py", line 213, in forward
    config.gelu_checkpoint)
RuntimeError: CUDA error: an illegal memory access was encountered
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f8d765e71e2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f8d76835f92 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f8d765d59cd in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x540ae2 (0x7f8dc216dae2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x540b86 (0x7f8dc216db86 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: /home/yedeming/.local/bin/python3() [0x54f226]
frame #6: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #7: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #8: /home/yedeming/.local/bin/python3() [0x572df1]
frame #9: /home/yedeming/.local/bin/python3() [0x54f202]
frame #10: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #11: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #12: /home/yedeming/.local/bin/python3() [0x572e67]
frame #13: /home/yedeming/.local/bin/python3() [0x54f202]
frame #14: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #15: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #16: /home/yedeming/.local/bin/python3() [0x572e67]
frame #17: /home/yedeming/.local/bin/python3() [0x54f202]
frame #18: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #19: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #20: /home/yedeming/.local/bin/python3() [0x572e67]
frame #21: /home/yedeming/.local/bin/python3() [0x54f202]
frame #22: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #23: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #24: /home/yedeming/.local/bin/python3() [0x572e67]
frame #25: /home/yedeming/.local/bin/python3() [0x54f202]
frame #26: /home/yedeming/.local/bin/python3() [0x588a98]
frame #27: /home/yedeming/.local/bin/python3() [0x5ad558]
frame #28: /home/yedeming/.local/bin/python3() [0x5ad56e]
frame #29: /home/yedeming/.local/bin/python3() [0x56b636]
frame #30: PyDict_SetItemString + 0x153 (0x570da3 in /home/yedeming/.local/bin/python3)
frame #31: PyImport_Cleanup + 0x76 (0x4f2ee6 in /home/yedeming/.local/bin/python3)
frame #32: Py_FinalizeEx + 0x5e (0x637f7e in /home/yedeming/.local/bin/python3)
frame #33: Py_Main + 0x395 (0x638fe5 in /home/yedeming/.local/bin/python3)
frame #34: main + 0xe0 (0x4b0dc0 in /home/yedeming/.local/bin/python3)
frame #35: __libc_start_main + 0xe7 (0x7f8ddfb0db97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5b26fa in /home/yedeming/.local/bin/python3)

Looking forward to your reply!

Best wishes,
Deming Ye

The text was updated successfully, but these errors were encountered:

YeDeming · 2020-10-01T14:02:14Z

I found adding "args.local_rank if hasattr(args, 'local_rank') else -1" to DeepSpeedTransformerConfig in BingBertSquad can solve this problem.

tjruwase · 2020-10-01T14:08:19Z

@YeDeming thanks for using DeepSpeed. Sorry about this issue, but I am glad you found a resolution.

From your solution, it seems like args.local_rank == -1 in the 2GPU case. Can you please confirm that by logging the value of args.local_rank at startup?

YeDeming · 2020-10-01T14:15:39Z

It seems that you forget to add "local_rank=..." in

https://github.com/microsoft/DeepSpeedExamples/blob/ba63ad0fa861d28b3b33bc2c20f702647403e258/BingBertSquad/turing/nvidia_modeling.py#L510-L522

The problem also apears in nvidia_modelingpreln.py

RezaYazdaniAminabadi · 2020-10-07T02:17:05Z

Hi @YeDeming

Thanks for pointing this out. Yes, you are right we need this argument passed to the kernel. This was not needed previously as the device was set before creating the model. But, after refactoring the code, we forgot to pass it to the kernel. I have made a PR to fix this: deepspeedai/DeepSpeedExamples#58

Thanks.
Reza

YeDeming closed this as completed Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Error when run with multiple GPUs #454

CUDA Error when run with multiple GPUs #454

YeDeming commented Sep 30, 2020 •

edited

Loading

YeDeming commented Oct 1, 2020

tjruwase commented Oct 1, 2020

YeDeming commented Oct 1, 2020 •

edited

Loading

RezaYazdaniAminabadi commented Oct 7, 2020

CUDA Error when run with multiple GPUs #454

CUDA Error when run with multiple GPUs #454

Comments

YeDeming commented Sep 30, 2020 • edited Loading

YeDeming commented Oct 1, 2020

tjruwase commented Oct 1, 2020

YeDeming commented Oct 1, 2020 • edited Loading

RezaYazdaniAminabadi commented Oct 7, 2020

YeDeming commented Sep 30, 2020 •

edited

Loading

YeDeming commented Oct 1, 2020 •

edited

Loading