Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Error when run with multiple GPUs #454

Closed
YeDeming opened this issue Sep 30, 2020 · 4 comments
Closed

CUDA Error when run with multiple GPUs #454

YeDeming opened this issue Sep 30, 2020 · 4 comments

Comments

@YeDeming
Copy link

YeDeming commented Sep 30, 2020

Thanks for opening source the great code!

I try to load HugginFace checkpoint and run BingBertSquad example with deepspeed transformer kernal.

The script:

#~/bin/bash

#1: number of GPUs
#2: Model File Address
#3: BertSquad Data Directory Address
#4: Output Directory Address

NGPU_PER_NODE=$1
MODEL_FILE=$2
SQUAD_DIR=$3
OUTPUT_DIR=$4
LR=${5:-0.00003}
SEED=${6:-12345}
MASTER_PORT=${7:-29500}
DROPOUT=${8:-0.1}
echo "lr is ${LR}"
echo "seed is $SEED"
echo "master port is $MASTER_PORT"
echo "dropout is ${DROPOUT}"

# Force deepspeed to run with only local node
NUM_NODES=1
HOSTFILE=/dev/null

NGPU=$((NGPU_PER_NODE*NUM_NODES))
EFFECTIVE_BATCH_SIZE=24
MAX_GPU_BATCH_SIZE=12
PER_GPU_BATCH_SIZE=$((EFFECTIVE_BATCH_SIZE/NGPU))
if [[ $PER_GPU_BATCH_SIZE -lt $MAX_GPU_BATCH_SIZE ]]; then
       GRAD_ACCUM_STEPS=1
else
       GRAD_ACCUM_STEPS=$((PER_GPU_BATCH_SIZE/MAX_GPU_BATCH_SIZE))
fi
JOB_NAME="deepspeed_${NGPU}GPUs_${EFFECTIVE_BATCH_SIZE}batch_size"
config_json=deepspeed_bsz24_config.json
run_cmd="deepspeed --num_nodes ${NUM_NODES} --num_gpus ${NGPU_PER_NODE} \
       --master_port=${MASTER_PORT} \
       --hostfile ${HOSTFILE} \
       nvidia_run_squad_deepspeed.py \
       --bert_model ../../bert-base-uncased \
       --do_train \
       --do_lower_case \
       --predict_batch_size 12 \
       --do_predict \
       --train_file $SQUAD_DIR/train-v1.1.json \
       --predict_file $SQUAD_DIR/dev-v1.1.json \
       --train_batch_size $PER_GPU_BATCH_SIZE \
       --learning_rate ${LR} \
       --num_train_epochs 2.0 \
       --max_seq_length 384 \
       --doc_stride 128 \
       --output_dir $OUTPUT_DIR \
       --job_name ${JOB_NAME} \
       --gradient_accumulation_steps ${GRAD_ACCUM_STEPS} \
       --deepspeed \
       --deepspeed_config ${config_json} \
       --dropout ${DROPOUT} \
       --model_file $MODEL_FILE \
       --seed ${SEED} \
       --ckpt_type HF \
       --origin_bert_config_file ../../bert-base-uncased/config.json \
       --deepspeed_transformer_kernel \
       --fp16
       "

echo ${run_cmd}
eval ${run_cmd}

I run in two environment:

(1) 1080ti with the provided docker
1GPU with fp32 --> success
1GPU with fp16 --> NAN
2GPU with fp32 --> error
(2) TITAN RTX and manually use install.sh
1GPU with fp16 --> success
2GPU with fp16--> error

The error on the RTX server is shown in below (it is similiar to the error on the 1080ti sever):

!!!! kernel execution error. (m: 768, n: 4608, k: 3072, error: 13)                                    
!!!! kernel execution error. (m: 2304, n: 4608, k: 768, error: 13)                                   
!!!! kernel execution error. (m: 384, n: 384, k: 64, error: 13)
!!!! kernel execution error. (m: 64, n: 384, k: 384, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 3072, error: 13)
!!!! kernel execution error. (m: 2304, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 384, n: 384, k: 64, error: 13)
!!!! kernel execution error. (m: 64, n: 384, k: 384, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 3072, error: 13)
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1147, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 998, in main
    start_positions, end_positions)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yedeming/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 743, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/private/yedeming/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 1488, in forward
    output_all_encoded_layers=False)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/private/yedeming/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 937, in forward
    checkpoint_activations=checkpoint_activations)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/private/yedeming/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 572, in forward
    hidden_states = layer_module(hidden_states, attention_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yedeming/.local/lib/python3.6/site-packages/deepspeed/ops/transformer/transformer.py", line 560, in forward
    self.config)
  File "/home/yedeming/.local/lib/python3.6/site-packages/deepspeed/ops/transformer/transformer.py", line 213, in forward
    config.gelu_checkpoint)
RuntimeError: CUDA error: an illegal memory access was encountered
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f8d765e71e2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f8d76835f92 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f8d765d59cd in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x540ae2 (0x7f8dc216dae2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x540b86 (0x7f8dc216db86 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: /home/yedeming/.local/bin/python3() [0x54f226]
frame #6: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #7: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #8: /home/yedeming/.local/bin/python3() [0x572df1]
frame #9: /home/yedeming/.local/bin/python3() [0x54f202]
frame #10: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #11: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #12: /home/yedeming/.local/bin/python3() [0x572e67]
frame #13: /home/yedeming/.local/bin/python3() [0x54f202]
frame #14: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #15: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #16: /home/yedeming/.local/bin/python3() [0x572e67]
frame #17: /home/yedeming/.local/bin/python3() [0x54f202]
frame #18: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #19: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #20: /home/yedeming/.local/bin/python3() [0x572e67]
frame #21: /home/yedeming/.local/bin/python3() [0x54f202]
frame #22: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #23: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #24: /home/yedeming/.local/bin/python3() [0x572e67]
frame #25: /home/yedeming/.local/bin/python3() [0x54f202]
frame #26: /home/yedeming/.local/bin/python3() [0x588a98]
frame #27: /home/yedeming/.local/bin/python3() [0x5ad558]
frame #28: /home/yedeming/.local/bin/python3() [0x5ad56e]
frame #29: /home/yedeming/.local/bin/python3() [0x56b636]
frame #30: PyDict_SetItemString + 0x153 (0x570da3 in /home/yedeming/.local/bin/python3)
frame #31: PyImport_Cleanup + 0x76 (0x4f2ee6 in /home/yedeming/.local/bin/python3)
frame #32: Py_FinalizeEx + 0x5e (0x637f7e in /home/yedeming/.local/bin/python3)
frame #33: Py_Main + 0x395 (0x638fe5 in /home/yedeming/.local/bin/python3)
frame #34: main + 0xe0 (0x4b0dc0 in /home/yedeming/.local/bin/python3)
frame #35: __libc_start_main + 0xe7 (0x7f8ddfb0db97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5b26fa in /home/yedeming/.local/bin/python3)

Looking forward to your reply!

Best wishes,
Deming Ye

@YeDeming
Copy link
Author

YeDeming commented Oct 1, 2020

I found adding "args.local_rank if hasattr(args, 'local_rank') else -1" to DeepSpeedTransformerConfig in BingBertSquad can solve this problem.

@tjruwase
Copy link
Contributor

tjruwase commented Oct 1, 2020

@YeDeming thanks for using DeepSpeed. Sorry about this issue, but I am glad you found a resolution.

From your solution, it seems like args.local_rank == -1 in the 2GPU case. Can you please confirm that by logging the value of args.local_rank at startup?

@YeDeming
Copy link
Author

YeDeming commented Oct 1, 2020

It seems that you forget to add "local_rank=..." in

https://github.com/microsoft/DeepSpeedExamples/blob/ba63ad0fa861d28b3b33bc2c20f702647403e258/BingBertSquad/turing/nvidia_modeling.py#L510-L522

The problem also apears in nvidia_modelingpreln.py

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @YeDeming

Thanks for pointing this out. Yes, you are right we need this argument passed to the kernel. This was not needed previously as the device was set before creating the model. But, after refactoring the code, we forgot to pass it to the kernel. I have made a PR to fix this: deepspeedai/DeepSpeedExamples#58

Thanks.
Reza

@YeDeming YeDeming closed this as completed Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants