-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'CUDA error: an illegal memory access was encountered' in forward #308
Comments
@gongwei-130 Thanks for reporting this issue. |
Sure. Error changes after I removed
|
Hi @gongwei-130 Thanks for trying out deepspeed. I wonder which version of CUDA and PyTorch you are using here. Also, which GPU architecture you are using for your training? Best regards, |
Hi Reza, my cuda version is 10.2, pytorch version 1.6.0. GPU I have is Tesla V100-SXM2 32G. Full training log and the result of this test are as follows.
|
Hi @gongwei-130, Thanks for running the test. So, it seems there is nothing wrong with the forward of the transformer kernels. Best regards, |
Done. $ pytest tests/unit/test_dynamic_loss_scale.py::test_unfused_no_overflow -sv
|
Hi @gongwei-130 Sorry for the delayed response. Best regards, |
Hi, I'm running into the following error when attempting to train bert with ds_train_bert_bsz64k_seq128_m.sh. I printed out all tensor shapes in the batch and it looks fine since I used train_micro_batch_size_per_gpu=8 and train_batch_size=64 since I have 8 cards.
This error occurs during the forward pass of the first training step.
The text was updated successfully, but these errors were encountered: