Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在A100上加载FusedAdam报错 #44

Open
giter000 opened this issue Nov 8, 2022 · 1 comment
Open

在A100上加载FusedAdam报错 #44

giter000 opened this issue Nov 8, 2022 · 1 comment

Comments

@giter000
Copy link

giter000 commented Nov 8, 2022

您好,我尝试在2张NVIDIA A100-PCIE-40GB的卡上跑代码,直接使用了镜像环境。但是一直在加载FusedAdam时报以下错误,即使重装了apex也没解决,目前还没有找到解决办法:

Total train epochs 10 | Total train iters 286497 |
building Enc-Dec model ...

number of parameters on model parallel rank 1: 5543798784
number of parameters on model parallel rank 0: 5543798784
Traceback (most recent call last):
File "/mnt/finetune_cpm2.py", line 808, in
main()
File "/mnt/finetune_cpm2.py", line 791, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args, tokenizer.vocab_size, ds_config, prompt_config)
File "/mnt/utils.py", line 213, in setup_model_and_optimizer
optimizer = get_optimizer(model, args, prompt_config)
File "/mnt/utils.py", line 163, in get_optimizer
optimizer = Adam(param_groups,
File "/opt/conda/lib/python3.8/site-packages/apex/optimizers/fused_adam.py", line 79, in init
raise RuntimeError('apex.optimizers.FusedAdam requires cuda extensions')
RuntimeError: apex.optimizers.FusedAdam requires cuda extensions

请问是否可以在2张NVIDIA A100-PCIE-40GB的卡上跑?镜像中apex环境需要调整什么吗?感谢。

@t1101675
Copy link
Contributor

感觉是 cuda 配置的问题,可以看下当前环境是否能使用 cuda,是否能正常跑 torch 的训练

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants