-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
单机多卡训练问题 #15
Comments
Hi, |
环境 训练用数据集 训练的配置文件和启动指令 PYTHONPATH="$(dirname $0)/..":$PYTHONPATH 命令 单机单卡训练没有问题,我会用mini数据集尝试一下,看看多卡会不会报相同的错误 |
|
好的, 目前手头上没有多卡机器, 这周等到了多卡资源, 我来fix下, 应该问题不大 |
Hi, @Chenhuaizuo
*_mini.py后缀的配置文件适用场景: 输入数据为mini数据, 单卡推理部署测试用 如果你想测试在trainval数据集的训练链路, 4卡启动建议直接使用如下指令和配置文件:() clear && bash script/dist_train.sh dataset/config/sparse4d_temporal_r50_1x4_bs22_256x704.py |
您好,单机单卡3090,nuscenes数据集 v1.0-trainval,今天已经训练了,目前( Iter [33955/1637100] )没有复现loss为NaN的问题。 |
用四张GPU进行分布式训练,训练到iter[95/660]时,报错如下:
Traceback (most recent call last):
File "script/train.py", line 144, in
main()
File "script/train.py", line 131, in main
train_api(
File "/home/hp/cyw/06 SparseE2E/SparseEnd2End-main/SparseEnd2End-main/tool/trainer/train_sdk.py", line 119, in train_api
runner.run(data_loaders, cfg["workflow"])
File "/home/hp/cyw/06 SparseE2E/SparseEnd2End-main/SparseEnd2End-main/tool/runner/iter_based_runner.py", line 109, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/hp/cyw/06 SparseE2E/SparseEnd2End-main/SparseEnd2End-main/tool/runner/iter_based_runner.py", line 36, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/home/hp/cyw/06 SparseE2E/SparseEnd2End-main/SparseEnd2End-main/tool/utils/distributed.py", line 50, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/home/hp/cyw/06 SparseE2E/SparseEnd2End-main/SparseEnd2End-main/modules/cnn/base_detector.py", line 251, in train_step
loss, log_vars = self._parse_losses(losses)
File "/home/hp/cyw/06 SparseE2E/SparseEnd2End-main/SparseEnd2End-main/modules/cnn/base_detector.py", line 209, in _parse_losses
assert log_var_length == len(log_vars) * dist.get_world_size(), (
AssertionError: loss log variables are different across GPUs!
rank 2 len(log_vars): 25 keys: loss_cls_0,loss_box_0,loss_cns_0,loss_yns_0,loss_cls_1,loss_box_1,loss_cns_1,loss_yns_1,loss_cls_2,loss_box_2,loss_cns_2,loss_yns_2,loss_cls_3,loss_box_3,loss_cns_3,loss_yns_3,loss_cls_4,loss_box_4,loss_cns_4,loss_yns_4,loss_cls_5,loss_box_5,loss_cns_5,loss_yns_5,loss_dense_depth
请问该怎么解决
The text was updated successfully, but these errors were encountered: