Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单机多卡训练问题 #15

Open
Chenhuaizuo opened this issue Dec 24, 2024 · 8 comments
Open

单机多卡训练问题 #15

Chenhuaizuo opened this issue Dec 24, 2024 · 8 comments
Assignees

Comments

@Chenhuaizuo
Copy link

用四张GPU进行分布式训练,训练到iter[95/660]时,报错如下:
Traceback (most recent call last):
File "script/train.py", line 144, in
main()
File "script/train.py", line 131, in main
train_api(
File "/home/hp/cyw/06 SparseE2E/SparseEnd2End-main/SparseEnd2End-main/tool/trainer/train_sdk.py", line 119, in train_api
runner.run(data_loaders, cfg["workflow"])
File "/home/hp/cyw/06 SparseE2E/SparseEnd2End-main/SparseEnd2End-main/tool/runner/iter_based_runner.py", line 109, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/hp/cyw/06 SparseE2E/SparseEnd2End-main/SparseEnd2End-main/tool/runner/iter_based_runner.py", line 36, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/home/hp/cyw/06 SparseE2E/SparseEnd2End-main/SparseEnd2End-main/tool/utils/distributed.py", line 50, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/home/hp/cyw/06 SparseE2E/SparseEnd2End-main/SparseEnd2End-main/modules/cnn/base_detector.py", line 251, in train_step
loss, log_vars = self._parse_losses(losses)
File "/home/hp/cyw/06 SparseE2E/SparseEnd2End-main/SparseEnd2End-main/modules/cnn/base_detector.py", line 209, in _parse_losses
assert log_var_length == len(log_vars) * dist.get_world_size(), (
AssertionError: loss log variables are different across GPUs!
rank 2 len(log_vars): 25 keys: loss_cls_0,loss_box_0,loss_cns_0,loss_yns_0,loss_cls_1,loss_box_1,loss_cns_1,loss_yns_1,loss_cls_2,loss_box_2,loss_cns_2,loss_yns_2,loss_cls_3,loss_box_3,loss_cns_3,loss_yns_3,loss_cls_4,loss_box_4,loss_cns_4,loss_yns_4,loss_cls_5,loss_box_5,loss_cns_5,loss_yns_5,loss_dense_depth
请问该怎么解决

@ThomasVonWu
Copy link
Owner

ThomasVonWu commented Dec 24, 2024

Hi,
你是用的是啥环境, 训练用mini数据集吗, 另外,训练的配置文件和启动指令是? 我看看能不能复现出来.

@ThomasVonWu ThomasVonWu self-assigned this Dec 24, 2024
@Chenhuaizuo
Copy link
Author

环境
Linux系统 Ubuntu 20.04.6
Python: 3.8.20
torch: 1.13.0
torchaudio: 0.13.0
torchvision: 0.14.0

训练用数据集
不是mini数据集
用的是nuscence v1.0-trainval
通过在配置文件dataset/config/sparse4d_temporal_r50_1x1_bs1_256x704_mini.py中修改version="v1.0-trainval"

训练的配置文件和启动指令
配置文件: dataset/config/sparse4d_temporal_r50_1x1_bs1_256x704_mini.py
_单机多卡分布式训练脚本文件:_script/dist_train.sh
`#!/usr/bin/env bash
export CUDA_VISIBLE_DEVICES=0,1,2,3
CONFIG=$1
GPUS=$2
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
PORT=${PORT:-29500}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH
python -m torch.distributed.launch
--nnodes=$NNODES
--node_rank=$NODE_RANK
--master_addr=$MASTER_ADDR
--nproc_per_node=$GPUS
--master_port=$PORT
$(dirname "$0")/train.py
$CONFIG
--launcher pytorch ${@:3}`

命令
export PYTHONPATH=$PYTHONPATH:./
unset LD_LIBRARY_PATH
bash script/dist_train.sh dataset/config/sparse4d_temporal_r50_1x1_bs1_256x704_mini.py 4

单机单卡训练没有问题,我会用mini数据集尝试一下,看看多卡会不会报相同的错误

@Chenhuaizuo
Copy link
Author

验证了单机多卡训练mini数据集能正常训练。
目前的情况是,单机单卡/单机多卡训练mini数据集都正常,单机单卡训练trainval数据集正常,单机多卡无法训练trainval数据集,会出现AssertionError: loss log variables are different across GPUs!

@ThomasVonWu
Copy link
Owner

好的, 目前手头上没有多卡机器, 这周等到了多卡资源, 我来fix下, 应该问题不大

@yangwj2023
Copy link

yangwj2023 commented Dec 25, 2024

您好,大佬,我前段时间也遇到一个问题,单机多卡训练中出现了loss突然为Nan或者0的问题,麻烦大佬下次测试时也留意一下,谢谢。另外,单卡单机的训练,这边测试过程中,训练中断了,暂时没有机器供我重复训练,这几天我会再协调机器测试。显卡信息: RTX3090, 24GB 问题记录的日志如下图:
单机多卡训练-问题

@ThomasVonWu
Copy link
Owner

ThomasVonWu commented Dec 25, 2024

验证了单机多卡训练mini数据集能正常训练。
目前的情况是,单机单卡/单机多卡训练mini数据集都正常,单机单卡训练trainval数据集正常,单机多卡无法训练trainval数据集,会出现AssertionError: loss log variables are different across GPUs!

Hi, @Chenhuaizuo
我大概复现你的问题, 你有没有同步更新mini配置文件中:

num_gpus = 1 -> 4

*_mini.py后缀的配置文件适用场景: 输入数据为mini数据, 单卡推理部署测试用

如果你想测试在trainval数据集的训练链路, 4卡启动建议直接使用如下指令和配置文件:()

clear && bash script/dist_train.sh dataset/config/sparse4d_temporal_r50_1x4_bs22_256x704.py

@ThomasVonWu
Copy link
Owner

您好,大佬,我前段时间也遇到一个问题,单机多卡训练中出现了loss突然为Nan或者0的问题,麻烦大佬下次测试时也留意一下,谢谢。另外,单卡单机的训练,这边测试过程中,训练中断了,暂时没有机器供我重复训练,这几天我会再协调机器测试。显卡信息: RTX3090, 24GB 问题记录的日志如下图: 单机多卡训练-问题

针对loss is NaN or Inf的问题:
3090单机多卡训练我没做过实验, 建议根据batch-size和gpu数量调整lr, lr可以先小一点, 如果收敛再一步一步放大.

@yangwj2023
Copy link

针对loss is NaN or Inf的问题:
3090单机多卡训练我没做过实验, 建议根据batch-size和gpu数量调整lr, lr可以先小一点, 如果收敛再一步一步放大.

您好,单机单卡3090,nuscenes数据集 v1.0-trainval,今天已经训练了,目前( Iter [33955/1637100] )没有复现loss为NaN的问题。
另外,我近几天我也会尝试调整上述参数进行训练

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants