ppv4使用检测模型进行微调,多机多卡训练,eval阶段 只有一张卡进行验证 #12213
Replies: 6 comments 1 reply
-
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
我的数据是大分辨率,统计了一下,一半以上都是3000+,不进行裁剪很容易爆卡,在配置文件中设置
|
Beta Was this translation helpful? Give feedback.
-
您好,您是已经成功训练起来v4了吗,请问: |
Beta Was this translation helpful? Give feedback.
-
我没有遇到这个问题,看起来是数据维度问题,先使用train.py里的test_reader函数,验证你的数据集看看数据有没有问题,针对性处理一下 |
Beta Was this translation helpful? Give feedback.
-
你好,请问您多机多卡怎么跑起来的?我这边两台机器,docker里,--network=host,ssh已经互为免密了,svtr识别训练,怎么都跑不起来。 |
Beta Was this translation helpful? Give feedback.
-
请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem
系统环境/System Environment:
版本号/Version:Paddle:2.5 PaddleOCR: 问题相关组件/Related components:
运行指令/Command Code:
完整报错/Complete Error Message:
![screenshot-20230821-190811](https://private-user-images.githubusercontent.com/28285213/262006303-f6f5f02d-c56f-4250-b99d-bb82be1982d9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk3OTIzMzUsIm5iZiI6MTczOTc5MjAzNSwicGF0aCI6Ii8yODI4NTIxMy8yNjIwMDYzMDMtZjZmNWYwMmQtYzU2Zi00MjUwLWI5OWQtYmI4MmJlMTk4MmQ5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE3VDExMzM1NVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNjMDVhNGY2YjJjNDYyOTlkOGVmNDUwOGZkN2Q0ODgwYjE2ZDRkNTYxN2RkZjVjODg5ZTQ1MGVkYTU5NmEzZWQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Ef7abJwGnSwUYRUND8G8wx0WYpr3us1_m-0J3jJh7ps)
验证阶段batch_size_per_card: 1,发现多机多卡训练只有一张卡在验证,其他卡还在加载着训练数据,这样造成了大量的显存浪费,能不能能有机制让多张卡一起验证,且验证阶段不加载训练数据,显卡使用情况如下:
且随着不断的eval,显存不断增加,在一定时刻会爆掉显卡。我使用的数据比较大,在大于2000会进行裁剪,但是也会出现爆卡。我使用r50vd去训练,大于4000裁剪显卡也没有爆掉,不知道是什么原因了
Beta Was this translation helpful? Give feedback.
All reactions