-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
acoording the training scripts to train, but I can not get the results of the paper #23
Comments
Dear author, the training is according to train_ms_detr_300.sh, could you please tell me the training setting, environment,etc in detail ? I would be very grateful to you. |
Hi, the train_ms_detr_300.sh script is trained on 8 * V100 GPU. Do you modify the batch size per GPU. I did not run model on a single GPU and do not know whether the changes in global batch size will affect the final results. Here is the training log of ms-detr run with train_ms_detr_300.sh, it seems your loss, grad_norm, and the results of the first epoch are weird. Can you provide more information about your training script or run the Deformable-DETR baseline without MS-DETR to verify the influence of single-process training? |
Hi, I really appreciate your reply very much! And I have trained the model. The training script is as follows: And then the strange result shown above appeared, I do not know why, because I completely set according to the training log of the paper, except for only a single GPU, other parameter settings are the same. Thanks for your remind! I have only one RTX 4090, I will run the model on a single RTX 4090 to verify the influence of single-process training. Eventually, thank you very much! |
the training scripts are as follows:
GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 8 python -u main.py
--output_dir $EXP_DIR
--with_box_refine
--two_stage
--dim_feedforward 2048
--epochs 12
--lr_drop 11
--coco_path=$coco_path
--num_queries 300
--use_ms_detr
--use_aux_ffn
--cls_loss_coef 1
--o2m_cls_loss_coef 2
supplement: use one RTX 4090 to train, not using distributed.
the training log and the results of the process of the training are added as an attachment.
log.txt
traing_model_results.txt
The text was updated successfully, but these errors were encountered: