The default training script of DLRM v2 does not reach the reported AUC. #634

Kevin0624 · 2023-04-10T02:36:05Z

Hi Teams,

I have run the default training script with the following changes based on the results table
1. GLOBAL_BATCH_SIZE=16384
2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)

P.S. I did not use the unique flags. Because I could not find the corresponding argument in dlrm_main.py.

the training result is:
test AUC: 79.86% ( target: 80.30%)

Is there anyone have any idea?

erichan1 · 2023-04-28T22:24:20Z

cc @janekl Any thoughts on this?

janekl · 2023-05-04T11:32:30Z

Hello, I have two questions first that hopefully will help us to figure this out more effectively:

Could you share the exact command you tried?
What are "global_batch_size" and "opt_base_learning_rate" in the logs produced?

I expect that (batch size, learning rate) = (16384, 0.004) should work reasonably well and stably. But bear in mind that results may vary from run to run -- as the model is initialized randomly -- so it's best to run it several times.

Also, note that the threshold is 0.80275, not 0.803.

Finally, for MLPerf you should look at "eval_accuracy" logs for the validation set, not test set (it is better just not to use ---evaluate_on_training_end flag to avoid confusion here).

kkkparty · 2024-04-23T12:00:11Z

Hi Teams,

I have run the default training script with the following changes based on the results table 1. GLOBAL_BATCH_SIZE=16384 2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)

P.S. I did not use the unique flags. Because I could not find the corresponding argument in dlrm_main.py.

the training result is: test AUC: 79.86% ( target: 80.30%)

Is there anyone have any idea?

did you load dense part ? i encontered with just sparse weights, but none dense. will you glad to show me how to check dense weights?

ShriyaPalsamudram · 2024-08-01T14:32:59Z

@Kevin0624 has this been resolved?
Closing as it has been more than a year since the last activity

ShriyaPalsamudram closed this as completed Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The default training script of DLRM v2 does not reach the reported AUC. #634

The default training script of DLRM v2 does not reach the reported AUC. #634

Kevin0624 commented Apr 10, 2023

erichan1 commented Apr 28, 2023

janekl commented May 4, 2023

kkkparty commented Apr 23, 2024

ShriyaPalsamudram commented Aug 1, 2024

The default training script of DLRM v2 does not reach the reported AUC. #634

The default training script of DLRM v2 does not reach the reported AUC. #634

Comments

Kevin0624 commented Apr 10, 2023

erichan1 commented Apr 28, 2023

janekl commented May 4, 2023

kkkparty commented Apr 23, 2024

ShriyaPalsamudram commented Aug 1, 2024