Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The default training script of DLRM v2 does not reach the reported AUC. #634

Closed
Kevin0624 opened this issue Apr 10, 2023 · 4 comments
Closed

Comments

@Kevin0624
Copy link

Hi Teams,

I have run the default training script with the following changes based on the results table
1. GLOBAL_BATCH_SIZE=16384
2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)

P.S. I did not use the unique flags. Because I could not find the corresponding argument in dlrm_main.py.

the training result is:
test AUC: 79.86% ( target: 80.30%)

Is there anyone have any idea?

@erichan1
Copy link

cc @janekl Any thoughts on this?

@janekl
Copy link
Contributor

janekl commented May 4, 2023

Hello, I have two questions first that hopefully will help us to figure this out more effectively:

  1. Could you share the exact command you tried?
  2. What are "global_batch_size" and "opt_base_learning_rate" in the logs produced?

I expect that (batch size, learning rate) = (16384, 0.004) should work reasonably well and stably. But bear in mind that results may vary from run to run -- as the model is initialized randomly -- so it's best to run it several times.

Also, note that the threshold is 0.80275, not 0.803.

Finally, for MLPerf you should look at "eval_accuracy" logs for the validation set, not test set (it is better just not to use ---evaluate_on_training_end flag to avoid confusion here).

@kkkparty
Copy link

Hi Teams,

I have run the default training script with the following changes based on the results table 1. GLOBAL_BATCH_SIZE=16384 2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)

P.S. I did not use the unique flags. Because I could not find the corresponding argument in dlrm_main.py.

the training result is: test AUC: 79.86% ( target: 80.30%)

Is there anyone have any idea?

did you load dense part ? i encontered with just sparse weights, but none dense. will you glad to show me how to check dense weights?

@ShriyaPalsamudram
Copy link
Contributor

@Kevin0624 has this been resolved?
Closing as it has been more than a year since the last activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants