-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The default training script of DLRM v2 does not reach the reported AUC. #634
Comments
cc @janekl Any thoughts on this? |
Hello, I have two questions first that hopefully will help us to figure this out more effectively:
I expect that (batch size, learning rate) = (16384, 0.004) should work reasonably well and stably. But bear in mind that results may vary from run to run -- as the model is initialized randomly -- so it's best to run it several times. Also, note that the threshold is 0.80275, not 0.803. Finally, for MLPerf you should look at "eval_accuracy" logs for the validation set, not test set (it is better just not to use |
did you load dense part ? i encontered with just sparse weights, but none dense. will you glad to show me how to check dense weights? |
@Kevin0624 has this been resolved? |
Hi Teams,
I have run the default training script with the following changes based on the results table
1. GLOBAL_BATCH_SIZE=16384
2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)
P.S. I did not use the unique flags. Because I could not find the corresponding argument in dlrm_main.py.
the training result is:
test AUC: 79.86% ( target: 80.30%)
Is there anyone have any idea?
The text was updated successfully, but these errors were encountered: