-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent results with CPU and GPU configs on the dataset ogbl-ppa #82
Comments
Interesting observations. 1.) I am surprised to hear the GPU memory was exceeded for this dataset, it should easily fit inside GPU memory given the dataset only has a 500,000 or so nodes. I've run datasets that are an order of magnitude larger on a single GPU. This may indicate a memory leak somewhere. At what point in the training did the system fail and do you have a stack trace? 2.) The MRR for both configurations look quite weird, the CPU one being obviously low and the GPU one being quite high. One difference between the two configurations is that the CPU config uses async training and the GPU configuration uses sync training. So my guess is that the async training is preventing model convergence for the CPU case. You can turn on sync training with 3.) The GPU MRR is suspiciously high, this may be due the evaluation configuration, which only samples 1000 nodes (500 uniformly and 500 by degree). You can try running filtered mrr (which will use all nodes to produce negatives) by changing the evaluation settings to:
If the hits@100 is inconsistent with leaderboard results for this dataset then that would indicate a bug somewhere and I can investigate further. 4.) The configuration for this dataset is not optimized to the specific dataset. These hyperparameters were chosen based on what worked well for the datasets in our paper (fb15k, livejournal, twitter and freebase86m). You will probably need to tune hyperparameters to get good model performance. |
The OOM error is triggered during inference, I attached the trace log below [info] [12/13/21 16:51:20.207] ################ Finished training epoch 1 ################ |
Ah this is with the filtered evaluation settings I sent above? I was hoping it wouldn't OOM. That evaluation scenario is pretty memory intensive since it uses all 500,000 nodes as negatives to compute the MRR. You can try decreasing the evaluation batch size, but that will make the evaluation process quite slow. If you want to compare to the OGB leaderboards I think what might be best is to export the trained embeddings from Marius and evaluate them using OGB evaluators. May I ask what your intent is with training on this dataset? I can provide better recommendations and system configuration if I know what your end goal is. |
Describe the bug
I got really wired results regarding the evaluation on the dataset ogbl-ppa with CPU and with GPU, respectively. I have to change the memory to HostDevice for GPU version due to its overwhelming GRAM consumption (I thought the code could be running with 16G but it eventually exceeded 24GB).
To Reproduce
Steps to reproduce the behavior:
Run the marius script with config ogbl_ppa_cpu.ini and ogbl_ppa_gpu.ini, and then we have the following results
[2021-12-12 02:47:01.554] [info] [trainer.cpp:68] ################ Starting training epoch 3 ################
[2021-12-12 02:49:36.904] [info] [trainer.cpp:94] Total Edges Processed: 44586862, Percent Complete: 0.100
[2021-12-12 02:52:19.113] [info] [trainer.cpp:94] Total Edges Processed: 46709862, Percent Complete: 0.200
[2021-12-12 02:55:00.754] [info] [trainer.cpp:94] Total Edges Processed: 48832862, Percent Complete: 0.300
[2021-12-12 02:57:44.074] [info] [trainer.cpp:94] Total Edges Processed: 50955862, Percent Complete: 0.400
[2021-12-12 03:00:25.467] [info] [trainer.cpp:94] Total Edges Processed: 53078862, Percent Complete: 0.500
[2021-12-12 03:03:09.531] [info] [trainer.cpp:94] Total Edges Processed: 55201862, Percent Complete: 0.600
[2021-12-12 03:06:03.269] [info] [trainer.cpp:94] Total Edges Processed: 57324862, Percent Complete: 0.700
[2021-12-12 03:08:51.169] [info] [trainer.cpp:94] Total Edges Processed: 59447862, Percent Complete: 0.800
[2021-12-12 03:11:32.560] [info] [trainer.cpp:94] Total Edges Processed: 61570862, Percent Complete: 0.900
[2021-12-12 03:14:13.438] [info] [trainer.cpp:94] Total Edges Processed: 63693862, Percent Complete: 1.000
[2021-12-12 03:14:13.558] [info] [trainer.cpp:99] ################ Finished training epoch 3 ################
[2021-12-12 03:14:13.558] [info] [trainer.cpp:104] Epoch Runtime (Before shuffle/sync): 1632004ms
[2021-12-12 03:14:13.558] [info] [trainer.cpp:105] Edges per Second (Before shuffle/sync): 13009.73
[2021-12-12 03:14:14.870] [info] [dataset.cpp:761] Edges Shuffled
[2021-12-12 03:14:14.870] [info] [trainer.cpp:113] Epoch Runtime (Including shuffle/sync): 1633315ms
[2021-12-12 03:14:14.870] [info] [trainer.cpp:114] Edges per Second (Including shuffle/sync): 12999.288
[2021-12-12 03:14:37.284] [info] [evaluator.cpp:95] Num Eval Edges: 6062562
[2021-12-12 03:14:37.284] [info] [evaluator.cpp:96] Num Eval Batches: 0
[2021-12-12 03:14:37.284] [info] [evaluator.cpp:97] Auc: 0.508, Avg Ranks: 490.966, MRR: 0.008, Hits@1: 0.006, Hits@5: 0.007, Hits@10: 0.007, Hits@20: 0.008, Hits@50: 0.008, Hits@100: 0.009
[2021-12-13 01:53:58.848] [info] [trainer.cpp:68] ################ Starting training epoch 3 ################
[2021-12-13 01:54:03.413] [info] [trainer.cpp:94] Total Edges Processed: 44583862, Percent Complete: 0.100
[2021-12-13 01:54:07.270] [info] [trainer.cpp:94] Total Edges Processed: 46703862, Percent Complete: 0.200
[2021-12-13 01:54:11.005] [info] [trainer.cpp:94] Total Edges Processed: 48823862, Percent Complete: 0.299
[2021-12-13 01:54:15.259] [info] [trainer.cpp:94] Total Edges Processed: 50943862, Percent Complete: 0.399
[2021-12-13 01:54:19.315] [info] [trainer.cpp:94] Total Edges Processed: 53063862, Percent Complete: 0.499
[2021-12-13 01:54:23.355] [info] [trainer.cpp:94] Total Edges Processed: 55183862, Percent Complete: 0.599
[2021-12-13 01:54:27.633] [info] [trainer.cpp:94] Total Edges Processed: 57303862, Percent Complete: 0.699
[2021-12-13 01:54:31.465] [info] [trainer.cpp:94] Total Edges Processed: 59423862, Percent Complete: 0.798
[2021-12-13 01:54:35.505] [info] [trainer.cpp:94] Total Edges Processed: 61543862, Percent Complete: 0.898
[2021-12-13 01:54:39.482] [info] [trainer.cpp:94] Total Edges Processed: 63663862, Percent Complete: 0.998
[2021-12-13 01:54:39.547] [info] [trainer.cpp:99] ################ Finished training epoch 3 ################
[2021-12-13 01:54:39.547] [info] [trainer.cpp:104] Epoch Runtime (Before shuffle/sync): 40698ms
[2021-12-13 01:54:39.547] [info] [trainer.cpp:105] Edges per Second (Before shuffle/sync): 521694.72
[2021-12-13 01:54:40.847] [info] [dataset.cpp:761] Edges Shuffled
[2021-12-13 01:54:40.847] [info] [trainer.cpp:113] Epoch Runtime (Including shuffle/sync): 41998ms
[2021-12-13 01:54:40.847] [info] [trainer.cpp:114] Edges per Second (Including shuffle/sync): 505546.25
[2021-12-13 01:54:58.952] [info] [evaluator.cpp:95] Num Eval Edges: 6062562
[2021-12-13 01:54:58.952] [info] [evaluator.cpp:96] Num Eval Batches: 0
[2021-12-13 01:54:58.952] [info] [evaluator.cpp:97] Auc: 0.992, Avg Ranks: 2.925, MRR: 0.991, Hits@1: 0.990, Hits@5: 0.991, Hits@10: 0.991, Hits@20: 0.992, Hits@50: 0.993, Hits@100: 0.995
Environment
List your operating system, and dependency versions
Python 3.7.10
pytorch 1.7.1 (py3.7_cuda10.1.243_cudnn7.6.3_0)
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
cmake version 3.16.3
GNU Make 4.2.1
The text was updated successfully, but these errors were encountered: