Low evaluation accuracy on the ogbn-arxiv example in the doc #153

shengzeang · 2024-01-03T08:31:01Z

Describe the bug

After successfully building marius, I try to reproduce the [ogbn-arxiv example](https://marius-project.org/marius/examples/config/nc_ogbn_arxiv.html) in the doc following the instructions. There shows no error during marius_preprocess and marius_train. However, the evaluation accuracy is low compared to the results in the doc. Further, the accuracy does not quite improve during the training process. The validation accuracy fluctuates around 53%, and test accuracy around 50%. The training configuration file is the same as the one given in the example of the doc.

My training outputs at epoch 1, 5 and 10:

[01/03/24 07:10:34.239] ################ Starting training epoch 1 ################
[01/03/24 07:10:36.136] Nodes processed: [10000/90941], 11.00%
[01/03/24 07:10:36.683] Nodes processed: [20000/90941], 21.99%
[01/03/24 07:10:38.389] Nodes processed: [30000/90941], 32.99%
[01/03/24 07:10:39.993] Nodes processed: [40000/90941], 43.98%
[01/03/24 07:10:40.599] Nodes processed: [50000/90941], 54.98%
[01/03/24 07:10:41.634] Nodes processed: [60000/90941], 65.98%
[01/03/24 07:10:42.896] Nodes processed: [70000/90941], 76.97%
[01/03/24 07:10:43.368] Nodes processed: [80000/90941], 87.97%
[01/03/24 07:10:44.518] Nodes processed: [90000/90941], 98.97%
[01/03/24 07:10:44.680] Nodes processed: [90941/90941], 100.00%
[01/03/24 07:10:44.680] ################ Finished training epoch 1 ################
[01/03/24 07:10:44.680] Epoch Runtime: 10441ms
[01/03/24 07:10:44.680] Nodes per Second: 8709.989
[01/03/24 07:10:44.680] Evaluating validation set
[01/03/24 07:10:46.114] 
=================================
Node Classification: 29799 nodes evaluated
Accuracy: 49.934562%
=================================
[01/03/24 07:10:46.114] Evaluating test set
[01/03/24 07:10:50.094] 
=================================
Node Classification: 48603 nodes evaluated
Accuracy: 47.850956%

[01/03/24 07:11:33.628] ################ Starting training epoch 5 ################
[01/03/24 07:11:34.587] Nodes processed: [10000/90941], 11.00%
[01/03/24 07:11:36.210] Nodes processed: [20000/90941], 21.99%
[01/03/24 07:11:37.551] Nodes processed: [30000/90941], 32.99%
[01/03/24 07:11:38.041] Nodes processed: [40000/90941], 43.98%
[01/03/24 07:11:38.623] Nodes processed: [50000/90941], 54.98%
[01/03/24 07:11:39.214] Nodes processed: [60000/90941], 65.98%
[01/03/24 07:11:39.721] Nodes processed: [70000/90941], 76.97%
[01/03/24 07:11:40.329] Nodes processed: [80000/90941], 87.97%
[01/03/24 07:11:40.892] Nodes processed: [90000/90941], 98.97%
[01/03/24 07:11:40.986] Nodes processed: [90941/90941], 100.00%
[01/03/24 07:11:40.986] ################ Finished training epoch 5 ################
[01/03/24 07:11:40.986] Epoch Runtime: 7357ms
[01/03/24 07:11:40.986] Nodes per Second: 12361.153
[01/03/24 07:11:40.986] Evaluating validation set
[01/03/24 07:11:42.016] 
=================================
Node Classification: 29799 nodes evaluated
Accuracy: 53.384342%
=================================
[01/03/24 07:11:42.016] Evaluating test set
[01/03/24 07:11:43.721] 
=================================
Node Classification: 48603 nodes evaluated
Accuracy: 50.550378%

[01/03/24 07:12:15.394] ################ Starting training epoch 10 ################
[01/03/24 07:12:15.930] Nodes processed: [10000/90941], 11.00%
[01/03/24 07:12:16.525] Nodes processed: [20000/90941], 21.99%
[01/03/24 07:12:17.155] Nodes processed: [30000/90941], 32.99%
[01/03/24 07:12:17.642] Nodes processed: [40000/90941], 43.98%
[01/03/24 07:12:18.262] Nodes processed: [50000/90941], 54.98%
[01/03/24 07:12:18.858] Nodes processed: [60000/90941], 65.98%
[01/03/24 07:12:19.383] Nodes processed: [70000/90941], 76.97%
[01/03/24 07:12:19.978] Nodes processed: [80000/90941], 87.97%
[01/03/24 07:12:20.599] Nodes processed: [90000/90941], 98.97%
[01/03/24 07:12:20.644] Nodes processed: [90941/90941], 100.00%
[01/03/24 07:12:20.644] ################ Finished training epoch 10 ################
[01/03/24 07:12:20.644] Epoch Runtime: 5249ms
[01/03/24 07:12:20.644] Nodes per Second: 17325.395
[01/03/24 07:12:20.644] Evaluating validation set
[01/03/24 07:12:21.676] 
=================================
Node Classification: 29799 nodes evaluated
Accuracy: 52.917883%
=================================
[01/03/24 07:12:21.676] Evaluating test set
[01/03/24 07:12:23.386] 
=================================
Node Classification: 48603 nodes evaluated
Accuracy: 50.303479%

I wonder what could go wrong during the whole process. The build is successful, and there shows no error during preprocessing and training. My environment is listed below.

I'd be glad for any help! Thank you!

Environment
Results of running marius_env_info:

cmake:
  version: 3.28.1
cpu_info:
  num_cpus: 96
  total_memory: 375GB
cuda:
  version: '11.7'
gpu_info:
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
marius:
  bindings_installed: true
  install_path: /usr/local/lib/python3.8/dist-packages/marius
  version: 0.0.2
openmp:
  version: '201511'
operating_system:
  platform: Linux-3.10.107-1-tlinux2-0054-x86_64-with-glibc2.29
pybind:
  PYBIND11_BUILD_ABI: _cxxabi1011
  PYBIND11_COMPILER_TYPE: _gcc
  PYBIND11_STDLIB: _libstdcpp
python:
  deps:
    numpy_version: 1.24.4
    omegaconf_version: 2.3.0
    pandas_version: 2.0.3
    pip_version: 20.0.2
    pyspark_version: N/A
    pytest_version: N/A
    torch_version: !!python/object/new:torch.torch_version.TorchVersion
      - 2.0.1+cu117
    tox_version: N/A
  version: "3.8.10 (default, Nov 22 2023, 10:22:35) \n[GCC 9.4.0]"
pytorch:
  install_path: /usr/local/lib/python3.8/dist-packages/torch
  version: !!python/object/new:torch.torch_version.TorchVersion
    - 2.0.1+cu117

The text was updated successfully, but these errors were encountered:

shengzeang added the bug Something isn't working label Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low evaluation accuracy on the ogbn-arxiv example in the doc #153

Low evaluation accuracy on the ogbn-arxiv example in the doc #153

shengzeang commented Jan 3, 2024

Low evaluation accuracy on the ogbn-arxiv example in the doc #153

Low evaluation accuracy on the ogbn-arxiv example in the doc #153

Comments

shengzeang commented Jan 3, 2024