Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low evaluation accuracy on the ogbn-arxiv example in the doc #153

Open
shengzeang opened this issue Jan 3, 2024 · 0 comments
Open

Low evaluation accuracy on the ogbn-arxiv example in the doc #153

shengzeang opened this issue Jan 3, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@shengzeang
Copy link

Describe the bug

After successfully building marius, I try to reproduce the [ogbn-arxiv example](https://marius-project.org/marius/examples/config/nc_ogbn_arxiv.html) in the doc following the instructions. There shows no error during marius_preprocess and marius_train. However, the evaluation accuracy is low compared to the results in the doc. Further, the accuracy does not quite improve during the training process. The validation accuracy fluctuates around 53%, and test accuracy around 50%. The training configuration file is the same as the one given in the example of the doc.

My training outputs at epoch 1, 5 and 10:

[01/03/24 07:10:34.239] ################ Starting training epoch 1 ################
[01/03/24 07:10:36.136] Nodes processed: [10000/90941], 11.00%
[01/03/24 07:10:36.683] Nodes processed: [20000/90941], 21.99%
[01/03/24 07:10:38.389] Nodes processed: [30000/90941], 32.99%
[01/03/24 07:10:39.993] Nodes processed: [40000/90941], 43.98%
[01/03/24 07:10:40.599] Nodes processed: [50000/90941], 54.98%
[01/03/24 07:10:41.634] Nodes processed: [60000/90941], 65.98%
[01/03/24 07:10:42.896] Nodes processed: [70000/90941], 76.97%
[01/03/24 07:10:43.368] Nodes processed: [80000/90941], 87.97%
[01/03/24 07:10:44.518] Nodes processed: [90000/90941], 98.97%
[01/03/24 07:10:44.680] Nodes processed: [90941/90941], 100.00%
[01/03/24 07:10:44.680] ################ Finished training epoch 1 ################
[01/03/24 07:10:44.680] Epoch Runtime: 10441ms
[01/03/24 07:10:44.680] Nodes per Second: 8709.989
[01/03/24 07:10:44.680] Evaluating validation set
[01/03/24 07:10:46.114] 
=================================
Node Classification: 29799 nodes evaluated
Accuracy: 49.934562%
=================================
[01/03/24 07:10:46.114] Evaluating test set
[01/03/24 07:10:50.094] 
=================================
Node Classification: 48603 nodes evaluated
Accuracy: 47.850956%
[01/03/24 07:11:33.628] ################ Starting training epoch 5 ################
[01/03/24 07:11:34.587] Nodes processed: [10000/90941], 11.00%
[01/03/24 07:11:36.210] Nodes processed: [20000/90941], 21.99%
[01/03/24 07:11:37.551] Nodes processed: [30000/90941], 32.99%
[01/03/24 07:11:38.041] Nodes processed: [40000/90941], 43.98%
[01/03/24 07:11:38.623] Nodes processed: [50000/90941], 54.98%
[01/03/24 07:11:39.214] Nodes processed: [60000/90941], 65.98%
[01/03/24 07:11:39.721] Nodes processed: [70000/90941], 76.97%
[01/03/24 07:11:40.329] Nodes processed: [80000/90941], 87.97%
[01/03/24 07:11:40.892] Nodes processed: [90000/90941], 98.97%
[01/03/24 07:11:40.986] Nodes processed: [90941/90941], 100.00%
[01/03/24 07:11:40.986] ################ Finished training epoch 5 ################
[01/03/24 07:11:40.986] Epoch Runtime: 7357ms
[01/03/24 07:11:40.986] Nodes per Second: 12361.153
[01/03/24 07:11:40.986] Evaluating validation set
[01/03/24 07:11:42.016] 
=================================
Node Classification: 29799 nodes evaluated
Accuracy: 53.384342%
=================================
[01/03/24 07:11:42.016] Evaluating test set
[01/03/24 07:11:43.721] 
=================================
Node Classification: 48603 nodes evaluated
Accuracy: 50.550378%
[01/03/24 07:12:15.394] ################ Starting training epoch 10 ################
[01/03/24 07:12:15.930] Nodes processed: [10000/90941], 11.00%
[01/03/24 07:12:16.525] Nodes processed: [20000/90941], 21.99%
[01/03/24 07:12:17.155] Nodes processed: [30000/90941], 32.99%
[01/03/24 07:12:17.642] Nodes processed: [40000/90941], 43.98%
[01/03/24 07:12:18.262] Nodes processed: [50000/90941], 54.98%
[01/03/24 07:12:18.858] Nodes processed: [60000/90941], 65.98%
[01/03/24 07:12:19.383] Nodes processed: [70000/90941], 76.97%
[01/03/24 07:12:19.978] Nodes processed: [80000/90941], 87.97%
[01/03/24 07:12:20.599] Nodes processed: [90000/90941], 98.97%
[01/03/24 07:12:20.644] Nodes processed: [90941/90941], 100.00%
[01/03/24 07:12:20.644] ################ Finished training epoch 10 ################
[01/03/24 07:12:20.644] Epoch Runtime: 5249ms
[01/03/24 07:12:20.644] Nodes per Second: 17325.395
[01/03/24 07:12:20.644] Evaluating validation set
[01/03/24 07:12:21.676] 
=================================
Node Classification: 29799 nodes evaluated
Accuracy: 52.917883%
=================================
[01/03/24 07:12:21.676] Evaluating test set
[01/03/24 07:12:23.386] 
=================================
Node Classification: 48603 nodes evaluated
Accuracy: 50.303479%

I wonder what could go wrong during the whole process. The build is successful, and there shows no error during preprocessing and training. My environment is listed below.

I'd be glad for any help! Thank you!

Environment
Results of running marius_env_info:

cmake:
  version: 3.28.1
cpu_info:
  num_cpus: 96
  total_memory: 375GB
cuda:
  version: '11.7'
gpu_info:
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
  - memory: 32GB
    name: Tesla V100-SXM2-32GB
marius:
  bindings_installed: true
  install_path: /usr/local/lib/python3.8/dist-packages/marius
  version: 0.0.2
openmp:
  version: '201511'
operating_system:
  platform: Linux-3.10.107-1-tlinux2-0054-x86_64-with-glibc2.29
pybind:
  PYBIND11_BUILD_ABI: _cxxabi1011
  PYBIND11_COMPILER_TYPE: _gcc
  PYBIND11_STDLIB: _libstdcpp
python:
  deps:
    numpy_version: 1.24.4
    omegaconf_version: 2.3.0
    pandas_version: 2.0.3
    pip_version: 20.0.2
    pyspark_version: N/A
    pytest_version: N/A
    torch_version: !!python/object/new:torch.torch_version.TorchVersion
      - 2.0.1+cu117
    tox_version: N/A
  version: "3.8.10 (default, Nov 22 2023, 10:22:35) \n[GCC 9.4.0]"
pytorch:
  install_path: /usr/local/lib/python3.8/dist-packages/torch
  version: !!python/object/new:torch.torch_version.TorchVersion
    - 2.0.1+cu117
@shengzeang shengzeang added the bug Something isn't working label Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant