Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run disk version of MariusGNN failed #143

Open
JIESUN233 opened this issue Sep 12, 2023 · 2 comments
Open

Run disk version of MariusGNN failed #143

JIESUN233 opened this issue Sep 12, 2023 · 2 comments
Labels
question Further information is requested

Comments

@JIESUN233
Copy link

Hi, I'd like to run the disk version of MariusGNN. I found when I set the feature's storage type to PARTITION_BUFFER, I would meet the segmentation fault error:
f76c764dee94720861ef959d810a543
(If I set the storage type to HOST_MEMORY, I could run the training procedure successfully.)

Specifically,
I downloaded the master branch of Marius, built a docker, and conducted experiments in the docker.
I followed these instructions to install Marius.
1694487203233

Here is my training config file:
df849b1b901ba726a0bdd02c20f7dd0

And this is how I generate the datasets:
08270eea7cb92bfed4d7484b2df4238

This is the dataset directory:
28604338f560267d01c2574903600bf

I would be very appreciated if you could help solve my issue~

@JIESUN233 JIESUN233 added the question Further information is requested label Sep 12, 2023
@rogerwaleffe
Copy link
Collaborator

Hi there. Thanks for your question. It's not immediately obvious to me why this isn't working, but it's possible it is because you are trying to put the edges in Host_Memory. Can you try the following for your storage config:

  device_type: cuda
  dataset_dir: products_example/
  edges:
    type: FLAT_FILE
  nodes:
    type: HOST_MEMORY
  features:
    type: PARTITION_BUFFER
    options:
      num_partitions: 32
      buffer_capacity: 5
      prefetching: true
      fine_to_coarse_ratio: 1
      num_cache_partitions: 0
      node_partition_ordering: DISPERSED

@rogerwaleffe
Copy link
Collaborator

I looked into this issue a bit more and it seems there were some bugs in the code that appeared very infrequently, but more often when running disk-based training. I have fixed those issues in this PR (#147) and merged the changes into main.

Can you try running your config again?

With the updates, I did not have any issues running the following.

Preprocessing command:
marius_preprocess --dataset ogbn_arxiv --output_dir datasets/ogbn_arxiv --num_partitions 32

Config:

model:
  learning_task: NODE_CLASSIFICATION
  encoder:
    use_incoming_nbrs: true
    use_outgoing_nbrs: true
    train_neighbor_sampling:
      - type: UNIFORM
        options:
          max_neighbors: 15
        use_hashmap_sets: true
      - type: UNIFORM
        options:
          max_neighbors: 10
      - type: UNIFORM
        options:
          max_neighbors: 5
    eval_neighbor_sampling:
      - type: UNIFORM
        options:
          max_neighbors: 15
        use_hashmap_sets: true
      - type: UNIFORM
        options:
          max_neighbors: 10
      - type: UNIFORM
        options:
          max_neighbors: 5
    layers:
      - - type: FEATURE
          output_dim: 128
          bias: false
          activation: NONE
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          init:
            type: GLOROT_NORMAL
          input_dim: 128
          output_dim: 128
          bias: true
          bias_init:
            type: ZEROS
          activation: RELU
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          init:
            type: GLOROT_NORMAL
          input_dim: 128
          output_dim: 128
          bias: true
          bias_init:
            type: ZEROS
          activation: RELU
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          init:
            type: GLOROT_NORMAL
          input_dim: 128
          output_dim: 40
          bias: true
          bias_init:
            type: ZEROS
          activation: NONE
  decoder:
    type: NODE
  loss:
    type: CROSS_ENTROPY
    options:
      reduction: MEAN
  dense_optimizer:
    type: ADAM
    options:
      learning_rate: 0.003
storage:
  device_type: cuda
  dataset:
    dataset_dir: datasets/ogbn_arxiv/
    num_edges: 1166243
    num_nodes: 169343
    num_relations: 1
    num_train: 90941
    num_valid: 29799
    num_test: 48603
    feature_dim: 128
    num_classes: 40
  edges:
    type: FLAT_FILE
  nodes:
    type: HOST_MEMORY
  features:
    type: PARTITION_BUFFER
    options:
      num_partitions: 32
      buffer_capacity: 3
      prefetching: true
      fine_to_coarse_ratio: 1
      num_cache_partitions: 0
      node_partition_ordering: DISPERSED
  prefetch: true
  shuffle_input: true
  full_graph_evaluation: true
  train_edges_pre_sorted: false
training:
  batch_size: 1000
  num_epochs: 5
  pipeline:
    sync: true
  epochs_per_shuffle: 1
  logs_per_epoch: 10
evaluation:
  batch_size: 1000
  pipeline:
    sync: true
  epochs_per_eval: 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants