add remaining main claim experiments

marius-team · Aug 25, 2022 · 4250e29 · 4250e29
1 parent e9b9584
commit 4250e29
Show file tree

Hide file tree

Showing 42 changed files with 2,468 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -273,7 +273,10 @@ provided to the `run_experiment.py` script with any additional desired arguments
 | freebase86m_gs | P3.8xLarge | Table 4 | C2 | 30 hours; $350 | Freebase86M epoch time and accuracy for all three systems with graph data stored in CPU memory | - |
 | freebase86m_gs_disk_acc | P3.8xLarge  | Table 4 | C2 | 4 hours; $50 | Freebase86M disk-based training accuracy for MariusGNN | See disk-based training note below |
 | freebase86m_gs_disk_time | P3.2xLarge | Table 4 | C2 | 3 hours; $10 | Freebase86M disk-based training epoch time for MariusGNN | See disk-based training note below |
-| | | | | | | |
+| training_trace | P3.8xLarge | Table 6 | C3 | 6 hours; $75 | Breakdown of timing operations during training on Papers100M for MariusGNN, DGL, and PyG during in-memory training | See sampling note below  |
+| freebase86m_beta_battles | P3.8xLarge | Table 7 | C4 | 37 hours; $450 | Freebase86M results in Table 7 for in-memory training, COMET, and BETA using DistMult, GraphSage, and GAT models | See disk-based training microbenchmark note below |
+
+[comment]: <> (| | | | | | | |)
 
 Notes:
 1. **Disk-based training**: For disk-based training system comparisons, we report runtime using the smaller P3.2xLarge
@@ -285,13 +288,41 @@ the P3.2xLarge machine without evaluation and then export the final embeddings t
 evaluation (although this would prevent access to the per-epoch validation set metrics).
 
 
-2. We report validation set accuracy in the paper as test sets are not expected to be publicly available for all
+2. Disk-based training microbenchmarks: Unlike for the system comparisons, for simplicity, for the disk-based 
+training microbenchmarks
+(e.g., Table 7 and Figure 8), we do not use a separate machine for measuring accuracy and throughput. Instead we 
+use a single machine with sufficient memory for full graph evaluation, but during disk-based training using COMET or
+BETA, the full graph is loaded into memory only during evaluation. Training proceeds using the partition buffer and
+partition replacement policy and only a fraction of the graph is in memory at a given time. Using a single machine 
+reduces the number of experiments and machines that needed to be managed. Further, while the throughput numbers for 
+COMET/BETA reported by this method may not match the throughput these methods would achieve on a machine without 
+sufficient memory to store the full graph (e.g., a P3.2xLarge), the throughput numbers are sufficient for comparing the
+two methods (as the throughput numbers for both COMET and BETA were generated using the same hardware).
+
+
+3. Sampling: In Table 6 we report CPU sampling time as the total time required to sample multi-hop neighborhoods. This
+includes 1) identifying the multi-hop neighborhood and then 2) loading the features for the unique nodes in the 
+neighborhood into the mini batch (to prepare the mini batch for transfer to the GPU). The `training_trace` experiment 
+attempts to measure these two separately and output the results under the names "sampling" and "loading" times, 
+however this separation can only be done
+for MariusGNN (due to the dataloaders in DGL and PyG). Thus, in Table 6 we report the sum of the outputs
+"sampling" and "loading" 
+as the CPU Sampling Time for MariusGNN and report the outputs of "loading" (which includes "sampling") for DGL and PyG.
+
+
+4. We report validation set accuracy in the paper as test sets are not expected to be publicly available for all
 datasets.
 
 [comment]: <> (2. For multi-layer GNNs &#40;on Papers100M and Mag240M, extra eval&#41;, )
 
 [comment]: <> (only include optimal configs &#40;not all the hyperparameter tuning&#41;, multi gpu training bs)
 
+[comment]: <> (how paper cost numbers are calculated)
+
+[comment]: <> (disk based training comment for microbenchmarks)
+
+[comment]: <> (how to parse the train_trace results)
+
 
 ## Hit An Issue? ##
 If you have hit an issue with the system, the scripts, or the results, please let us know 

diff --git a/experiment_manager/disk/configs/freebase86m/dm_beta.yaml b/experiment_manager/disk/configs/freebase86m/dm_beta.yaml
@@ -0,0 +1,79 @@
+model:
+  learning_task: LINK_PREDICTION
+  embeddings:
+    dimension: 100
+    init:
+      type: NORMAL
+      options:
+        mean: 0
+        std: 0.001
+  decoder:
+    type: DISTMULT
+    options:
+      input_dim: 100
+      inverse_edges: true
+    optimizer:
+      type: ADAGRAD
+      options:
+        learning_rate: 0.1
+  loss:
+    type: SOFTMAX
+    options:
+      reduction: SUM
+storage:
+  device_type: cuda
+  dataset:
+    base_directory: datasets/freebase86m_beta_battles/
+    num_edges: 304727650
+    num_train: 304727650
+    num_nodes: 86054151
+    num_relations: 14824
+    num_valid: 16929318
+    num_test: 16929308
+  edges:
+    type: FLAT_FILE
+  embeddings:
+    type: PARTITION_BUFFER
+    options:
+      num_partitions: 16
+      buffer_capacity: 4
+      prefetching: true
+      fine_to_coarse_ratio: 1
+      num_cache_partitions: 0
+      edge_bucket_ordering: OLD_BETA
+      randomly_assign_edge_buckets: false
+  prefetch: true
+  shuffle_input: true
+  full_graph_evaluation: true
+training:
+  batch_size: 50000
+  negative_sampling:
+    num_chunks: 10
+    negatives_per_positive: 500
+    degree_fraction: 0.5
+    filtered: false
+  num_epochs: 10
+  pipeline:
+    sync: true
+#    staleness_bound: 32
+#    batch_host_queue_size: 16
+#    batch_device_queue_size: 16
+#    gradients_device_queue_size: 16
+#    gradients_host_queue_size: 16
+#    batch_loader_threads: 8
+#    batch_transfer_threads: 4
+#    compute_threads: 1
+#    gradient_transfer_threads: 4
+#    gradient_update_threads: 8
+  epochs_per_shuffle: 1
+  logs_per_epoch: 10
+evaluation:
+  batch_size: 10000
+  negative_sampling:
+    num_chunks: 1
+    negatives_per_positive: 2000
+    degree_fraction: 0.5
+    filtered: false
+  pipeline:
+    sync: true
+  epochs_per_eval: 1
diff --git a/experiment_manager/disk/configs/freebase86m/dm_comet.yaml b/experiment_manager/disk/configs/freebase86m/dm_comet.yaml
@@ -0,0 +1,79 @@
+model:
+  learning_task: LINK_PREDICTION
+  embeddings:
+    dimension: 100
+    init:
+      type: NORMAL
+      options:
+        mean: 0
+        std: 0.001
+  decoder:
+    type: DISTMULT
+    options:
+      input_dim: 100
+      inverse_edges: true
+    optimizer:
+      type: ADAGRAD
+      options:
+        learning_rate: 0.1
+  loss:
+    type: SOFTMAX
+    options:
+      reduction: SUM
+storage:
+  device_type: cuda
+  dataset:
+    base_directory: datasets/freebase86m_beta_battles/
+    num_edges: 304727650
+    num_train: 304727650
+    num_nodes: 86054151
+    num_relations: 14824
+    num_valid: 16929318
+    num_test: 16929308
+  edges:
+    type: FLAT_FILE
+  embeddings:
+    type: PARTITION_BUFFER
+    options:
+      num_partitions: 1024
+      buffer_capacity: 256
+      prefetching: true
+      fine_to_coarse_ratio: 128
+      num_cache_partitions: 0
+      edge_bucket_ordering: TWO_LEVEL_BETA
+      randomly_assign_edge_buckets: true
+  prefetch: true
+  shuffle_input: true
+  full_graph_evaluation: true
+training:
+  batch_size: 50000
+  negative_sampling:
+    num_chunks: 10
+    negatives_per_positive: 500
+    degree_fraction: 0.5
+    filtered: false
+  num_epochs: 10
+  pipeline:
+    sync: true
+#    staleness_bound: 32
+#    batch_host_queue_size: 16
+#    batch_device_queue_size: 16
+#    gradients_device_queue_size: 16
+#    gradients_host_queue_size: 16
+#    batch_loader_threads: 8
+#    batch_transfer_threads: 4
+#    compute_threads: 1
+#    gradient_transfer_threads: 4
+#    gradient_update_threads: 8
+  epochs_per_shuffle: 1
+  logs_per_epoch: 10
+evaluation:
+  batch_size: 10000
+  negative_sampling:
+    num_chunks: 1
+    negatives_per_positive: 2000
+    degree_fraction: 0.5
+    filtered: false
+  pipeline:
+    sync: true
+  epochs_per_eval: 1
diff --git a/experiment_manager/disk/configs/freebase86m/dm_mem.yaml b/experiment_manager/disk/configs/freebase86m/dm_mem.yaml
@@ -0,0 +1,71 @@
+model:
+  learning_task: LINK_PREDICTION
+  embeddings:
+    dimension: 100
+    init:
+      type: NORMAL
+      options:
+        mean: 0
+        std: 0.001
+  decoder:
+    type: DISTMULT
+    options:
+      input_dim: 100
+      inverse_edges: true
+    optimizer:
+      type: ADAGRAD
+      options:
+        learning_rate: 0.1
+  loss:
+    type: SOFTMAX
+    options:
+      reduction: SUM
+storage:
+  device_type: cuda
+  dataset:
+    base_directory: datasets/freebase86m_beta_battles/
+    num_edges: 304727650
+    num_train: 304727650
+    num_nodes: 86054151
+    num_relations: 14824
+    num_valid: 16929318
+    num_test: 16929308
+  edges:
+    type: HOST_MEMORY
+  embeddings:
+    type: HOST_MEMORY
+  prefetch: true
+  shuffle_input: true
+  full_graph_evaluation: true
+training:
+  batch_size: 50000
+  negative_sampling:
+    num_chunks: 10
+    negatives_per_positive: 500
+    degree_fraction: 0.5
+    filtered: false
+  num_epochs: 10
+  pipeline:
+    sync: true
+#    staleness_bound: 32
+#    batch_host_queue_size: 16
+#    batch_device_queue_size: 16
+#    gradients_device_queue_size: 16
+#    gradients_host_queue_size: 16
+#    batch_loader_threads: 8
+#    batch_transfer_threads: 4
+#    compute_threads: 1
+#    gradient_transfer_threads: 4
+#    gradient_update_threads: 8
+  epochs_per_shuffle: 1
+  logs_per_epoch: 10
+evaluation:
+  batch_size: 10000
+  negative_sampling:
+    num_chunks: 1
+    negatives_per_positive: 2000
+    degree_fraction: 0.5
+    filtered: false
+  pipeline:
+    sync: true
+  epochs_per_eval: 1