Auto retrieve dataset stats from dataset.yaml (#92)

marius-team · Apr 25, 2022 · 49669c0 · 49669c0
1 parent 55b2d97
commit 49669c0
Show file tree

Hide file tree

Showing 40 changed files with 243 additions and 295 deletions.
diff --git a/docs/config_interface/configuration.rst b/docs/config_interface/configuration.rst
@@ -112,13 +112,7 @@ Node embedings are optimized by the sparse optimizer.
    storage:
      device_type: cpu
      dataset:
-       base_directory: /home/data/datasets/fb15k_237/
-       num_edges: 272115
-       num_train: 272115
-       num_nodes: 14541
-       num_relations: 237
-       num_valid: 17535
-       num_test: 20466
+       dataset_dir: /home/data/datasets/fb15k_237/
      edges:
        type: DEVICE_MEMORY
        options:
@@ -228,15 +222,7 @@ classification. The loss function being used is Cross Entropy with sum as the re
    storage:
      device_type: cuda
      dataset:
-       base_directory: /home/data/datasets/ogbn_arxiv/
-       num_edges: 1166243
-       num_nodes: 169343
-       num_relations: 1
-       num_train: 90941
-       num_valid: 29799
-       num_test: 48603
-       node_feature_dim: 128
-       num_classes: 40
+       dataset_dir: /home/data/datasets/ogbn_arxiv/
      edges:
        type: DEVICE_MEMORY
      nodes:
@@ -251,8 +237,7 @@ classification. The loss function being used is Cross Entropy with sum as the re
      shuffle_input: true
      full_graph_evaluation: true
 
-The storage configuration here is very similar to the one shown above in Link Prediction. `num_classes` states the number of output 
-class labels. 
+The storage configuration here is very similar to the one shown above in Link Prediction.
 
 3. Configure training and evaluation
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

diff --git a/docs/config_interface/full_schema.rst b/docs/config_interface/full_schema.rst
@@ -60,13 +60,7 @@ in the encoder phase which is directly fed to the `DISTMULT` decoder. Both embed
      full_graph_evaluation: true
      device_type: cpu
      dataset:
-       base_directory: /home/data/datasets/fb15k_237/
-       num_edges: 272115
-       num_train: 272115
-       num_nodes: 14541
-       num_relations: 237
-       num_valid: 17535
-       num_test: 20466
+       dataset_dir: /home/data/datasets/fb15k_237/
      edges:
        type: DEVICE_MEMORY
        options:
@@ -1002,7 +996,7 @@ Storage Configuration
      - No
    * - model_dir
      - String
-     - Saves the model parameters in the given directory. If not specified, stores in `model_x` directory within the `base_directory` where x changes incrementally from 0 - 10. A maximum of 11 models are stored when `model_dir` is not specified, post which the contents in `model_10/` directory are overwritten with the latest parameters.
+     - Saves the model parameters in the given directory. If not specified, stores in `model_x` directory within the `dataset_dir` where x changes incrementally from 0 - 10. A maximum of 11 models are stored when `model_dir` is not specified, post which the contents in `model_10/` directory are overwritten with the latest parameters.
      - No
 
 Below is a storage configuration that contains the path to the pre-processed data and specifies storage backends to be used for edges, features 
@@ -1013,11 +1007,7 @@ and embeddings.
    storage:
      device_type: cpu
      dataset:
-       base_directory: /home/data/datasets/fb15k_237/
-       num_edges: 272115
-       num_nodes: 14541
-       num_relations: 237
-       num_train: 272115
+       dataset_dir: /home/data/datasets/fb15k_237/
      edges:
        type: DEVICE_MEMORY
        options:
@@ -1052,61 +1042,51 @@ Dataset Configuration
      - Type
      - Description
      - Required
-   * - base_directory
+   * - dataset_dir
      - String
      - Directory containing the prepreprocessed dataset. Also used to store model parameters and embedding table.
      - Yes
    * - num_edges
      - Int
      - Number of edges in the input graph. If link prediction, this should be set to the number of training edges.
-     - Yes
+     - No
    * - num_nodes
      - Int
      - Number of nodes in the input graph.
-     - Yes
+     - No
    * - num_relations
      - Int
      - Number of relations (edge-types) in the input graph. (Default 1)
      - No
    * - num_train
      - Int
      - Number of training examples. In link prediction the examples are edges, in node classification they are nodes.
-     - Yes
+     - No
    * - num_valid
      - Int
      - Number of validation examples. If not given, no validation will be performed
      - No
    * - num_test
      - Int
      - Number of test examples. If not given, only training will occur.
-     - Evaluation
+     - No (Evaluation)
    * - node_feature_dim
      - Int
      - Dimension of the node features, if any.
      - No
    * - num_classes
      - Int
      - Number of class labels.
-     - Node classification
+     - No (Node classification)
 
-For Marius in-built datasets, the below numbers can be retirieved from output of `marius_preprocess`. For custom user datasets, the 
-numbers are expected to match against the processed dataset values. Below is the cofiguration for the `fb15k_237` dataset. 
+For Marius in-built datasets, the below numbers are retrieved from output of `marius_preprocess`. For custom user datasets, a 
+file with the dataset statistics mentioned above should be present in the `dataset_dir`. Below is the cofiguration for the `fb15k_237` dataset. 
 
 .. code-block:: yaml 
 
    storage:
      dataset:
-       base_directory: /home/data/datasets/fb15k_237/
-       num_edges: 272115
-       num_nodes: 14541
-       num_relations: 237
-       num_train: 272115
-       num_valid: 17535
-       num_test: 20466
-       node_feature_dim: -1
-       rel_feature_dim: -1
-       num_classes: -1
-       initialized: true
+       dataset_dir: /home/data/datasets/fb15k_237/
 
 
 Storage Backend Configuration

diff --git a/docs/config_interface/samples.rst b/docs/config_interface/samples.rst
@@ -186,11 +186,7 @@ GPU Memory
    storage:
      device_type: cuda
      dataset:
-       base_directory: /home/data/datasets/fb15k_237/
-       num_edges: 272115
-       num_train: 272115
-       num_nodes: 14541
-       num_relations: 237
+       dataset_dir: /home/data/datasets/fb15k_237/
      edges:
        type: DEVICE_MEMORY
        options:
@@ -210,11 +206,7 @@ Mixed CPU-GPU
    storage:
      device_type: cuda
      dataset:
-       base_directory: /home/data/datasets/fb15k_237/
-       num_edges: 272115
-       num_train: 272115
-       num_nodes: 14541
-       num_relations: 237
+       dataset_dir: /home/data/datasets/fb15k_237/
      edges:
        type: HOST_MEMORY
        options:
@@ -234,11 +226,7 @@ Disk-Based
    storage:
      device_type: cuda
      dataset:
-       base_directory: /home/data/datasets/fb15k_237/
-       num_edges: 272115
-       num_train: 272115
-       num_nodes: 14541
-       num_relations: 237
+       dataset_dir: /home/data/datasets/fb15k_237/
      edges:
        type: FLAT_FILE
        options:
@@ -260,11 +248,7 @@ as follows
    storage:
      device_type: cuda
      dataset:
-       base_directory: /home/data/datasets/fb15k_237_partitioned/
-       num_edges: 272115
-       num_train: 272115
-       num_nodes: 14541
-       num_relations: 237
+       dataset_dir: /home/data/datasets/fb15k_237_partitioned/
      edges:
        type: FLAT_FILE
        options:

diff --git a/docs/examples/config/lp_custom.rst b/docs/examples/config/lp_custom.rst
@@ -82,7 +82,7 @@ Let's check what is inside the generated ``dataset.yaml`` file:
 .. code-block:: bash
 
    $ cat datasets/custom_lp_example/dataset.yaml
-    base_directory: /marius-internal/datasets/custom_lp_example/
+    dataset_dir: /marius-internal/datasets/custom_lp_example/
     num_edges: 932994
     num_nodes: 169343
     num_relations: 1
@@ -111,7 +111,7 @@ The configuration file contains information including but not limited to the inp
 
 For the full configuration schema, please refer to ``docs/config_interface``. 
 
-An example YAML configuration file for the OGBN_Arxiv dataset (link prediction model with DistMult) is given in ``examples/configuration/custom_lp.yaml``. Note that the ``base_directory`` is set to the preprocessing output directory, in our example, ``datasets/custom_lp_example/``.
+An example YAML configuration file for the OGBN_Arxiv dataset (link prediction model with DistMult) is given in ``examples/configuration/custom_lp.yaml``. Note that the ``dataset_dir`` is set to the preprocessing output directory, in our example, ``datasets/custom_lp_example/``.
 
 Let's create the same YAML configuration file for the OGBN_Arxiv dataset from scratch. We follow the structure of the configuration file and create each of the four sections one by one. In a YAML file, indentation is used to denote nesting and all parameters are in the format of key-value pairs. 
 
@@ -151,22 +151,16 @@ Let's create the same YAML configuration file for the OGBN_Arxiv dataset from sc
         evaluation:
           # omit
       
-#. Next, we set the **storage** and **dataset**. We begin by setting all required parameters. This includes ``dataset``. Here, the ``base_directory`` is set to ``datasets/custom_lp_example/``, which is the preprocessing output directory. To populate the ``num_edges``, ``num_train``,..., ``num_test`` fields, we simply copy the input dataset statistics obtained from ``datasets/custom_lp_example/dataset.yaml`` and fill in each of their values.
+#. Next, we set the **storage** and **dataset**. We begin by setting all required parameters. This includes ``dataset``. Here, the ``dataset_dir`` is set to ``datasets/custom_lp_example/``, which is the preprocessing output directory.
 
     .. code-block:: yaml
     
         model:
           # omit
         storage:
           device_type: cuda
-          dataset: # copy values from "datasets/custom_lp_example/dataset.yaml"
-            base_directory: /marius-internal/datasets/custom_lp_example/
-            num_edges: 932994
-            num_nodes: 169343
-            num_relations: 1
-            num_train: 932994
-            num_valid: 116624
-            num_test: 116625
+          dataset:
+            dataset_dir: /marius-internal/datasets/custom_lp_example/
           edges:
             type: DEVICE_MEMORY
           embeddings:

diff --git a/docs/examples/config/lp_fb15k237.rst b/docs/examples/config/lp_fb15k237.rst
@@ -55,7 +55,7 @@ Let's check what is inside the generated ``dataset.yaml`` file:
 .. code-block:: bash
 
    $ cat datasets/fb15k_237_example/dataset.yaml
-   base_directory: /marius-internal/datasets/fb15k_237_example/
+   dataset_dir: /marius-internal/datasets/fb15k_237_example/
    num_edges: 272115
    num_nodes: 14541
    num_relations: 237
@@ -84,7 +84,7 @@ The configuration file contains information including but not limited to the inp
 
 For the full configuration schema, please refer to ``docs/config_interface``.
 
-An example YAML configuration file for the FB15K_237 dataset is given in ``examples/configuration/fb15k_237.yaml``. Note that the ``base_directory`` is set to the preprocessing output directory, in our example, ``datasets/fb15k_237_example/``.
+An example YAML configuration file for the FB15K_237 dataset is given in ``examples/configuration/fb15k_237.yaml``. Note that the ``dataset_dir`` is set to the preprocessing output directory, in our example, ``datasets/fb15k_237_example/``.
 
 Let's create the same YAML configuration file for the FB15K_237 dataset from scratch. We follow the structure of the configuration file and create each of the four sections one by one. In a YAML file, indentation is used to denote nesting and all parameters are in the format of key-value pairs. 
 
@@ -121,22 +121,16 @@ Let's create the same YAML configuration file for the FB15K_237 dataset from scr
         evaluation:
           # omit
       
-#. Next, we set the **storage** and **dataset**. We begin by setting all required parameters. This includes ``dataset``. Here, the ``base_directory`` is set to ``datasets/fb15k_237_example/``, which is the preprocessing output directory. To populate the ``num_edges``, ``num_train``,..., ``num_test`` fields, we simply copy the input dataset statistics obtained from ``datasets/fb15k_237_example/dataset.yaml`` and fill in each of their values.
+#. Next, we set the **storage** and **dataset**. We begin by setting all required parameters. This includes ``dataset``. Here, the ``dataset_dir`` is set to ``datasets/fb15k_237_example/``, which is the preprocessing output directory.
 
     .. code-block:: yaml
     
         model:
           # omit
         storage:
           device_type: cuda
-          dataset: # copy values from "datasets/fb15k_237_example/dataset.yaml"
-            base_directory: datasets/fb15k_237_example/
-            num_edges: 272115
-            num_train: 272115
-            num_nodes: 14541
-            num_relations: 237
-            num_valid: 17535
-            num_test: 20466
+          dataset:
+            dataset_dir: datasets/fb15k_237_example/
           edges:
             type: DEVICE_MEMORY
           embeddings:

diff --git a/docs/examples/config/lp_paleobiology.rst b/docs/examples/config/lp_paleobiology.rst
@@ -72,7 +72,7 @@ Before we can run Marius we need to specify our model hyperparameters and path t
 
 ``training.num_epochs``: The number of epochs to train for. Adjust higher or lower based on the desired accuracy of the embeddings. Our default is ``100`` epochs.
 
-``storage.dataset.base_directory``: We assume we are in the link prediction example directory ``~/marius-examples/link-predict-example/`` when running the training process, so our default relative path is ``dataset/``. Change if running from another directory.
+``storage.dataset.dataset_dir``: We assume we are in the link prediction example directory ``~/marius-examples/link-predict-example/`` when running the training process, so our default relative path is ``dataset/``. Change if running from another directory.
 
 ``paleo_config.yaml``
 
@@ -95,13 +95,7 @@ Before we can run Marius we need to specify our model hyperparameters and path t
         storage:
             device_type: cuda
             dataset:
-                base_directory: dataset/
-                num_edges: 96522
-                num_train: 96522
-                num_nodes: 14752
-                num_relations: 5
-                num_valid: 5362
-                num_test: 5363
+                dataset_dir: dataset/
             edges:
                 type: DEVICE_MEMORY
             embeddings:
@@ -162,7 +156,7 @@ The output should appear similar to::
     Hits@100: 0.459437
     =================================
 
-After this has finished, our output will be in our ``[base_directory]`` (using the provided config, this will be ``dataset/``.
+After this has finished, our output will be in our ``[dataset_dir]`` (using the provided config, this will be ``dataset/``.
 
 Here are the files that were created in training:
 Let's check again what was added in the ``datasets/custom_lp_example/`` directory. For clarity, we only list the files that were created in training. Notice that several files have been created, including the trained model, the embedding table, a full configuration file, and output logs: