Skip to content

Commit

Permalink
Auto retrieve dataset stats from dataset.yaml (#92)
Browse files Browse the repository at this point in the history
  • Loading branch information
basavaraj29 authored Apr 25, 2022
1 parent 55b2d97 commit 49669c0
Show file tree
Hide file tree
Showing 40 changed files with 243 additions and 295 deletions.
21 changes: 3 additions & 18 deletions docs/config_interface/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,13 +112,7 @@ Node embedings are optimized by the sparse optimizer.
storage:
device_type: cpu
dataset:
base_directory: /home/data/datasets/fb15k_237/
num_edges: 272115
num_train: 272115
num_nodes: 14541
num_relations: 237
num_valid: 17535
num_test: 20466
dataset_dir: /home/data/datasets/fb15k_237/
edges:
type: DEVICE_MEMORY
options:
Expand Down Expand Up @@ -228,15 +222,7 @@ classification. The loss function being used is Cross Entropy with sum as the re
storage:
device_type: cuda
dataset:
base_directory: /home/data/datasets/ogbn_arxiv/
num_edges: 1166243
num_nodes: 169343
num_relations: 1
num_train: 90941
num_valid: 29799
num_test: 48603
node_feature_dim: 128
num_classes: 40
dataset_dir: /home/data/datasets/ogbn_arxiv/
edges:
type: DEVICE_MEMORY
nodes:
Expand All @@ -251,8 +237,7 @@ classification. The loss function being used is Cross Entropy with sum as the re
shuffle_input: true
full_graph_evaluation: true
The storage configuration here is very similar to the one shown above in Link Prediction. `num_classes` states the number of output
class labels.
The storage configuration here is very similar to the one shown above in Link Prediction.

3. Configure training and evaluation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
44 changes: 12 additions & 32 deletions docs/config_interface/full_schema.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,13 +60,7 @@ in the encoder phase which is directly fed to the `DISTMULT` decoder. Both embed
full_graph_evaluation: true
device_type: cpu
dataset:
base_directory: /home/data/datasets/fb15k_237/
num_edges: 272115
num_train: 272115
num_nodes: 14541
num_relations: 237
num_valid: 17535
num_test: 20466
dataset_dir: /home/data/datasets/fb15k_237/
edges:
type: DEVICE_MEMORY
options:
Expand Down Expand Up @@ -1002,7 +996,7 @@ Storage Configuration
- No
* - model_dir
- String
- Saves the model parameters in the given directory. If not specified, stores in `model_x` directory within the `base_directory` where x changes incrementally from 0 - 10. A maximum of 11 models are stored when `model_dir` is not specified, post which the contents in `model_10/` directory are overwritten with the latest parameters.
- Saves the model parameters in the given directory. If not specified, stores in `model_x` directory within the `dataset_dir` where x changes incrementally from 0 - 10. A maximum of 11 models are stored when `model_dir` is not specified, post which the contents in `model_10/` directory are overwritten with the latest parameters.
- No

Below is a storage configuration that contains the path to the pre-processed data and specifies storage backends to be used for edges, features
Expand All @@ -1013,11 +1007,7 @@ and embeddings.
storage:
device_type: cpu
dataset:
base_directory: /home/data/datasets/fb15k_237/
num_edges: 272115
num_nodes: 14541
num_relations: 237
num_train: 272115
dataset_dir: /home/data/datasets/fb15k_237/
edges:
type: DEVICE_MEMORY
options:
Expand Down Expand Up @@ -1052,61 +1042,51 @@ Dataset Configuration
- Type
- Description
- Required
* - base_directory
* - dataset_dir
- String
- Directory containing the prepreprocessed dataset. Also used to store model parameters and embedding table.
- Yes
* - num_edges
- Int
- Number of edges in the input graph. If link prediction, this should be set to the number of training edges.
- Yes
- No
* - num_nodes
- Int
- Number of nodes in the input graph.
- Yes
- No
* - num_relations
- Int
- Number of relations (edge-types) in the input graph. (Default 1)
- No
* - num_train
- Int
- Number of training examples. In link prediction the examples are edges, in node classification they are nodes.
- Yes
- No
* - num_valid
- Int
- Number of validation examples. If not given, no validation will be performed
- No
* - num_test
- Int
- Number of test examples. If not given, only training will occur.
- Evaluation
- No (Evaluation)
* - node_feature_dim
- Int
- Dimension of the node features, if any.
- No
* - num_classes
- Int
- Number of class labels.
- Node classification
- No (Node classification)

For Marius in-built datasets, the below numbers can be retirieved from output of `marius_preprocess`. For custom user datasets, the
numbers are expected to match against the processed dataset values. Below is the cofiguration for the `fb15k_237` dataset.
For Marius in-built datasets, the below numbers are retrieved from output of `marius_preprocess`. For custom user datasets, a
file with the dataset statistics mentioned above should be present in the `dataset_dir`. Below is the cofiguration for the `fb15k_237` dataset.

.. code-block:: yaml
storage:
dataset:
base_directory: /home/data/datasets/fb15k_237/
num_edges: 272115
num_nodes: 14541
num_relations: 237
num_train: 272115
num_valid: 17535
num_test: 20466
node_feature_dim: -1
rel_feature_dim: -1
num_classes: -1
initialized: true
dataset_dir: /home/data/datasets/fb15k_237/
Storage Backend Configuration
Expand Down
24 changes: 4 additions & 20 deletions docs/config_interface/samples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -186,11 +186,7 @@ GPU Memory
storage:
device_type: cuda
dataset:
base_directory: /home/data/datasets/fb15k_237/
num_edges: 272115
num_train: 272115
num_nodes: 14541
num_relations: 237
dataset_dir: /home/data/datasets/fb15k_237/
edges:
type: DEVICE_MEMORY
options:
Expand All @@ -210,11 +206,7 @@ Mixed CPU-GPU
storage:
device_type: cuda
dataset:
base_directory: /home/data/datasets/fb15k_237/
num_edges: 272115
num_train: 272115
num_nodes: 14541
num_relations: 237
dataset_dir: /home/data/datasets/fb15k_237/
edges:
type: HOST_MEMORY
options:
Expand All @@ -234,11 +226,7 @@ Disk-Based
storage:
device_type: cuda
dataset:
base_directory: /home/data/datasets/fb15k_237/
num_edges: 272115
num_train: 272115
num_nodes: 14541
num_relations: 237
dataset_dir: /home/data/datasets/fb15k_237/
edges:
type: FLAT_FILE
options:
Expand All @@ -260,11 +248,7 @@ as follows
storage:
device_type: cuda
dataset:
base_directory: /home/data/datasets/fb15k_237_partitioned/
num_edges: 272115
num_train: 272115
num_nodes: 14541
num_relations: 237
dataset_dir: /home/data/datasets/fb15k_237_partitioned/
edges:
type: FLAT_FILE
options:
Expand Down
16 changes: 5 additions & 11 deletions docs/examples/config/lp_custom.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ Let's check what is inside the generated ``dataset.yaml`` file:
.. code-block:: bash
$ cat datasets/custom_lp_example/dataset.yaml
base_directory: /marius-internal/datasets/custom_lp_example/
dataset_dir: /marius-internal/datasets/custom_lp_example/
num_edges: 932994
num_nodes: 169343
num_relations: 1
Expand Down Expand Up @@ -111,7 +111,7 @@ The configuration file contains information including but not limited to the inp

For the full configuration schema, please refer to ``docs/config_interface``.

An example YAML configuration file for the OGBN_Arxiv dataset (link prediction model with DistMult) is given in ``examples/configuration/custom_lp.yaml``. Note that the ``base_directory`` is set to the preprocessing output directory, in our example, ``datasets/custom_lp_example/``.
An example YAML configuration file for the OGBN_Arxiv dataset (link prediction model with DistMult) is given in ``examples/configuration/custom_lp.yaml``. Note that the ``dataset_dir`` is set to the preprocessing output directory, in our example, ``datasets/custom_lp_example/``.

Let's create the same YAML configuration file for the OGBN_Arxiv dataset from scratch. We follow the structure of the configuration file and create each of the four sections one by one. In a YAML file, indentation is used to denote nesting and all parameters are in the format of key-value pairs.

Expand Down Expand Up @@ -151,22 +151,16 @@ Let's create the same YAML configuration file for the OGBN_Arxiv dataset from sc
evaluation:
# omit
#. Next, we set the **storage** and **dataset**. We begin by setting all required parameters. This includes ``dataset``. Here, the ``base_directory`` is set to ``datasets/custom_lp_example/``, which is the preprocessing output directory. To populate the ``num_edges``, ``num_train``,..., ``num_test`` fields, we simply copy the input dataset statistics obtained from ``datasets/custom_lp_example/dataset.yaml`` and fill in each of their values.
#. Next, we set the **storage** and **dataset**. We begin by setting all required parameters. This includes ``dataset``. Here, the ``dataset_dir`` is set to ``datasets/custom_lp_example/``, which is the preprocessing output directory.

.. code-block:: yaml
model:
# omit
storage:
device_type: cuda
dataset: # copy values from "datasets/custom_lp_example/dataset.yaml"
base_directory: /marius-internal/datasets/custom_lp_example/
num_edges: 932994
num_nodes: 169343
num_relations: 1
num_train: 932994
num_valid: 116624
num_test: 116625
dataset:
dataset_dir: /marius-internal/datasets/custom_lp_example/
edges:
type: DEVICE_MEMORY
embeddings:
Expand Down
16 changes: 5 additions & 11 deletions docs/examples/config/lp_fb15k237.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ Let's check what is inside the generated ``dataset.yaml`` file:
.. code-block:: bash
$ cat datasets/fb15k_237_example/dataset.yaml
base_directory: /marius-internal/datasets/fb15k_237_example/
dataset_dir: /marius-internal/datasets/fb15k_237_example/
num_edges: 272115
num_nodes: 14541
num_relations: 237
Expand Down Expand Up @@ -84,7 +84,7 @@ The configuration file contains information including but not limited to the inp

For the full configuration schema, please refer to ``docs/config_interface``.

An example YAML configuration file for the FB15K_237 dataset is given in ``examples/configuration/fb15k_237.yaml``. Note that the ``base_directory`` is set to the preprocessing output directory, in our example, ``datasets/fb15k_237_example/``.
An example YAML configuration file for the FB15K_237 dataset is given in ``examples/configuration/fb15k_237.yaml``. Note that the ``dataset_dir`` is set to the preprocessing output directory, in our example, ``datasets/fb15k_237_example/``.

Let's create the same YAML configuration file for the FB15K_237 dataset from scratch. We follow the structure of the configuration file and create each of the four sections one by one. In a YAML file, indentation is used to denote nesting and all parameters are in the format of key-value pairs.

Expand Down Expand Up @@ -121,22 +121,16 @@ Let's create the same YAML configuration file for the FB15K_237 dataset from scr
evaluation:
# omit
#. Next, we set the **storage** and **dataset**. We begin by setting all required parameters. This includes ``dataset``. Here, the ``base_directory`` is set to ``datasets/fb15k_237_example/``, which is the preprocessing output directory. To populate the ``num_edges``, ``num_train``,..., ``num_test`` fields, we simply copy the input dataset statistics obtained from ``datasets/fb15k_237_example/dataset.yaml`` and fill in each of their values.
#. Next, we set the **storage** and **dataset**. We begin by setting all required parameters. This includes ``dataset``. Here, the ``dataset_dir`` is set to ``datasets/fb15k_237_example/``, which is the preprocessing output directory.

.. code-block:: yaml
model:
# omit
storage:
device_type: cuda
dataset: # copy values from "datasets/fb15k_237_example/dataset.yaml"
base_directory: datasets/fb15k_237_example/
num_edges: 272115
num_train: 272115
num_nodes: 14541
num_relations: 237
num_valid: 17535
num_test: 20466
dataset:
dataset_dir: datasets/fb15k_237_example/
edges:
type: DEVICE_MEMORY
embeddings:
Expand Down
12 changes: 3 additions & 9 deletions docs/examples/config/lp_paleobiology.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Before we can run Marius we need to specify our model hyperparameters and path t

``training.num_epochs``: The number of epochs to train for. Adjust higher or lower based on the desired accuracy of the embeddings. Our default is ``100`` epochs.

``storage.dataset.base_directory``: We assume we are in the link prediction example directory ``~/marius-examples/link-predict-example/`` when running the training process, so our default relative path is ``dataset/``. Change if running from another directory.
``storage.dataset.dataset_dir``: We assume we are in the link prediction example directory ``~/marius-examples/link-predict-example/`` when running the training process, so our default relative path is ``dataset/``. Change if running from another directory.

``paleo_config.yaml``

Expand All @@ -95,13 +95,7 @@ Before we can run Marius we need to specify our model hyperparameters and path t
storage:
device_type: cuda
dataset:
base_directory: dataset/
num_edges: 96522
num_train: 96522
num_nodes: 14752
num_relations: 5
num_valid: 5362
num_test: 5363
dataset_dir: dataset/
edges:
type: DEVICE_MEMORY
embeddings:
Expand Down Expand Up @@ -162,7 +156,7 @@ The output should appear similar to::
Hits@100: 0.459437
=================================

After this has finished, our output will be in our ``[base_directory]`` (using the provided config, this will be ``dataset/``.
After this has finished, our output will be in our ``[dataset_dir]`` (using the provided config, this will be ``dataset/``.

Here are the files that were created in training:
Let's check again what was added in the ``datasets/custom_lp_example/`` directory. For clarity, we only list the files that were created in training. Notice that several files have been created, including the trained model, the embedding table, a full configuration file, and output logs:
Expand Down
Loading

0 comments on commit 49669c0

Please sign in to comment.