marius-team · AnzeXie · Aug 25, 2021 · Aug 25, 2021 · Aug 25, 2021 · Aug 25, 2021
diff --git a/docs/user_guide/command_line_interface.rst b/docs/user_guide/command_line_interface.rst
@@ -31,45 +31,49 @@ The available options:
 
 ::
 
- usage: preprocess [-h] [--files files [files ...]] [--dataset dataset] [--num_partitions num_partitions] [--overwrite]
- [--generate_config [generate_config]] [--format format] [--delim delim] [--dtype dtype] [--not_remap_ids]
- [--dataset_split dataset_split dataset_split] [--start_col start_col] [--num_line_skip num_line_skip]
- output_directory
-
- Preprocess Datasets
-
- positional arguments:
- output_directory      Directory to put graph data
-
- optional arguments:
- -h, --help            show this help message and exit
- --files files [files ...]
-     Files containing custom dataset
- --dataset dataset     Supported dataset to preprocess
- --num_partitions num_partitions
-     Number of partitions to split the edges into
- --overwrite           Overwrites the output_directory if this issetOtherwise, files with same the names will be treated as the data for current dataset.
- --generate_config [generate_config], -gc [generate_config]
-     Generates a single-GPU/multi-CPU/multi-GPU training configuration file by default.
-     Valid options (default to GPU): [GPU, CPU, multi-GPU]
- --format format       Format of data, eg. srd
- --delim delim, -d delim
-     Specifies the delimiter
- --dtype dtype         Indicates the numpy.dtype
- --not_remap_ids       If set, will not remap ids
- --dataset_split dataset_split dataset_split, -ds dataset_split dataset_split
-     Split dataset into specified fractions
- --start_col start_col, -sc start_col
-     Indicates the column index to start from
- --num_line_skip num_line_skip, -nls num_line_skip
-     Indicates number of lines to skip from the beginning
-
- Specify certain config (optional): [--<section>.<key>=<value>]
+    usage: preprocess [-h] [--download_directory download_directory] [--input_files input_files [input_files ...] |
+                  --dataset dataset] [--num_partitions num_partitions]
+                  [--generate_template_config [generate_template_config]] [--format format] [--delim delim]
+                  [--remap_id_dtype remap_id_dtype] [--not_remap_ids] [--dataset_split dataset_split dataset_split]
+                  [--start_col start_col] [--num_line_skip num_line_skip]
+                  output_directory
+
+    Preprocess Datasets
+
+    positional arguments:
+    output_directory      Directory to put preprocessed graph data.
+
+    optional arguments:
+    -h, --help            show this help message and exit
+    --download_directory download_directory
+                            Directory to put downloaded data files for supported datasets.
+    --input_files input_files [input_files ...]
+                            Input files of custom dataset
+    --dataset dataset     Name of supported dataset to preprocess
+    --num_partitions num_partitions
+                            Number of partitions to split the edges into
+    --generate_template_config [generate_template_config], -gtc [generate_template_config]
+                            Generates a single-GPU training configuration file which contains parameters with default values.
+                            Valid options (default to GPU): [GPU, CPU, multi-GPU]
+    --format format       Specifies the sequence of source, destination (and relation) in input data files, eg. srd
+    --delim delim, -d delim
+                            Specifies the delimiter between source, (relation,) destination strings in input data files.
+    --remap_id_dtype remap_id_dtype
+                            Indicates the data format to store the remapped IDs.
+    --not_remap_ids       If set, will not remap ids
+    --dataset_split dataset_split dataset_split, -ds dataset_split dataset_split
+                            Split dataset into specified fractions
+    --start_col start_col, -sc start_col
+                            Indicates the column index to start parsing source/destination nodes( or relation).
+    --num_line_skip num_line_skip, -nls num_line_skip
+                            Indicates number of lines/rows to skip from the beginning of the file.
+
+    Specify certain config (optional): [--<section>.<key>=<value>]
 
 output_directory
 ++++++++++++++++
 ``<output_directory>`` is a **required** argument for ``marius_preprocess``. 
-It is the directory where all the files created by ``marius_preprocess`` wil be stored.
+It is the directory where all the preprocessed files created by ``marius_preprocess`` wil be stored.
 ``marius_preprocess`` will create this file if it does not exist.
 ``marius_preprocess`` outputs the following files to ``<output_directory>``.
 For the preprocessing of supported datasets, ``<output_directory>`` also includes
@@ -105,24 +109,32 @@ The source, relation and destination of edge ``i`` can be retrieved from
 files by reading 3 4-byte integers (or 8-byte integers if using int64 data type for storage)
 at the offset in the file ``i * 3 * 4`` (or ``i * 3 * 8`` when using int64).
 
-\-\-files <files ...>
+\-\-download_directory <download_directory>
++++++++++++++++++++++++++++++++++++++++++++
+``--download_directory`` is an **optional** argument for ``marius_preprocess``.
+It is the directory where ``marius_preprocess`` puts all downloaded files for
+:ref:`built-in datasets`. The default value of this argument is ``download_dir``.
+
+\-\-intput_files <files ...>
 +++++++++++++++++++++
-``--files`` is an **optional** argument for ``marius_preprocess``.
+``--intput_files`` is an **optional** argument for ``marius_preprocess``.
 It should be a list of files containing custom dataset. It should not be used
-at the same time when ``--dataset`` is used.
+at the same time when ``--dataset`` is used. The input dataset files should 
+have columnar format where each edge occupies its own row and is composed of 
+a source node, a destination node (and a relation) separated by a delimiter.
 
 For example, the following command preprocesses the custom dataset composed of ``custom_train.csv``,
 ``custom_valid.csv`` and ``custom_test.csv`` and stores them into directory ``output_dir``.
 
 ::
 
- marius_preprocess output_dir --files custom_train.csv custom_valid.csv custom_test.csv
+ marius_preprocess output_dir --input_files custom_train.csv custom_valid.csv custom_test.csv
 
 \-\-dataset <dataset>
 +++++++++++++++++++++
 ``--dataset`` is an **optional** argument for ``marius_preprocess``.
 It can be one of the names of a Marius supported dataset. 
-It should not be used at the same time when ``--files`` is used.
+It should not be used at the same time when ``--input_files`` is used.
 To see which datasets are supported by Marius, check out
 :ref:`dataset` table.
 
@@ -135,40 +147,38 @@ The default value for ``<num_partitions>`` is one.
 \-\-overwrite
 +++++++++++++
 ``--overwrite`` is an **optional** argument for ``marius_preprocess``. If this option is set, then
-the ``<output_directory>`` will be overwritten. Otherwise, ``marius_preprocess`` will treat the files
-in ``<output_directory>`` with the same file names as the latest files for current run. When switching
-from one dataset to another one, the converted data files of the previous dataset in same ``<output_directory>``
-may be treated as the already-preprocessed data files for the current dataset if this option is not set.
+the ``<output_directory>`` and ``<download_directory>`` will removed before the preprocessing starts
+to prevent files left from the previous runs to interfere with files from current run.
 
-\-\-generate_config <device>, \-gc <device>
-+++++++++++++++++++++++++++++++++++++++++++
-``--generate_config <device>, -gc <device>`` is an **optional** argument for ``marius_preprocess``.
+\-\-generate_template_config <device>, \-gtc <device>
+++++++++++++++++++++++++++++++++++++++++++++++++++++
+``--generate_template_config <device>, -gtc <device>`` is an **optional** argument for ``marius_preprocess``.
 If this option is set, ``marius_preprocess`` will generate a Marius configuration
 file in the ``<output_directory>`` with all configuration parameters set to the recommended defaults if not 
 explicitly defined.
 
 The generated Marius configuration is for single-GPU setting by default if ``<device>`` is not set.
 If other device, such as ``CPU`` or ``multi-GPU``, is required, users can just append the option after
-``--generate_config``, e.g. ``--generate_config CPU``.
+``--generate_template_config``, e.g. ``--generate_template_config CPU``.
 
 For example, the following example will set ``general.device=CPU`` in the Marius 
 configuration file generated for dataset WordNet18 (``wn18_cpu.ini``).
 
 ::
 
- marius_preprocess ./output_dir --dataset wn18 --generate_config CPU
+ marius_preprocess ./output_dir --dataset wn18 --generate_template_config CPU
 
 \-\-<section>.<key>=<value>
 +++++++++++++++++++++++++++
 ``--<section>.<key>=<value>`` is an **optional** argument for ``marius_preprocess``.
-When ``--generate_config <device>`` is set, ``--<section>.<key>=<value>`` can be used
+When ``--generate_template_config <device>`` is set, ``--<section>.<key>=<value>`` can be used
 to change the value of certain option in the Marius configuration file generated.
 For example, the following example will set ``model.embedding_sze=256`` and ``training.num_epochs=100``
 in the Marius configuration file generated for custom dataset composed of ``custom_dataset.csv`` (``custom_gpu.ini``).
 
 ::
 
- marius_preprocess ./output_dir --files custom_dataset.csv --generate_config --model.embedding_sze=256 --training.num_epochs=100
+ marius_preprocess ./output_dir --input_files custom_dataset.csv --generate_template_config --model.embedding_sze=256 --training.num_epochs=100
 
 \-\-format <format>
 +++++++++++++++++++
@@ -182,7 +192,7 @@ storing edges in the sequence of source node, relation and destination node.
 
 ::
 
- marius_preprocess ./output_dir --files custom_dataset.csv --format src
+ marius_preprocess ./output_dir --input_files custom_dataset.csv --format src
 
 \-\-delim <delim>, \-d <delim>
 +++++++++++++++++++++++++++++
@@ -191,9 +201,9 @@ storing edges in the sequence of source node, relation and destination node.
 If ``<delim>`` is not set, ``marius_preprocess`` will use Python Sniffer to detect a delimiter.
 The delimiter is printed to the terminal so users can verify it.
 
-\-\-dtype <dtype>
-+++++++++++++++++
-``--dtype <dtype>`` is an **optional** argument for ``marius_preprocess``.
+\-\-remap_id_dtype <dtype>
+++++++++++++++++++++++++++
+``--remap_id_dtype <dtype>`` is an **optional** argument for ``marius_preprocess``.
 It defines the format for storing each node remapped ID and relation remapped ID. The current supported
 format is ``int32`` and ``int64``. 
 When storing in ``int32``, each remapped ID will be a 4-byte integer.
@@ -207,6 +217,7 @@ The default ``<dtype>`` is set to ``int32``.
 \-\-not_remap_ids
 +++++++++++++++++
 ``--not_remap_ids`` is an **optional** argument for ``marius_preprocess``.
+During the preprocess, nodes and relations are all mapped to numerical IDs.
 If this option is set, the remapped IDs of nodes and relations will be the same 
 as the read-in order of the nodes and relations from original dataset.
 
@@ -224,7 +235,7 @@ validation, and test sets with a corresponding proportion of 0.99, 0.05, and 0.0
 
 ::
 
- marius_preprocess ./output_dir --files custom_dataset.csv --dataset_split 0.05 0.05
+ marius_preprocess ./output_dir --input_files custom_dataset.csv --dataset_split 0.05 0.05
 
 \-\-start_col <start_col>
 +++++++++++++++++++++++++
@@ -258,7 +269,7 @@ The available options:
 ::
 
     usage: config_generator [-h] [--data_directory data_directory] [--dataset dataset | --stats num_nodes num_edge_types num_train num_valid num_test]
-    [--device [generate_config]]
+    [--device [device]]
     output_directory
 
     Generate configs
@@ -276,7 +287,7 @@ The available options:
     --stats num_nodes num_edge_types num_train num_valid num_test, -s num_nodes num_edge_types num_train num_valid num_test
     Dataset statistics
     Enter in order of num_nodes, num_edge_types, num_train num_valid, num_test
-    --device [generate_config], -dev [generate_config]
+    --device [device], -dev [device]
     Generates configs for a single-GPU/multi-CPU/multi-GPU training configuration file by default.
     Valid options (default to GPU): [GPU, CPU, multi-GPU]
 

diff --git a/docs/user_guide/preprocess.rst b/docs/user_guide/preprocess.rst
@@ -160,6 +160,8 @@ The second approach can be done in the following steps:
 
 The names of the output files can be anything, as long as the path options are set in the configuration file.
 
+.. _built-in datasets:
+
 Built-in datasets
 ----------------------------------------------------------
 
@@ -202,14 +204,15 @@ For example, preprocessing the wn18 dataset produces the following output
 ::
 
     user@ubuntu: marius_preprocess output_dir/ --dataset wn18
-    Downloading fetch.phpmedia=en:wordnet-mlj12.tar.gz to output_dir/fetch.phpmedia=en:wordnet-mlj12.tar.gz
+    wn18
+    Downloading fetch.phpmedia=en:wordnet-mlj12.tar.gz to download_dir/fetch.phpmedia=en:wordnet-mlj12.tar.gz
     Extracting
     Extraction completed
     Detected delimiter: ~   ~
-    Reading in output_dir/wordnet-mlj12-train.txt   1/3
-    Reading in output_dir/wordnet-mlj12-valid.txt   2/3
-    Reading in output_dir/wordnet-mlj12-test.txt   3/3
-    Number of instance per file: [141442, 5000, 5000]
+    Reading in download_dir/wordnet-mlj12-train.txt   1/3
+    Reading in download_dir/wordnet-mlj12-valid.txt   2/3
+    Reading in download_dir/wordnet-mlj12-test.txt   3/3
+    Number of instance per file:[141442, 5000, 5000]
     Number of nodes: 40943
     Number of edges: 151442
     Number of relations: 18
@@ -218,13 +221,13 @@ For example, preprocessing the wn18 dataset produces the following output
 Generating configuration files
 ------------------------------
 
-The ``marius_preprocess`` tool can generate a training configuration file for the input dataset using the argument ``--generate_config <device>``, where the <device> is CPU for cpu-based processing, and GPU for gpu-based processing.
+The ``marius_preprocess`` tool can generate a training configuration file for the input dataset using the argument ``--generate_template_config <device>``, where the <device> is CPU for cpu-based processing, and GPU for gpu-based processing.
 
 Specific configuration options can be set by passing ``--<section>.<key>=<value>`` to the command for each option. E.g.
 
 ::
 
-    marius_preprocess output_dir/ --dataset wn18 --generate_config CPU --model.embedding_size=256 --training.num_epochs=100
+    marius_preprocess output_dir/ --dataset wn18 --generate_template_config CPU --model.embedding_size=256 --training.num_epochs=100
 
 This will preprocess the wn18 dataset and will generate a configuration file with following options set:
 

diff --git a/src/python/tools/config_generator.py b/src/python/tools/config_generator.py
@@ -143,7 +143,7 @@ def set_args():
                       nargs=5, help='Dataset statistics\n' +
                       'Enter in order of num_nodes, num_relations, num_train' +
                       ' num_valid, num_test')
-    parser.add_argument('--device', '-dev', metavar='generate_config',
+    parser.add_argument('--device', '-dev', metavar='device',
                         choices=["GPU", "CPU", "multi-GPU"],
                         nargs='?', default='GPU',
                         help=('Generates configs for a single-GPU/multi-CPU' +