Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comments for preprocess module #65

Closed
wants to merge 18 commits into from
Closed
127 changes: 69 additions & 58 deletions docs/user_guide/command_line_interface.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,45 +31,49 @@ The available options:

::

usage: preprocess [-h] [--files files [files ...]] [--dataset dataset] [--num_partitions num_partitions] [--overwrite]
[--generate_config [generate_config]] [--format format] [--delim delim] [--dtype dtype] [--not_remap_ids]
[--dataset_split dataset_split dataset_split] [--start_col start_col] [--num_line_skip num_line_skip]
output_directory

Preprocess Datasets

positional arguments:
output_directory Directory to put graph data

optional arguments:
-h, --help show this help message and exit
--files files [files ...]
Files containing custom dataset
--dataset dataset Supported dataset to preprocess
--num_partitions num_partitions
Number of partitions to split the edges into
--overwrite Overwrites the output_directory if this issetOtherwise, files with same the names will be treated as the data for current dataset.
--generate_config [generate_config], -gc [generate_config]
Generates a single-GPU/multi-CPU/multi-GPU training configuration file by default.
Valid options (default to GPU): [GPU, CPU, multi-GPU]
--format format Format of data, eg. srd
--delim delim, -d delim
Specifies the delimiter
--dtype dtype Indicates the numpy.dtype
--not_remap_ids If set, will not remap ids
--dataset_split dataset_split dataset_split, -ds dataset_split dataset_split
Split dataset into specified fractions
--start_col start_col, -sc start_col
Indicates the column index to start from
--num_line_skip num_line_skip, -nls num_line_skip
Indicates number of lines to skip from the beginning

Specify certain config (optional): [--<section>.<key>=<value>]
usage: preprocess [-h] [--download_directory download_directory] [--input_files input_files [input_files ...] |
--dataset dataset] [--num_partitions num_partitions]
[--generate_template_config [generate_template_config]] [--format format] [--delim delim]
[--remap_id_dtype remap_id_dtype] [--not_remap_ids] [--dataset_split dataset_split dataset_split]
[--start_col start_col] [--num_line_skip num_line_skip]
output_directory

Preprocess Datasets

positional arguments:
output_directory Directory to put preprocessed graph data.

optional arguments:
-h, --help show this help message and exit
--download_directory download_directory
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
Directory to put downloaded data files for supported datasets.
--input_files input_files [input_files ...]
Input files of custom dataset
--dataset dataset Name of supported dataset to preprocess
--num_partitions num_partitions
Number of partitions to split the edges into
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this true? I thought we are splitting the nodes into partitions; @JasonMoho ? If is it the nodes please say num_node_partitions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's change this to num_node_partitions @AnzeXie

--generate_template_config [generate_template_config], -gtc [generate_template_config]
Generates a single-GPU training configuration file which contains parameters with default values.
Valid options (default to GPU): [GPU, CPU, multi-GPU]
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
--format format Specifies the sequence of source, destination (and relation) in input data files, eg. srd
--delim delim, -d delim
Specifies the delimiter between source, (relation,) destination strings in input data files.
--remap_id_dtype remap_id_dtype
Indicates the data format to store the remapped IDs.
--not_remap_ids If set, will not remap ids
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
--dataset_split dataset_split dataset_split, -ds dataset_split dataset_split
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
Split dataset into specified fractions
--start_col start_col, -sc start_col
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
Indicates the column index to start parsing source/destination nodes( or relation).
--num_line_skip num_line_skip, -nls num_line_skip
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
Indicates number of lines/rows to skip from the beginning of the file.

Specify certain config (optional): [--<section>.<key>=<value>]

output_directory
++++++++++++++++
``<output_directory>`` is a **required** argument for ``marius_preprocess``.
It is the directory where all the files created by ``marius_preprocess`` wil be stored.
It is the directory where all the preprocessed files created by ``marius_preprocess`` wil be stored.
``marius_preprocess`` will create this file if it does not exist.
``marius_preprocess`` outputs the following files to ``<output_directory>``.
For the preprocessing of supported datasets, ``<output_directory>`` also includes
Expand Down Expand Up @@ -105,24 +109,32 @@ The source, relation and destination of edge ``i`` can be retrieved from
files by reading 3 4-byte integers (or 8-byte integers if using int64 data type for storage)
at the offset in the file ``i * 3 * 4`` (or ``i * 3 * 8`` when using int64).

\-\-files <files ...>
\-\-download_directory <download_directory>
+++++++++++++++++++++++++++++++++++++++++++
``--download_directory`` is an **optional** argument for ``marius_preprocess``.
It is the directory where ``marius_preprocess`` puts all downloaded files for
:ref:`built-in datasets`. The default value of this argument is ``download_dir``.

\-\-intput_files <files ...>
+++++++++++++++++++++
``--files`` is an **optional** argument for ``marius_preprocess``.
``--intput_files`` is an **optional** argument for ``marius_preprocess``.
It should be a list of files containing custom dataset. It should not be used
at the same time when ``--dataset`` is used.
at the same time when ``--dataset`` is used. The input dataset files should
have columnar format where each edge occupies its own row and is composed of
a source node, a destination node (and a relation) separated by a delimiter.

For example, the following command preprocesses the custom dataset composed of ``custom_train.csv``,
``custom_valid.csv`` and ``custom_test.csv`` and stores them into directory ``output_dir``.

::

marius_preprocess output_dir --files custom_train.csv custom_valid.csv custom_test.csv
marius_preprocess output_dir --input_files custom_train.csv custom_valid.csv custom_test.csv

\-\-dataset <dataset>
+++++++++++++++++++++
``--dataset`` is an **optional** argument for ``marius_preprocess``.
It can be one of the names of a Marius supported dataset.
It should not be used at the same time when ``--files`` is used.
It should not be used at the same time when ``--input_files`` is used.
To see which datasets are supported by Marius, check out
:ref:`dataset` table.

Expand All @@ -135,40 +147,38 @@ The default value for ``<num_partitions>`` is one.
\-\-overwrite
+++++++++++++
``--overwrite`` is an **optional** argument for ``marius_preprocess``. If this option is set, then
the ``<output_directory>`` will be overwritten. Otherwise, ``marius_preprocess`` will treat the files
in ``<output_directory>`` with the same file names as the latest files for current run. When switching
from one dataset to another one, the converted data files of the previous dataset in same ``<output_directory>``
may be treated as the already-preprocessed data files for the current dataset if this option is not set.
the ``<output_directory>`` and ``<download_directory>`` will removed before the preprocessing starts
to prevent files left from the previous runs to interfere with files from current run.

\-\-generate_config <device>, \-gc <device>
+++++++++++++++++++++++++++++++++++++++++++
``--generate_config <device>, -gc <device>`` is an **optional** argument for ``marius_preprocess``.
\-\-generate_template_config <device>, \-gtc <device>
++++++++++++++++++++++++++++++++++++++++++++++++++++
``--generate_template_config <device>, -gtc <device>`` is an **optional** argument for ``marius_preprocess``.
If this option is set, ``marius_preprocess`` will generate a Marius configuration
file in the ``<output_directory>`` with all configuration parameters set to the recommended defaults if not
explicitly defined.

The generated Marius configuration is for single-GPU setting by default if ``<device>`` is not set.
If other device, such as ``CPU`` or ``multi-GPU``, is required, users can just append the option after
``--generate_config``, e.g. ``--generate_config CPU``.
``--generate_template_config``, e.g. ``--generate_template_config CPU``.

For example, the following example will set ``general.device=CPU`` in the Marius
configuration file generated for dataset WordNet18 (``wn18_cpu.ini``).

::

marius_preprocess ./output_dir --dataset wn18 --generate_config CPU
marius_preprocess ./output_dir --dataset wn18 --generate_template_config CPU

\-\-<section>.<key>=<value>
+++++++++++++++++++++++++++
``--<section>.<key>=<value>`` is an **optional** argument for ``marius_preprocess``.
When ``--generate_config <device>`` is set, ``--<section>.<key>=<value>`` can be used
When ``--generate_template_config <device>`` is set, ``--<section>.<key>=<value>`` can be used
to change the value of certain option in the Marius configuration file generated.
For example, the following example will set ``model.embedding_sze=256`` and ``training.num_epochs=100``
in the Marius configuration file generated for custom dataset composed of ``custom_dataset.csv`` (``custom_gpu.ini``).

::

marius_preprocess ./output_dir --files custom_dataset.csv --generate_config --model.embedding_sze=256 --training.num_epochs=100
marius_preprocess ./output_dir --input_files custom_dataset.csv --generate_template_config --model.embedding_sze=256 --training.num_epochs=100

\-\-format <format>
+++++++++++++++++++
Expand All @@ -182,7 +192,7 @@ storing edges in the sequence of source node, relation and destination node.

::

marius_preprocess ./output_dir --files custom_dataset.csv --format src
marius_preprocess ./output_dir --input_files custom_dataset.csv --format src

\-\-delim <delim>, \-d <delim>
+++++++++++++++++++++++++++++
Expand All @@ -191,9 +201,9 @@ storing edges in the sequence of source node, relation and destination node.
If ``<delim>`` is not set, ``marius_preprocess`` will use Python Sniffer to detect a delimiter.
The delimiter is printed to the terminal so users can verify it.

\-\-dtype <dtype>
+++++++++++++++++
``--dtype <dtype>`` is an **optional** argument for ``marius_preprocess``.
\-\-remap_id_dtype <dtype>
++++++++++++++++++++++++++
``--remap_id_dtype <dtype>`` is an **optional** argument for ``marius_preprocess``.
It defines the format for storing each node remapped ID and relation remapped ID. The current supported
format is ``int32`` and ``int64``.
When storing in ``int32``, each remapped ID will be a 4-byte integer.
Expand All @@ -207,6 +217,7 @@ The default ``<dtype>`` is set to ``int32``.
\-\-not_remap_ids
+++++++++++++++++
``--not_remap_ids`` is an **optional** argument for ``marius_preprocess``.
During the preprocess, nodes and relations are all mapped to numerical IDs.
If this option is set, the remapped IDs of nodes and relations will be the same
as the read-in order of the nodes and relations from original dataset.

Expand All @@ -224,7 +235,7 @@ validation, and test sets with a corresponding proportion of 0.99, 0.05, and 0.0

::

marius_preprocess ./output_dir --files custom_dataset.csv --dataset_split 0.05 0.05
marius_preprocess ./output_dir --input_files custom_dataset.csv --dataset_split 0.05 0.05

\-\-start_col <start_col>
+++++++++++++++++++++++++
Expand Down Expand Up @@ -258,7 +269,7 @@ The available options:
::

usage: config_generator [-h] [--data_directory data_directory] [--dataset dataset | --stats num_nodes num_edge_types num_train num_valid num_test]
[--device [generate_config]]
[--device [device]]
output_directory

Generate configs
Expand All @@ -276,7 +287,7 @@ The available options:
--stats num_nodes num_edge_types num_train num_valid num_test, -s num_nodes num_edge_types num_train num_valid num_test
Dataset statistics
Enter in order of num_nodes, num_edge_types, num_train num_valid, num_test
--device [generate_config], -dev [generate_config]
--device [device], -dev [device]
Generates configs for a single-GPU/multi-CPU/multi-GPU training configuration file by default.
Valid options (default to GPU): [GPU, CPU, multi-GPU]

Expand Down
17 changes: 10 additions & 7 deletions docs/user_guide/preprocess.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,8 @@ The second approach can be done in the following steps:

The names of the output files can be anything, as long as the path options are set in the configuration file.

.. _built-in datasets:

Built-in datasets
----------------------------------------------------------

Expand Down Expand Up @@ -202,14 +204,15 @@ For example, preprocessing the wn18 dataset produces the following output
::

user@ubuntu: marius_preprocess output_dir/ --dataset wn18
Downloading fetch.phpmedia=en:wordnet-mlj12.tar.gz to output_dir/fetch.phpmedia=en:wordnet-mlj12.tar.gz
wn18
Downloading fetch.phpmedia=en:wordnet-mlj12.tar.gz to download_dir/fetch.phpmedia=en:wordnet-mlj12.tar.gz
Extracting
Extraction completed
Detected delimiter: ~ ~
Reading in output_dir/wordnet-mlj12-train.txt 1/3
Reading in output_dir/wordnet-mlj12-valid.txt 2/3
Reading in output_dir/wordnet-mlj12-test.txt 3/3
Number of instance per file: [141442, 5000, 5000]
Reading in download_dir/wordnet-mlj12-train.txt 1/3
Reading in download_dir/wordnet-mlj12-valid.txt 2/3
Reading in download_dir/wordnet-mlj12-test.txt 3/3
Number of instance per file:[141442, 5000, 5000]
Number of nodes: 40943
Number of edges: 151442
Number of relations: 18
Expand All @@ -218,13 +221,13 @@ For example, preprocessing the wn18 dataset produces the following output
Generating configuration files
------------------------------

The ``marius_preprocess`` tool can generate a training configuration file for the input dataset using the argument ``--generate_config <device>``, where the <device> is CPU for cpu-based processing, and GPU for gpu-based processing.
The ``marius_preprocess`` tool can generate a training configuration file for the input dataset using the argument ``--generate_template_config <device>``, where the <device> is CPU for cpu-based processing, and GPU for gpu-based processing.

Specific configuration options can be set by passing ``--<section>.<key>=<value>`` to the command for each option. E.g.

::

marius_preprocess output_dir/ --dataset wn18 --generate_config CPU --model.embedding_size=256 --training.num_epochs=100
marius_preprocess output_dir/ --dataset wn18 --generate_template_config CPU --model.embedding_size=256 --training.num_epochs=100

This will preprocess the wn18 dataset and will generate a configuration file with following options set:

Expand Down
2 changes: 1 addition & 1 deletion src/python/tools/config_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ def set_args():
nargs=5, help='Dataset statistics\n' +
'Enter in order of num_nodes, num_relations, num_train' +
' num_valid, num_test')
parser.add_argument('--device', '-dev', metavar='generate_config',
parser.add_argument('--device', '-dev', metavar='device',
choices=["GPU", "CPU", "multi-GPU"],
nargs='?', default='GPU',
help=('Generates configs for a single-GPU/multi-CPU' +
Expand Down
Loading