Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comments for preprocess module #65

Closed
wants to merge 18 commits into from
Closed
85 changes: 46 additions & 39 deletions docs/user_guide/command_line_interface.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,45 +31,48 @@ The available options:

::

usage: preprocess [-h] [--files files [files ...]] [--dataset dataset] [--num_partitions num_partitions] [--overwrite]
[--generate_config [generate_config]] [--format format] [--delim delim] [--dtype dtype] [--not_remap_ids]
[--dataset_split dataset_split dataset_split] [--start_col start_col] [--num_line_skip num_line_skip]
output_directory

Preprocess Datasets

positional arguments:
output_directory Directory to put graph data

optional arguments:
-h, --help show this help message and exit
--files files [files ...]
Files containing custom dataset
--dataset dataset Supported dataset to preprocess
--num_partitions num_partitions
Number of partitions to split the edges into
--overwrite Overwrites the output_directory if this issetOtherwise, files with same the names will be treated as the data for current dataset.
--generate_config [generate_config], -gc [generate_config]
Generates a single-GPU/multi-CPU/multi-GPU training configuration file by default.
Valid options (default to GPU): [GPU, CPU, multi-GPU]
--format format Format of data, eg. srd
--delim delim, -d delim
Specifies the delimiter
--dtype dtype Indicates the numpy.dtype
--not_remap_ids If set, will not remap ids
--dataset_split dataset_split dataset_split, -ds dataset_split dataset_split
Split dataset into specified fractions
--start_col start_col, -sc start_col
Indicates the column index to start from
--num_line_skip num_line_skip, -nls num_line_skip
Indicates number of lines to skip from the beginning

Specify certain config (optional): [--<section>.<key>=<value>]
usage: preprocess [-h] [--download_directory download_directory] [--files files [files ...] | --dataset dataset] [--num_partitions num_partitions]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't preprocess supposed to be used for general datasets? why did we call it download_directory; we should have self-explanatory flags like input_data_directory , output_data_directory; also what is files? these names need to change

Copy link
Collaborator Author

@AnzeXie AnzeXie Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes preprocess can be used for general datasets.
This download_directory is only used for preprocess to store downloaded files for supported datasets. When users specify the option dataset to a supported dataset name, preprocess downloads the dataset files into this directory if users specify it. This downloading behavior only happens for supported datasets. When preprocessing custom datasets, this directory is not used. I think it might be inaccurate to call such a directory as input_data_directory since users do not use it for inputing general dataset files.
When preprocessing general datasets, users are expected to input the path to all input files through the option files. I should use a more self-explanatory flag for this option such as input_data_path.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should split operations such as download etc from the preprocess util. The functionality of preprocess should be limited to reading an input graph of a well specified schema and returning the necessary output files that Marius requires to perform training. As simple as that. Please separate download to a separate utility in the Marius system.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not include this in the current PR.

[--overwrite] [--generate_config [generate_config]] [--format format] [--delim delim] [--dtype dtype] [--not_remap_ids]
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
[--dataset_split dataset_split dataset_split] [--start_col start_col] [--num_line_skip num_line_skip]
output_directory

Preprocess Datasets

positional arguments:
output_directory Directory to put preprocessed graph data.

optional arguments:
-h, --help show this help message and exit
--download_directory download_directory
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
Directory to put downloaded data files for supported datasets.
--files files [files ...]
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
Files containing custom dataset
--dataset dataset Supported dataset to preprocess
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
--num_partitions num_partitions
Number of partitions to split the edges into
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this true? I thought we are splitting the nodes into partitions; @JasonMoho ? If is it the nodes please say num_node_partitions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's change this to num_node_partitions @AnzeXie

--overwrite Removes the output_directory and download_directory if this is set.
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
Otherwise, files with same the names from previous run may interfere with files of current run.
--generate_config [generate_config], -gc [generate_config]
Generates a single-GPU training configuration file by default.
Valid options (default to GPU): [GPU, CPU, multi-GPU]
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
--format format Format of data, eg. srd
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
--delim delim, -d delim
Specifies the delimiter
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
--dtype dtype Indicates the numpy.dtype
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
--not_remap_ids If set, will not remap ids
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
--dataset_split dataset_split dataset_split, -ds dataset_split dataset_split
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
Split dataset into specified fractions
--start_col start_col, -sc start_col
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
Indicates the column index to start from
--num_line_skip num_line_skip, -nls num_line_skip
AnzeXie marked this conversation as resolved.
Show resolved Hide resolved
Indicates number of lines to skip from the beginning

Specify certain config (optional): [--<section>.<key>=<value>]

output_directory
++++++++++++++++
``<output_directory>`` is a **required** argument for ``marius_preprocess``.
It is the directory where all the files created by ``marius_preprocess`` wil be stored.
It is the directory where all the preprocessed files created by ``marius_preprocess`` wil be stored.
``marius_preprocess`` will create this file if it does not exist.
``marius_preprocess`` outputs the following files to ``<output_directory>``.
For the preprocessing of supported datasets, ``<output_directory>`` also includes
Expand Down Expand Up @@ -105,6 +108,12 @@ The source, relation and destination of edge ``i`` can be retrieved from
files by reading 3 4-byte integers (or 8-byte integers if using int64 data type for storage)
at the offset in the file ``i * 3 * 4`` (or ``i * 3 * 8`` when using int64).

\-\-download_directory <download_directory>
+++++++++++++++++++++++++++++++++++++++++++
``--download_directory`` is an **optional** argument for ``marius_preprocess``.
It is the directory where ``marius_preprocess`` puts all downloaded files for
:ref:`built-in datasets`. The default value of this argument is ``download_dir``.

\-\-files <files ...>
+++++++++++++++++++++
``--files`` is an **optional** argument for ``marius_preprocess``.
Expand Down Expand Up @@ -135,10 +144,8 @@ The default value for ``<num_partitions>`` is one.
\-\-overwrite
+++++++++++++
``--overwrite`` is an **optional** argument for ``marius_preprocess``. If this option is set, then
the ``<output_directory>`` will be overwritten. Otherwise, ``marius_preprocess`` will treat the files
in ``<output_directory>`` with the same file names as the latest files for current run. When switching
from one dataset to another one, the converted data files of the previous dataset in same ``<output_directory>``
may be treated as the already-preprocessed data files for the current dataset if this option is not set.
the ``<output_directory>`` and ``<download_directory>`` will removed before the preprocessing starts
to prevent files left from the previous runs to interfere with files from current run.

\-\-generate_config <device>, \-gc <device>
+++++++++++++++++++++++++++++++++++++++++++
Expand Down
13 changes: 8 additions & 5 deletions docs/user_guide/preprocess.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,8 @@ The second approach can be done in the following steps:

The names of the output files can be anything, as long as the path options are set in the configuration file.

.. _built-in datasets:

Built-in datasets
----------------------------------------------------------

Expand Down Expand Up @@ -202,14 +204,15 @@ For example, preprocessing the wn18 dataset produces the following output
::

user@ubuntu: marius_preprocess output_dir/ --dataset wn18
Downloading fetch.phpmedia=en:wordnet-mlj12.tar.gz to output_dir/fetch.phpmedia=en:wordnet-mlj12.tar.gz
wn18
Downloading fetch.phpmedia=en:wordnet-mlj12.tar.gz to download_dir/fetch.phpmedia=en:wordnet-mlj12.tar.gz
Extracting
Extraction completed
Detected delimiter: ~ ~
Reading in output_dir/wordnet-mlj12-train.txt 1/3
Reading in output_dir/wordnet-mlj12-valid.txt 2/3
Reading in output_dir/wordnet-mlj12-test.txt 3/3
Number of instance per file: [141442, 5000, 5000]
Reading in download_dir/wordnet-mlj12-train.txt 1/3
Reading in download_dir/wordnet-mlj12-valid.txt 2/3
Reading in download_dir/wordnet-mlj12-test.txt 3/3
Number of instance per file:[141442, 5000, 5000]
Number of nodes: 40943
Number of edges: 151442
Number of relations: 18
Expand Down
Loading