-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add comments for preprocess module #65
Changes from 9 commits
d6e30df
ac57577
1124bbf
e9314bc
7a4dde4
3dcd491
a0a3cdd
a498f80
373c2e1
32014e3
699f6c3
5c133d1
52ae299
6e8c132
ba8407a
bbe2760
668be8b
65ffdde
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,45 +31,48 @@ The available options: | |
|
||
:: | ||
|
||
usage: preprocess [-h] [--files files [files ...]] [--dataset dataset] [--num_partitions num_partitions] [--overwrite] | ||
[--generate_config [generate_config]] [--format format] [--delim delim] [--dtype dtype] [--not_remap_ids] | ||
[--dataset_split dataset_split dataset_split] [--start_col start_col] [--num_line_skip num_line_skip] | ||
output_directory | ||
|
||
Preprocess Datasets | ||
|
||
positional arguments: | ||
output_directory Directory to put graph data | ||
|
||
optional arguments: | ||
-h, --help show this help message and exit | ||
--files files [files ...] | ||
Files containing custom dataset | ||
--dataset dataset Supported dataset to preprocess | ||
--num_partitions num_partitions | ||
Number of partitions to split the edges into | ||
--overwrite Overwrites the output_directory if this issetOtherwise, files with same the names will be treated as the data for current dataset. | ||
--generate_config [generate_config], -gc [generate_config] | ||
Generates a single-GPU/multi-CPU/multi-GPU training configuration file by default. | ||
Valid options (default to GPU): [GPU, CPU, multi-GPU] | ||
--format format Format of data, eg. srd | ||
--delim delim, -d delim | ||
Specifies the delimiter | ||
--dtype dtype Indicates the numpy.dtype | ||
--not_remap_ids If set, will not remap ids | ||
--dataset_split dataset_split dataset_split, -ds dataset_split dataset_split | ||
Split dataset into specified fractions | ||
--start_col start_col, -sc start_col | ||
Indicates the column index to start from | ||
--num_line_skip num_line_skip, -nls num_line_skip | ||
Indicates number of lines to skip from the beginning | ||
|
||
Specify certain config (optional): [--<section>.<key>=<value>] | ||
usage: preprocess [-h] [--download_directory download_directory] [--files files [files ...] | --dataset dataset] [--num_partitions num_partitions] | ||
[--overwrite] [--generate_config [generate_config]] [--format format] [--delim delim] [--dtype dtype] [--not_remap_ids] | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[--dataset_split dataset_split dataset_split] [--start_col start_col] [--num_line_skip num_line_skip] | ||
output_directory | ||
|
||
Preprocess Datasets | ||
|
||
positional arguments: | ||
output_directory Directory to put preprocessed graph data. | ||
|
||
optional arguments: | ||
-h, --help show this help message and exit | ||
--download_directory download_directory | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Directory to put downloaded data files for supported datasets. | ||
--files files [files ...] | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Files containing custom dataset | ||
--dataset dataset Supported dataset to preprocess | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
--num_partitions num_partitions | ||
Number of partitions to split the edges into | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this true? I thought we are splitting the nodes into partitions; @JasonMoho ? If is it the nodes please say num_node_partitions There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah let's change this to num_node_partitions @AnzeXie |
||
--overwrite Removes the output_directory and download_directory if this is set. | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Otherwise, files with same the names from previous run may interfere with files of current run. | ||
--generate_config [generate_config], -gc [generate_config] | ||
Generates a single-GPU training configuration file by default. | ||
Valid options (default to GPU): [GPU, CPU, multi-GPU] | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
--format format Format of data, eg. srd | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
--delim delim, -d delim | ||
Specifies the delimiter | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
--dtype dtype Indicates the numpy.dtype | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
--not_remap_ids If set, will not remap ids | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
--dataset_split dataset_split dataset_split, -ds dataset_split dataset_split | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Split dataset into specified fractions | ||
--start_col start_col, -sc start_col | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Indicates the column index to start from | ||
--num_line_skip num_line_skip, -nls num_line_skip | ||
AnzeXie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Indicates number of lines to skip from the beginning | ||
|
||
Specify certain config (optional): [--<section>.<key>=<value>] | ||
|
||
output_directory | ||
++++++++++++++++ | ||
``<output_directory>`` is a **required** argument for ``marius_preprocess``. | ||
It is the directory where all the files created by ``marius_preprocess`` wil be stored. | ||
It is the directory where all the preprocessed files created by ``marius_preprocess`` wil be stored. | ||
``marius_preprocess`` will create this file if it does not exist. | ||
``marius_preprocess`` outputs the following files to ``<output_directory>``. | ||
For the preprocessing of supported datasets, ``<output_directory>`` also includes | ||
|
@@ -105,6 +108,12 @@ The source, relation and destination of edge ``i`` can be retrieved from | |
files by reading 3 4-byte integers (or 8-byte integers if using int64 data type for storage) | ||
at the offset in the file ``i * 3 * 4`` (or ``i * 3 * 8`` when using int64). | ||
|
||
\-\-download_directory <download_directory> | ||
+++++++++++++++++++++++++++++++++++++++++++ | ||
``--download_directory`` is an **optional** argument for ``marius_preprocess``. | ||
It is the directory where ``marius_preprocess`` puts all downloaded files for | ||
:ref:`built-in datasets`. The default value of this argument is ``download_dir``. | ||
|
||
\-\-files <files ...> | ||
+++++++++++++++++++++ | ||
``--files`` is an **optional** argument for ``marius_preprocess``. | ||
|
@@ -135,10 +144,8 @@ The default value for ``<num_partitions>`` is one. | |
\-\-overwrite | ||
+++++++++++++ | ||
``--overwrite`` is an **optional** argument for ``marius_preprocess``. If this option is set, then | ||
the ``<output_directory>`` will be overwritten. Otherwise, ``marius_preprocess`` will treat the files | ||
in ``<output_directory>`` with the same file names as the latest files for current run. When switching | ||
from one dataset to another one, the converted data files of the previous dataset in same ``<output_directory>`` | ||
may be treated as the already-preprocessed data files for the current dataset if this option is not set. | ||
the ``<output_directory>`` and ``<download_directory>`` will removed before the preprocessing starts | ||
to prevent files left from the previous runs to interfere with files from current run. | ||
|
||
\-\-generate_config <device>, \-gc <device> | ||
+++++++++++++++++++++++++++++++++++++++++++ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't
preprocess
supposed to be used for general datasets? why did we call itdownload_directory
; we should have self-explanatory flags likeinput_data_directory
,output_data_directory
; also what isfiles
? these names need to changeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
preprocess
can be used for general datasets.This
download_directory
is only used forpreprocess
to store downloaded files for supported datasets. When users specify the optiondataset
to a supported dataset name,preprocess
downloads the dataset files into this directory if users specify it. This downloading behavior only happens for supported datasets. When preprocessing custom datasets, this directory is not used. I think it might be inaccurate to call such a directory asinput_data_directory
since users do not use it for inputing general dataset files.When preprocessing general datasets, users are expected to input the path to all input files through the option
files
. I should use a more self-explanatory flag for this option such asinput_data_path
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should split operations such as download etc from the
preprocess
util. The functionality ofpreprocess
should be limited to reading an input graph of a well specified schema and returning the necessary output files that Marius requires to perform training. As simple as that. Please separatedownload
to a separate utility in the Marius system.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not include this in the current PR.