-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add comments for preprocess module #65
Conversation
4b7db33
to
ac57577
Compare
cc @thodrek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- My comments on live journal apply on all data sets. Please address for Live Journal and copy over to all data sets.
- Please update the initial PR to indicate which documentation convention you are using. Is it PEP 484?
- Please update the title of the PR and prepend the token [WIP] indicating the the PR is Work In Progress; since the documentation of the second file is missing.
@@ -23,6 +29,17 @@ | |||
|
|||
|
|||
def live_journal(output_dir, num_partitions=1, split=(.05, .05)): | |||
"""Preprocesses the dataset live_journal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please expand on what is the output of preprocess with respect to files created. I.e., what files will be present in the dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed in every single dataset preprocessing function.
I'd remove all these comments and just add a comment in each to see the general parser for more details on output.
54207de
to
a0a3cdd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please answer all posted questions in the input flags before I proceed with the review
Indicates number of lines to skip from the beginning | ||
|
||
Specify certain config (optional): [--<section>.<key>=<value>] | ||
usage: preprocess [-h] [--download_directory download_directory] [--files files [files ...] | --dataset dataset] [--num_partitions num_partitions] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't preprocess
supposed to be used for general datasets? why did we call it download_directory
; we should have self-explanatory flags like input_data_directory
, output_data_directory
; also what is files
? these names need to change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes preprocess
can be used for general datasets.
This download_directory
is only used for preprocess
to store downloaded files for supported datasets. When users specify the option dataset
to a supported dataset name, preprocess
downloads the dataset files into this directory if users specify it. This downloading behavior only happens for supported datasets. When preprocessing custom datasets, this directory is not used. I think it might be inaccurate to call such a directory as input_data_directory
since users do not use it for inputing general dataset files.
When preprocessing general datasets, users are expected to input the path to all input files through the option files
. I should use a more self-explanatory flag for this option such as input_data_path
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should split operations such as download etc from the preprocess
util. The functionality of preprocess
should be limited to reading an input graph of a well specified schema and returning the necessary output files that Marius requires to perform training. As simple as that. Please separate download
to a separate utility in the Marius system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not include this in the current PR.
and/or valid_edges.pt and test_edges.pt depend on dataset_split. | ||
The following files are created by this function: | ||
train_edges.pt: Dump of tensor memory for edges in the training set. | ||
valid_edges.pt: Dump of tensor memroy for edges in the validation set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memroy->memory
Describe the pull request
This pull request adds docstrings to
preprocess.py
andcsv_converter.py
to enhance the readability of the codes.The docstrings added follow PEP 257 and Google Python Style Guide (https://github.com/google/styleguide/blob/gh-pages/pyguide.md#s3.8.1-comments-in-doc-strings).
This pull request also adds a
download_directory
option for users to choose which directory to store the downloaded dataset files. By default, the value of this option is "download_dir". Related documents are also updated.How was this tested?
Docs and comments were inspected manually.
The option of
download_directory
is tested by adding a new test intest_csv_preprocessor.py
.