Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comments for preprocess module #65

Closed
wants to merge 18 commits into from

Conversation

AnzeXie
Copy link
Collaborator

@AnzeXie AnzeXie commented Aug 25, 2021

Describe the pull request

This pull request adds docstrings to preprocess.py and csv_converter.py to enhance the readability of the codes.
The docstrings added follow PEP 257 and Google Python Style Guide (https://github.com/google/styleguide/blob/gh-pages/pyguide.md#s3.8.1-comments-in-doc-strings).

This pull request also adds a download_directory option for users to choose which directory to store the downloaded dataset files. By default, the value of this option is "download_dir". Related documents are also updated.

How was this tested?
Docs and comments were inspected manually.
The option of download_directory is tested by adding a new test in test_csv_preprocessor.py.

@AnzeXie AnzeXie force-pushed the Add_preprocess_comments branch from 4b7db33 to ac57577 Compare August 25, 2021 05:38
@shivaram
Copy link

cc @thodrek

Copy link

@thodrek thodrek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • My comments on live journal apply on all data sets. Please address for Live Journal and copy over to all data sets.
  • Please update the initial PR to indicate which documentation convention you are using. Is it PEP 484?
  • Please update the title of the PR and prepend the token [WIP] indicating the the PR is Work In Progress; since the documentation of the second file is missing.

@@ -23,6 +29,17 @@


def live_journal(output_dir, num_partitions=1, split=(.05, .05)):
"""Preprocesses the dataset live_journal.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand on what is the output of preprocess with respect to files created. I.e., what files will be present in the dir.

Copy link
Collaborator

@JasonMoho JasonMoho Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed in every single dataset preprocessing function.
I'd remove all these comments and just add a comment in each to see the general parser for more details on output.

src/python/tools/preprocess.py Outdated Show resolved Hide resolved
src/python/tools/preprocess.py Outdated Show resolved Hide resolved
src/python/tools/preprocess.py Outdated Show resolved Hide resolved
src/python/tools/preprocess.py Outdated Show resolved Hide resolved
src/python/tools/preprocess.py Outdated Show resolved Hide resolved
src/python/tools/preprocess.py Outdated Show resolved Hide resolved
src/python/tools/preprocess.py Show resolved Hide resolved
@AnzeXie AnzeXie changed the title Add comments for preprocess module Add comments for preprocess module [WIP] Aug 25, 2021
@AnzeXie AnzeXie changed the title Add comments for preprocess module [WIP] Add comments for preprocess module Aug 26, 2021
@AnzeXie AnzeXie force-pushed the Add_preprocess_comments branch from 54207de to a0a3cdd Compare August 26, 2021 05:16
@AnzeXie AnzeXie requested a review from thodrek August 26, 2021 07:34
Copy link

@thodrek thodrek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please answer all posted questions in the input flags before I proceed with the review

Indicates number of lines to skip from the beginning

Specify certain config (optional): [--<section>.<key>=<value>]
usage: preprocess [-h] [--download_directory download_directory] [--files files [files ...] | --dataset dataset] [--num_partitions num_partitions]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't preprocess supposed to be used for general datasets? why did we call it download_directory; we should have self-explanatory flags like input_data_directory , output_data_directory; also what is files? these names need to change

Copy link
Collaborator Author

@AnzeXie AnzeXie Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes preprocess can be used for general datasets.
This download_directory is only used for preprocess to store downloaded files for supported datasets. When users specify the option dataset to a supported dataset name, preprocess downloads the dataset files into this directory if users specify it. This downloading behavior only happens for supported datasets. When preprocessing custom datasets, this directory is not used. I think it might be inaccurate to call such a directory as input_data_directory since users do not use it for inputing general dataset files.
When preprocessing general datasets, users are expected to input the path to all input files through the option files. I should use a more self-explanatory flag for this option such as input_data_path.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should split operations such as download etc from the preprocess util. The functionality of preprocess should be limited to reading an input graph of a well specified schema and returning the necessary output files that Marius requires to perform training. As simple as that. Please separate download to a separate utility in the Marius system.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not include this in the current PR.

docs/user_guide/command_line_interface.rst Outdated Show resolved Hide resolved
docs/user_guide/command_line_interface.rst Show resolved Hide resolved
docs/user_guide/command_line_interface.rst Outdated Show resolved Hide resolved
docs/user_guide/command_line_interface.rst Outdated Show resolved Hide resolved
docs/user_guide/command_line_interface.rst Outdated Show resolved Hide resolved
docs/user_guide/command_line_interface.rst Show resolved Hide resolved
docs/user_guide/command_line_interface.rst Show resolved Hide resolved
docs/user_guide/command_line_interface.rst Show resolved Hide resolved
docs/user_guide/command_line_interface.rst Show resolved Hide resolved
and/or valid_edges.pt and test_edges.pt depend on dataset_split.
The following files are created by this function:
train_edges.pt: Dump of tensor memory for edges in the training set.
valid_edges.pt: Dump of tensor memroy for edges in the validation set.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memroy->memory

@JasonMoho JasonMoho closed this Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants