Add comments for preprocess module #65

AnzeXie · 2021-08-25T05:16:44Z

Describe the pull request

This pull request adds docstrings to preprocess.py and csv_converter.py to enhance the readability of the codes.
The docstrings added follow PEP 257 and Google Python Style Guide (https://github.com/google/styleguide/blob/gh-pages/pyguide.md#s3.8.1-comments-in-doc-strings).

This pull request also adds a download_directory option for users to choose which directory to store the downloaded dataset files. By default, the value of this option is "download_dir". Related documents are also updated.

How was this tested?
Docs and comments were inspected manually.
The option of download_directory is tested by adding a new test in test_csv_preprocessor.py.

shivaram · 2021-08-25T13:23:02Z

cc @thodrek

thodrek

My comments on live journal apply on all data sets. Please address for Live Journal and copy over to all data sets.
Please update the initial PR to indicate which documentation convention you are using. Is it PEP 484?
Please update the title of the PR and prepend the token [WIP] indicating the the PR is Work In Progress; since the documentation of the second file is missing.

thodrek · 2021-08-25T13:46:56Z

src/python/tools/preprocess.py

@@ -23,6 +29,17 @@


 def live_journal(output_dir, num_partitions=1, split=(.05, .05)):
+    """Preprocesses the dataset live_journal.


Please expand on what is the output of preprocess with respect to files created. I.e., what files will be present in the dir.

I don't think this is needed in every single dataset preprocessing function.
I'd remove all these comments and just add a comment in each to see the general parser for more details on output.

src/python/tools/preprocess.py

src/python/tools/csv_converter.py

thodrek

Please answer all posted questions in the input flags before I proceed with the review

thodrek · 2021-09-29T14:01:00Z

docs/user_guide/command_line_interface.rst

-     Indicates number of lines to skip from the beginning
-
- Specify certain config (optional): [--<section>.<key>=<value>]
+    usage: preprocess [-h] [--download_directory download_directory] [--files files [files ...] | --dataset dataset] [--num_partitions num_partitions]


isn't preprocess supposed to be used for general datasets? why did we call it download_directory; we should have self-explanatory flags like input_data_directory , output_data_directory; also what is files? these names need to change

Yes preprocess can be used for general datasets.
This download_directory is only used for preprocess to store downloaded files for supported datasets. When users specify the option dataset to a supported dataset name, preprocess downloads the dataset files into this directory if users specify it. This downloading behavior only happens for supported datasets. When preprocessing custom datasets, this directory is not used. I think it might be inaccurate to call such a directory as input_data_directory since users do not use it for inputing general dataset files.
When preprocessing general datasets, users are expected to input the path to all input files through the option files. I should use a more self-explanatory flag for this option such as input_data_path.

We should split operations such as download etc from the preprocess util. The functionality of preprocess should be limited to reading an input graph of a well specified schema and returning the necessary output files that Marius requires to perform training. As simple as that. Please separate download to a separate utility in the Marius system.

Let's not include this in the current PR.

docs/user_guide/command_line_interface.rst

JasonMoho · 2021-10-06T15:48:32Z

src/python/tools/csv_converter.py

+    and/or valid_edges.pt and test_edges.pt depend on dataset_split.
+    The following files are created by this function:
+        train_edges.pt: Dump of tensor memory for edges in the training set.
+        valid_edges.pt: Dump of tensor memroy for edges in the validation set.


memroy->memory

Added comments for preprocess.py

d6e30df

AnzeXie requested review from thodrek and JasonMoho August 25, 2021 05:16

AnzeXie self-assigned this Aug 25, 2021

AnzeXie linked an issue Aug 25, 2021 that may be closed by this pull request

Enhance description of the output files from pre-processing and description of input format that Marius requires. #64

Open

minor fix

ac57577

AnzeXie force-pushed the Add_preprocess_comments branch from 4b7db33 to ac57577 Compare August 25, 2021 05:38

minor fix

1124bbf

thodrek requested changes Aug 25, 2021

View reviewed changes

AnzeXie changed the title ~~Add comments for preprocess module~~ Add comments for preprocess module [WIP] Aug 25, 2021

AnzeXie added 2 commits August 25, 2021 18:06

Fixes for comments

e9314bc

Added docstrings for csv_converter.py

7a4dde4

AnzeXie changed the title ~~Add comments for preprocess module [WIP]~~ Add comments for preprocess module Aug 26, 2021

AnzeXie added 2 commits August 25, 2021 23:36

Merge branch 'main' into Add_preprocess_comments

3dcd491

minor fix

a0a3cdd

AnzeXie force-pushed the Add_preprocess_comments branch from 54207de to a0a3cdd Compare August 26, 2021 05:16

AnzeXie added 2 commits August 26, 2021 01:55

Fix according to tests

a498f80

Updated binding test

373c2e1

AnzeXie requested a review from thodrek August 26, 2021 07:34

JasonMoho reviewed Aug 26, 2021

View reviewed changes

src/python/tools/csv_converter.py Outdated Show resolved Hide resolved

JasonMoho reviewed Aug 26, 2021

View reviewed changes

src/python/tools/csv_converter.py Show resolved Hide resolved

AnzeXie added 3 commits August 29, 2021 12:45

Added download_dir

32014e3

Added tests for download_dir

699f6c3

Update docs for download_dir

5c133d1

thodrek requested changes Sep 29, 2021

View reviewed changes

AnzeXie added 4 commits October 5, 2021 15:37

Changes according to comments

52ae299

updated tests

6e8c132

updated tests

ba8407a

updated test

bbe2760

AnzeXie added 2 commits October 5, 2021 21:49

Sync csv_converter with new option names

668be8b

Sync csv_converter and its tests

65ffdde

JasonMoho reviewed Oct 6, 2021

View reviewed changes

JasonMoho closed this Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comments for preprocess module #65

Add comments for preprocess module #65

AnzeXie commented Aug 25, 2021 •

edited

Loading

shivaram commented Aug 25, 2021

thodrek left a comment •

edited

Loading

thodrek Aug 25, 2021

JasonMoho Oct 6, 2021 •

edited

Loading

thodrek left a comment

thodrek Sep 29, 2021

AnzeXie Sep 29, 2021 •

edited

Loading

thodrek Sep 30, 2021

JasonMoho Oct 6, 2021

JasonMoho Oct 6, 2021

		@@ -23,6 +29,17 @@


		def live_journal(output_dir, num_partitions=1, split=(.05, .05)):
		"""Preprocesses the dataset live_journal.

Add comments for preprocess module #65

Add comments for preprocess module #65

Conversation

AnzeXie commented Aug 25, 2021 • edited Loading

shivaram commented Aug 25, 2021

thodrek left a comment • edited Loading

Choose a reason for hiding this comment

thodrek Aug 25, 2021

Choose a reason for hiding this comment

JasonMoho Oct 6, 2021 • edited Loading

Choose a reason for hiding this comment

thodrek left a comment

Choose a reason for hiding this comment

thodrek Sep 29, 2021

Choose a reason for hiding this comment

AnzeXie Sep 29, 2021 • edited Loading

Choose a reason for hiding this comment

thodrek Sep 30, 2021

Choose a reason for hiding this comment

JasonMoho Oct 6, 2021

Choose a reason for hiding this comment

JasonMoho Oct 6, 2021

Choose a reason for hiding this comment

AnzeXie commented Aug 25, 2021 •

edited

Loading

thodrek left a comment •

edited

Loading

JasonMoho Oct 6, 2021 •

edited

Loading

AnzeXie Sep 29, 2021 •

edited

Loading