Skip to content

Commit

Permalink
initial documentation for spark preprocessor s3 support
Browse files Browse the repository at this point in the history
  • Loading branch information
basavaraj29 committed Oct 12, 2022
1 parent 872cbf7 commit 712459c
Showing 1 changed file with 13 additions and 0 deletions.
13 changes: 13 additions & 0 deletions docs/preprocess_datasets/command_line.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,19 @@ Datasets in delimited file formats such as CSVs can be preprocessed with ``mariu

See this `example <custom_dataset_example_>`_.

Custom datasets stored in s3 can also be preprocessed using spark mode of ``marius_preprocess``. If the supplied
edge paths start with ``s3a://``, then the spark preprocessor reads files from s3 and writes back the processed

This comment has been minimized.

Copy link
@shivaram

shivaram Oct 13, 2022

can the inputs and outputs be in different buckets?

This comment has been minimized.

Copy link
@basavaraj29

basavaraj29 Oct 13, 2022

Author Collaborator

yes, input paths are read from --edges. they can exist in some other bucket too.

output to both local and the given s3 bucket (read from an environment variable).

``S3_BUCKET``, ``AWS_ACCESS_KEY_ID``, and ``AWS_SECRET_ACCESS_KEY`` environment variables need to be set for this
to work.

.. code-block:: bash
$ export S3_BUCKET=<bucket to which the preprocessed files will be written>

This comment has been minimized.

Copy link
@shivaram

shivaram Oct 13, 2022

why do we need this? can we just have output_directory be "s3a://fb15k237/datasets/custom_spark_s3" ?

This comment has been minimized.

Copy link
@basavaraj29

basavaraj29 Oct 13, 2022

Author Collaborator

umm, spark s3 writer implementation is as follows.

  1. once the pre-processing is done, it writes multiple csv files.
  2. we combine them into a single csv file using cat
  3. for binary output mode, we read each 10^8 bytes chunk using pandas, and write out the binary file.

if we want to directly write to s3, we'll need to do the above using s3fs. We can do that, but I'm wondering if we would be better off with a single file upload call instead. (given we have minimum storage locally for writing the pre-processed data)

This comment has been minimized.

Copy link
@shivaram

shivaram Oct 16, 2022

I briefly chatted with @JasonMoho about this. I think a better thing to do here is to not combine things into one CSV but to store each partition as a different object on S3. I think that will scale better (there are limits to object size on S3) and also make it possible to run pre-processing on environments like AWS EMR where no local disk might exist.

$ export AWS_ACCESS_KEY_ID=<...>
$ export AWS_SECRET_ACCESS_KEY=<...>
$ marius_preprocess --edges s3a://fb15k237/train.txt s3a://fb15k237/valid.txt s3a://fb15k237/test.txt
--output_directory datasets/custom_spark_s3/ --spark
Usage
-----------------------
Expand Down

0 comments on commit 712459c

Please sign in to comment.