Spark preprocessor now works with s3 #118

basavaraj29 · 2022-10-07T04:29:00Z

expects env variables S3_BUCKET , AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY .

marius_preprocess --edges s3a://fb15k237/train.txt s3a://fb15k237/valid.txt s3a://fb15k237/test.txt --output_directory /home/data/datasets/fb15k_237/ --spark

writes preprocessed output to the local directory as well as s3, can delete the files from local, but keeping them for now.

shivaram · 2022-10-07T15:55:46Z

src/python/tools/preprocess/converters/writers/spark_writer.py

        super().__init__()

        self.spark = spark
        self.output_dir = output_dir
+        self.output_to_s3 = output_to_s3
+        if self.output_to_s3:
+            self.s3_bucket = os.getenv("S3_BUCKET")


Can we write a readme document or have some documentation that lists all the env variables that a user is required to set

sure, will check in a README too.

spark preprocessor now works with s3

872cbf7

basavaraj29 requested a review from JasonMoho October 7, 2022 04:29

shivaram reviewed Oct 7, 2022

View reviewed changes

initial documentation for spark preprocessor s3 support

712459c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark preprocessor now works with s3 #118

Spark preprocessor now works with s3 #118

basavaraj29 commented Oct 7, 2022

shivaram Oct 7, 2022

basavaraj29 Oct 10, 2022

Spark preprocessor now works with s3 #118

Are you sure you want to change the base?

Spark preprocessor now works with s3 #118

Conversation

basavaraj29 commented Oct 7, 2022

shivaram Oct 7, 2022

Choose a reason for hiding this comment

basavaraj29 Oct 10, 2022

Choose a reason for hiding this comment