Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark preprocessor now works with s3 #118

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

basavaraj29
Copy link
Collaborator

expects env variables S3_BUCKET , AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY .

marius_preprocess --edges s3a://fb15k237/train.txt s3a://fb15k237/valid.txt s3a://fb15k237/test.txt --output_directory /home/data/datasets/fb15k_237/ --spark

writes preprocessed output to the local directory as well as s3, can delete the files from local, but keeping them for now.

@basavaraj29 basavaraj29 requested a review from JasonMoho October 7, 2022 04:29
super().__init__()

self.spark = spark
self.output_dir = output_dir
self.output_to_s3 = output_to_s3
if self.output_to_s3:
self.s3_bucket = os.getenv("S3_BUCKET")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write a readme document or have some documentation that lists all the env variables that a user is required to set

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will check in a README too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants