Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyspark preprocessor outputs to s3 #124

Open
wants to merge 2 commits into
base: spark-preprocessor-optimizations
Choose a base branch
from

Conversation

basavaraj29
Copy link
Collaborator

the preprocessor now writes processed edge and node data to s3, but the data is split into many files. need to combine them.

the following errors when the files are small,

s3_obj.merge(output_filename, files_list)

throws the error EntityTooSmall.

once we have a single file, we can look into converting that to binary. Alternatively, we can define a custom writer that outputs in binary format without the intermediate csv files.

@basavaraj29 basavaraj29 changed the base branch from main to spark-preprocessor-optimizations November 22, 2022 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant