Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Add support to write parquet files to GCS #3978

Merged
merged 1 commit into from
Jan 4, 2024

Conversation

tigrux
Copy link
Contributor

@tigrux tigrux commented Dec 8, 2023

This change adds support to write parquet files to GCS.
It is based on the support already present to write S3.

Fixes #3976

Copy link

github-actions bot commented Dec 8, 2023

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

std::shared_ptr<arrow::Schema> schema)
: Datasource(filePath, schema),
filePath_(filePath),
schema_(schema),
pool_(std::move(veloxPool)),
s3SinkPool_(std::move(s3SinkPool)) {}
s3SinkPool_(std::move(s3SinkPool)),
gcsSinkPool_(std::move(gcsSinkPool)) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extend the class to VeloxParquetDatsourceS3, VeloxParquetDatsourceGCS and VeloxParquetDatsourceABFS, insted of puting all of them in the same file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the class needs a refactor, but that goes out of the scope of this change.
I propose to make the refactor after this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to onboard ABFS write part native support , @FelixYBW / @tigrux , so i make a refactor PR to address this comment #5486

@@ -56,6 +56,16 @@ void VeloxParquetDatasource::init(const std::unordered_map<std::string, std::str
#else
throw std::runtime_error(
"The write path is S3 path but the S3 haven't been enabled when writing parquet data in velox runtime!");
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, Let's move the logic to independent file and compile the file depends on compile flag. Let's avoid to use ifdef ENABLE_S3/GCS etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but please read my previous answer.

@tigrux tigrux force-pushed the velox-gcs-write branch 4 times, most recently from 36d5b1f to cf3f3a5 Compare December 11, 2023 05:40
This change adds support to write parquet files to GCS.
It is based on the support already present to write S3.

Fixes apache#3976
Copy link
Contributor

@zhouyuan zhouyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@zhouyuan zhouyuan merged commit 6f92ac8 into apache:main Jan 4, 2024
17 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_3978_time.csv log/native_master_01_03_2024_38de952fe_time.csv difference percentage
q1 32.64 32.78 0.142 100.43%
q2 25.05 23.75 -1.305 94.79%
q3 36.17 37.47 1.304 103.61%
q4 38.78 40.67 1.881 104.85%
q5 72.03 73.42 1.388 101.93%
q6 5.29 6.78 1.497 128.32%
q7 86.38 85.99 -0.389 99.55%
q8 85.73 88.32 2.591 103.02%
q9 126.57 126.96 0.390 100.31%
q10 43.39 44.95 1.566 103.61%
q11 20.38 19.97 -0.403 98.02%
q12 25.29 28.06 2.773 110.97%
q13 46.13 46.25 0.119 100.26%
q14 19.21 17.13 -2.080 89.17%
q15 28.56 28.28 -0.274 99.04%
q16 15.40 15.84 0.439 102.85%
q17 103.85 102.58 -1.272 98.78%
q18 150.87 148.96 -1.902 98.74%
q19 14.09 14.29 0.205 101.45%
q20 27.73 28.11 0.384 101.38%
q21 225.54 232.01 6.468 102.87%
q22 14.04 13.97 -0.069 99.51%
total 1243.10 1256.55 13.452 101.08%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Velox+GCS connector cannot be used to write parquet files
5 participants