Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-5414] [VL] Support arrow csv option and schema #5850

Merged
merged 19 commits into from
Jun 3, 2024

Conversation

jinchengchenghh
Copy link
Contributor

@jinchengchenghh jinchengchenghh commented May 23, 2024

Support basic option now, will support more options after arrow patch merged.
apache/arrow#41646
Before this patch, if the required schema is different with file schema, csv read will fallback.
And changed to use index in file instead of check the file column name considering case sensitive.
Add a new common test function when the rule applies to Logical plan.

Compile arrow with version 15.0.0-gluten, upgrade arrow-dataset and arrow-c-data version from 15.0.0 to 15.0.0-gluten

Copy link

#5414

Copy link

Run Gluten Clickhouse CI

8 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

9 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@jinchengchenghh
Copy link
Contributor Author

Can you help review this PR and rerun the failed test? Thanks! @zhztheplayer

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits. Thanks!

key: cache-velox-build-${{ hashFiles('./cache-key') }}
- name: Build Gluten Velox third party
if: ${{ steps.cache.outputs.cache-hit != 'true' }}
run: |
source dev/ci-velox-buildstatic.sh
- uses: actions/upload-artifact@v2
- name: 'Upload Artifact Native'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why quoting the name? Since we didn't do it on other names.

Comment on lines +62 to +64
path: |
./cpp/build/releases/
~/.m2/repository/org/apache/arrow/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we modify velox_docker_cache.yml as well? Did you check that file already?

Comment on lines +1 to +4
Name;Language
Juno;Java
Peter;Python
Celin;C++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The placing folder can be renamed from resources/datasource/csv to resource/arrow-datasource/csv or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The csv file is not relevant to arrow datasource, it is just used to test Arrow.
It is just a normal CSV file, so I place it here.

Copy link

Run Gluten Clickhouse CI

zhztheplayer
zhztheplayer previously approved these changes May 29, 2024
@jinchengchenghh
Copy link
Contributor Author

jinchengchenghh commented May 31, 2024

Can you help merge this one? Thanks! @zhztheplayer

Copy link

github-actions bot commented Jun 1, 2024

Run Gluten Clickhouse CI

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@zhztheplayer zhztheplayer merged commit ad817ed into apache:main Jun 3, 2024
40 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_5850_time.csv log/native_master_06_02_2024_26ff58d3b_time.csv difference percentage
q1 33.45 33.06 -0.393 98.83%
q2 23.59 26.68 3.092 113.11%
q3 36.86 36.70 -0.152 99.59%
q4 30.73 35.50 4.771 115.52%
q5 68.58 69.83 1.252 101.83%
q6 7.13 5.99 -1.144 83.97%
q7 80.03 81.70 1.677 102.10%
q8 84.86 84.47 -0.388 99.54%
q9 120.61 119.12 -1.482 98.77%
q10 47.37 45.39 -1.982 95.82%
q11 20.14 20.18 0.042 100.21%
q12 25.06 25.30 0.241 100.96%
q13 37.33 37.41 0.075 100.20%
q14 18.89 20.07 1.177 106.23%
q15 31.14 30.21 -0.923 97.04%
q16 13.95 14.14 0.192 101.37%
q17 101.75 101.99 0.247 100.24%
q18 147.57 147.16 -0.405 99.73%
q19 13.44 17.14 3.701 127.53%
q20 28.93 26.85 -2.087 92.79%
q21 259.41 259.27 -0.141 99.95%
q22 12.33 12.39 0.057 100.46%
total 1243.15 1250.57 7.427 100.60%

Comment on lines +288 to +289
mvn clean install -am \
-DskipTests -Drat.skip -Dmaven.gitcommitid.skip -Dcheckstyle.skip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need compile whole project? it's a time consuming command...
I wonder which depends module can not be compiled by -am options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow-bom module includes many modules, so we need to compile whole project

WangGuangxin added a commit to WangGuangxin/gluten that referenced this pull request Jun 11, 2024
WangGuangxin added a commit to WangGuangxin/gluten that referenced this pull request Jun 15, 2024
WangGuangxin added a commit to WangGuangxin/gluten that referenced this pull request Jun 18, 2024
WangGuangxin added a commit to WangGuangxin/gluten that referenced this pull request Jun 19, 2024
WangGuangxin added a commit to WangGuangxin/gluten that referenced this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants