-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-4398][IT] Add Golden Files for TPC-H Spark32 + Gluten Execution Plan #4432
Conversation
Run Gluten Clickhouse CI |
6 similar comments
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
@zwangsheng Thanks in advance. Would you like to introduce a bit about the term "golden file"? Thanks. If I understand it correctly, when execution plan for a query gets changed, developer should regenerate the files using If yes, I am worried that this way puts extra burden to developers since compiling After all, would you like to share a bit about the background why you wanted to add this check to Gluten (and its CI)? Thanks. |
Thanks for noticing this feature. @zhztheplayer
Base on Spark SQL Module has PlanStabilitySuite to sense how code changes (from PR) change the execution plan. Gluten is closely related to spark execution plans, and
Yes, if devs make same change to related code, the Golden Files should be re-generate by
At present, it seems that the process of user contribution PR will be affected, and maybe we can make some changes based on this PR to reduce the compilation cost of the user side(in the process of CI detection, we can upload the changed Golden File to the log store section of Github Action, and users can download it instead of compiling it locally, that seems to allay the concerns you mentioned) But in general, we still should have such a Golden File mechanism in case we miss the relevant changes. Thanks again, if have new ideas to help us check spark execution plan changes, be open to them. |
Run Gluten Clickhouse CI |
3 similar comments
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Thanks for the detailed explanation @zwangsheng.
I still feel the procedure is probably a little too complicated for developers. Gluten has in-place TPC-H unit test code here So if possible I might be inclined to have the golden checking integrated with current unit tests rather than integration testing. Then developers could avoid re-building the whole project to change golden files when plan gets changed. I understand it might require for a bunch of changes to this PR but let's see if it is worth it. I was considering bring some A simple comparison between using it and ut: it: ut (probably): I assume the ut solution should take much shorter time than using it? |
Thanks, i will give a try to move those golden files test to unit test. |
Thank you! |
Run Gluten Clickhouse CI |
7 similar comments
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
3973536
to
b648a09
Compare
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
if [ -z "$GITHUB_RUN_ID" ] | ||
then | ||
echo "Unable to parse GITHUB_RUN_ID." | ||
exit 1 | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check $GITHUB_JOB
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Thanks @zwangsheng !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems overkill to generate golden files for each type
- gluten-bhj-vanilla-be
- vanilla-bhj-gluten-be
- partitioned
- v1
- v1-bhj
- v2
- v2-bhj
Can we just maintain v1
and v1-bhj
?
Reducing type seems feasible, but considering Spark retains v2 related, should we also retain v2 related? |
I think no body would use v2 code path to read parquet.. it is mainly used for data lake. |
Another question: After this patch, is there any best practice to re-generate all the golden files through a single command?
The example is good but it seems developer should fix the files one by one? |
Nope, github action will cache those actual golden files in tgz mode, so users can download and replace all in one click. |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
lgtm, please add a step by step docs to help contributors check and update golden files. |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Hi @ulysses-you @zhztheplayer Golden files related introduction was added in |
Actual Plan path: /tmp/tpch-approved-plan/v2-bhj/spark322/5.txt | ||
Golden Plan path: /opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-approved-plan/v2-bhj/spark322/5.txt (VeloxTPCHSuite.scala:101) | ||
``` | ||
For developers to update the golden plan, you can find the actual plan in Github CI Artifacts or in local `/tmp/` directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a .png
to make it clear ?
Run Gluten Clickhouse CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, cc @zhztheplayer @PHILO-HE if you have other comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution!
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
What changes were proposed in this pull request?
Add tpc-h golden file for spark32 and check whether PR changed Spark Execution Plan
Continue worker with #4399
Users can download and replace the local corresponding file for quick correction.
Golden files located at
${GLUTEN_REPO_HOME}/backends-velox/src/test/resources/tpch-approved-plan/${TPC_H_TEST_SUB_TYPE}/${SPARK_VERSION}/${QUERY_ID}.txt
,for example:
${GLUTEN_REPO_HOME}/backends-velox/src/test/resources/tpch-approved-plan/gluten-bhj-vanilla-be/spark322/1.txt
.And when user run unit test in local, them can find actual golden file in tmp dir, find this with unit test output log.
How was this patch tested?
Add Unit Test for Spark32 to check golden files