Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE] Only materialize subquery before doing transform #5862

Merged
merged 6 commits into from
May 28, 2024

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented May 24, 2024

What changes were proposed in this pull request?

We transform subquery(e.g., dpp) during columanr rules which is not actually been executed, so we should not materialize subquery when replacing expression as it is not in concurrent. This pr wraps doTransform with transform to always do materialize subquery before doTransform, so that the subquries can be submitted in concurrent.

How was this patch tested?

Pass CI (also no regression)

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten Clickhouse CI

@ulysses-you
Copy link
Contributor Author

/Benchmark Velox TPCDS

1 similar comment
@ulysses-you
Copy link
Contributor Author

/Benchmark Velox TPCDS

Copy link

Run Gluten Clickhouse CI

@ulysses-you ulysses-you marked this pull request as draft May 24, 2024 07:38
Copy link

Run Gluten Clickhouse CI

Comment on lines 33 to 34
// the first column in first row from `query`.
val rows = query.plan.executeCollect()
if (rows.length > 1) {
throw new IllegalStateException(
s"more than one row returned by a subquery used as an expression:\n${query.plan}")
}
val result: AnyRef = if (rows.length == 1) {
assert(
rows(0).numFields == 1,
s"Expects 1 field, but got ${rows(0).numFields}; something went wrong in analysis")
rows(0).get(0, query.dataType)
} else {
// If there is no rows returned, the result should be null.
null
}
val result = query.eval(InternalRow.empty)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this pr, we do not need to execute subquery manually so the exception behavior is same with vanilla Spark. Note that, this code change is just for simplify. The subquery has already been materialized before doing transform.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulysses-you, it would be nice to add these comment in the code.

@ulysses-you ulysses-you marked this pull request as ready for review May 27, 2024 00:59
@ulysses-you
Copy link
Contributor Author

cc @zhztheplayer @PHILO-HE @zzcclp thank you

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work! Please reword PR description to help reader figure out the motivation.

Can we use "transform" to replace "executeTransform"? I guess there is some convention: the concrete implementation of doXXX is generally called inside XXX.

@ulysses-you
Copy link
Contributor Author

@PHILO-HE yes, I agree the convention in general.. The reason using executeTransform follows the Spark naming, like executeColumnar -> doColumnar, executeBroadcast -> doBroadcast. Please let me know what you prefer to.

@PHILO-HE
Copy link
Contributor

@PHILO-HE yes, I agree the convention in general.. The reason using executeTransform follows the Spark naming, like executeColumnar -> doColumnar, executeBroadcast -> doBroadcast. Please let me know what you prefer to.

I note Spark's calling path is executeColumnar -> doExecuteColumnar. So it may be better to make Gluten's calling path become transform -> doTransform for better readability.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L222

Copy link

Run Gluten Clickhouse CI

@ulysses-you
Copy link
Contributor Author

@PHILO-HE I see, addressed

Copy link

Run Gluten Clickhouse CI

PHILO-HE
PHILO-HE previously approved these changes May 27, 2024
Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one trivial comment.
cc @zhichao, @zhztheplayer, please review further.

Comment on lines 33 to 34
// the first column in first row from `query`.
val rows = query.plan.executeCollect()
if (rows.length > 1) {
throw new IllegalStateException(
s"more than one row returned by a subquery used as an expression:\n${query.plan}")
}
val result: AnyRef = if (rows.length == 1) {
assert(
rows(0).numFields == 1,
s"Expects 1 field, but got ${rows(0).numFields}; something went wrong in analysis")
rows(0).get(0, query.dataType)
} else {
// If there is no rows returned, the result should be null.
null
}
val result = query.eval(InternalRow.empty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulysses-you, it would be nice to add these comment in the code.

Copy link

Run Gluten Clickhouse CI

@ulysses-you
Copy link
Contributor Author

/Benchmark Velox TPCDS

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ulysses-you
Copy link
Contributor Author

@PHILO-HE @zhztheplayer I'm not sure the TPCDS benchmark has triggered and run successfully, can you help check the internal state? thank you!

Copy link
Contributor

@zzcclp zzcclp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhztheplayer
Copy link
Member

/Benchmark Velox

@PHILO-HE
Copy link
Contributor

@PHILO-HE @zhztheplayer I'm not sure the TPCDS benchmark has triggered and run successfully, can you help check the internal state? thank you!

It's triggered, but not executed successfully due to download failure for some deps. Just fixed. It's running now.

@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCDS SF2000 with Velox backend, for reference only ====

query log/native_5862_time.csv log/native_master_05_27_2024_efd6f31fb4_time.csv difference percentage
q1 14.97 15.37 0.400 102.67%
q2 14.14 15.78 1.638 111.58%
q3 5.19 4.39 -0.796 84.65%
q4 63.54 64.02 0.487 100.77%
q5 7.79 8.19 0.398 105.11%
q6 3.47 2.67 -0.795 77.07%
q7 4.13 6.21 2.075 150.18%
q8 5.61 3.35 -2.257 59.77%
q9 17.55 18.08 0.531 103.03%
q10 10.10 10.51 0.407 104.03%
q11 35.09 35.26 0.171 100.49%
q12 1.41 1.51 0.097 106.85%
q13 5.57 5.60 0.026 100.46%
q14a 41.81 44.21 2.402 105.75%
q14b 39.94 41.63 1.696 104.25%
q15 2.60 3.81 1.207 146.40%
q16 40.63 39.72 -0.916 97.75%
q17 5.46 5.72 0.256 104.69%
q18 6.21 9.08 2.864 146.09%
q19 3.14 3.49 0.347 111.03%
q20 1.33 1.45 0.117 108.79%
q21 1.05 1.20 0.151 114.38%
q22 9.59 7.98 -1.612 83.19%
q23a 82.52 81.64 -0.873 98.94%
q23b 101.74 99.94 -1.796 98.23%
q24a 81.61 72.11 -9.503 88.36%
q24b 70.73 79.45 8.716 112.32%
q25 5.39 4.56 -0.834 84.54%
q26 4.98 10.31 5.333 207.14%
q27 2.95 2.66 -0.286 90.30%
q28 20.27 22.02 1.747 108.62%
q29 6.77 7.00 0.231 103.40%
q30 4.05 4.12 0.064 101.58%
q31 6.13 5.95 -0.179 97.09%
q32 1.05 1.24 0.194 118.51%
q33 6.89 4.75 -2.146 68.87%
q34 4.97 4.66 -0.309 93.79%
q35 6.64 7.93 1.289 119.43%
q36 3.28 3.18 -0.100 96.95%
q37 4.05 3.93 -0.117 97.12%
q38 11.99 11.84 -0.151 98.74%
q39a 3.30 3.35 0.041 101.25%
q39b 3.01 2.79 -0.218 92.76%
q40 3.67 3.83 0.163 104.44%
q41 0.57 0.56 -0.013 97.71%
q42 0.81 0.94 0.136 116.84%
q43 4.58 3.55 -1.036 77.40%
q44 7.45 7.08 -0.368 95.06%
q45 3.41 3.51 0.097 102.86%
q46 7.91 3.03 -4.885 38.25%
q47 14.91 14.84 -0.071 99.53%
q48 4.43 4.40 -0.034 99.22%
q49 7.34 7.41 0.069 100.94%
q50 21.21 24.44 3.231 115.23%
q51 8.43 8.72 0.294 103.49%
q52 0.93 0.99 0.053 105.66%
q53 4.13 1.79 -2.342 43.30%
q54 3.42 3.30 -0.117 96.57%
q55 1.02 0.94 -0.078 92.40%
q56 4.63 4.38 -0.243 94.76%
q57 8.94 8.52 -0.427 95.22%
q58 5.36 2.71 -2.647 50.63%
q59 14.30 16.41 2.107 114.73%
q60 4.78 5.10 0.320 106.69%
q61 5.37 5.28 -0.091 98.30%
q62 3.91 4.25 0.343 108.77%
q63 1.80 1.85 0.056 103.10%
q64 50.48 49.92 -0.565 98.88%
q65 13.63 13.56 -0.071 99.48%
q66 3.07 3.01 -0.061 98.01%
q67 362.16 354.16 -8.001 97.79%
q68 3.88 3.50 -0.384 90.12%
q69 6.20 6.78 0.584 109.42%
q70 8.18 8.76 0.571 106.98%
q71 2.25 2.27 0.018 100.78%
q72 190.11 191.86 1.745 100.92%
q73 2.26 2.29 0.025 101.12%
q74 23.35 21.60 -1.758 92.47%
q75 23.64 23.74 0.101 100.43%
q76 7.41 7.03 -0.383 94.83%
q77 2.06 1.78 -0.278 86.50%
q78 38.04 39.90 1.854 104.87%
q79 3.46 3.47 0.008 100.23%
q80 10.82 10.58 -0.247 97.72%
q81 4.49 4.53 0.039 100.88%
q82 6.43 8.65 2.216 134.45%
q83 1.52 1.46 -0.060 96.07%
q84 2.99 2.93 -0.063 97.89%
q85 6.95 7.13 0.185 102.67%
q86 4.97 3.19 -1.771 64.34%
q87 12.20 12.28 0.079 100.65%
q88 16.46 16.77 0.312 101.89%
q89 2.58 2.65 0.066 102.54%
q90 4.36 3.40 -0.960 77.99%
q91 4.30 2.57 -1.730 59.81%
q92 1.12 1.14 0.026 102.30%
q93 30.27 31.14 0.870 102.87%
q94 21.16 21.35 0.193 100.91%
q9 81.41 81.54 0.130 100.16%
q5 2.43 6.35 3.922 261.69%
q96 12.64 11.98 -0.659 94.79%
q97 1.91 1.82 -0.091 95.24%
q98 9.22 9.80 0.581 106.30%
q99 9.22 9.80 0.581 106.30%
total 1894.45 1895.41 0.962 100.05%

@ulysses-you ulysses-you merged commit 7616803 into apache:main May 28, 2024
40 checks passed
@ulysses-you ulysses-you deleted the clean branch May 28, 2024 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants