Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-4713][CORE] Fix invalid children caused by std::move in RowVectorStream #4753

Merged
merged 1 commit into from
Feb 23, 2024

Conversation

liujiayi771
Copy link
Contributor

@liujiayi771 liujiayi771 commented Feb 23, 2024

What changes were proposed in this pull request?

In #4713, we discovered that executing q23b on large scale TPCDS dataset would result in a core dump. After investigation, we concluded that an optimization introduced in #4628, which removed some unnecessary post-projects, led to the occurrence of the core dump. However, we believe that this is only a symptom and not the root cause of the problem. In #4726, we restored the post-project and continued to investigate the reason behind the core dump that was triggered by the absence of the post-project.

We ultimately identified the root cause of the issue to be that the next method of RowVectorStream moves away the children of RowVector, leaving the RowVector's children in an indeterminate state. Since the aggregation operation reuses the output RowVector, accessing a child that has been moved away during reuse can lead to a core dump.

  1. Why doesn't a core dump occur when there is a post-project?
    FilterProject operator in Velox does not reuse the output RowVector, it always creates a new RowVector to return.

  2. Why doesn't the bug get triggered on small datasets?
    The bug is only triggered when there is a reuse scenario occurring with multiple batches being outputted by the aggregation. With small data, the aggregation would only output a single batch, hence no issue arises. However, if the batch size is reduced, this bug can also be triggered.

  3. Does the bug occur when the aggregation outputs multiple batches?
    The bug is triggered when the downstream of the aggregation is a RowVectorStream operator.

After the fix, we can revert the optimization of removing unnecessary post-projects introduced in #4628.

How was this patch tested?

Testing 1T TPCDS q23b.

Copy link

#4713

Copy link

Run Gluten Clickhouse CI

@liujiayi771
Copy link
Contributor Author

cc @zhouyuan, @rui-mo, @ulysses-you, thanks.

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Verified on internal TPC-DS test.

@rui-mo rui-mo changed the title [GLUTEN-4713][CORE] Fix RowVectorStream not move the children of RowVector [GLUTEN-4713][CORE] Fix invalid children caused by std::move in RowVectorStream Feb 23, 2024
Copy link
Contributor

@ulysses-you ulysses-you left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you

@ulysses-you
Copy link
Contributor

/Benchmark Velox

@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_4753_time.csv log/native_master_02_23_2024_9ceade94a_time.csv difference percentage
q1 33.78 34.69 0.909 102.69%
q2 22.67 24.33 1.668 107.36%
q3 37.82 38.86 1.041 102.75%
q4 37.78 36.08 -1.697 95.51%
q5 70.88 72.24 1.362 101.92%
q6 7.42 7.31 -0.110 98.52%
q7 83.65 84.62 0.973 101.16%
q8 85.91 85.39 -0.511 99.41%
q9 121.99 124.53 2.545 102.09%
q10 45.19 44.23 -0.963 97.87%
q11 20.41 19.94 -0.468 97.71%
q12 28.69 26.67 -2.017 92.97%
q13 45.48 44.33 -1.160 97.45%
q14 17.98 20.85 2.868 115.95%
q15 27.56 29.04 1.475 105.35%
q16 14.30 14.19 -0.115 99.20%
q17 102.03 102.20 0.172 100.17%
q18 149.50 150.35 0.850 100.57%
q19 15.50 14.06 -1.441 90.70%
q20 27.07 26.19 -0.879 96.75%
q21 224.87 226.58 1.706 100.76%
q22 13.86 13.90 0.044 100.32%
total 1234.32 1240.58 6.254 100.51%

@ulysses-you ulysses-you merged commit bff3e59 into apache:main Feb 23, 2024
21 checks passed
@liujiayi771 liujiayi771 deleted the value-stream-move branch February 23, 2024 06:25
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_4753_time.csv log/native_master_02_23_2024_9ceade94a_time.csv difference percentage
q1 32.45 34.69 2.240 106.90%
q2 25.46 24.33 -1.131 95.56%
q3 37.69 38.86 1.171 103.11%
q4 38.84 36.08 -2.760 92.89%
q5 70.38 72.24 1.867 102.65%
q6 7.04 7.31 0.267 103.79%
q7 87.27 84.62 -2.656 96.96%
q8 85.60 85.39 -0.202 99.76%
q9 122.22 124.53 2.306 101.89%
q10 43.20 44.23 1.022 102.37%
q11 20.14 19.94 -0.204 98.99%
q12 26.50 26.67 0.169 100.64%
q13 45.40 44.33 -1.074 97.63%
q14 17.85 20.85 3.004 116.83%
q15 29.93 29.04 -0.893 97.02%
q16 13.91 14.19 0.273 101.97%
q17 102.99 102.20 -0.789 99.23%
q18 150.78 150.35 -0.425 99.72%
q19 13.77 14.06 0.288 102.09%
q20 28.13 26.19 -1.941 93.10%
q21 227.60 226.58 -1.025 99.55%
q22 13.66 13.90 0.235 101.72%
total 1240.83 1240.58 -0.259 99.98%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants