[VL] TPCDS Performance drop after new operator "VeloxAppendBatches" #6694

Surbhi-Vijay · 2024-08-02T11:26:42Z

Backend

VL (Velox)

Bug description

We have observed performance drop in TPCDS runs after the patch #6009.

Top regressing Queries

QueryId New runtime Previous Runtime
query64 50712 22841
query24a 44883 27452
query24b 45003 28742

When we disabled the feature using "spark.gluten.sql.columnar.backend.velox.coalesceBatchesBeforeShuffle": "false". We see the same runtime as previous runs.

We are using azure cluster and reading data from remote storage account. The regression is seen in VeloxAppendBatches where in some instances, it is taking a lot of time

Below are the plan snippets from query64

Spark version

Spark-3.4.x

Spark configurations

No response

System information

No response

Relevant logs

No response

The text was updated successfully, but these errors were encountered:

Surbhi-Vijay · 2024-08-02T11:28:02Z

cc @zhztheplayer @zhli1142015

zhztheplayer · 2024-08-02T13:07:47Z

Thank you for reporting.

IIUC The operator itself doesn't seem to be the one that slows down your query? Say the slowest tasks just took 438ms and 166ms.

Would you like to share more parts of comparisons of the DAGs? Especially, would you check shuffle write time as well?

FelixYBW · 2024-08-02T22:25:44Z

So Q64 elapsed time is 22841 -> 50712. Append operator's input is 4.15 row per batch, output is 3070 row per batch, which benefit performance in our test.

The operator's overhead is a sequential memcpy which definitely can't directly cause the 2x elapsed time increase. There should be some side-effect caused this.

@zhli1142015 Do you still have the tool I shared, Can you get the chart of each stage in traceview? Let's see which stage caused the issue and reproduce the stage in native.

Surbhi-Vijay · 2024-08-05T13:46:26Z

Thanks @FelixYBW for explaining. I am trying to come up with a minimal query to showcase the impact. If I do not achieve the same then will post the detailed analysis of Q64 and Q24.

Surbhi-Vijay · 2024-08-09T10:21:40Z

We investigated the Regressing queries. We see that the regression is not directly caused by "VeloxAppendBatches" but rather due to plan changes when data size is reduced due to this feature.

Analysis for query24b
When the coalesceBatch feature is enabled, the data size get reduced which prompted AQE to change join type from SHJ to BHJ.

Left Side => With CoalesceBatch enabled
Right Side => With CoalesceBatch disabled

here, ColumnarBroadcastExchange is taking additional 8s after the ColumnarExchange.

At another place, we observe that one of the shuffle hash join build side got changed due to the same reason.

FelixYBW · 2024-08-09T19:45:31Z

@zhztheplayer @marin-ma Why the patch caused the plan changing? Looks a bug. Do we use batch numbers instead of row numbers as the plan creating?

DamonZhao-sfu · 2024-08-12T18:03:34Z

So Q64 elapsed time is 22841 -> 50712. Append operator's input is 4.15 row per batch, output is 3070 row per batch, which benefit performance in our test.

The operator's overhead is a sequential memcpy which definitely can't directly cause the 2x elapsed time increase. There should be some side-effect caused this.

@zhli1142015 Do you still have the tool I shared, Can you get the chart of each stage in traceview? Let's see which stage caused the issue and reproduce the stage in native.

Hi! Could you share the profiling tool in this image? Thanks!

zhztheplayer · 2024-08-14T08:59:16Z

@Surbhi-Vijay

Was there a large difference on shuffle write size?

marin-ma · 2024-08-14T09:09:31Z

@Surbhi-Vijay Could you share the metrics details of ColumnarExchange with "VeloxAppendBatches" enabled/disabled?

Surbhi-Vijay · 2024-08-14T18:54:48Z

@Surbhi-Vijay Could you share the metrics details of ColumnarExchange with "VeloxAppendBatches" enabled/disabled?

Below "ColumnarExchange" is for join (store_sales join customer) which converted to BHJ from SHJ in q24b when veloxAppendBatches is enabled.

Left Side => With CoalesceBatch enabled
Right Side => With CoalesceBatch disabled

Surbhi-Vijay · 2024-08-14T19:03:00Z

@Surbhi-Vijay

Was there a large difference on shuffle write size?

It is reporting same numbers, does not seems to be any diff in shuffle write size.

marin-ma · 2024-08-16T12:12:05Z

@Surbhi-Vijay The "data size" metric changed from 58.7M to 5.2M. This could cause a plan change since the Join operation relies on this value to decide whether to use BHJ. However, a 10x reduction in "data size" seems unreasonable to me.

Could you also share the spark configurations? I've compared TPCDS q24b with enable/disable VeloxResizeBatches, but I don't see such a stage producing different data size.

FelixYBW · 2024-08-19T22:46:25Z

@Surbhi-Vijay Any update? It doesn't make sense the merge batch operators impact the shuffle data size.

Surbhi-Vijay · 2024-08-20T11:19:14Z

@FelixYBW @marin-ma I see this behavior of reduced data size wherever VeloxAppendBatches is getting applied.

All other metrics (apart from datasize) are almost same (except #batches and #rows/batch - which are expected to change)

The shuffle stage also shows the almost exact same metrices. At this point, I am suspecting if there is any bug in populating data size when this feature is enabled.

JunhyungSong · 2024-10-25T22:55:55Z

Do we have a solution for this? Does #6670 solve this issue?

zhztheplayer · 2024-10-26T01:29:09Z

Do we have a solution for this? Does #6670 solve this issue?

I think so. Would you like to help have a try? If it works then we can close this issue.

cc @Surbhi-Vijay

Surbhi-Vijay added bug Something isn't working triage labels Aug 2, 2024

zhztheplayer changed the title ~~TPCDS Performance drop after new operator "VeloxAppendBatches"~~ [VL] TPCDS Performance drop after new operator "VeloxAppendBatches" Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] TPCDS Performance drop after new operator "VeloxAppendBatches" #6694

[VL] TPCDS Performance drop after new operator "VeloxAppendBatches" #6694

Surbhi-Vijay commented Aug 2, 2024

Surbhi-Vijay commented Aug 2, 2024

zhztheplayer commented Aug 2, 2024 •

edited

Loading

FelixYBW commented Aug 2, 2024

Surbhi-Vijay commented Aug 5, 2024

Surbhi-Vijay commented Aug 9, 2024

FelixYBW commented Aug 9, 2024

DamonZhao-sfu commented Aug 12, 2024

zhztheplayer commented Aug 14, 2024

marin-ma commented Aug 14, 2024

Surbhi-Vijay commented Aug 14, 2024

Surbhi-Vijay commented Aug 14, 2024

marin-ma commented Aug 16, 2024

FelixYBW commented Aug 19, 2024

Surbhi-Vijay commented Aug 20, 2024

JunhyungSong commented Oct 25, 2024

zhztheplayer commented Oct 26, 2024

[VL] TPCDS Performance drop after new operator "VeloxAppendBatches" #6694

[VL] TPCDS Performance drop after new operator "VeloxAppendBatches" #6694

Comments

Surbhi-Vijay commented Aug 2, 2024

Backend

Bug description

Top regressing Queries

Spark version

Spark configurations

System information

Relevant logs

Surbhi-Vijay commented Aug 2, 2024

zhztheplayer commented Aug 2, 2024 • edited Loading

FelixYBW commented Aug 2, 2024

Surbhi-Vijay commented Aug 5, 2024

Surbhi-Vijay commented Aug 9, 2024

FelixYBW commented Aug 9, 2024

DamonZhao-sfu commented Aug 12, 2024

zhztheplayer commented Aug 14, 2024

marin-ma commented Aug 14, 2024

Surbhi-Vijay commented Aug 14, 2024

Surbhi-Vijay commented Aug 14, 2024

marin-ma commented Aug 16, 2024

FelixYBW commented Aug 19, 2024

Surbhi-Vijay commented Aug 20, 2024

JunhyungSong commented Oct 25, 2024

zhztheplayer commented Oct 26, 2024

zhztheplayer commented Aug 2, 2024 •

edited

Loading