-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VL] Optimize the performance of hash based shuffle by accumulating batches #5951
[VL] Optimize the performance of hash based shuffle by accumulating batches #5951
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/apache/incubator-gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
Thanks @XinShuoWang Please fix the code style so that we can run TPCH benchmark :)
Is this part already included in the PR? |
742ecb7
to
f669b6a
Compare
Thank you for the improvement. The ideal case of current split function is that: the input batch size should be as large as possible but all columns can fit into L2 cache. Once the column data can't fit into L2 cache, split performance will drop dramatically. On the other side if the batch size is too small, the overhead of preparing of split actually can't be ignored. We should have room to improve with this case. Also in current implementation, we must flatten the rowvector to split. The flatten itself also have overhead which is not analyzed. |
@FelixYBW I added a log at the above location and found that the interval between |
34f87f0
to
9ffbccb
Compare
9ffbccb
to
398a42b
Compare
Oops. I missed this PR before opening this for similar purpose... |
The change makes sense to me. I think it's operational to merge this and use #6009 as follow-up which adds an individual Spark operator controlling this behavior for being reused for other operators in future (say, joins or aggs) by some kind of strategies. Let me know if any thoughts. @XinShuoWang @FelixYBW @marin-ma |
@zhztheplayer @Yohahaha Can you help me to fix the failed CI? The error message is |
I think it's Uniffle's error and has been fixed on main branch, try rebase. |
velox has a feature to merge small vector for the output of aggs,filter and join,ref https://github.com/facebookincubator/velox/pull/7899/files but the pr is not merge to master. shall we go on the pr to merge it to master? |
Sounds great to me that someone could take on the remaining work of facebookincubator/velox#7899. This batch resizing work doesn't have to be in individual operators actually, as long as the logic can be reused both in Velox and Gluten shuffle. And it's enough to see only some extra metrics showing the resizing time from relevant operators. Before that we can still use the operator to do optimizations from Gluten side. |
What changes were proposed in this pull request?
I used perf to observe the benchmark and found that the most time-consuming functions were
splitFixedWidthValueBuffer
andsplitBinaryType
. However, current column computing engines (such as starrocks) also use this idea of exchanging random read memory overhead for sequential write memory overhead to implement thesplit
function, so I think there is not much room for optimization of thesplit
function.I found that when the ShuffleBatchSize is increased, the performance will be significantly improved. I think the performance benefits mainly come from the following aspects:
It can give full play to the advantages of sequential memory writing in the split stage. When PartitionNum is 10000 and ShuffleBatchSize is 4096 (the default value in the benchmark), each Partition is only allocated 1 row of data at most (the data obtained by logging statistics in the benchmark). At this time, it is obviously impossible to give full play to the advantages of sequential memory writing.
It can reduce the number of function calls and the number of memory allocations.
Therefore, I implemented this PR to cache the data to be Shuffled, which can optimize the performance of ShuffleWrite. For specific test data, please refer to the screenshot below.
I think this PR can also control whether to cache data in combination with memory usage, thereby avoiding the ShuffleWrite OOM problem.
How was this patch tested?
Command
Before optimize
After optimize
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)