-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VL] Result mismatch in CollectList when partial sort is involved #8184
Comments
@NEUpanning, in Gluten sort agg is replaced with hash agg, which makes the prior sort operator unnecessary possibly. So Gluten will remove the sort operator unless it is needed for ensuring ordering requirement. |
Thank you for the details. I believe the relevant sort-removal code is now here on main branch. I think we could preserve the sort as long as vanilla Spark plan has it with a hash agg. This requires for a fix. |
@zhztheplayer I see this feature is implemented in gluten 1.2 branch, but main branch doesn't include it for some reason. For this issue, CollectList function is replaced by VeloxCollectList function in logical optimization phase. Here is the spark plan:
This leads to Spark using SortAggregateExec |
Backend
VL (Velox)
Bug description
Describe the issue
Reproducing SQL:
Results:
The vanilla result is deterministic and values_list is sorted by value column:
The gluten result is non-deterministic and values_list is not sorted, e.g. :
The gluten physical plan:
Even though the collect_list function is non-deterministic, as stated in the documentation, some ETL tasks in our production environment depend on this behavior in vanilla Spark.
Root cause for this issue
We can see the Sort operator is removed through the gluten plan. This change appears to be due to this code snippet: code link.
I'm wondering why the
partial sort
added by SQL 'sort by' needs to be removed forSortAggregateExec
. Would it be possible to retain thepartial sort
operator for resolving this issue?Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response
The text was updated successfully, but these errors were encountered: