-
Notifications
You must be signed in to change notification settings - Fork 339
SQL Performance Enhancement:Segmentation based Grouping & Aggregation
The essay analyzes the fundamental reason behind slow SQL statements and provides the solution to make improvements through code samples. Looking ${article} for details.
Suppose we divide data table T by field x according to set X={X1<X2<…<Xn}and perform a specific aggregate operation by segment number. The beginning segment (the 0th one) meets the condition x<X1, the ith (0<i<n) segment meets the condition represented by the left-closed and right-open interval [Xi<x<Xi+1), and the nth segment satisfies the condition x>=Xn. Certain segments are defined by the right-closed and left-open intervals.
Usually, the SQL for performing segmentation and summarization uses CASE WHEN statement:
select xgroup,sum(x),avg(x) from
(select case
when x< X<sub>1</sub> then 0
when x< X<sub>2</sub> then 1
…
else n
end as xgroup,
x
from T)
group by xgroup
The CASE WHEN statement checks conditions one by one in turn and stops if the current record meets the condition; and moves on to the next condition if the current one does not satisfy it. The x field of each record needs to be matched multiple times to finally decide which segment it belongs to. The number of matches is one at least and n-1 at most. The average number of matches of the x field is n/2 and the degree of complexity is O(n). When the future number of segments is big and the total volume of data is large, the overall performance of this algorithm is bad.
When X1,X2,…,Xn is dynamically changed (a parameter passed in at the front-end, for instance), it is hard to express it in SQL, let alone performance optimization.
Since set X is ordered, we can use binary search to speed up the segmentation process. The degree of complexity is O(log2n). The algorithm is much faster than the order-based segmentation when n is relatively large. Yet, it can be slower when n is small because of its complexity. Usually when n is greater or equal to 8, the binary search algorithm is really efficient. There are generally several scores of members in set X. Even it is unordered, we can first perform an in-memory sorting and the overall performance is almost not affected.
Moreover, the binary search algorithm lets set X to be easily passed in as a parameter for dynamic segmentation.
Define parameter argX first to pass in the sequence of segments, like [0,100,230,450,480,566,798…], specified by X. Sort the sequence parameter if it is unordered.
A | |
---|---|
1 | =arg_X.sort() |
2 | =file("T.ctx").open().cursor(x) |
3 | =A2.groups(A1.pseg(x):xgroup;sum(x),avg(x)) |
A1: Pass in the defined parameter and sort it out.
A2: Open the specified composite table and create a cursor based on its x field.
A3: Since the number of segments is small, we use the small result set method to perform grouping and sum and calculate the average on x field. The pseg function is used as the grouping expression where the binary search is employed to calculate the segment number in which x field falls in A1.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code