-
Notifications
You must be signed in to change notification settings - Fork 332
Restore the original meaning of grouping operations
Grouping is a common operation in databases, but not everyone can deeply understand it.
The essence of grouping operation is to partition a set into several subsets according to certain rules, which means that the return value should be a set composed of sets. However, people generally do not care much about the member sets that make up this set (i.e. grouped subsets), but are more interested in the aggregated values of these subsets. Therefore, grouping operation is often accompanied by further aggregate calculation of subsets.
SQL is designed in this way. When there is a GROUP BY clause, the SELECT section can only write aggregation expressions apart from grouping fields. Of course, another reason is the discreteness problem of SQL, which cannot return a set of sets and can only force aggregation operations.
Over time, people may think that grouping always needs to be combined with subsequent aggregation operations, and forget that grouping and aggregation are actually two independent steps.
However, there are still times when we are more interested in these grouped subsets rather than aggregated values.
For example, if we want to find out which employees in the company have the same birthdays as other employees, a simple idea is to group employees by birthday, then find subsets of groups with more than 1 members, and then make a union. In this case, we are not only interested in the aggregation value (the number of members in the grouped subset), but also more interested in the grouped subsets themselves.
This operation can be quite verbose when written in SQL, requiring subqueries and traversing the original set twice.
SELECT * FROM employee WHERE birthday IN
( SELECT birthday FROM employee GROUP BY birthday HAVING COUNT(*)>1 )
By the way, we assume that the birthdate field is the birthday here, in fact, our daily meaning of birthday does not have a year, while the birthdate field in the data table usually has one. At this point, it is necessary to convert the birthday to month and day before doing GROUP and WHERE. However, due to the incomplete set-orientation of SQL, it is difficult to write IN operations involving two members. This birthday needs to be rewritten to look like month(birthday())*100+day(birthday), i.e., assembled into one expression to be judged by IN, which is very cumbersome.
SPL with discreteness has more thorough set-oriented features, and can provide a set of sets as a data type, thus can preserve the grouped subsets. This way, grouping and aggregation can be restored to two steps, and the above operation can be clearly written:
employee.group(month(birthday),day(birthday)).select(~.len()>1).conj()
Group by month/day of birthday, filter out subsets of groups with more than 1 member, and then get the union. Grouping by multiple keys is a natural thing, and there is no need to combine them into one expression and then use IN.
In fact, this statement still needs to traverse the data twice, but it is only a count and does not require comparison like SQL, and the performance is much better.
Taking a step back, even if we are only interested in aggregated values, we may need to keep these grouped subsets for reuse and calculate multiple aggregated values, rather than discarding them after completing one aggregation and re grouping them the next time we calculate. Grouping is a costly operation, and nowadays the HASH method is commonly used to implement grouping. Calculating and comparing HASH values is much more complex than simply traversing.
For example, we calculate the number of people in each department, and then calculate the average age of people in departments with more than 10 people. This needs to be written in two statements in SQL, as the latter requires a HAVING condition:
SELECT department,COUNT(*) FROM employee GROUP BY department
SELECT department,AVERAGE(age) FROM employee GROUP BY department HAVING COUNT(*)>=10
The GROUP action here needs to be executed twice.
If the grouped subsets can be kept, only one grouping is needed, and there is no need to repeat the grouping in SPL:
g=employee.group(department)
g.new(~.department,~.len())
g.select(~.len()>=10).new(~.department,~.avg(age))
It is also possible that we are only interested in a certain aggregate value, but this aggregate value is difficult to calculate and cannot be simply calculated using SUM/COUNT. It requires programming, and in this case, we also need to retain the grouped subsets, which is difficult to implement in SQL. We will talk about this when discussing aggregation operations later on.
The result of grouping is a set of sets, which is still a set, so obviously further grouping can be done.
g1=employee.group(year(birthday)) // Group by year of birth
g2=g1.group(year(birthday)%100\10) // Group all grouped subsets by decade
g3=g1.(~.group(month(birthday))) // Group each grouped subset by month of birth
The last two steps will result in a set of sets of sets, and situations of three layers or deeper are rarely encountered in real business, but can be used to understand the thinking mode of sets and the essence of grouping operations.
Let’s talk a bit more. SQL has specially designed the HAVING keyword for filtering the result set after GROUP, which makes many beginners confused about its understanding and application. In fact, HAVING is conceptually redundant, and it is no different from WHERE. It is simply because SQL cannot retain subsets after grouping, and grouping and aggregation need to be written in one statement, and it also needs to distinguish from WHERE, and then “HAVING” is created. If it is possible to retain the grouped subsets and perform step-by-step calculation, “HAVING” is not necessary. There is not and no need for this thing in SPL.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code