-
Notifications
You must be signed in to change notification settings - Fork 332
Non equivalence grouping
We have studied the essence of grouping operations, i.e., partitioning a set into several subsets according to certain rules. However, the focus of the discussion is to restore the steps of grouping operations, and there is no discussion of partition rules. In the examples, certain grouping key values are used to define partition rules, which is the method used in SQL.
We refer to this partition method as equivalence grouping.
Equivalence grouping is equivalent to defining an equivalence relationship on a set: members (records) with equal key values are considered equivalent. The equivalence relationship is a mathematical concept that will not be elaborated here. It can be proven that any equivalence relationship can completely partition the original set into several subsets, and the members of each subset are equivalent to each other.
Complete partitioning has two properties:
- No empty subset.
- Any member of the original set belongs to and only belongs to one subset.
When examining equivalence grouping, we will find that the results of this grouping can accurately satisfy the characteristics of complete partitioning.
If there are equivalence grouping and complete partitioning, are there also non equivalence grouping and incomplete partitioning? Is there any other way to generate complete partitioning? Do these have business meaning?
The answer is yes.
For example, we need to count the number of male and female employees. We can write it as follows:
SELECT gender,COUNT(*) FROM employee GROUP BY gender
But if all the employees in the company are male or female, the calculation result will only be one row, which may not be what we want.
To solve this problem, we can design a grouping scheme: first list a set, then compare the key value of the set members to be grouped with the criterion set members, and divide them into one subset if they are the same. Finally, the number of subsets is the same as the number of members of the criterion set. This type of grouping is called alignment grouping.
SPL provides alignment grouping operation, and to count the number of male and female employees can be written as follows:
a=[Male,Female] // Criterion set
g=employee.align(a,gender) // Use function align to implement alignment grouping and split the set
g.new(a(#),~.len()) // Use the grouped subsets to calculate aggregate value
It can be imagined that this type of alignment grouping is very common in daily statistics, such as by region or department. The benchmark set can be listed in advance, and we often require the result set to appear in the order of the criterion set. Equivalence grouping cannot guarantee this order, and sorting is necessary. When sorting, this criterion set should still be provided because the original set member attributes do not have this information.
Alignment grouping may result in empty subset, and it cannot guarantee that any member of the original set will be put into one subset (such as some unimportant members not being included in the criterion set). However, alignment grouping can ensure that each member only appears in at most one subset.
We can also generalize alignment grouping to more general enumeration grouping.
Enumeration grouping refers to specifying a set of conditions in advance, using the set members to be grouped as parameters to calculate this batch of conditions, and those who meet the condition are divided into a subset corresponding to the condition, and the subset in the grouping result corresponds one-to-one with the pre specified condition.
For example, grouping employees by age range to count the number can be written as follows in SPL:
a=[?<=30,?<=40,?>40] // Use ? to indicate the parameter to be passed in
g=employee.enum(a,age) // Use function enum to implement enumeration grouping and split the set
....
Obviously, enumeration grouping is also not uncommon in daily business.
Enumeration grouping and alignment grouping are similar in that they both require a criterion set to be listed first. In fact, alignment grouping is a special type of enumeration grouping. However, the difference is that enumeration grouping may create subsets with duplicate members, which is known as repeatable grouping.
a=[?<=30,?>20 && ?<=40,?>50] // Conditions overlap
g=employee.enum@r(a,age)
Repeatable grouping is relatively rare in practical business, but understanding it can also help to understand the essence of grouping operations again.
On the surface, there may seem to be a significant difference between alignment grouping, enumeration grouping, and GROUP BY of SQL, but once you understand the essence of grouping operations, you will understand that they are actually the same thing: partitioning a set into several subsets. Only the methods of partitioning are different.
There are other grouping methods that do not rely entirely on member attributes, but they are still a method of "partitioning a set into subsets", which we will discuss later.
Another question is, SQL only provides equivalence grouping, will it not be enough? How does SQL solve the problems of alignment grouping and enumeration grouping?
In fact, SQL has complete computing capability, and both types of non equivalence grouping mentioned above can be converted to equivalence grouping, but it is much troublesome.
For alignment grouping, we can do a LEFT JOIN on the criterion set and the set to be grouped and then perform a GROUP BY on this result set to implement the effect of alignment grouping. Be sure to use Left JOIN, as using JOIN may result in losing empty subsets, while using Full JOIN may add members outside the benchmark set.
The enumeration grouping is also similar, but the statement will be more complex, as we have to design the conditions of JOIN based on the enumeration conditions, and it is difficult to provide a universal writing method.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code