-
Notifications
You must be signed in to change notification settings - Fork 335
SPL:grouping & aggregation
Grouping operation is to assign the members of a set with the same attributes to the same group. For example, employee tables are grouped by department, and each group of employees has the same department; sales tables can be grouped by sales year, and each group has sales records of the same year, and so on.
Sometimes we need to calculate the data by different groups, and it is when grouping & aggregation is needed. Grouping & aggregation means grouping the data according to certain regulations, and then performing aggregation operation for each group.
[e.g. 1] Query the total sales of each month in 2014 according to the sales table. Some of the data are as follows:
ID | CUSTOMERID | ORDERDATE | SELLERID | PRODUCTID | AMOUNT |
---|---|---|---|---|---|
10400 | EASTC | 2014/01/01 | 1 | 27 | 3063.0 |
10401 | HANAR | 2014/01/01 | 1 | 17 | 3868.6 |
10402 | ERNSH | 2014/01/02 | 8 | 70 | 2713.5 |
10403 | ERNSH | 2014/01/03 | 4 | 42 | 1005.9 |
10404 | MAGAA | 2014/01/03 | 2 | 74 | 1675.0 |
… | … | … | … | … | … |
The A.groups() function is provided in SPL for grouping & aggregation.
The SPL script looks like this:
A | |
---|---|
1 | =T("Sales.csv").select(year(ORDERDATE)==2014) |
2 | =A1.groups(month(ORDERDATE):MONTH; sum(AMOUNT):AMOUNT) |
A1: import sales table and select the records of 2014.
A2: use the A.groups() function to group data by month and aggregate the total sales of each month.
[e.g. 2] Query the departments with an average age of less than 45 in New York State according to the employee table. Some of the data are as follows:
ID | NAME | BIRTHDAY | STATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | 1974/11/20 | California | R&D | 7000 |
2 | Ashley | 1980/07/19 | New York | Finance | 11000 |
3 | Rachel | 1970/12/17 | New Mexico | Sales | 9000 |
4 | Emily | 1985/03/07 | Texas | HR | 7000 |
5 | Ashley | 1975/05/13 | Texas | R&D | 16000 |
… | … | … | … | … | … |
The SPL script looks like this:
A | |
---|---|
1 | =T("Employee.csv").select(STATE=="New York") |
2 | =A1.groups(DEPT; avg(age(BIRTHDAY)):AVGAGE) |
3 | =A2.select(AVGAGE < 45) |
A1: import the employee table and select the records of New York State.
A2: use the A.groups() function to calculate the average age by department.
A3: select the records with an average age of less than 45 in the result set after grouping & aggregation.
In SPL, the A.select() function can be used to select data both before and after grouping.
In SPL, the A.groups() function not only supports aggregation functions supported by SQL, such as sum, count, avg, max, min,etc., but also supports some of the more common ones that are not supported by the Group BY statement of SQL: top (first Ns/last Ns) , iterate (iteration function) , icount (count distinct value) , median (median, some database support) , and so on.
[e.g. 3] According to the sales table, summarize the times that customers are ranked as the top three sales of each month in 2014. Some of the data are as follows:
ID | CUSTOMERID | ORDERDATE | SELLERID | PRODUCTID | AMOUNT |
---|---|---|---|---|---|
10400 | EASTC | 2014/01/01 | 1 | 27 | 3063.0 |
10401 | HANAR | 2014/01/01 | 1 | 17 | 3868.6 |
10402 | ERNSH | 2014/01/02 | 8 | 70 | 2713.5 |
10403 | ERNSH | 2014/01/03 | 4 | 42 | 1005.9 |
10404 | MAGAA | 2014/01/03 | 2 | 74 | 1675.0 |
… | … | … | … | … | … |
The SPL script looks like this:
A | |
---|---|
1 | =T("Sales.csv").select(year(ORDERDATE)==2014) |
2 | =A1.groups(month(ORDERDATE):MONTH; top(-3;AMOUNT):TOP3) |
3 | =A2.conj(TOP3).groups(CUSTOMERID; count(~):COUNT) |
A1: import the sales table and select the records of 2014.
A2: use the A.groups() function to aggregate the top three by month.
A3: after concatenating the top three records of each month, use the A.groups() function to count the times for each customer.
[e.g. 4] Summarize the number of students who get the median scores or fail the exam (less than 60 marks) of each class in every subject. Some of the data are as follows:
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
1 | 1 | English | 84 |
1 | 1 | Math | 77 |
1 | 1 | PE | 69 |
1 | 2 | English | 81 |
1 | 2 | Math | 80 |
… | … | … | … |
The solution of is:
A | |
---|---|
1 | =T("Scores.csv") |
2 | =A1.groups(CLASS,SUBJECT; median(,SCORE):MEDIAN_SCORE, count(SCORE<60):FAIL_COUNT) |
A1: import the score table.
A2: use the A.groups() function to aggregate the number of median and failed students by class and subject.
In SPL, the count() function can count the number of failed students directly by using the parameter “SCORE<60” without selecting the failed records in advance. This makes it easy to calculate both the median and the number.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code