Skip to content

SPL:grouping & aggregation

esProcSPL edited this page Jul 22, 2024 · 1 revision

Grouping operation is to assign the members of a set with the same attributes to the same group. For example, employee tables are grouped by department, and each group of employees has the same department; sales tables can be grouped by sales year, and each group has sales records of the same year, and so on.

Sometimes we need to calculate the data by different groups, and it is when grouping & aggregation is needed. Grouping & aggregation means grouping the data according to certain regulations, and then performing aggregation operation for each group.

[e.g. 1] Query the total sales of each month in 2014 according to the sales table. Some of the data are as follows:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
10400 EASTC 2014/01/01 1 27 3063.0
10401 HANAR 2014/01/01 1 17 3868.6
10402 ERNSH 2014/01/02 8 70 2713.5
10403 ERNSH 2014/01/03 4 42 1005.9
10404 MAGAA 2014/01/03 2 74 1675.0

The A.groups() function is provided in SPL for grouping & aggregation.

The SPL script looks like this:

A
1 =T("Sales.csv").select(year(ORDERDATE)==2014)
2 =A1.groups(month(ORDERDATE):MONTH; sum(AMOUNT):AMOUNT)

A1: import sales table and select the records of 2014.

A2: use the A.groups() function to group data by month and aggregate the total sales of each month.

[e.g. 2] Query the departments with an average age of less than 45 in New York State according to the employee table. Some of the data are as follows:

ID NAME BIRTHDAY STATE DEPT SALARY
1 Rebecca 1974/11/20 California R&D 7000
2 Ashley 1980/07/19 New York Finance 11000
3 Rachel 1970/12/17 New Mexico Sales 9000
4 Emily 1985/03/07 Texas HR 7000
5 Ashley 1975/05/13 Texas R&D 16000

The SPL script looks like this:

A
1 =T("Employee.csv").select(STATE=="New York")
2 =A1.groups(DEPT; avg(age(BIRTHDAY)):AVGAGE)
3 =A2.select(AVGAGE < 45)

A1: import the employee table and select the records of New York State.

A2: use the A.groups() function to calculate the average age by department.

A3: select the records with an average age of less than 45 in the result set after grouping & aggregation.

In SPL, the A.select() function can be used to select data both before and after grouping.

In SPL, the A.groups() function not only supports aggregation functions supported by SQL, such as sum, count, avg, max, min,etc., but also supports some of the more common ones that are not supported by the Group BY statement of SQL: top (first Ns/last Ns) , iterate (iteration function) , icount (count distinct value) , median (median, some database support) , and so on.

[e.g. 3] According to the sales table, summarize the times that customers are ranked as the top three sales of each month in 2014. Some of the data are as follows:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
10400 EASTC 2014/01/01 1 27 3063.0
10401 HANAR 2014/01/01 1 17 3868.6
10402 ERNSH 2014/01/02 8 70 2713.5
10403 ERNSH 2014/01/03 4 42 1005.9
10404 MAGAA 2014/01/03 2 74 1675.0

The SPL script looks like this:

A
1 =T("Sales.csv").select(year(ORDERDATE)==2014)
2 =A1.groups(month(ORDERDATE):MONTH; top(-3;AMOUNT):TOP3)
3 =A2.conj(TOP3).groups(CUSTOMERID; count(~):COUNT)

A1: import the sales table and select the records of 2014.

A2: use the A.groups() function to aggregate the top three by month.

A3: after concatenating the top three records of each month, use the A.groups() function to count the times for each customer.

[e.g. 4] Summarize the number of students who get the median scores or fail the exam (less than 60 marks) of each class in every subject. Some of the data are as follows:

CLASS STUDENTID SUBJECT SCORE
1 1 English 84
1 1 Math 77
1 1 PE 69
1 2 English 81
1 2 Math 80

The solution of is:

A
1 =T("Scores.csv")
2 =A1.groups(CLASS,SUBJECT; median(,SCORE):MEDIAN_SCORE, count(SCORE<60):FAIL_COUNT)

A1: import the score table.

A2: use the A.groups() function to aggregate the number of median and failed students by class and subject.

In SPL, the count() function can count the number of failed students directly by using the parameter “SCORE<60” without selecting the failed records in advance. This makes it easy to calculate both the median and the number.

Sales.csv

Employee.csv

Scores.csv

Clone this wiki locally