SPL：grouping & aggregation

Grouping operation is to assign the members of a set with the same attributes to the same group. For example, employee tables are grouped by department, and each group of employees has the same department; sales tables can be grouped by sales year, and each group has sales records of the same year, and so on.

Sometimes we need to calculate the data by different groups, and it is when grouping & aggregation is needed. Grouping & aggregation means grouping the data according to certain regulations, and then performing aggregation operation for each group.

[e.g. 1] Query the total sales of each month in 2014 according to the sales table. Some of the data are as follows:

ID	CUSTOMERID	ORDERDATE	SELLERID	PRODUCTID	AMOUNT
10400	EASTC	2014/01/01	1	27	3063.0
10401	HANAR	2014/01/01	1	17	3868.6
10402	ERNSH	2014/01/02	8	70	2713.5
10403	ERNSH	2014/01/03	4	42	1005.9
10404	MAGAA	2014/01/03	2	74	1675.0
…	…	…	…	…	…

The A.groups() function is provided in SPL for grouping & aggregation.

The SPL script looks like this:

	A
1	=T("Sales.csv").select(year(ORDERDATE)==2014)
2	=A1.groups(month(ORDERDATE):MONTH; sum(AMOUNT):AMOUNT)

A1: import sales table and select the records of 2014.

A2: use the A.groups() function to group data by month and aggregate the total sales of each month.

[e.g. 2] Query the departments with an average age of less than 45 in New York State according to the employee table. Some of the data are as follows:

ID	NAME	BIRTHDAY	STATE	DEPT	SALARY
1	Rebecca	1974/11/20	California	R&D	7000
2	Ashley	1980/07/19	New York	Finance	11000
3	Rachel	1970/12/17	New Mexico	Sales	9000
4	Emily	1985/03/07	Texas	HR	7000
5	Ashley	1975/05/13	Texas	R&D	16000
…	…	…	…	…	…

The SPL script looks like this:

	A
1	=T("Employee.csv").select(STATE=="New York")
2	=A1.groups(DEPT; avg(age(BIRTHDAY)):AVGAGE)
3	=A2.select(AVGAGE < 45)

A1: import the employee table and select the records of New York State.

A2: use the A.groups() function to calculate the average age by department.

A3: select the records with an average age of less than 45 in the result set after grouping & aggregation.

In SPL, the A.select() function can be used to select data both before and after grouping.

In SPL, the A.groups() function not only supports aggregation functions supported by SQL, such as sum, count, avg, max, min,etc., but also supports some of the more common ones that are not supported by the Group BY statement of SQL: top (first Ns/last Ns) , iterate (iteration function) , icount (count distinct value) , median (median, some database support) , and so on.

[e.g. 3] According to the sales table, summarize the times that customers are ranked as the top three sales of each month in 2014. Some of the data are as follows:

ID	CUSTOMERID	ORDERDATE	SELLERID	PRODUCTID	AMOUNT
10400	EASTC	2014/01/01	1	27	3063.0
10401	HANAR	2014/01/01	1	17	3868.6
10402	ERNSH	2014/01/02	8	70	2713.5
10403	ERNSH	2014/01/03	4	42	1005.9
10404	MAGAA	2014/01/03	2	74	1675.0
…	…	…	…	…	…

The SPL script looks like this:

	A
1	=T("Sales.csv").select(year(ORDERDATE)==2014)
2	=A1.groups(month(ORDERDATE):MONTH; top(-3;AMOUNT):TOP3)
3	=A2.conj(TOP3).groups(CUSTOMERID; count(~):COUNT)

A1: import the sales table and select the records of 2014.

A2: use the A.groups() function to aggregate the top three by month.

A3: after concatenating the top three records of each month, use the A.groups() function to count the times for each customer.

[e.g. 4] Summarize the number of students who get the median scores or fail the exam (less than 60 marks) of each class in every subject. Some of the data are as follows:

CLASS	STUDENTID	SUBJECT	SCORE
1	1	English	84
1	1	Math	77
1	1	PE	69
1	2	English	81
1	2	Math	80
…	…	…	…

The solution of is:

	A
1	=T("Scores.csv")
2	=A1.groups(CLASS,SUBJECT; median(,SCORE):MEDIAN_SCORE, count(SCORE<60):FAIL_COUNT)

A1: import the score table.

A2: use the A.groups() function to aggregate the number of median and failed students by class and subject.

In SPL, the count() function can count the number of failed students directly by using the parameter “SCORE<60” without selecting the failed records in advance. This makes it easy to calculate both the median and the number.

Sales.csv

Employee.csv

Scores.csv

SPL Resource： SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPL：grouping & aggregation

Clone this wiki locally