-
Notifications
You must be signed in to change notification settings - Fork 332
Non basic Aggregation Examples
An aggregate operation summarizes data and returns the result. An aggregation is always preceded by a grouping operation. Common aggregate calculations include sum, max, min, count and logic operations. In this article, you’ll find illustrations of aggregate problems and their simple and efficient solutions with esProc scripts. Looking ${article} for details.
【Example 1】 Based on the GDP table, calculate the GDP per capita for direct-controlled municipalities, first-tier cities and second-tier cities respectively. Below is part of the source table:
ID | City | GDP | Population |
---|---|---|---|
1 | Shanghai | 32679 | 2418 |
2 | Beijing | 30320 | 2171 |
3 | Shenzhen | 24691 | 1253 |
4 | Guangzhou | 23000 | 1450 |
5 | Chongqing | 20363 | 3372 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to database |
2 | =A1.query("select * from GDP") | / Query GDP table |
3 | [["Beijing","Shanghai","Tianjin","Chongqing"].pos(?)>0, ["Beijing","Shanghai","Guangzhou","Shenzhen"].pos(?)>0, ["Chengdu","Hangzhou","Chongqing","Wuhan","Xian","Suzhou", "Tianjin","Nanjing","Changsha","Zhengzhou","Dongguan","Qingdao", "Shenyang","Ningbo","Kunming"].pos(?)>0] |
/ Enumerate direct-controlled cities, first-tier cities and second-tier cities respectively |
4 | =A2.enum@r(A3,City) | / Group records in GDP table according to the enumerated sequences of cities |
5 | =A4.new(A3(#):Area,~.sum(GDP)/~.sum(Population)*10000:CapitaGDP) | / Calculate GDP per capita in each group, during which sum() function is used to calculate sum |
A5’s result:
Area | CapitaGDP |
---|---|
["Beijing","Shanghai","Tianjin","Chongqing"].pos(?)>0 | 107345.03 |
["Beijing","Shanghai","Guangzhou","Shenzhen"].pos(?)>0 | 151796.49 |
["Chengdu","Hangzhou","Chongqing","Wuhan","Xi’an","Suzhou","Tianjin","Nanjing","Changsha","Zhengzhou","Dongguan","Qingdao","Shenyang","Ningbo","Kunming"].pos(?)>0 | 106040.57 |
【Example 2】 The following is part of the orders table. We want to merge records of customer ANATR that have overlapping time periods (between order date and finish date).
OrderID | Customer | SellerId | OrderDate | FinishDate |
---|---|---|---|---|
10308 | ANATR | 7 | 2012/09/18 | 2012/10/16 |
10309 | ANATR | 3 | 2012/09/19 | 2012/10/17 |
10625 | ANATR | 3 | 2013/08/08 | 2013/09/05 |
10702 | ANATR | 1 | 2013/10/13 | 2013/11/24 |
10759 | ANATR | 3 | 2013/11/28 | 2013/12/26 |
… | … | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to data source |
2 | =A1.query("select * from Orders where Customer='ANATR'order by OrderDate") | / Get orders records of customer ANATR and sort them by order date |
3 | =A2.group@i(OrderDate>max(FinishDate[,-1])) | / Group the selected orders records, during which a new group is created when the order date of the current record is later than all previous finish dates |
4 | =A3.new(Customer,~.min(OrderDate):OrderDate,~.max(FinishDate):FinishDate) | / For each group, min() function gets the earliest order date and max() function gets the latest finish date, and use them respectively as the new order date and the new finish date |
A4’s result:
Customer | OrderDate | FinishDate |
---|---|---|
ANATR | 2012/09/18 | 2012/10/17 |
ANATR | 2013/08/08 | 2013/09/05 |
ANATR | 2013/10/13 | 2013/11/24 |
ANATR | 2013/11/28 | 2013/12/29 |
… | … | … |
【Example 4】 The following is part of the scores table. We want to, for each subject, find the number of students in class one who fail in this subject.
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
Class one | 1 | English | 84 |
Class one | 1 | Math | 77 |
Class one | 1 | PE | 69 |
Class one | 2 | English | 81 |
Class one | 2 | Math | 80 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to database |
2 | =A1.query("select * from Scores where CLASS='Class one'") | / Get scores records of students in class one |
3 | =A2.groups(SUBJECT; count(SCORE< 60):FailCount) | / Perform grouping & aggregation, during which count() function calculates the number of students who fail in the subject |
A3’s result:
SUBJECT | FailCount |
---|---|
English | 2 |
Math | 0 |
PE | 2 |
【Example 5】 Based on the following primary school online-learning terminal tables (as shown by pic 2), we want to find if all students use mobile phones to learn. Pic 1 is the directory where the tables for classes of all grades are stored.
ID | STUDENT_NAME | TERMINAL |
---|---|---|
1 | Rebecca Moore | Phone |
2 | Ashley Wilson | Phone,PC,Pad |
3 | Rachel Johnson | Phone,PC,Pad |
4 | Emily Smith | Phone,Pad |
5 | Ashley Smith | Phone,PC |
6 | Matthew Johnson | Phone |
7 | Alexis Smith | Phone,PC |
8 | Megan Wilson | Phone,PC,Pad |
… | … | … |
【SPL script】
A | B | C | |
---|---|---|---|
1 | =directory@ps("D:/Primary School") | / Traverse the target directory recursively to list all files | |
2 | for A1 | =file(A2).xlsimport@t() | / Import the Excel files of all classes circularly |
3 | =B2.([TERMINAL,"Phone"].ifn().split@c().pos("Phone") > 0) | @ | |
4 | =B3.cand() | / A.cand() function checks whether B3’s members are all True |
A4’s result:
Value |
---|
false |
【Example 6】 The following is part of the sales data. We want to find if there is at least one month in 2014 when customer RATTC’s amount ranks in top 3.
OrderID | Customer | SellerId | OrderDate | Amount |
---|---|---|---|---|
10400 | EASTC | 1 | 2014/01/01 | 3063.0 |
10401 | HANAR | 1 | 2014/01/01 | 3868.6 |
10402 | ERNSH | 8 | 2014/01/02 | 2713.5 |
10403 | ERNSH | 4 | 2014/01/03 | 1005.9 |
10404 | MAGAA | 2 | 2014/01/03 | 1675.0 |
… | … | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db").query("select * from sales") | / Connect to the data source to query sales table |
2 | =A1.select(year(OrderDate)==2014) | / Get records of 2014 |
3 | =A2.group(month(OrderDate)) | / Group records of 2014 by months |
4 | =A3.(~.groups(Customer; sum(Amount):Amount)) | / Group records in each group by customers and calculate each customer’s total amount |
5 | =A4.new(~.top(-3; Amount):Top3) | / Loop through records of each month to get customers whose total amounts rank in top 3 |
6 | =A5.(Top3.(Customer).pos("RATTC")>0) | / Among each month’s top 3, check whether customer RATTC is included |
7 | =A6.cor() | / A.cor() function checks if there is a True in A6’s members |
A7’s result:
Value |
---|
false |
Find more examples in SPL CookBook.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code