-
Notifications
You must be signed in to change notification settings - Fork 332
“Top N” Queries on Structured Data
A “top N” query gets the first or bottom N values, the records, or their positions, from a data set. Sometimes we use such a query after a grouping operation to get values, specific information, or positions from a subset. This article will tell you how to handle different types of “top N” scenarios through examples and offers sample scripts in esProc SPL. Looking ${article} for details.
Getting the maximum or minimum value can be regarded as a special top N scenario where N is 1. It’s so common that I single it out for special illustration. There are many such computing tasks, such as getting the highest math score in class one, finding the age of the youngest employee, and so on. On certain occasions, instead of the specific values we are more concerned with their positions. Often this happens in inter-row calculations. One example is to find how much the sales amount in the company’s best month goes up compared with the previous month. To get this done we need to first get the sequence number of the record holding the month with the highest sales amount, then the sales amount in the previous month, and compare them. Other times we want to get certain information in the record containing the maximum or minimum value. Examples include getting the name of the student who has the highest math score in class one, finding which department the youngest employee belongs to in a company, etc. Now we’ll look at how to deal with the three scenarios of “getting maximum/minimum”. The following table records information of NASDAQ Composite:
Date | Open | Close | Volume |
---|---|---|---|
2019/01/02 | 6506.910156 | 6665.939941 | 2261800000 |
2019/01/03 | 6584.77002 | 6463.5 | 2607290000 |
2019/01/04 | 6567.140137 | 6738.859863 | 2579550000 |
2019/01/07 | 6757.529785 | 6823.470215 | 2507550000 |
2019/01/08 | 6893.439941 | 6897.0 | 2380290000 |
… | … | … | … |
【Example 1】 Get the highest closing price in the NASDAQ in the year of 2019. 【SPL script】
A | B | |
---|---|---|
1 | =file("IXIC.txt").import@t() | / Import IXIC data |
2 | =A1.select(year(Date)==2019) | / Get data of 2019 |
3 | =A2.max(Close) | / A.max() function gets the highest closing price |
To get the lowest closing price in the NASDAQ in the year of 2019:
A | B | |
---|---|---|
3 | =A2.min(Close) | / A.min()function gets the lowest closing price |
【Example 2】 Calculate the growth rate of the trading date with the highest NASDAQ closing price in 2019 compared with the closing price in the previous day. 【SPL script】
A | B | |
---|---|---|
1 | =file("IXIC.txt").import@t() | / Import IXIC data |
2 | =A1.select(year(Date)==2019).sort(Date) | / Get data of 2019 and sort it by Date |
3 | =A2.pmax(Close) | / A.pmax() gets the sequence number of the record holding the highest closing price |
4 | =A2.calc(A3,Close/Close[-1]-1) | / Calculate the growth rate by dividing the highest closing price by the closing price of the previous day |
There could be more than one record having the maximum value. To return the sequence numbers of all eligible records, you can use @a option in A.pmax() function:
A | B | |
---|---|---|
3 | =A2.pmax@a(Close) | / Get sequence numbers of all records holding the highest closing price |
You can use @z option in A.pmax() function to locate the record(s) from backwards to forwards:
A | B | |
---|---|---|
3 | =A2.pmax@z(Close) | / Get the sequence number of the record holding the highest closing price |
【Example 3】 Get the date in the record holding the highest NASDAQ closing price in the year of 2019. 【SPL script】
A | B | |
---|---|---|
1 | =file("IXIC.txt").import@t() | / Import IXIC data |
2 | =A1.select(year(Date)==2019) | / Get data of 2019 |
3 | =A2.maxp(Close) | / A.maxp() gets the record having the highest closing price |
4 | =A3.Date | / Get the date from A3’s record |
A.minp() function is used to get the record(s) containing the minimum value:
A | B | |
---|---|---|
3 | =A2.minp(Close) | / A.minp() gets the record having the lowest closing price |
Both A.maxp() and A.minp() functions can work with @a option and @z option to achieve specific computing goals. Here we won’t give examples.
There are same three scenarios of getting the top/bottom N. Use the same NASDAQ composite data to illustrate them:
Date | Open | Close | Volume |
---|---|---|---|
2019/01/02 | 6506.910156 | 6665.939941 | 2261800000 |
2019/01/03 | 6584.77002 | 6463.5 | 2607290000 |
2019/01/04 | 6567.140137 | 6738.859863 | 2579550000 |
2019/01/07 | 6757.529785 | 6823.470215 | 2507550000 |
2019/01/08 | 6893.439941 | 6897.0 | 2380290000 |
… | … | … | … |
【Example 4】 Get the top 3 NASDAQ volumes in the year of 2019. 【SPL script】
A | B | |
---|---|---|
1 | =file("IXIC.txt").import@t() | / Import IXIC data |
2 | =A1.select(year(Date)==2019) | / Get data of 2019 |
3 | =A2.top(-3, Volume) | / A.top(n,x) function gets the top 3 volumes |
To get the bottom 4 NASDAQ volumes in the year of 2019:
A | B | |
---|---|---|
3 | =A2.top(4, Volume) | / A.top(n,x) function gets the bottom 4 volumes |
【Example 5】 Calculate the growth rate of each of the trading dates having the 3 highest NASDAQ closing prices in 2019 compared with their closing price in the prior day. 【SPL script】
A | B | |
---|---|---|
1 | =file("IXIC.txt").import@t() | / Import IXIC data |
2 | =A1.select(year(Date)==2019).sort(Date) | / Get data of 2019 and sort it by Date |
3 | =A2.ptop(-3, Close) | /A.ptop(n,x) function gets the sequence numbers of the highest 3 closing prices |
4 | =A3.run( |
/ Calculate the growth rate circularly by dividing the current volume by volume of the prior day |
【Example 6】 Get the records of trading dates with 5 lowest volumes in NASDAQ 2019. 【SPL script】
A | B | |
---|---|---|
1 | =file("IXIC.txt").import@t() | / Import IXIC data |
2 | =A1.select(year(Date)==2019).sort(Date) | / Get data of 2019 |
3 | =A2.top(5; Close) | / A.top(n; x) function gets the records of 5 trading dates with the lowest volumes |
Getting the maximum/minimum value and the top/bottom N values from each subset after a data set is grouped are also common computing goals, like finding the 5 top-selling items for each month and getting the customers whose total order amounts rank in top 3 for each year. Now let’s look at how to handle top N queries on subsets.
【Example 7】 Based on the score table, get the highest math score in each class. Below is part of the source table:
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
1 | 1 | English | 95 |
1 | 1 | Math | 90 |
1 | 1 | PE | 80 |
1 | 2 | English | 75 |
1 | 2 | Math | 84 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =file("Score.txt").import@t() | / Import Score table |
2 | =A1.select(Subject:"Math") | / Get records of math subject |
3 | =A2.groups(Class; max(Score):BestScore) | / Group the selected records by classes and use max() function to get the highest math score in each class |
A top N query can be regarded as a kind of aggregate operation on each subset after a data set is grouped. Here we look at how to get top N values and records respectively. 【Example 8】 Query the highest two math scores in each class. Below is part of the score table:
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
1 | 1 | English | 95 |
1 | 1 | Math | 90 |
1 | 1 | PE | 80 |
1 | 2 | English | 75 |
1 | 2 | Math | 84 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =file("Score.txt").import@t() | / Import Score table |
2 | =A1.select(Subject:"Math") | / Get records of math subject |
3 | =A2.group(Class; ~.top(-2, Score):top2) | / Group the selected records by classes and use A.top() function to get the highest two math scores in each class |
4 | =A3.new(Class, top2(1):First, top2(2):Second) | / Create the result set where column 1 contains classes, column 2 holds the highest score and column 3 has the second highest score |
【Example 9】 For each subject in each class, query the information of students whose scores rank in top 3. Below is part of the score table:
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
1 | 1 | English | 95 |
1 | 1 | Math | 90 |
1 | 1 | PE | 80 |
1 | 2 | English | 75 |
1 | 2 | Math | 84 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =file("Score.txt").import@t() | / Import Score table |
2 | =A1.group(Class,Subject;~.top(-3;Score):top3) | / Group records by classes and subjects and get the highest two scores in each group |
3 | =A2.conj(top3) | / Concatenate records with the highest 2 scores from all groups |
etting top N values or records in a cumulative way won’t generate the subsets after the data set is grouped. This method is used to handle scenarios when data volume is huge. There are same two scenarios here – top N values and top N records. 【Example 10】 Get the hire dates of the two employees in each department who have the longest hire durations. Below is part of the employee table:
EID | NAME | DEPT | EntryDate |
---|---|---|---|
1 | Rebecca | R&D | 2005/03/11 |
2 | Ashley | Finance | 2008/03/16 |
3 | Rachel | Sales | 2010/12/01 |
4 | Emily | HR | 2006/08/15 |
5 | Ryan | R&D | 2004/07/30 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =file("Employee.txt").cursor@t() | / Create cursor of the *Employee *table |
2 | =A1.groups(Department; top(2,EntryDate):Top2) | / Group the cursor by departments and get the earliest two hire dates in each group |
3 | =A2.news(Top2;Department, ~:EntryDate) | / Create a new table where column 1 is Department and column 2 is EntryDate |
【Example 11】 Get the information of employees whose salaries rank in top 3 in each department. Below is part of the employee table:
EID | NAME | DEPT | SALARY |
---|---|---|---|
1 | Rebecca | R&D | 7000 |
2 | Ashley | Finance | 11000 |
3 | Rachel | Sales | 9000 |
4 | Emily | HR | 7000 |
5 | Ryan | R&D | 13000 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =file("Employee.txt").cursor@t() | / Create cursor of the *Employee *table |
2 | =A1.groups(Department; top(-3;Salary):Top3) | / Group the cursor by departments and get the records having a Salary value that ranks in top 3 in each group |
3 | =A2.conj(Top3) | / Concatenate eligible records from all groups |
Find more examples in SPL CookBook.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code