-
Notifications
You must be signed in to change notification settings - Fork 332
Locate Operation on Ordered Sets
Sometimes it’s a record’s position in an ordered sets that we are more concerned with in order to accomplish the computing goal. For instance, we want to find which row the order whose ID is 752084 settles in the orders table, or which trading date when SSE Composite Index has the highest closing price in the year of 2019. Here we have the efficient and fast ways to locate records and provide sample scripts in esProc SPL.
To locate a member in another ordered set, we compare the member to each member of the target set to find its position.
【Example 1】 In the Teachers.txt below, column 1 contains professors’ names, column 2 has subjects, and other columns contain course code (null means an absence of code). Here’s part of the source table:
Teachers.txt
Petitti | Matematica | mif | mig | vif | vig | null | null | … |
---|---|---|---|---|---|---|---|---|
Canales | Apesca | luc | lud | mac | mad | mic | mid | … |
Lucero | NavegacionI | lub | luc | lud | lue | mab | mac | … |
Bergamaschi | TecPesc | lua | luf | maa | maf | mia | mif | … |
… | … | … | … | … | … | … | … | … |
Here’s the Courses table:
Monday | Tuesday | Wednesday | Thursday | Friday |
---|---|---|---|---|
lua | maa | mia | jua | via |
lub | mab | mib | jub | vib |
luc | mac | mic | juc | vic |
lud | mad | mid | jud | vid |
lue | mae | mie | jue | vie |
luf | maf | mif | juf | vif |
lug | mag | mig | jug | vig |
【SPL script】
A | B | |
---|---|---|
1 | =file("Teachers.txt").import() | / Import data from Teachers.txt |
2 | =A1.new(#1:professor, |
/ Create a new table based on the Teachers table, where column 1 contains professors’ names and column 2 is lists of courses |
3 | =file("Courses.txt").import@t().conj(~.array()) | / Import data from Courses.txt, and concatenate all courses |
4 | =A3.(A2.select(codeArray.pos(A3.~)).(professor)) | / Loop through each of A3’s courses, use pos() function to locate the current course in A2’s course list, and get the eligible professor for this course |
5 | =create(Monday,Tuesday,Wednesday,Thursday,Friday).record(A4.(~.concat@c())) | / Create a Monday-to -Friday timetable and populate professors to it |
A5’s result:
Monday | Tuesday | Wednesday | Thursday | Friday |
---|---|---|---|---|
Bergamaschi,Puebla | Bergamaschi,Pue… | Bergamaschi,Puebla | Bergamaschi,Pue… | Bergamaschi,Puebla |
Lucero,Puebla,Lu… | Lucero,Mazza,Pu… | Lucero,Puebla,Chi… | Lucero,Mazza,Pe… | Lucero,Puebla,Vel… |
Canales,Lucero,P… | Canales,Lucero,M… | Canales,Lucero,P… | Canales,Lucero,M… | Lucero,Velasco,Lu… |
… | … | … | … | … |
A member could appear more than once when locating it in an ordered set. For instance, we may find that there are more than one student who get a 100 score for math and that more than one employee who has the name Ashley.
【Example 2】 Based on the following associated Sales table and Customer table, find the customers who didn’t place an order in the year of 2014. The following picture shows their relationship:
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | /Connect to database |
2 | =A1.query("select * from Sales where year(OrderDate)=2014") | / Get records of 2014 from Sales table |
3 | =A1.query("select * from Customer") | / Query Customer table |
4 | =A3.(ID).sort() | / List customer IDs and sort them |
5 | =A2.align(A4.len(), A4.pos@b(CustomerID)) | / Group sales records of 2014 by A4’s customer IDs, during which pos function is used to locate the corresponding customer ID in A4’s list; since customer IDs are ordered, we use @b option to perform a fast binary search |
6 | =A3(A5.pos@a(null)) | / pos() function works with @ option to get all records of customers who didn’t place an order (A5’s value is null); by default the function returns only the first-found eligible record |
A6’s result:
ID | Name | State | … |
---|---|---|---|
ALFKI | CMA-CGM | Texas | … |
CENTC | Nedlloyd | Florida | … |
To locate the maximum/minimum value is to get the sequence number(s) of the records containing the maximum or minimum value.
【Example 3】 Based on the stock transaction table, calculate the growth rate of the date when the SSE Composite Index has its higher closing price compared with the prior day. Below is part of the source table:
Date | Open | Close | Amount |
---|---|---|---|
2019/12/31 | 3036.3858 | 3050.124 | 2.27E11 |
2019/12/30 | 2998.1689 | 3040.0239 | 2.67E11 |
2019/12/27 | 3006.8517 | 3005.0355 | 2.58E11 |
2019/12/26 | 2981.2485 | 3007.3546 | 1.96E11 |
2019/12/25 | 2980.4276 | 2981.8805 | 1.9E11 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =file("000001.csv").import@ct() | / Import the source data file |
2 | =A1.sort(Date) | / Sort data by date |
3 | =A2.pmax(Close) | / Use A.pmax() function to get the sequence number of the record having the highest closing price |
4 | =A2.calc(A3,Close/Close[-1]-1) | / Calculate the growth rate by dividing the current closing price by the previous closing price |
A4’s result:
Member |
---|
0.010276967857506536 |
Similarly, we can use A.pmin() function to get the sequence number of the record containing the minimum value:
A | B | |
---|---|---|
3 | =A3.pmin(Close) | / Use A.pmin() function to get the sequence number of the record having the lowest closing price |
There could be more than one record containing the maximum value. To return sequence numbers of all eligible records, just use @a option with A.pmax() function:
A | B | |
---|---|---|
3 | =A2.pmax@a(Close) | / A.pmin() function works with @a option to get sequence numbers of all records having the highest closing price |
To locate a record from back to front, A.pmax() function has @z option:
A | B | |
---|---|---|
3 | =A2.pmax@z(Close) | / A.pmax() function works with @z option to get sequence number of the record with the highest closing price |
A conditional locate operation calculates a specific condition on each member of an ordered set and returns the sequence numbers of members that generates a "true", such as getting sequence numbers of employees who are above 50 years old from the employee table, finding sequence number of math scores that are over 90 from the score table, and so on.
【Example 4】 Based on the stock transaction table, calculate the growth rate of each of the transaction dates when the closing prices increases by over 3% compared with the previous transaction date. Below is part of the source table:
Date | Open | Close | Amount |
---|---|---|---|
2019/12/31 | 3036.3858 | 3050.124 | 2.27E11 |
2019/12/30 | 2998.1689 | 3040.0239 | 2.67E11 |
2019/12/27 | 3006.8517 | 3005.0355 | 2.58E11 |
2019/12/26 | 2981.2485 | 3007.3546 | 1.96E11 |
2019/12/25 | 2980.4276 | 2981.8805 | 1.9E11 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =file("000001.csv").import@ct() | / Import the source data file |
2 | =A1.select(year(Date)==2019).sort(Date) | / Get stock records of 2019 and sort them by date |
3 | =A2.pselect@a(Close/Close[-1]>1.03) | / A.pselect () function works with @a option to get sequence numbers of all records where closing prices increases by over 3% |
4 | =A3.new(A2( |
/ Calculate growth rate of each day circularly |
A3’s result:
Member |
---|
161 |
187 |
211 |
A4’s result:
Date | Amount increase |
---|---|
2019/02/25 | 1.758490566037736 |
2019/03/29 | 1.3344827586206895 |
2019/05/10 | 1.3908629441624365 |
So we can see that the transaction amount of each of the three days increases greatly compared with the prior day.
Besides locating a member in ordered set according to a specific value or expression, sometimes we perform the location by specified interval for further grouping and aggregation. [0,18,35,60] is a set of ages that represents teenagers, young adults, middle-aged and older adults in order. The interval number where age 20 falls in is 2, which means they are a young adult. Here we define a set of salaries [0,8000,15000,30000] and try to find the interval number corresponding to the salary 25000, the result is 3.
【Example 5】 Based on the following EMPLOYEE table, count the employees in different salary ranges: below 8000, between 8000 and 120000, and above 12000. Here’s part of the source table:
ID | NAME | BIRTHDAY | SALARY |
---|---|---|---|
1 | Rebecca | 1974-11-20 | 7000 |
2 | Ashley | 1980-07-19 | 11000 |
3 | Rachel | 1970-12-17 | 9000 |
4 | Emily | 1985-03-07 | 7000 |
5 | Ashley | 1975-05-13 | 16000 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to database |
2 | =A1.query("select * from EMPLOYEE") | / Query EMPLOYEE table |
3 | [0,8000,12000] | / Define salary intervals |
4 | =A2.align@a(A3.len(),A3.pseg(SALARY)) | / A.pseg(y) function gets the interval number that a salary value corresponds to |
5 | =A4.new(A3 (#):SALARY,~.count():COUNT) | / Count the employees in each interval |
A5’s result:
SALARY | COUNT |
---|---|
0 | 308 |
8000 | 153 |
12000 | 39 |
At some occasions we need to first calculate the interval values and then find the number of intervals in a set corresponding to a member.
【Example 6】 Based on the following EMPLOYEE table, calculate the average salary for employees of different hire durations: below 10 years, between 10 to 20 years and above 20 years. Below is part of the source table:
ID | NAME | HIREDATE | SALARY |
---|---|---|---|
1 | Rebecca | 2005-03-11 | 7000 |
2 | Ashley | 2008-03-16 | 11000 |
3 | Rachel | 2010-12-01 | 9000 |
4 | Emily | 2006-08-15 | 7000 |
5 | Ashley | 2004-07-30 | 16000 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to database |
2 | =A1.query("select * from EMPLOYEE") | /Query EMPLOYEE table |
3 | [0,10,20] | / Define intervals of hire durations |
4 | =A2.align@a(A3.len(),A3.pseg(year(now())-~,year(HIREDATE))) | / A.pseg(y) function gets the interval number that a hire date corresponds to |
5 | =A4.new(A3(#):EntryYears,~.avg(SALARY):AvgSalary) | / Calculate average salary in each group |
A5’s result:
EntryYears | AvgSalary |
---|---|
0 | 6807.69 |
10 | 7417.78 |
20 | 7324.32 |
Sorting is the rearrangement of members in a set of a certain order by a key word using a specific method. It aims to sort out the currently “unordered” set of records. But, the original order is not always useless or meaningless. For instance, a result orders table after query is arranged by datetime, but we need to sort it out by sales amount. Yet the datetime will be useless in subsequent calculations, so we also want to retain the original order while ordering the records by the specific condition.
【Example 7】 Based on the following EMPLOYEE table, get records of the thee eldest employees and arrange them by hire dates in temporal order. Below is part of the source table:
ID | NAME | BIRTHDAY | HIREDATE |
---|---|---|---|
1 | Rebecca | 1974-11-20 | 2005-03-11 |
2 | Ashley | 1980-07-19 | 2008-03-16 |
3 | Rachel | 1970-12-17 | 2010-12-01 |
4 | Emily | 1985-03-07 | 2006-08-15 |
5 | Ashley | 1975-05-13 | 2004-07-30 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to database |
2 | =A1.query("select * from EMPLOYEE order by HIREDATE") | / Query EMPLOYEE table and sort it by hire date |
3 | =A2.psort(BIRTHDAY) | / A.psort() function gets the sequence numbers of employees’ birthdays before they are sorted |
4 | =A2(A3.to(3).sort()) | / Get the first three EMPLOYEE records according to the sequence number of birthdays |
A4’s result:
ID | NAME | BIRTHDAY | HIREDATE |
---|---|---|---|
296 | Olivia | 1968-11-05 | 2006-11-01 |
440 | Nicholas | 1968-11-24 | 2008-07-01 |
444 | Alexis | 1968-11-12 | 2010-12-01 |
Apart from locating individual members, we can locate a subset of set at a time. For instance, the result of locating the set [c,d,a] in [a,b,c,d,e] is [3,4,1], and the locating set [c,f] in [a,b,c,d,e] returns a null because the larger set contains doesn’t contain f.
【Example 8】 Based on the *PostRecord *table, group records by labels and count the frequencies of each label. Below is part of the source table:
ID | TITLE | Author | Label |
---|---|---|---|
1 | Easy analysis of Excel | 2 | Excel,ETL,Import,Export |
2 | Early commute: Easy to pivot excel | 3 | Excel,Pivot,Python |
3 | Initial experience of SPL | 1 | Basics,Introduction |
4 | Talking about set and reference | 4 | Set,Reference,Dispersed,SQL |
5 | Early commute: Better weapon than Python | 4 | Python,Contrast,Install |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to database |
2 | =A1.query("select * from PostRecord") | / Query *PostRecord *table |
3 | =A2.conj(Label.split(",")).id() | / Split each Label value by comma and concatenate all values to get the unique labels |
4 | =A2.align@ar(A3.len(),A3.pos(Label.split(","))) | / A.pos() function finds the set of sequence numbers of all labels and then align@r() function groups PostRecord records by those sequence numbers |
5 | =A4.new(A3(#):Label,~.count():Count).sort@z(Count) | / Count the frequencies of each label and sort the result in descending order |
A5’s result:
Label | Count |
---|---|
SPL | 7 |
SQL | 6 |
Basics | 5 |
… | … |
In certain scenarios, it’s not the sequence numbers of set B’s members in set A that we are concerned with, but that whether set A includes all members of set B.
【Example 9】 The following table records official languages of different countries. Now we want to find countries that use both Chinese and English as their official languages. Below is part of the source data:
Country | Language |
---|---|
China | Chinese |
UK | English |
Singapore | English |
Singapore | Malay |
Singapore | Chinese |
Singapore | Tamil |
Malaysia | Malay |
… | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to database |
2 | =A1.query("select * from Language") | / Query Language table |
3 | =A2.group(Country) | / Group records by country |
4 | =A3.select(~.(Language).contain("Chinese","English")) | / A.contain() function checks whether the current country uses both Chinese and English as their official languages |
5 | =A4.(Country) | / Get the eligible country |
A5’s result:
Member |
---|
Singapore |
We have a special method for finding sequence numbers of members among primary key values.
【Exmaple 10】 Below are the associated Product table and Category table, find product records whose category IDs are not included in the Category table. Here’s their relationship:
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | /Connect to database |
2 | =A1.query("select * from Category").keys(ID) | / Query Category table and set ID as the primary key |
3 | =A1.query("select * from Product") | /Query Product table |
4 | =A3.select(A2.pfind(CategoryID)==0) | / A.pfind() function gets sequence numbers of records from Category table whose primary key values are the same as CategoryID values in* Product* table; the result of 0 means that the current category ID doesn’t exist in Category table. And then get product records whose category IDs are not included by Category table |
A4’s result:
ID | Name | CategoryID | … |
---|---|---|---|
12 | German cheese | … | |
26 | Spun sugar | 9 | … |
Finding sequence numbers of top/bottom N records is a common computing goal, such as finding students whose math scores rank in top 3 in the class, getting 5 employees who have the shortest hire durations. Let’s look at how to get sequence numbers of the top/bottom N members.
【Example 11】 The following table stores SSE Composite Index transaction information. We want to find the growth rate of each of the 3 days with the highest closing prices in 2019, compared with the prior day. Below is part of the source data:
Date | Open | Close | Amount |
---|---|---|---|
2019/12/31 | 3036.3858 | 3050.124 | 2.27E11 |
2019/12/30 | 2998.1689 | 3040.0239 | 2.67E11 |
2019/12/27 | 3006.8517 | 3005.0355 | 2.58E11 |
2019/12/26 | 2981.2485 | 3007.3546 | 1.96E11 |
2019/12/25 | 2980.4276 | 2981.8805 | 1.9E11 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =file("000001.csv").import@ct() | /Import the source data file |
2 | =A1.select(year(Date)==2019) | / Get stock records of 2019 |
3 | =A2.ptop(-3, Close) | / A.ptop() gets the sequence numbers of 3 records when the closing prices stay the highest; -3 means getting top 3 in descending order, and a positive integer will get the specific top N in ascending order |
4 | =A3.run( |
/ Calculate growth rate circularly by dividing the current amount by that of the prior day |
A3’s result:
Member |
---|
154 |
156 |
157 |
A4’s result:
Member |
---|
-0.0278 |
-0.0139 |
0.0112 |
There are more examples in SPL CookBook.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code