-
Notifications
You must be signed in to change notification settings - Fork 335
The Select Operation on Structured Data
A select operation gets one or more members from a set according to a specific condition. It is similar to a locate operation. But the latter is concerned with the positions of certain members while the select operation wants the detailed information of a member record. Getting the incomes of the employees whose hire durations are over 10 years and finding names of customers whose order amounts are above 10,000 are among the many examples of select operations. But to perform selections conveniently and efficiently, you’d better read this article carefully where various scenarios are listed and sample scripts in esProc SPL are provided.
【Example 1】 List European and African cities with a population of more than 2 million and their populations in column groups (each column group is ordered by population in descending order). Below is part of the world’s urban population table:
Continent | Country | City | Population |
---|---|---|---|
Africa | Egypt | Cairo | 6789479 |
Asia | China | Shanghai | 24240000 |
Europe | Britain | London | 7285000 |
… | … | … | … |
The expected result:
Europe City | Population | Africa City | Population |
---|---|---|---|
Moscow | 8389200 | Cairo | 6789479 |
London | 7285000 | Kinshasa | 5064000 |
St Petersburg | 4694000 | Alexandria | 3328196 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db").query("select * from World where Continent in('Europe','Africa') and Population >= 2000000") | / Connect to database and get records of European and African cities having a population of over 2 million |
2 | =A1.select(Continent:"Europe") | / A.select()function gets records of Europe |
3 | =A1.select(Continent:"Africa") | /A.select()function gets records of Africa |
4 | =create('Europe City',Population,'Africa City', Population) | / Create an empty table according to the target structure |
5 | =A4.paste(A2.(City),A2.(Population),A3.(City),A3.(Population)) | / A.paste()function pastes values to corresponding columns |
A4’s result:
Europe City | Population | Africa City | Population |
---|---|---|---|
Moscow | 8389200 | Cairo | 6789479 |
London | 7285000 | Kinshasa | 5064000 |
St Petersburg | 4694000 | Alexandria | 3328196 |
… | … | … | … |
【Example 2】 Get full names and salaries of employees in R&D department in the New York state. Below is part of the source data:
ID | NAME | SURNAME | STATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | Moore | California | R&D | 7000 |
2 | Ashley | Wilson | New York | Finance | 11000 |
3 | Rachel | Johnson | New Mexico | Sales | 9000 |
4 | Emily | Smith | Texas | HR | 7000 |
5 | Ashley | Smith | Texas | R&D | 16000 |
… | … | … | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to the data source |
2 | =A1.query("select * from Employee") | / Import Employee table |
3 | =A2.select(STATE=="New York"&&DEPT=="R&D") | / A.select() function gets records of employees of R&D department in New York; the double ampersand sign ("&&") is used to connect two conditions that need to be met at the same time |
4 | =A3.new(NAME+""+SURNAME:FULLNAME, SALARY) | / Use the plus sign (+) to connect the name string and surname string into a full name |
A4’s result:
FULLNAME | SALARY |
---|---|
Matthew Johnson | 6000 |
Lauren Thomas | 12000 |
Brooke Williams | 12000 |
【Example 3】 Find the number of employees who are below 30 or over 50 in each department. Below is part of the source data:
ID | NAME | BIRTHDAY | STATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | 1974/11/20 | California | R&D | 7000 |
2 | Ashley | 1980/07/19 | New York | Finance | 11000 |
3 | Rachel | 1970/12/17 | New Mexico | Sales | 9000 |
4 | Emily | 1985/03/07 | Texas | HR | 7000 |
5 | Ashley | 1975/05/13 | Texas | R&D | 16000 |
… | … | … | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to the data source |
2 | =A1.query("select * from Employee") | / Import Employee table |
3 | =A2.select((age=age(BIRTHDAY), age<30 | |
4 | =A3.groups(DEPT; count(~):Count) | / Group the selected records and count the employees |
A4’s result:
DEPT | Count |
---|---|
4 | 9 |
5 | 5 |
6 | 3 |
7 | 6 |
8 | 1 |
【Example 4】 Based on the scores table, get the student ID with the lowest math score. Below is part of the source table:
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
Class one | 1 | English | 84 |
Class one | 1 | Math | 77 |
Class one | 1 | PE | 69 |
Class one | 2 | English | 81 |
Class one | 2 | Math | 80 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to the database |
2 | =A1.query("select * from Scores where SUBJECT='Math'and CLASS='Class one'") | / Get math records for class one |
3 | =A2.minp(SCORE) | /A.minp() function gets the record having the lowest score |
4 | =A3.STUDENTID | / Get the student ID from A3’s record |
There could be more than one record that contains the minimum value. To return all eligible records, you can use @a option in A.minp() function:
A | B | |
---|---|---|
3 | =A2.minp@a(SCORE) | / A.minp() function works with @a option to get all records containing the lowest score |
4 | =A3.(STUDENTID) | / Get student IDs from A3’s records |
A3’s result:
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
Class one | 5 | Math | 60 |
Class one | 14 | Math | 60 |
A4’s result:
Member |
---|
5 |
14 |
【Example 5】 Based on the Olympics medal table, get all information of the country that holds the top place for the longest time in terms of total medals. Below is part of the source table:
Game | Nation | Gold | Silver | Copper |
---|---|---|---|---|
30 | USA | 46 | 29 | 29 |
30 | China | 38 | 27 | 23 |
30 | UK | 29 | 17 | 19 |
30 | Russia | 24 | 26 | 32 |
30 | Korea | 13 | 8 | 7 |
… | … | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =file("Olympic.csv").import@cqt() | / Import the Olympic medal tally file |
2 | =A1.sort@z(Game, 1000000*Gold+1000*Silver+Copper) | / Sort records by games and the totals |
3 | =A2.group@o1(Game) | / Get the one record for each game, which is the first record because records are already sorted in descending order |
4 | =A3.group@o(Nation) | / Group the selected records by nations in their original order |
5 | =A4.maxp(~.len()) | / A.maxp() selects the group having the most records, which are the country that holds the top place in a row for the longest time |
A5’s result:
Game | Nation | Gold | Silver | Copper |
---|---|---|---|---|
10 | USA | 41 | 32 | 30 |
9 | USA | 22 | 18 | 16 |
8 | USA | 45 | 27 | 27 |
7 | USA | 41 | 27 | 28 |
Sometimes we need to first find the sequence number of the range where a value belongs to and then get its corresponding members in a set. Examples include returning the assessment (Excellent, Good, Average, Failed) according to a student’s score, and finding a household’s income range (Below or near poverty level, Low income, middle class, Upper middle class) according to its annually income.
【Example 6】 Based on the score table, find the number of students in Excellent, Pass, and Failed ranges respectively for the English subject. Below is part of the source data:
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
Class one | 1 | English | 84 |
Class one | 1 | Math | 77 |
Class one | 1 | PE | 69 |
Class one | 2 | English | 81 |
Class one | 2 | Math | 80 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db").query("select * from Scores where SUBJECT='English'") | / Connect to the database and get records of English subject |
2 | =create(Assessment,Score).record(["fail",0,"pass",60,"excellent",90]) | / Create a lookup table of score ranges and assessments |
3 | =A1.derive(A2.segp(Score,SCORE).Assessment:Assessment) | / A.segp() function gets assessment for each score according to the sequence number of the range where the score falls in |
4 | =A3.groups(Assessment;count(1):Count) | / Group A3’s records by assessment and count students in each group |
A4’s result:
Assessment | Count |
---|---|
excellent | 6 |
Fail | 4 |
Pass | 18 |
【Example 7】 Based on the score table, for each subject in every class, find the top 2 students’ IDs in terms of score. Below is part of the source data:
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
Class one | 1 | English | 84 |
Class one | 1 | Math | 77 |
Class one | 1 | PE | 69 |
Class one | 2 | English | 81 |
Class one | 2 | Math | 80 |
… | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to the database |
2 | =A1.query("select * from Scores") | / Query Scores table |
3 | =A2.group(CLASS,SUBJECT;~.top(-2;SCORE):TOP2) | / A.top() function gets the records of top 2 students for each subject in each class; -2 means getting two records in descending order |
4 | =A3.conj(TOP2) | / Concatenate all top 2 records together |
A4’s result:
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
Class one | 4 | English | 96 |
Class one | 9 | English | 93 |
Class one | 13 | Math | 97 |
Class one | 10 | Math | 97 |
… | … | … | … |
It’s common to locate records according to primary key values, such as finding employee records by employee IDs, querying detailed orders data by orders IDs, etc.
【Example 8】 Based on the associated Course table and SelectCourse table, list a table of course selection information where each course occupies a column. Below is part of the Course table:
ID | NAME | TEACHERID |
---|---|---|
1 | Environmental protection and … | 5 |
2 | Mental health of College Students | 1 |
3 | Computer language Matlab | 8 |
… | … | … |
Here is part of the SelectCourse table:
ID | STUDENT_NAME | COURSE |
---|---|---|
1 | Rebecca | 2,7 |
2 | Ashley | 1,8 |
3 | Rachel | 2,7,10 |
… | … | … |
The expected result:
ID | STUDENT_NAME | COURSE1 | COURSE2 | COURSE3 | … |
---|---|---|---|---|---|
1 | Rebecca | Mental health of College Students | Into Shakespeare | … | |
2 | Ashley | Environmental protection and … | Modern economics | … | |
3 | Rachel | Mental health of College Students | Into Shakespeare | Music appreciation | … |
… | … | … | … | … | … |
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | /Connect to the database |
2 | =A1.query("select * from Course").keys(ID) | / Query Course table and set ID as the primary key |
3 | =A1.query("select * from SelectCourse") | / Query SelectCourse table |
4 | =A3.run(COURSE=COURSE.split@cp()) | / Split each Course value in SelectCourse table by comma and reassign it to the field |
5 | =A4.max(COURSE.len()) | / Get the largest number of selected courses |
6 | =create(ID,STUDENT_NAME, ${A5.("COURSE"+string(~)).concat@c()}) | / Create an empty table where the number of course columns is the number A5 gets |
7 | >A4.run(A6.record([ID,STUDENT_NAME] | COURSE.(A2.find(~).Name))) |
A6’s result:
ID | STUDENT_NAME | COURSE1 | COURSE2 | COURSE3 |
---|---|---|---|---|
1 | Rebecca | Mental health of College Students | Into Shakespeare | |
2 | Ashley | Environmental protection and … | Modern economics | |
3 | Rachel | Mental health of College Students | Into Shakespeare | Music appreciation |
… | … | … | … | … |
Find more examples in SPL CookBook.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code