-
Notifications
You must be signed in to change notification settings - Fork 332
SPL:selecting data by segment
Selecting data by segment refers to finding the data in a group according to the specified interval. For example, return to assessments (excellent, good, medium, poor) according to the test results; return to the levels of assets (poor, well-off, middle-class, rich, etc.) according to the annual income of the family.
We first define an interval sequence for the segment. For example, we define the age sequence of [0,18,45,60], which represents minors, youth, middle-ages, and old-ages respectively. Next by calculating the sequence number of the value of the age field in the sequence of the age range, we can get what age range the person is in.
The A.pseg() function is provided in SPL, which returns the segment number of a member in sequence,and the default interval is left-open and right-closed. For example, the 17-year-old belongs to the age range of [0,18), so the sequence number returned by the A.pseg() function is 1; whereas the 18-year-old belongs to the age range of [18,45), then the sequence number returned by the A.pseg() function is 2. The A.pseg() function provides the @r option to use left-open and right-closed intervals when calculating the segment numbers. Take the same case for an example, 18-year-old is assigned to the interval of (0,18] by A.pseg() function,so the sequence number returned is 1.
In the case of employees, some of the data are as follows:
ID | NAME | ENTRYDATE | STATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | 2005/03/11 | California | R&D | 7000 |
2 | Ashley | 2008/03/16 | New York | Finance | 11000 |
3 | Rachel | 2010/12/01 | New Mexico | Sales | 9000 |
4 | Emily | 2006/08/15 | Texas | HR | 7000 |
5 | Ashley | 2004/07/30 | Texas | R&D | 16000 |
… | … | … | … | … | … |
[e.g. 1] Divide the employees into groups of less than 8,000, form 8,000 to 12,000 and more than 12,000 according to their salaries, and count the number of employees in each group.
The SPL script looks like this:
A | |
---|---|
1 | =T("Employee.csv") |
2 | [0,8000,12000] |
3 | =A1.align@a(A2.len(),A2.pseg(SALARY)) |
4 | =A3.new(A2(#):SALARY,~.count():COUNT) |
A1: import employee table.
A2: define the intervals of salaries.
A3: use the A.pseg() function to get the segment numbers of the salaries, and then group them according to the numbers
A4: calculate the number of people in each group.
[e.g. 2] Divide the employees into three segments in terms of their working years, 1 to 10 years, 10 to 20 years and 20 years respectively, and count the average salary of each group.
The SPL script looks like this:
A | |
---|---|
1 | =T("Employee.csv") |
2 | [0,10,20] |
3 | =A1.align@a(A2.len(),A2.pseg@r(interval@y(ENTRYDATE,now()))) |
4 | =A3.new(A2(#):EntryYears,~.avg(SALARY):AvgSalary) |
A1: import employee table.
A2: define the intervals of working years.
A3: use the A.pseg() function to get the segment numbers of the working years, where the @r option represents the left-open and right-closed interval. And then group them by the numbers.
A4: calculate the average salary of each group.
Sometimes we need to get the corresponding member in the set by calculating the sequence number of the field value (or expression) in the segment.
The A.segp() function is provided in SPL, which returns the corresponding member in the sequence based on the segment number.
[e.g. 3] According to the score table, count the number of excellent, pass and fail students in the English subject. Some of the data are as follows:
Class | StudentID | Subject | Score |
---|---|---|---|
1 | 1 | English | 95 |
1 | 1 | Math | 90 |
1 | 1 | PE | 80 |
1 | 2 | English | 75 |
1 | 2 | Math | 84 |
… | … | … | … |
The SPL script looks like this:
A | |
---|---|
1 | =T("Scores.csv").select(SUBJECT:"English") |
2 | =create(Assessment,Score).record(["Fail",0,"Pass",60,"Excellent",90]) |
3 | =A1.derive(A2.segp(Score,SCORE).Assessment:Assessment) |
4 | =A3.groups(Assessment;count(1):Count) |
A1: import the score table and select the data of the English subject.
A2: create a check table of score and assessment:
Assessment | Score |
---|---|
Fail | 0 |
Pass | 60 |
Excellent | 90 |
A3: use the A.segp() function to obtain the corresponding assessment according to the segment numbers of the score in the check table.
A4: group and calculate the number by assessment.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code