Skip to content

SPL:selecting data by segment

esProcSPL edited this page Jul 22, 2024 · 1 revision

Selecting data by segment refers to finding the data in a group according to the specified interval. For example, return to assessments (excellent, good, medium, poor) according to the test results; return to the levels of assets (poor, well-off, middle-class, rich, etc.) according to the annual income of the family.

1. The segment number of a member in sequence

We first define an interval sequence for the segment. For example, we define the age sequence of [0,18,45,60], which represents minors, youth, middle-ages, and old-ages respectively. Next by calculating the sequence number of the value of the age field in the sequence of the age range, we can get what age range the person is in.

The A.pseg() function is provided in SPL, which returns the segment number of a member in sequence,and the default interval is left-open and right-closed. For example, the 17-year-old belongs to the age range of [0,18), so the sequence number returned by the A.pseg() function is 1; whereas the 18-year-old belongs to the age range of [18,45), then the sequence number returned by the A.pseg() function is 2. The A.pseg() function provides the @r option to use left-open and right-closed intervals when calculating the segment numbers. Take the same case for an example, 18-year-old is assigned to the interval of (0,18] by A.pseg() function,so the sequence number returned is 1.

In the case of employees, some of the data are as follows:

ID NAME ENTRYDATE STATE DEPT SALARY
1 Rebecca 2005/03/11 California R&D 7000
2 Ashley 2008/03/16 New York Finance 11000
3 Rachel 2010/12/01 New Mexico Sales 9000
4 Emily 2006/08/15 Texas HR 7000
5 Ashley 2004/07/30 Texas R&D 16000

[e.g. 1] Divide the employees into groups of less than 8,000, form 8,000 to 12,000 and more than 12,000 according to their salaries, and count the number of employees in each group.

The SPL script looks like this:

A
1 =T("Employee.csv")
2 [0,8000,12000]
3 =A1.align@a(A2.len(),A2.pseg(SALARY))
4 =A3.new(A2(#):SALARY,~.count():COUNT)

A1: import employee table.

A2: define the intervals of salaries.

A3: use the A.pseg() function to get the segment numbers of the salaries, and then group them according to the numbers

A4: calculate the number of people in each group.

[e.g. 2] Divide the employees into three segments in terms of their working years, 1 to 10 years, 10 to 20 years and 20 years respectively, and count the average salary of each group.

The SPL script looks like this:

A
1 =T("Employee.csv")
2 [0,10,20]
3 =A1.align@a(A2.len(),A2.pseg@r(interval@y(ENTRYDATE,now())))
4 =A3.new(A2(#):EntryYears,~.avg(SALARY):AvgSalary)

A1: import employee table.

A2: define the intervals of working years.

A3: use the A.pseg() function to get the segment numbers of the working years, where the @r option represents the left-open and right-closed interval. And then group them by the numbers.

A4: calculate the average salary of each group.

2. Returning the corresponding member of a sequence based on the segment number

Sometimes we need to get the corresponding member in the set by calculating the sequence number of the field value (or expression) in the segment.

The A.segp() function is provided in SPL, which returns the corresponding member in the sequence based on the segment number.

[e.g. 3] According to the score table, count the number of excellent, pass and fail students in the English subject. Some of the data are as follows:

Class StudentID Subject Score
1 1 English 95
1 1 Math 90
1 1 PE 80
1 2 English 75
1 2 Math 84

The SPL script looks like this:

A
1 =T("Scores.csv").select(SUBJECT:"English")
2 =create(Assessment,Score).record(["Fail",0,"Pass",60,"Excellent",90])
3 =A1.derive(A2.segp(Score,SCORE).Assessment:Assessment)
4 =A3.groups(Assessment;count(1):Count)

A1: import the score table and select the data of the English subject.

A2: create a check table of score and assessment:

Assessment Score
Fail 0
Pass 60
Excellent 90

A3: use the A.segp() function to obtain the corresponding assessment according to the segment numbers of the score in the check table.

A4: group and calculate the number by assessment.

Employee.csv

Scores.csv

Clone this wiki locally