Skip to content

Comparison of SQL and SPL:Non equi grouping

esProcSPL edited this page Apr 28, 2024 · 1 revision

The equi-grouping operations divides a data table according to same field values or expression results, while there are certain common grouping operations use special grouping conditions. Grouping customers according to the order of Beijing, Shanghai, Guangzhou and Shenzhen and calculate total sales amount in each group, and divide families into wealthy, well-off, comfortable, and poor according to the yearly income. This essay will explain the solutions and basic principles of SQL and SPL, the two commonly used programming languages, in dealing with this type of grouping scenarios, and find the simpler and more efficient way through sample programs. Looking ${article} for details.


A grouping operation is called equi-grouping if the grouping condition is same values or expression results. Equi-grouping operations are the most seen and they divide a data set completely. Their features are as follows:

(1) No empty subset in the grouping result set;

(2) Each member in the original set belongs to one and only one subset.

Does there exist the non-equi grouping and incomplete division? The answer is yes. Their features are as follows:

(1) The grouping result set contains at least one empty subset;

(2) The grouping result set does not contain all members of the original set;

(3) One member may belong to more than one subset.

Ⅰ. Alignment grouping

An alignment grouping compares a specific field or expression of each member in the to-be-grouped set with every member in a predefined base set and puts members matching same member in the base set in the same group. The number of groups in the result set is the same as that of members in the base set. The alignment grouping may generate empty group or leaves one or more members in outside of any group.

【Example 1】According to the Course table and Select_Course table below, find the unselected courses according to the order in Course table.

COURSE:

ID NAME TEACHERID
1 Environmental protection and sustainable development 5
2 Mental health of College Students 1
3 Matlab 8
4 Electromechanical basic practice 7
5 Introduction to modern life science 3

SELECT_COURSE:

ID COURSEID STUDENTID
1 6 59
2 6 43
3 5 52
4 5 44
5 5 37

SQL solution:

SQL does not support the alignment grouping. We get the distinct selected courses from the Select_Course table and then select records that do not contain these selected courses. Below are the SQL queries:

   select * 
   from COURSE
   where 
      ID not in
         (select DISTINCT COURSEID
         from SELECT_COURSE)
   order by rownum

SPL solution:

SPL offers A.align() function to perform alignment grouping.

A
1 =T("Course.csv")
2 =T("SelectCourse.csv")
3 =A2.align(A1:ID, COURSEID)
4 =A1(A3.pos@a(null))

A1: Import Course table.

A2: Import SelectCourse table.

A3: Use A.align() function to group SelectCourse table according to Course table’s ID field.

A4: Get the courses that no students select from Course table.

【Example 2】According to DEPARTMENT table and EMPLOYEE table below, calculate the number of employees in each department according to the order of DEPARTMENT table. Below is part of the source tables:

DEPARTMENT:

DEPT MANAGER
Administration 1
Finance 4
HR 5
Marketing 6
Production 7

EMPLOYEE:

ID NAME BIRTHDAY STATE DEPT SALARY
1 Rebecca 1974/11/20 California R&D 7000
2 Ashley 1980/07/19 New York Finance 11000
3 Rachel 1970/12/17 New Mexico Sales 9000
4 Emily 1985/03/07 Texas HR 7000
5 Ashley 1975/05/13 Texas R&D 16000

SQL solution:

For the alignment grouping operation, each group will contain all matching members in the to-be-grouped set. Since SQL does not directly support this type of grouping, we group EMPLOYEE table by department and count employees in each department, perform a left join between DEPARTMENT table and the result set of grouping and summarizing EMPLOYEE table, and query each department to get the number of employees. Note that the SQL JOIN operation between two result sets does not ensure that records are arranged in a specific order, so we need to sort them by row numbers in DEPARTMENT table; otherwise, the query result may not be ordered in the same way as DEPARTMENT table. Below are the SQL queries:

   select 
      t1.DEPT, t2.COUNT
   from 
      (select DEPT,rownum ordernum from DEPARTMENT) t1
   left join 
      (select 
         DEPT, count(*) COUNT 
      from EMPLOYEE 
      group by DEPT) t2
   on t1.DEPT=t2.DEPT
   order by ordernum

SPL solution:

SPL offers A.align() function to handle the alignment grouping. It works with @a option to return all matching members for each group.

A
1 =T("Department.csv")
2 =T("Employee.csv")
3 =A2.align@a(A1:DEPT, DEPT)
4 =A3.new(DEPT, ~.len():COUNT)

A1: Import Department table.

A2: Import Employee table.

A3: Group Employee table according to the order in Department table and return all matching members for each group.

A4: Count employees in each department.

【Example 3】According to Employee table (Below is a part of it), calculate the average salary for each state of [California, Texas, New York, Florida] and for all the other states as a whole, which are classified as “Other”.

ID NAME BIRTHDAY STATE DEPT SALARY
1 Rebecca 1974/11/20 California R&D 7000
2 Ashley 1980/07/19 New York Finance 11000
3 Rachel 1970/12/17 New Mexico Sales 9000
4 Emily 1985/03/07 Texas HR 7000
5 Ashley 1975/05/13 Texas R&D 16000

SQL solution:

In this case, the non-matching members are required to put into a new group. SQL does not have a method to do this. First, we define a set of the four specified state groups and the “Other” group, and then check which group each employee record belongs to using CASE WHEN statement. Below are SQL queries:

   with cte1(ID,STATE) as 
      (select 1,'California' from DUAL 
         UNION ALL select 2,'Texas' from DUAL
         UNION ALL select 3,'New York' from DUAL
         UNION ALL select 4,'Florida' from DUAL
         UNION ALL select 5,'OTHER' from DUAL)
   select 
      t1.STATE, t2.AVG_SALARY 
   from cte1 t1
   left join 
      (select 
         STATE,avg(SALARY) AVG_SALARY 
      from
         ( select 
            CASE WHEN 
            STATE IN ('California','Texas','New York','Florida') 
            THEN STATE 
            ELSE 'OTHER' END STATE, 
            SALARY 
         from EMPLOYEE)
      group by STATE) t2
   on t1.STATE=t2.STATE
   order by t1.ID

SPL solution:

SPL has A.align() function to do this. It uses @n option to put the non-matching members into a new group.

A
1 =T("Employee.csv")
2 [California,Texas,New York,Florida]
3 =A1.align@an(A2,STATE)
4 =A3.new(if (#>A2.len(),"OTHER",STATE):STATE,~.avg(SALARY):AVG_SALARY)

A1: Import Employee table.

A2: Define a constant set of regions.

A3: A.align() groups Employee table according to A2’s set of states. @a option is used to return all matching members for each group; @n option enables putting the non-matching members into a new group.

A4: Name the new group “OTHER” and calculate the average salary of each group.

SQL uses JOION to achieve an alignment grouping because it does not have a method to do it directly. As a SQL result set is unordered, we record the order of the records in the original table through their original row numbers. This makes SQL solution to an alignment grouping problem complicated. SPL offers A.align() function for specifically handling alignment grouping operations, which features concise syntax and efficient execution.

Ⅱ. Enumeration grouping

An enumeration grouping prespecifies a set of enumerated conditions, computes each conditional expression using every member of the to-be-grouped set as the parameter, and puts members that make the same condition true into the corresponding subset. The subsets in the result set correspond to the prespecified conditions one by one.

【Example 4】According to the following China urban population table, divide cities into groups and count the number in each group. Megacity: above 2 million population; Large city: between 1 to 2 million population; Major city: between 0.5 to 1 million population; Other cities. Below is part of the source table:

ID CITY POPULATION PROVINCE
1 Shanghai 12286274 Shanghai
2 Beijing 9931140 Beijing
3 Chongqing 7421420 Chongqing
4 Guangzhou 7240465 Guangdong
5 HongKong 7010000 HongKong Special Administrative Region

SQL solution:

SQL lacks a direct method of handling enumeration grouping operations. We segment the source table using CASE WHEN statement, then group data by comparing it with the segmentation values and perform aggregation. CASE WHEN statement put each record into the first matching grouping only. Below are SQL queries:

   with cte1(ID,CITY_TYPE) as 
      (select 1,'Megacity ' from DUAL 
      UNION ALL select 2,'Large city ' from DUAL
      UNION ALL select 3,'Major city ' from DUAL
      UNION ALL select 4,'Other cities ' from DUAL)
   select 
      t1.CITY_TYPE, nvl(t2.COUNT,0) COUNT
   from cte1 t1
   left join 
      (select CITY_TYPE, count(*) COUNT from 
         (select 
         case when POPULATION>2000000 then ' Megacity '
         when POPULATION>1000000 then ' Large city '
         when POPULATION>500000 then ' Major city '
         when POPULATION<=500000 then ' Other cities '
         else 'OTHER' end CITY_TYPE
      from URBAN_POPULATION)
      group by CITY_TYPE) t2
   on t1.CITY_TYPE=t2.CITY_TYPE
   order by t1.ID

SPL solution:

SPL provides A.enum() function to perform grouping according to the enumerated conditions.

A
1 =T("UrbanPopulation.txt")
2 [?>2000000,?>1000000,?>500000,?<=500000]
3 =A1.enum(A2,POPULATION)
4 [Megacity, Large city, Major city, Other cities]
5 =A3.new(A4(#):CITY_TYPE,~.count():COUNT)

A1: Import UrbanPopulation table.

A2: Define grouping conditions.

A3: A.enum() function performs enumeration grouping on A1’s table according to A1’s conditions and, by default, puts each record in the first matching grouping only.

A4: Define a name for each group.

A5: Count cities in each group according to the enumeration grouping result.

【Example 5】Divide the following employee table into three groups according to age groups <35, <45 and other ages. Below is part of the source table:

ID NAME BIRTHDAY STATE DEPT SALARY
1 Rebecca 1974/11/20 California R&D 7000
2 Ashley 1980/07/19 New York Finance 11000
3 Rachel 1970/12/17 New Mexico Sales 9000
4 Emily 1985/03/07 Texas HR 7000
5 Ashley 1975/05/13 Texas R&D 16000

SQL solution:

It is required, in this case, that the non-matching records be put into a new group. We use CASE WHEN … ELSE statement to define an “Others” group to achieve this. Below are SQL queries:

   with cte1(ID,AGE_AREA) as 
      (select 1,'Below 35 ' from DUAL 
      UNION ALL select 2,'Below 45 ' from DUAL
      UNION ALL select 3,'Others ' from DUAL)
   select 
      t1.AGE_AREA, nvl(t2.AVG_SALARY,0) AVG_SALARY
   from cte1 t1
   left join 
   (select 
      AGE_AREA, avg(SALARY) AVG_SALARY 
   from 
      (select 
         case when TRUNC(months_between(sysdate, BIRTHDAY)/12)<35 
         then ' Below 35 '
         when TRUNC(months_between(sysdate, BIRTHDAY)/12)<45 
         then ' Below 45 '
         else ' Others ' end AGE_AREA,
         SALARY
      from EMPLOYEE)
   group by AGE_AREA) t2
   on t1.AGE_AREA=t2.AGE_AREA
   order by t1.ID

SPL solution:

A.enum() function uses @n option to put the non-matching records into a new group.

A
1 =T("Employee.csv")
2 [?<35,?<45]
3 =A1.enum@n(A2, age(BIRTHDAY))
4 [Below 35,Below 45, Others]
5 =A3.new(A4(#):AGE_AREA,~.avg(SALARY):AVG_SALARY)

A1: Import Employee table.

A2: Define grouping conditions.

A3: A.enum() function performs enumeration grouping by age groups and works with @n option to put non-matching records into a new group.

A4: Define a name for each group.

A5: Calculate the average salary in each group according to A3’s result.

【Example 6】According to the following cities’ GDB table, calculate per capita GDP for the direct-controlled municipalities, first-tier cities and second-tier cities. Groups may contain common members as Beijing is both a direct-controlled city and a first-tier city. Below is part of the source table:

ID CITY GDP POPULATION
1 Shanghai 32679 2418
2 Beijing 30320 2171
3 Shenzhen 24691 1253
4 Guangzhou 23000 1450
5 Chongqing 20363 3372

SQL solution:

In this case, one record could be put into more than one group. CASE WHEN statement can only put a record into the first matching group and does not support re-putting it to another group, so we need to calculate per capita GDP for each group and then concatenate the results. Below are SQL queries:

   select 
      'Direct-controlled municipality ' AREA,
      SUM(GDP*1000)/SUM(POPULATION) CAPITA_GDP 
   from CITY_GDP 
   where 
      CITY in ('Beijing','Shanghai','Tianjin','Chongqing')
   UNION ALL
   select 
      'First-tier city ' AREA,
      SUM(GDP*1000)/SUM(POPULATION) CAPITA_GDP 
   from CITY_GDP 
   where 
      CITY in ('Beijing','Shanghai','Guangzhou','Shenzhen')
   UNION ALL
   select 
      'Second-tier city ' AREA,
      SUM(GDP*1000)/SUM(POPULATION) CAPITA_GDP 
   from CITY_GDP 
   where 
      CITY in ('Chengdu','Hangzhou','Chongqing','Wuhan','Xian','Suzhou','Tianjin','Nanjing','Changsha','Zhengzhou','Dongguan','Qingdao','Shenyang','Ningbo','Kunming')

In the SQL query, groups are respectively queried and summarized, and their results are concatenated one by one. Each group requires a piece of SQL code. When there are many groups, it’s complicated to write and maintain the SQL query.

SPL solution:

A.enum() function uses @r function to check whether all members in a group match a condition.

A
1 =T("CityGDP.txt")
2 [["Beijing","Shanghai","Tianjin","Chongqing"].pos(?)>0,["Beijing","Shanghai","Guangzhou","Shenzhen"].pos(?)>0,["Chengdu","Hangzhou","Chongqing","Wuhan","Xian","Suzhou","Tianjin","Nanjing","Changsha","Zhengzhou","Dongguan","Qingdao","Shenyang","Ningbo","Kunming"].pos(?)>0]
3 =A1.enum@r(A2,CITY)
4 [direct-controlled municipality, first-tier city, second-tier city]
5 =A3.new(A4(#):AREA,.sum(GDP)/.sum(POPULATION)*10000:CAPITA_GDP)

A1: Import CityGDP table.

A2: Enumerate conditions for direct-controlled city, first-tier city and second-tier ciy.

A3: A.enum() groups A1’s table according to the enumerated conditions and works with @r option to check whether all members in a group match a condition.

A4: Define a name for each group.

A5: Calculate per capita GDP in each group according to A3’s result.

SPL needs just one option to handle the re-grouping of a member in an enumeration grouping operation. No matter how many groups there are, users only need to maintain the enumerated conditions, instead of modifying the code for grouping & aggregation.

Summary

SQL is not fit to handle both alignment grouping and enumeration grouping. There are two causes for this. One is SQL does not provide a special statement or function to handle them, the other is that SQL’s theoretical foundation, the relational algebra, is based on unordered sets, which makes it hard to handle computations that involve the fixed order. SPL, however, is based on ordered sets, and is thus naturally good at handling order-based computations. Besides, it offers functions to specifically deal with the two types of grouping operations.

When the query is complicated, the complexity of SQL query increases by multiple times. It involves the use of temporary table and nested query, etc, which makes it harder to write and maintain the SQL query. SPL, however, can compose succinct code step by step according to the natural way of thinking.

The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of grouping functions, which makes the marriage of both Java and SQL advantages. A grouping operation in SPL will thus become simple and easy.

Clone this wiki locally