-
Notifications
You must be signed in to change notification settings - Fork 332
Simple SQL Statements Only Exist in Coursebooks and Training Courses
The sample SQL statements in coursebooks are usually simple and easy to understand. They even read like English sentences, giving the impression that SQL is rather simple and easy to learn.
Actually, such a SQL statement consisting of only a few lines of code can only be found in coursebooks and training courses. In real-world businesses, the amount of SQL code is measured by KB instead of the number of lines. One SQL statement having several hundred lines of code and N layers of nested subqueries often reaches 3KB to 5KB in size. Such SQL statements are not easy to learn at all but rather a nightmare even to professional programmers.
The KB-level size itself isn’t a big problem. A long SQL statement is inevitable if the computing goal is truly complicated. The Python or Java counterpart could be even longer. Yet, unlike statements written in other languages, a long SQL statement is not only long but difficult to write and understand. Moreover, the difficulty isn’t proportional to the degree of task complexity. Except for the simplest scenarios, the degree of SQL code complexity increases sharply even if the task complexity only rises slightly. Programmers need to have very strong ability in order to accomplish the coding, and those computing scenarios often appear in programming interviews.
Why does SQL generate so long and difficult statements?
We mentioned one reason before. The SQL syntax is like English grammar, but the programming language isn’t a procedural type. It puts too many actions in one statement, and this increases the difficulty of working out a computing logic out of thin air.
But we find that the code is still difficult to understand when the computing task becomes slightly complicated, even after SQL adds the CTE syntax that enables procedural programming.
This is because SQL lacks many expressive abilities. As a result, programmers can only write code in a roundabout way rather than according to the natural way of thinking.
Let’s look at a simple scenario.
T is a simplified sales performance table. It has three fields: sales (seller), product and amount (sales amount). We want to find sales whose amounts of both ACs and TVs rank in top 10.
The task is simple and we can design the computing process in a natural way:
1.Sort records of the table by AC’s amount and get the top 10;
2.Sort records of the table by TV’s amount and get the top 10;
3.Perform intersection on the two result sets to get what we want.
We can write the SQL code with CTE syntax in the following way:
with A as (select top 10 sales from T where product='AC' order by amount desc),
B as (select top10 sales from T where product='TV' order by amount desc)
select * from A intersect B
The statement is not very short but is logically clear.
Now we make the task a little more complicated: find the sales whose amounts of all products rank in top10. According to logic in the previous task, we can code the computing process as follows:
1. List all products;
2. Find sales whose amounts rank in top 10 for each product and save them separately;
3. Get intersection of the top10 sets.
Unfortunately, the CTE syntax only supports expressing a definite number of intermediate results. For this task, we do not know the total number of products in advance, that is, the number of WITH clauses is indefinite. So, we are unable to code the current logic with CTE.
Well, let’s switch to a different logic:
1.Group records of the table by product, sort each group, and get top10 in each group;
2.Get intersection of these top10 sets.
To achieve this, we need to save the grouping result obtained in step 1. The intermediate result is a table, where there is a field for storing the top10 sales in the corresponding group members. This means that its values are sets. SQL lacks the set data type and still cannot code the logic.
Let’s try another solution. Group records by product and find the frequency of each seller in the top10s of all groups; if the number of appearances is equal to the number of products, the sales is in the top10 in terms of amount for all products.
select sales from (
select sales from (
select sales, rank() over (partition by product order by amount desc ) ranking
from T ) where ranking <=10 )
group by sales having count(*)=(select count(distinct product) from T)
Now we are finally able to write down the code with the help of window functions. But the logic is too roundabout; how many can work it out and put it down on paper?
SQL cannot describe the first two, simple logics; we can only adopt the third, more roundabout one. The reason behind this is one of SQL’s major shortcomings: incomplete set-lization.
SQL has the concept of sets, but it does not offer sets as a basic data type. It does not allow field values to be sets and, except for tables, it does not provide any other set data type. As a result, both the logic and the code for implementing the set computations are very roundabout.
In the above code, we use the keyword top. In fact, there isn’t such a keyword in the relational algebra. It isn’t the standard SQL syntax.
But how should we find the top10 sales without using keyword top?
The general logic is like this: get the number of members where the amounts are greater than the current one, take it as the ranking, and get members whose rankings are not greater than 10.
select sales from (
select A.sales sales, A.product product,
(select count(*)+1 from T
where A.product=product and A.amount<=amount) ranking
from T A )where product='AC' and ranking<=10
Note that the subquery cannot be written stepwise using the CTE syntax because it uses information in the main query as parameter.
We can also use join to write the query so that we can handle the computation step by step using the CTE syntax:
select sales from (
select A.sales sales, A.product product, count(*)+1 ranking from T A, T B
where A.sales=B.sales and A.product=B.product and A.amount<=B.amount
group by A.sales,A.product )
where product='AC' and ranking<=10
No matter how we implement the query, the logic is always too circuitous. Even a professional programmer needs to think for a while only to get the top10 job done.
This is due to SQL’s another major shortcoming: lack of support for orderliness. SQL inherits the mathematics’ unordered sets. It is rather difficult to take care of computations related to the order – we all know how common they are (link ratio calculation, YOY calculation, getting top 20%, getting rankings, to name a few).
The SQL2003 standard offered some order-related computing ability. This alleviates the difficulty for SQL of coping with order-based computations to some extent. Here is the SQL code to get the top10:
select sales from (
select sales, rank() over (partition by product order by amount desc ) ranking
from T )
where ranking <=10
The subquery is still there.
SQL’s unordered sets base hasn’t been fundamentally changed by window functions. Many order-based computations are still left difficult to tackle, such as counting the longest consecutively rising dates for a certain stock, which we often use as an example scenario:
select max(ContinuousDays) from (
select count(*) ContinuousDays from (
select sum(UpDownTag) over (order by TradeDate) NoRisingDays from (
select TradeDate,case when Price>lag(price) over ( order by TradeDate) then 0 else 1 end UpDownTag from Stock ))
group by NoRisingDays )
The natural way of thinking is this: sort records by date and perform count – add 1 when the price rises and reset the number as 0 when the price drops – and get the largest count. However, the logic cannot be expressed in SQL. With the language, we can only write the query as nested.
The computing task was a real recruitment exam problem, but only 20% applicants were able to work it out.
SQL’s shortcomings are clear even in such a simple example. The language has more weaknesses, which we won’t discuss in this article.
In a word, SQL cannot express logics according to the natural way of thinking. It usually takes a roundabout route and produces lengthy and difficult to understand statements.
Real-world computing tasks are more complicated than these examples. There are various difficulties – big and small – during the process of handling the computations. A roundabout way here and a few more lines there, and it is not strange that a SQL statement for a little complicated task finally has hundreds of lines of multilayer nested code and that two months later even the authors themselves cannot understand the code.
Indeed, SQL isn’t simple at all.
Here comes the commercial.
What can we do about difficult to write SQL statements? Use esProc SPL!
esProc SPL is open-source software written in Java. Download esProc SPL HERE.
SPL adds discreteness on the base of SQL’s existing set-lization and achieves complete set-lization and the ability to exploit data order. We can write code for the second example according to the natural way of thinking in SPL:
Find records of sales whose amounts for all products rank in top10, group them by product, get top10 members in each group, and then perform intersection:
T.group(product).(~.top(10;-amount)).isect()
SPL supports the use of a set of sets. The “getting top” computation is treated as an ordinary aggregation. Based on these concepts, it is easy to implement natural logics.
To count the longest consecutively rising dates for a stock, SPL just writes the code in the natural way of thinking:
cnt=0
Stock.sort(TradeDate).max(cnt=if(Price>Price[-1],cnt+1,0))
SPL boasts powerful order-based computing ability. It is effortless to implement the same logic expressed in the above SQL statement:
Stock.sort(TradeDate).group@i(Price<Price[-1]).max(~.len())
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code