-
Notifications
You must be signed in to change notification settings - Fork 335
Is SPL more difficult or easier than SQL?
SPL, as a technology specifically used for processing structured data and semi-structured data, is often faster than SQL by several to hundreds of times in practice, much shorter in code length, and has obvious advantages, especially when dealing with complex calculations. These advantages attract strong attention from users. However, they worry about the difficulty of mastering SPL. After all, the concept and syntax of SPL are quite different from SQL, and they need to re-understand some concepts and learn new syntax, which makes them hesitant.
So, how difficult is it to get started with SPL? Let's begin with SQL to discuss the problem.
SQL has been the most widely used query language for structured data, and it is very simple to use SQL to implement common query calculations. For example, the grouping and aggregating calculation can be implemented through a simple group by statement. Thus, compared to other high-level languages (the same query calculation needs dozens of lines of code in Java), SQL is simply a language that could not be simpler. Moreover, the syntax of SQL is designed to conform to English habit, and the statement for querying the data is like an English sentence, which greatly reduces the difficulty of use.
However, the simplicity of SQL is only limited to simple query tasks. When the query condition becomes slightly more complex, using SQL is not easy. Simple queries with three or five lines of code exist only in textbooks; the actual business is much more complex.
Let’s take a common example: calculate the maximum consecutive days that a stock keeps rising.
This calculation is not difficult and can be performed in a natural way: first, sort the records by trading day, and then set a Counter column to compare the records one by one. If the stock price rises, increment the counter by 1, otherwise, clear the counter as 0. Finally, get the maximum value of the counter.
Unfortunately, however, SQL cannot directly describe this logic involving procedure (unless stored procedure is used), so this calculation task has to be implemented in another way:
select max (consecutive_day)
from (select count(*) (consecutive_day
from (select sum(rise_mark) over(order by trade_date) days_no_gain
from (select trade_date,
case when closing_price>lag(closing_price) over(order by trade_date)
then 0 else 1 END rise_mark
from stock_price ) )
group by days_no_gain)
As you can see, this code is to group the transaction records, putting the consecutively rose records into the same group. In this way, the calculation objective can be obtained as long as the number of members in the group with the maximum members is calculated. Both grouping and counting are the operations that SQL supports, but SQL only supports equivalence grouping and doesn’t support the ordered grouping by the order of data. Thus, SQL has to use sub-query and window function to create the grouping mark, and then set the grouping marks of consecutively rose records as the same value. Only in this way can the equivalence grouping be performed and can the expected value be obtained. The disadvantage of this is that only with effort can you understand this coding way. Moreover, since this code takes advantage of the window function that is available in SQL 2003 standard, it can directly calculate the rise over yesterday, making it relatively convenient to calculate this mark, but it still needs to nest several levels of codes. If it was the earlier SQL92 standard, it would be difficult to perform the consecutive rising calculation, let alone the complexity of whole code, it would be many times more complex.
Having read and understood this SQL code, you will find it not easy to implement such kind of calculation in SQL. SQL, which does not support procedural computing and ordered computing (the support from window function is still low), makes it very difficult to solve the originally simple problem.
In addition to lacking the ordered computing capability, SQL has other deficiencies. For example, it does not support discrete records; the set orientation is not thorough; it lacks the object reference mechanism. Any of these deficiencies will make coding difficult. When solving a problem, if the process of thinking of a solution (natural thinking) to implementing (coding is done) becomes very tortuous, and it needs to take a lot of effort to implement, the development difficulty is greatly increased. In fact, we often see hundreds of lines of very long SQL code in practice. Such code is often caused by this “tortuous” process, and the development cycle for such codes is often measured in weeks or even in months, and hence the development cost is extremely high. Moreover, even if the code is written, there may be an embarrassing situation where even the code writer himself cannot understand it after a month or two, resulting in a high cost in both maintenance or handover.
If the code is complex, it will result in low development efficiency and high cost. Moreover, complex code often comes with poor performance. Even if the code is written, the running speed is not fast.
Let's see another common example: take the top 10 out of 100 million pieces of data. Coding in SQL is not complicated:
SELECT TOP 10 x FROM T ORDER BY x DESC
This query uses ORDER BY. Strictly following this logic means sorting the full data. Yet, big data sorting is a very slow action. If the memory space is not enough, it needs to write buffer data to external storage. Multiple disk reads/writes will make the performance decrease sharply.
We know that this calculation does not require full sorting at all. What you need to do is to keep a set containing 10 largest numbers. When traversing the data (once), discard the smaller one, leave the larger one, and the left numbers at last are the result we want. This calculation requires very little memory space and does not involve repeatedly reading and writing the external storage. It’s a pity that such an algorithm cannot be described in SQL.
Fortunately, although SQL syntax limits the said algorithm, it can be solved in practice. Many database engines will automatically optimize when they face such query tasks to avoid too inefficient algorithms. However, such automatic optimization still only works for a simple query.
Now let's take a slightly more complicated TopN calculation as an example: calculate the top 10 in each group. SQL code (we can see that this code is already a little bit troublesome):
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn
FROM Orders )
WHERE rn<=10
This code implements the calculation in two steps: create the in-group sequence number (in-group sorting) with the help of window function; filter out the records that meet the conditions through sub-query. Due to unthorough set orientation of SQL, the calculation has to be implemented indirectly from partition, sorting and sub-query, making this SQL code a little bit complicated. Moreover, in this case, most of the database optimizers will get confused, and cannot get the purpose of the SQL statement, and instead, they have to execute the sorting according to the logic written in the statement (there is still the keyword ORDER BY in the statement). As a result, the performance decreases sharply.
Depending only on database’s automatic optimization is unreliable. You also need to understand the execution plan and modify the statements. However, sometimes it is impossible to modify a statement successfully due to the lack of necessary operations; you have to write UDF and calculate yourself, which is difficult and troublesome. Furthermore, even if UDF is used, it doesn't work because the storage scheme cannot be changed. In order to ensure performance, you often have to write in Java/C++ on the periphery by yourself, which will make the complexity very high, and the development cost will rise dramatically.
Many tasks that can be accomplished originally in a normal coding way often need to implement in a roundabout way once SQL is used, resulting in long code and poor performance. For such SQL code, it is often difficult for the programmers themselves to understand, let alone database's automatic optimization engine. Moreover, when the running speed slows down, it needs more hardware resources to make up, this will increase the hardware cost. Consequently, both the development and hardware costs increase!
In fact, the industry has already realized that SQL has its limitations in dealing with complex problems and that a mature and easy-to-use data warehouse does not provide SQL only. Some data warehouses have started to introduce Python, Scala, and use some technologies like MapReduce to solve this problem, but so far, the effect is not satisfactory. Specifically, MapReduce is too poor in performance, extremely high in hardware resource consumption, very cumbersome in coding, and still difficult in implementing many calculations; Although Python's Pandas is more powerful in logical function, it doesn't do a good job in terms of details. Obviously, it is not elaborately designed, as there are much duplicate content and style inconsistency, and it’s still not easy to describe complex logic. In addition, it is also difficult to obtain high performance due to the lack of the computing power for big data and corresponding storage mechanisms; As for Scala, it is inconvenient to use its DataFrame object, and Scala does not well support the ordered operation, and the large number of record copying actions generated during computation will lead to poorer performance, and to some extent, it can even be said that it is a regression.
In fact, it's easy to understand why these problems arise. It's just like a tall building with unstable foundation, no matter how you repair its body, it will not help. The only way to completely solve the problems is to use an all-new tool.
SPL, is exactly the tool to solve these problems.
Instead of using SQL’s relational algebra system, the new discrete dataset theory is invented. Based on this theory, a new language “SPL” appears (equivalent to pushing down the tall building “SQL” and rebuilding). SPL supports procedural calculation and provides a variety of computing mechanisms, such as ordered computing. This language is quite different from SQL in terms of algorithm implementation.
Let’s use SPL to solve the above problem: calculate the maximum consecutive days that a stock keeps rising:
A | |
1 | =stock_price.sort(trade_date) |
2 | =0 |
3 | =A1.max(A2=if(closing_price> closing_price[-1],A2+1,0)) |
As you can see from this code that it basically follows the natural thinking to solve this problem, that is, sorting, comparing ([-1] means taking the data of previous day), and calculating the result we want. This code is very simple and concise.
Even if implementing the logic of SQL statement, it is simple to code in SPL:
tock_price.sort(trade_date).group@i(closing_price<closing_price[-1]).max(~.len())
You can see that although the calculation idea is exactly the same as the previous SQL code, it is much easier to express in SPL, and there is no need to do it in a tortuous way because SPL directly supports ordered grouping.
Concise syntax will dramatically improve development efficiency, and the development cost will reduce accordingly. Moreover, it will improve computing performance.
A | ||
1 | =file(“data.ctx”).create().cursor() | |
2 | =A1.groups(;top(10,amount)) | Get orders whose amount rank in top 10 |
3 | =A1.groups(area;top(10,amount)) | Get orders whose amount rank in top 10 in each area |
For example, the above-mentioned TopN operation is considered as an aggregation operation in SPL, just like SUM and COUNT, except that its return value is a set. In this way, a high-complexity sorting operation can be transformed to a low-complexity aggregation operation, and the range of application is extended.
In these statements, there are no sort-related keywords, and it will not trigger a full sorting action and will not involve data interaction with hard disk even if the data amount is large. The syntax for calculating TopN from a whole set and from grouped subset are basically the same, and both calculations can achieve higher performance. SPL boasts many more high-performance algorithms, such as ordered grouping, position index, parallel computing, and ordered merge. All these algorithms can improve computing performance greatly.
To explain why SPL is simple and efficient, we can look at an analogy:
When we solve an arithmetic problem: 1+2+3+...+100, ordinary people will add step by step, while clever Gauss uses 50*101 to get it done in one go. With the new operation type of multiplication, both the description of the solution (concise code) and the implementation of the calculation (efficient execution) are greatly improved, making it much easier to accomplish the calculation task.
Therefore, we can say that SQL (relational algebra), which was born 50 years ago, is like an arithmetic system with addition only. It is inevitable to result in cumbersome code and low performance. In contrast, SPL (discrete dataset) is the arithmetic system with multiplication, and it is natural that the code is concise and the efficiency is high.
Some people may ask that although it is indeed simpler after using multiplication, the prerequisite is that you must be as clever as Gauss, yet, not everyone is as clever as Gauss. Does it mean that only clever programmers can use SPL, and is it more difficult to use?
It needs to be answered in two aspects.
One is that the solution to some calculation tasks may be thought of but cannot be written. For example, the above-mentioned ordered grouping and TopN that does not need full sorting cannot be coded in SQL, and you have to use the cumbersome “addition”; In contrast, SPL provides many “multiplications”, through which you can code any solution you think of, and it is easier to code.
The other is that there are indeed some solutions that we cannot think of, for we are not as clever as Gauss. However, now that someone has already thought of such solution, the only thing we need to do is to learn them. For example, if we learn how to calculate 1+2+…+100 in a clever way, then we will be able to calculate 2+4+…+500. The number of common solutions is not large, so we can master them with some exercises. It is true that mastering them is not an easy thing, and you need to do some exercises. With more exercises, these solutions will become your “natural thinking, and by then, it will be easy to accomplish your calculation task.
In fact, there are many scenarios that are difficult to be handled in SQL in real business. Here below are some examples that crash SQL.
Complicated ordered computing: analysis of user behavior through conversion funnel
After the users log in an e-commerce website/APP, multiple operation events will occur, such as page browsing, searching, adding to cart, placing an order and paying. These events are ordered by time, and there are always some users lost after each event. Usually, the funnel conversion analysis is to count the number of users of each event first, and then perform complex calculations based on the counting result, such as calculating the conversion rate. Here, the analysis is effective only when multiple events occur within a specified time period and in the specified order, which is a typical complicated multi-step ordered computation. It is very difficult for SQL to implement this calculation.
Multi-step big data batch job
The amount of data involved in an off-line batch job is huge (sometimes involving full business data), and the calculation logic is very complex and involves performing multi-step calculations in order. Moreover, it usually requires accomplishing the batch job within a specified time window. Otherwise, the business will be affected, and accidents may occur.
Since it is difficult for SQL to directly perform such calculations, it usually needs to use stored procedure. When the calculation is complex, it needs to use the cursor to fetch data to calculate. However, this method is not only inefficient but also unable to compute in parallel, resulting in low efficiency and high resource consumption. In addition, the implementation code for stored procedure is often up to dozens of steps and thousands of lines, and needs to repeatedly store the intermediate results, resulting in an extremely high IO cost, and a frequent occurrence of the phenomenon that calculation cannot be accomplished within a specified time window.
Multi-index calculation on big data, repeated use, many associations
Index calculation is a common business in industries like finance and telecoms. As data amount and index number (both) grow, it is impossible to use SQL to perform real-time calculation because the calculation process will involve using the detail data many times, repeatedly traversing large table, and also involves many operations such as large table association, conditional filtering, grouping and aggregating, de-duplication and counting mixed operation, and high-concurrency occurs frequently. As a result, a pre-processing way has to be adopted, which cannot meet changeable real-time query needs.
Because of space, it's impossible to write too long code here, so let's take the e-commerce funnel as an example. Coding in SQL is like this:
with e1 as (
select uid,1 as step1,min(etime) as t1
from event
where etime>= to\_date('2021-01-10') and etime<to\_date('2021-01-25')
and eventtype='eventtype1' and …
group by 1),
e2 as (
select uid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2
from event as e2
inner join e1 on e2.uid = e1.uid
where e2.etime>= to\_date('2021-01-10') and e2.etime<to\_date('2021-01-25')
and e2.etime > t1 and e2.etime < t1 + 7
and eventtype='eventtype2' and …
group by 1),
e3 as (
select uid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
from event as e3
inner join e2 on e3.uid = e2.uid
where e3.etime>= to\_date('2021-01-10') and e3.etime<to\_date('2021-01-25')
and e3.etime > t2 and e3.etime < t1 + 7
and eventtype='eventtype3' and …
group by 1)
select
sum(step1) as step1,
sum(step2) as step2,
sum(step3) as step3
from
e1
left join e2 on e1.uid = e2.uid
left join e3 on e2.uid = e3.uid
Since SQL lacks the ordered computing capability, and the set orientation is not thorough, it needs to use multiple sub-queries and repeatedly perform JOIN, which is difficult in both coding and understanding, and the operation performance is very low. Moreover, this code is related to the number of steps of funnel and every extra step needs to add one more sub-query, which is cumbersome to code. Even so, not all databases can execute this calculation.
SPL code for the same calculation task:
A | |
1 | =["etype1","etype2","etype3"] |
2 | =file("event.ctx").open() |
3 | =A2.cursor(id,etime,etype;etime>=date("2021-01-10") && etime<date("2021-01-25") && A1.contain(etype) && …) |
4 | =A3.group(uid).(~.sort(etime)) |
5 | =A4.new(~.select@1(etype==A1(1)):first,~:all).select(first) |
6 | =A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime<t1+7).etime, null)))) |
7 | =A6.groups(;count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3) |
This code is written according to natural thinking. In fact, we only need to group by uid, and then loop through each group, and check if there is corresponding record (time) in the event type list. The only special thing is that the first event needs to be handled separately, after finding it, take it as the input parameter of the second event. After that, the 2nd to the Nth event is handled in the same way (they can be expressed in generic code), and finally count according to user group.
The above-mentioned SPL solution is basically the same as natural thinking, and the calculation can be accomplished in 7 simple steps by using the characteristics such as orderliness, set orientation grouping, and hence it is very simple. In addition, this code can handle funnel involving any number of steps. Since the calculation can be accomplished by traversing the data only once, and it does not involve interacting with external storage, the performance is higher.
However, if you want to use SPL (a programming language) to achieve desired results, you still need to have some knowledge of the functions and algorithms that SPL provides. Only then can you choose the right one from many functions, and this is where SPL beginners get confused. What SPL provides is a “toolbox”, in which there are a variety of tools. As for which tool to use, it depends entirely on actual needs. However, once you master the use method of any tool in the box, you will be able to easily solve any practice problem you encounter in the future, and even if you want to modify the existing solution (performance optimization), you can do it with ease. In contrast, SQL provides very few tools, this makes you do not know how to start even though thinking of a good method sometimes, and instead you have to implement in a tortuous way, which is not only difficult but also very slow.
Of course, you need to learn more for using SPL. In a sense, SPL is a little bit more “difficult”. It is like solving an applied problem, since primary school students only use four arithmetic operations, it seems easy; while middle school students need to learn the concept of equation, and more knowledge is required. In fact, however, primary school students need to work out a solution based on specific problem, which is often quite difficult, and they have to give different solution every time. On the contrary, middle school students only need to write the equation according to fixed rules, so which one do you think is easier?
Naturally, it needs a learning process to master equation. If you don't have equation-related knowledge, you will have no idea what you are going to do. Because it is strange to you, you will feel it difficult, and the same goes for SPL. When comparing SPL with Java, you will find it much less difficult to learn SPL than Java, after all, some concepts of Java like object-oriented, reflection are also very complex. If a programmer can learn Java, SPL is not a problem at all, it's just a matter of getting used to it, and you should not have preconceived opinion.
In addition, for some very complex scenarios that require extreme performance, some complex algorithm knowledge is required, and the learning difficulty will increase. In this case, you can consult SPL experts to jointly develop a solution. In fact, the key to handling these complex scenarios is the algorithm rather than the language itself. No matter what technology is used, you have to solve these algorithm problems. Due to the deficiencies of SQL in terms of set orientation, discreteness, orderliness, etc., it is extremely difficult for SQL to solve these problems, and in some cases, it is even impossible for SQL to solve, while SPL is relatively easy to express this kind of computation.
At this point, we can draw a conclusion: SQL works easily only for simple scenarios. When facing complex business logic, it is difficult to code and slow to run in SQL because of “tortuous” process, yet these complex businesses account for the majority of actual application (Pareto principle). To make the implementation of these complex scenarios become easy, you can use SPL, as SPL provides simpler and more efficient means for implementation. Now we want to emphasize it again: the key to handling complex scenarios is the algorithm. Yet, coming up with an algorithm is not enough, the ability to implement this algorithm is also required, and algorithm should be implemented easily (SQL does not do a good job in this regard). Fortunately, SPL provides this possibility.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code