-
Notifications
You must be signed in to change notification settings - Fork 332
Writing simply and running fast are the same thing, why can't SQL run fast?
We have discussed the principles of coding difficulty and complexity, and now focusing on performance issues, running speed is of course a very important thing.
We know that software cannot change the performance of hardware, and the CPU and hard disk have their fixed speed. However, we can design low complexity algorithms, which require less computation. As a result, the computer executes fewer actions and naturally becomes faster. It was originally planned to perform 100 million calculations, but if there is a good algorithm that can reduce the computational load to 1 million, it is not surprising that it is 100 times faster. However, just coming up with an algorithm is not enough. It is necessary to actually write this algorithm in some programming language, otherwise the computer will not execute it.
However, if the programming language used is not powerful, it may be impossible to write it. In this case, there is nothing to do but bear with the low speed.
SQL often experiences this kind of thing. Let's use this simple example to illustrate: To find the top 10 from 100 million pieces of data, and writing it in SQL is not complicated:
SELECT TOP 10 * FROM Orders ORDER BY Amount DESC
There is an “ORDER BY” in this statement, and the corresponding execution logic is to perform a large sort on 100 million pieces of data. We know that sorting is a very slow action, requiring data to be traversed multiple times (with a specific complexity of N*logN times). If the amount of data is too large to fit in memory, external memory needs to be used for buffering, resulting in a sharp decline in performance. If the logic described in this SQL statement is strictly followed, this operation will not run fast no matter what.
In fact, we know that just getting the top 10 doesn't require sorting all the data. As long as a small set of the top 10 is always maintained during traversal, and this small set is constantly modified during the traversal process, a single traversal is enough, with a complexity of N*log10, which is about 8 times less than log 100 million, which means the computational complexity is 8 times smaller. Moreover, this algorithm does not require external memory for buffering.
Unfortunately, it is not possible to describe such a calculation process using SQL, and one can only write it as shown above, and then rely on the database to optimize it. Fortunately, almost all databases can optimize this statement and are not foolish enough to do large sorting, so they can also run faster.
But what will happen if the situation is a bit more complicated? For example, if we change the requirement to calculate the top 10 within groups, the SQL statement looks like this:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn
FROM Orders )
WHERE rn<=10
This is very different from the previous writing of getting the top 10 from the entire set, which is caused by the lack of discreteness we mentioned earlier. You need to first use the window function to create a in-group sequence number, and then filter out the records that meet the conditions using a subquery, which is a bit “convoluted”.
Anyway, there is still the word "ORDER BY" here, and the operation logic still needs to do a sorting. We found through actual testing that in Oracle, for the same amount of data, calculating the top 10 within groups is dozens of times slower than calculating the top 10 of the entire set above. Supposedly, it should only be slightly slower by adding a grouping operation. There is a high possibility that Oracle has actually done sorting or even external storage sorting (although we cannot be certain since we have not read Oracle's source code), and the database optimization engine simply faint in this slightly complex situation, and can only honestly execute according to the logic written in SQL, resulting in a sharp decline in performance.
Of course, perhaps one day the database will evolve again, and encountering such a situation, it can also optimize it (in fact, there do exist some). But there are always more complex situations, and the evolution speed of database optimization engines is very slow.
For example, for a more complex requirement, perform funnel analysis in e-commerce systems to calculate the user churn rate after each action. The following is a three-step funnel calculation that occurs in actual business, and SQL is written as follows:
WITH e1 AS (
SELECT userid, visittime AS step1_time, MIN(sessionid) AS sessionid, 1 AS step1
FROM defined_events e1 JOIN eventgroup ON eventgroup.id = e1.eventgroup
WHERE visittime >= DATE_ADD(arg_date,INTERVAL -14 day) AND visittime < arg_date AND eventgroup.name='SiteVisit'
GROUP BY userid,visittime
), e2 AS (
SELECT e2.userid, MIN(e2.sessionid) AS sessionid, 1 AS step2, MIN(visittime) AS step2_time, MIN(e1.step1_time) AS step1_time
FROM defined_events e2 JOIN e1 ON e1.sessionid = e2.sessionid AND visittime > step1_time JOIN eventgroup ON eventgroup.id = e2.eventgroup
WHERE visittime < DATE_ADD(step1_time ,INTERVAL +1 day) AND eventgroup.name = 'ProductDetailPage'
GROUP BY e2.userid
), e3 AS (
SELECT e3.userid, MIN(e3.sessionid) AS sessionid, 1 AS step3, MIN(visittime) AS step3_time, MIN(e2.step1_time) AS step1_time
FROM defined_events e3 JOIN e2 ON e2.sessionid = e3.sessionid AND visittime > step2_time
JOIN eventgroup ON eventgroup.id = e3.eventgroup
WHERE visittime < DATE_ADD(step1_time ,INTERVAL +1 day) AND (eventgroup.name = 'OrderConfirmationType1')
GROUP BY e3.userid
)
SELECT s.devicetype AS devicetype,
COUNT(DISTINCT CASE WHEN funnel_conversions.step1 IS NOT NULL THEN funnel_conversions.step1_userid ELSE NULL END) AS step1_count,
COUNT(DISTINCT CASE WHEN funnel_conversions.step2 IS NOT NULL THEN funnel_conversions.step2_userid ELSE NULL END) AS step2_count,
COUNT(DISTINCT CASE WHEN funnel_conversions.step3 IS NOT NULL THEN funnel_conversions.step3_userid ELSE NULL END) AS step3_count,
COUNT(DISTINCT CASE WHEN funnel_conversions.step3 IS NOT NULL THEN funnel_conversions.step3_userid ELSE NULL END)
/ COUNT(DISTINCT CASE WHEN funnel_conversions.step1 IS NOT NULL THEN funnel_conversions.step1_userid ELSE NULL END) AS step3_rate
FROM (
SELECT e1.step1_time AS step1_time, e1.userid AS userid, e1.userid AS step1_userid, e2.userid AS step2_userid,e3.userid AS step3_userid,
e1.sessionid AS step1_sessionid, step1, step2, step3
FROM e1 LEFT JOIN e2 ON e1.userid=e2.userid LEFT JOIN e3 ON e2.userid=e3.userid ) funnel_conversions
LEFT JOIN sessions s ON funnel_conversions.step1_sessionid = s.id
GROUP BY s.devicetype
This SQL statement is very convoluted, making it very difficult to understand. We put it here to let readers have a feel. This is just three steps, and if you want to calculate a few more steps, you still need to write more subqueries, then it cannot be listed here. For this complex SQL, it is hard to think of any method to optimize. As a result, this SQL statement ran on the Medium cluster (4-node) of Snowflake for 3 minutes without getting a result, and the user had to give up.
So, we cannot and should not rely on database optimization, we still need to write high-performance algorithms ourselves.
Then, what kind of programming language can write the algorithms mentioned earlier?
A lot, C++, Java, and so on. In theory, stored procedures written in SQL are also possible, but the results of engineering implementation may still be too slow due to the lack of discreteness. For example, the TopN problem requires temporary tables to store small sets of 10 members in SQL, which is difficult to write and very slow.
Then simply don't use SQL, just write it in C++, Java, and so on.
This goes back to the set-oriented features we mentioned earlier. These languages lack set orientation and are cumbersome to write. Like TopN within groups, if you want to consider various data types, multiple grouping keys, and may even carry a calculation expression, even a few thousand lines may not be able to write it. Funnel analysis, as a complex operation, also needs to consider the issue of big data storage, and it is normal to write tens of thousands of lines. The development cost is extremely high, and maintenance in the future will also be exhausting.
Thus, there is no operability in reality.
So, being able to write simple code becomes very meaningful. On the one hand, it is short, which means less workload, and on the other hand, it is also easier, which means more programmers can write. From this perspective, writing simply and running fast are the same thing. To run fast, it means to have a programming language that makes high-performance algorithms easy to write, and thus have operability.
By this standard, maybe only esProc SPL is suitable, which has both features of set orientation and discreteness, and provides a considerable number of inherent high-performance algorithm library functions, which can be written simply and run fast.
The top 10 problem written in SPL is as follows:
Orders.groups(;top(10;-Amount)
Orders.groups(Area;top(10;-Amount))
SPL regards TopN as aggregation calculation, and there is no sorting word in this statement, it will not perform a large sorting, but instead use the fast algorithm mentioned earlier. Moreover, the writing method of getting top 10 within groups here is almost the same as getting the top 10 of the entire set, except for the addition of grouping key. This is also a result that supports discreteness on the basis of set orientation.
The funnel analysis is written in SPL as follows:
A | |
---|---|
1 | =now() |
2 | =eventgroup=file("eventgroup.btx").import@b() |
3 | =devicetype=file("devicetype.btx").import@b() |
4 | =long(elapse(arg_date,-14)) |
5 | =long(arg_date) |
6 | =long(arg_date+1) |
7 | =A2.(case(NAME,"SiteVisit":1,"ProductDetailPage":2,"OrderConfirmationType1":3;null)) |
8 | =file("defined_events.ctx").open() |
9 | =A8.cursor@m(USERID,SESSIONID,VISITTIME,EVENTGROUPNO;VISITTIME>=A4 && VISITTIME<A6,EVENTGROUPNO:A7:#) |
10 | =sessions=file("sessions.ctx").open().cursor@m(USERID,ID,DEVICETYPENO;;A9) |
11 | =A9.joinx@m(USERID:SESSIONID,A10:USERID:ID,DEVICETYPENO) |
12 | =A11.group(USERID) |
13 | =A12.new(~.align@a(3,EVENTGROUPNO):e,e(1).select(VISITTIME<A5).group@u1(VISITTIME):e1,e(2).group@o(SESSIONID):e2,e(3):e3) |
14 | =A13.run(e=join@m(e1:e1,SESSIONID;e2:e2,SESSIONID).select(e2=e2.select(VISITTIME>e1.VISITTIME && VISITTIME<e1.VISITTIME+86400000).min(VISITTIME) ) ) |
15 | =A14.run(e0=e1.id(DEVICETYPENO),e1=e.min(e1.VISITTIME),e2=e.min(e2),e=e.min(e1.SESSIONID),e3=e3.select(SESSIONID==e && VISITTIME>e2 && VISITTIME<e1+86400000).min(VISITTIME),e=e0) |
16 | =A15.news(e;~:DEVICETYPE,e2,e3) |
17 | =A16.groups(DEVICETYPE;count(1):STEP1_COUNT,count(e2):STEP2_COUNT,count(e3):STEP3_COUNT,null:STEP3_RATE) |
18 | =A17.run(DEVICETYPE=devicetype.m(DEVICETYPE).DEVICETYPE,STEP3_RATE=STEP3_COUNT/STEP1_COUNT) |
19 | =interval@s(A1,now()) |
This code is shorter and more flexible than the previous SQL, and adding a few more steps will still be this code. The actual test results show that this code can get a result on a single server in 10 seconds.
With the set orientation and discreteness of SPL, it is possible to write simply and run fast.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code