-
Notifications
You must be signed in to change notification settings - Fork 332
SPL computing performance test series:in group accumulation
This task originates from the counting of the age (storage time) of products in the ERP system of a large electrical appliance enterprise.
The original data consist of the following pair of tables that are in primary-sub relationship:
Store:
Field name | Data type | Field description |
---|---|---|
store | String | Store ID |
product | Integer | Product ID |
total | Integer | Current stock |
Inbound:
Field name | Data type | Field description |
---|---|---|
store | String | Store ID |
product | Integer | Product ID |
indate | Date | Inbound date |
num | Integer | Inbound number |
Now we are to calculate the age (the number of days from the inbound date to an end date) of products in the inbound within a time period (usually 90 days) before a specified end date:
Inbound_age:
Field name | Data type | Field description |
---|---|---|
store | String | Store ID |
product | Integer | Product ID |
indate | Date | Inbound date |
num | Integer | Inbound number |
age | Integer | Storage time |
There are four details to know about the original data and computing process (take a 90-day time period as an example):
-
Products are shipped out in chronological order of receipt, that is, the earlier it is received, the earlier it will be shipped out;
-
When the total inbound number in the last 90 days is less than the actual stock, the age of the extra products is categorized as “over 90 days”;
-
When the total inbound number in the last 90 days exceeds the actual stock, it indicates part of products are already shipped out in the past 90 days and should be deleted, and their age is no longer calculated;
-
If the number of a certain product that is put into the store on a certain day is m, and the number of this product that is shipped out is n, then the number of the product should be recorded as m-n in the Inbound_age.
Now let’s show the calculation details with actual data:
Store:
store | product | total |
---|---|---|
store_id_000001 | 1 | 700 |
store_id_000002 | 1 | 1346 |
store_id_000003 | 1 | 850 |
Inbound:
store | product | indate | num |
---|---|---|---|
store_id_000001 | 1 | 2016-09-07 | 100 |
store_id_000001 | 1 | 2016-08-13 | 200 |
store_id_000003 | 1 | 2016-09-23 | 300 |
store_id_000003 | 1 | 2016-09-15 | 200 |
store_id_000003 | 1 | 2016-08-25 | 400 |
store_id_000003 | 1 | 2016-08-07 | 100 |
The end date is set as 2016-9-30, and the time period is 90 days. The calculated inbound_age:
store | product | indate | num | age |
---|---|---|---|---|
store_id_000001 | 1 | 2016-09-07 | 100 | 23 |
store_id_000001 | 1 | 2016-08-13 | 200 | 48 |
store_id_000001 | 1 | 2016-07-01 | 400 | 91 |
store_id_000003 | 1 | 2016-09-23 | 300 | 7 |
store_id_000003 | 1 | 2016-09-15 | 200 | 15 |
store_id_000003 | 1 | 2016-08-25 | 350 | 36 |
In this table, there is a record whose indate is 2016-07-01, and the recorded age is 91, indicating that there are products with an age of over 90 days in current stock.
Let's take a look at the calculation process of two groups of data:
1. For the product with store=store_id_000001, product=1, and the stock is 700, there are 2 inbound records:
Inbound date: 2016-09-07, age: 23 days, inbound number: 100; stock at that time: 700 - 100 = 600 Inbound date: 2016-08-13, age: 48 days, inbound number: 200; stock at that time: 600 - 200 = 400
The 400 products in the store were received before 2016-07-01, and their age is over 90 days.
2. For the product with store=store_id_000003, product=1, and the stock is 850, there are 4 inbound records:
Inbound date: 2016-09-23, age: 7 days, inbound number: 300; stock at that time: 850 - 300 = 550 Inbound date: 2016-09-15, age: 15 days, inbound number: 200; stock at that time: 550 - 200 = 350 Inbound date: 2016-08-25, age: 36 days, inbound number: 400, shipped number: 50, stock at that time: 350.
The 100 products received on 2016-08-07 are all shipped out, so their age is no longer calculated.
We select two kinds of relational databases to make a comparison with SPL:
i. Oracle, which is often used as a benchmark in database performance test. Oracle 19 is to be tested.
ii). Starrocks, which is claimed to be a faster OLAP database. Starrocks 2.5.2 is to be tested.
SPL Community Edition (version 20230528) is tested.
Test environment: one physical server with the following configuration:
2 x Intel3014 CPUs, main frequency 1.7G, 12 cores in total, 64G memory, SSD (Solid State Drive)
To examine the sensitivity of these technologies to CPU and memory, we install two virtual machine environments on the physical server to conduct separate test. Virtual machine configurations:
- VM1: 8 CPUs, 32G memory
- VM2: 4 CPUs, 16G memory
For Starrocks, at least two nodes, BE and FE, need to be installed. The BE that undertakes computing task is installed on one VM, and the FE that undertakes management task is installed on the physical machine to avoid affecting the test results. For SPL and Oracle, they only need to be installed on VM.
Use generateData.splx to generate the store and inbound.
A | B | C | D | E | |
---|---|---|---|---|---|
1 | 1000 | 100 | 1500 | ||
2 | 200000000 | 20 | 60 | 10 | 100 |
3 | =date(2015,10,1) | =date(2016,9,30) | |||
4 | |||||
5 | =create(store,product,total) | ||||
6 | =file("e:/store.ctx").create@y(#store,#product,total) | ||||
7 | =create(store,product,indate,num) | ||||
8 | =file("e:/inbound.ctx").create@y(#store,#product,#indate,num) | ||||
9 | |||||
10 | func genOne (store,product) | ||||
11 | ="store_id_"+mid(string(store+1000000),2) | ||||
12 | >A5.insert(0,B11,product,B1+rand(C1-B1)) | ||||
13 | =B2+rand(C2-B2) | =to(1000).(rand(interval(B3,D3))).id@u().m(to(B13)).sort() | |||
14 | for B13 | =elapse(B3,C13(B14)) | =D2+rand(E2-D2) | ||
15 | >A7.insert(0,B11,product,C14,D14) | ||||
16 | >C3 = C3+1 | ||||
17 | |||||
18 | for 1000000 | for A1 | =func(genOne,A18,B18) | ||
19 | if (C3 >= A2) | >A6.append(A5.cursor()), A8.append(A7.cursor()) | |||
20 | break A18 | ||||
21 | >A6.append(A5.cursor()) | >A5.reset() | |||
22 | >A8.append(A7.cursor()) | >A7.reset() | |||
23 | >A6.close() | >A8.close() | |||
24 | |||||
25 | =file("e:/store.csv").export@c(file("e:/store.ctx").open().cursor()) | ||||
26 | =file("e:/inbound.csv").export@c(file("e:/inbound.ctx").open().cursor()) |
The first three lines define the data scale and field’s data range of the two tables:
A1 defines the range of product IDs;
Current stock is random in the range of B1 ~ C1. A2 defines the number of records in inbound;
The number of inbound records of each product of each store is generated randomly between B2~C2, and the inbound number of each time is generated randomly between D2~E2;
Define the start and end dates. The inbound date is generated randomly between the start date and the end date (B3~D3);
A5 and A6 define the data structure of two tables. A6 and A8 define the file (composite table) for storing the data of two tables;
A10 defines a genOne function, and give the values of two primary key fields (store and product) to generate a primary table (store) record and multiple sub-table records associated with this record;
Line 18 performs a cross-multiplication and combination on three primary key fields, and then calls the genOne function to generate each group of primary-sub table data;
A25 and A26 convert two tables in composite table format (ctx) to text format (csv);
The actual scale of the generated test data is as follows:
Store: 5.06 million records, size of composite table: 33M, size of text file (csv): 123M;
Inbound: 200 million records. When the time period is 1 year, the size of composite table is 604M, and the size of text file (csv) is 6.8G;
When calculating the age of three months of data (2016-07-01~2016-09-30), the result set has 45 million records;
SPL calculates the age based on the composite table. For the data in text format, they will be imported into Oracle and Starrocks to calculates the age in SQL.
CREATE TABLE inbound_age AS
WITH t1 AS (
SELECT d.store,d.product,
d.indate,
d.num,
SUM(d.num) OVER (PARTITION BY d.store,d.product ORDER BY d.indate DESC) AS accu_num,
ROW_NUMBER()OVER(PARTITION BY d.store,d.product ORDER BY d.indate DESC) rn,
c.total
FROM inbound d JOIN store c ON d.store = c.store AND d.product = c.product
WHERE d.indate>TO_DATE('2016-07-01','YYYY-MM-DD') AND d.indate<=TO_DATE('2016-09-30','YYYY-MM-DD')
),
t11 AS (
SELECT t1.* FROM
t1 JOIN (
SELECT store,product,MAX(rn) max_rn
FROM t1 WHERE accu_num - num<total GROUP BY store,product
)t2
ON t1.store = t2.store AND t1.product = t2.product AND t1.rn = t2.max_rn
),
t_result AS (
SELECT t.*,
CASE WHEN over90=0
THEN ROUND(TO_NUMBER(TO_DATE('2016-09-30','YYYY-MM-DD') - indate))
ELSE 91 END AS age
FROM (
SELECT store,product,indate,
total-accu_num AS num,1 AS over90
FROM t11 WHERE total>accu_num
UNION
SELECT store,product,indate,
CASE WHEN total<accu_num
THEN num-(accu_num-total) ELSE num END AS num,
0 AS over90
FROM t1 WHERE accu_num - num<total
)t ORDER BY store,product,indate DESC
)
SELECT /*+ PARALLEL (t_result,8) */ * FROM t_result
This code employs the CREATETABLEASSELECT... statement to store the calculation result into inbound_age. The calculation process is somewhat complex, so the WITH subquery is used to define three intermediate tables. The final main query sets the parallel number as 8. Since VM2 installs a 4-core CPU, the parallel number should be changed to 4 during the test.
After calculating the age of three months of inbound data, the result set t_result contains 45 million records. Writing such a large result set back to hard disk would be time-consuming. In order to measure the time to write back the large result set, this SQL code is modified in a way that the last line only counts the records in the result set. In this way, it almost avoids the write-back action.
…
SELECT /*+ PARALLEL (t_result,8) */ COUNT(*) cnt FROM t_result
The full inbound data refer to the data of a whole year, with a total of 200 million records. In the above test, only the data of the past 3 months (approximately 50 million records) are calculated, the calculated result set contains 45 million records.
If we change the time period to full year (2015-10-01 ~ 2016-09-30), all 200 million records will be included in the calculation, which will generate 77 million records. Change the date-related parameters in the SQL code:
…
WHERE d.indate>TO_DATE('2015-10-01','YYYY-MM-DD')AND d.indate<=TO_DATE('2016-09-30','YYYY-MM-DD')
…
ELSE 365 END AS age
…
This code indicates that the age is 1 year and above.
CREATE TABLE inbound_age AS
WITH t1 AS (
SELECT d.store,d.product,
d.indate,
d.num,
SUM(d.num) OVER (PARTITION BY d.store,d.product ORDER BY d.indate DESC) AS accu_num,
ROW_NUMBER()OVER(PARTITION BY d.store,d.product ORDER BY d.indate DESC) rn,
c.total
FROM inbound d JOIN store c ON d.store = c.store AND d.product = c.product
WHERE d.indate> '2016-07-01' AND d.indate<= '2016-09-30'
),
t11 AS (
SELECT t1.* FROM
t1 JOIN (
SELECT store,product,MAX(rn) max_rn
FROM t1 WHERE accu_num - num<total GROUP BY store,product
)t2
ON t1.store = t2.store AND t1.product = t2.product AND t1.rn = t2.max_rn
),
t_result AS (
SELECT t.*,
CASE WHEN over90=0
THEN DATEDIFF('2016-09-30', indate)
ELSE 91 END AS age
FROM (
SELECT store,product, indate,
total-accu_num AS num,1 AS over90
FROM t11 WHERE total>accu_num
UNION
SELECT store,product,indate,
CASE WHEN total<accu_num
THEN num-(accu_num-total) ELSE num END AS num,
0 AS over90
FROM t1 WHERE accu_num - num<total
) t ORDER BY store,product,indate DESC
)
SELECT * FROM t_result
There is no need for Starrocks to set the parallel number, as it will control automatically.
Like Oracle, modify the last line of the SQL code to only counting the records:
…
SELECT COUNT(*) cnt FROM t_result
Modify the date-related parameter in the SQL code:
…
WHERE d.indate> '2015-10-01'AND d.indate<='2016-09-30'
…
ELSE 365 END AS age
…
This code indicates that the age is 1 year and above.
A | B | |
---|---|---|
1 | =now() | |
2 | =date(2016,7,1) | |
3 | =date(2016,9,30) | |
4 | =file("e:/store.ctx").open().cursor@m(;;8) | |
5 | =file("e:/inbound.ctx").open().cursor(;indate>A2&&indate<=A3;A4) | |
6 | =A5.joinx@m(store:product,A4:store:product,total,0:age, 0:accu) | |
7 | =A6.group@o(store,product).(~.rvs()) | |
8 | =A7.(~.run(cum(num;store,product):accu,num-if(accu>total,(accu-total),0):num)) | |
9 | =A8.(last=~.m(-1),~=if(last.accu>=last.total,~.select(num>0),~.derive@o().insert(0,last.store,last.product,A2,last.total-last.accu,last.total,0))) | |
10 | =A9.(~.run(interval(indate,A3):age)) | |
11 | =A8.conj().new(store,product,indate,num,age) | |
12 | fork A11 | =file("e:/age_"/#A12/".btx").export@b(A12) |
13 | =interval@ms(A1,now()) |
Store the data of two tables in order by the primary keys (store and product) to the composite table. A4/A5 loads the data of composite table as cursors, and the data loading and subsequent calculation are performed simultaneously. A4 specifies the parallel number as 8 (Since VM2 installs a 4-core CPU, the parallel number should be modified as 4). The cursor in A5 only loads the data of three months by means of filter condition, and segmentation is performed synchronously with the cursor in A4.
A6: perform an order-based merge join on the two tables to merge the total field of the store into the inbound.
A7: form the group to calculate age, and reverse the order in descending order by inbound date;
A8: calculate the in-group accumulated number, and the effective number of products that are still in store;
A9: remove the redundant inbound records and add the inbound records of over 90 days (set the inbound date as 2016-07-01, and calculated age is 91 days, representing the records of over 90 days), based on the comparison result of the accumulated number and the stock;
A10: calculate the age;
A11: aggregate the small result sets of each group using the conj() function, and then generate final large result set using the new() function.
Line 12 uses the parallel code block fork to save the parallel multi-cursor as multiple bin files, which will still be kept in order. The calculations before line 12 are all defined on cursor and will not be executed. A series of calculations are triggered only when the result is exported in line 12.
Like Oracle and Starrocks, if we don’t export the detailed data, and only count the number of records, just remove the fork code block in line 12 (A12/B12), and modify A12 to count the result cursor with skip():
A | |
---|---|
12 | =A11.skip() |
Modify the start date of A2:
A | |
---|---|
2 | =date(2015,10,1) |
Unit: seconds
VM1 | VM2 | ||||||
SPL | Oracle | Starrocks | SPL | Oracle | Starrocks | ||
90-day data | Store result | 10 | 181 | 92 | 22 | 252 | Out of memory (OOM) |
Count only | 9 | 92 | 63 | 15 | 167 | OOM | |
Full data | Store result | 30 | 451 | OOM | 48 | 800 | OOM |
Count only | 23 | 295 | OOM | 43 | 601 | OOM |
1. SPL can adopt higher-performance algorithms to avoid association, while SQL can only use inefficient JOIN operation, and its performance depends heavily on the optimizer. For complex operations like the one in this test, the optimizer usually fails. For similar tests, SPL runs faster than Oracle SQL by more than 10 times.
2. The result set is relatively large, and it will consume more time to write it back to hard disk. Since SPL saves the result set as bin file, it is faster and has little impact on performance. In contrast, the performance of Oracle and Starrocks decreases significantly.
3. Starrocks, as a professional OLAP database, adopts the CPU optimization technology, and performs better than Oracle. However, Starrock still has to use inefficient algorithms, and hence its computing performance is still far behind SPL. Moreover, the CPU optimization technology relies heavily on in-memory computing, and cannot function normally when the memory capacity is small or the data amount is large.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code