-
Notifications
You must be signed in to change notification settings - Fork 335
Performance Optimization Skill:First half Ordered Sorting
When performing sorting operation on the data set, the following situation sometimes may occur: when the data set T is already ordered by field “a” but not by field “b”, we want to sort T by fields “a”, “b”, and this is what we call the first-half ordered sorting (“a” is ordered). For this case, we come up with an optimization technique to speed up the sorting: first retrieve a group of records with the same “a” value from T and sort them by field “b”; then get next group of records with the same “a” value in turn and do the sorting by “b”; and continue the above actions until all the records in T are retrieved and sorted. With this method, we don’t need to sort the records in T all at once, instead, only a group of them is retrieved each time, which only requires the memory to be able to hold a small group of records.
However, it’s a pity that SQL doesn’t support this algorithm, and it always requires the full sorting on all records. While SPL is in support of this algorithm, so we’ll test it and compare it with the sorting in Oracle.
The test computer has two Intel2670 CPUs, 2.6G frequency, 16 cores in total, 64G memory and an SSD hard disk, where a 16-core virtual machine with 8G memory is set for testing.
On the virtual machine, we create data table salesman1 which consists of two fields, area (strings) and salesman (strings), and generate 400 million rows of records sorted by field “area” in ascending order. The field “area” has 2,000 distinct values with each “area” value corresponding to 200,000 “salesman” values. Then we import the data table into Oracle database and use it to generate an esProc SPL composite table for testing.
Then we create another same-structure data table salesman2 for big data tests. This table has 2 billion rows of records and 4,000 distinct values in “area” field where each value corresponds to 500,000 “salesman” values.
All the tests is required to sort the data table by both “area” and “salesman”.
SQL query statement:
select area, salesman from salesman1 order by area, salesman
This simple SQL statement is enough for the sorting. However, it may take extremely long to output all the sorting result, so we rewrite the SQL statement to output only the row in the middle position, i.e., the row with 200 million as its row number, for the purpose of decreasing the output amount. In this case, only the time of sorting process is counted:
select area, salesman from (
select area, salesman, rownum rn from (
select area, salesman from salesman1 order by area, salesman
)
) where rn=200000000;
In addition, the no-business significant query is only written to force database to perform the full sorting and avoid the output time.
SPL script:
A | |
---|---|
1 | =now() |
2 | =file("/home/ctx/salesman1.ctx").open().cursor(area,salesman) |
3 | =A2.group@qs(area;salesman) |
4 | =A3.skip(199999999) |
5 | =A3.fetch(1) |
6 | =interval@s(A1,now()) |
The @s option in group@qs is used to sort the data set only rather than grouping it; the @q option means the data set is already ordered by the sorting expression before semicolon (“area”) and enables the first-half ordered sorting with the expression after semicolon (“salesman”).
SQL query statement:
select area, salesman from (
select area, salesman, rownum rn from (
select area, salesman from salesman2 order by area, salesman
)
) where rn=1000000000;
The query outputs the row with 1 billion as its row number.
SPL script:
A | |
---|---|
1 | =now() |
2 | =file("/home/ctx/salesman2.ctx").open().cursor(area,salesman) |
3 | =A2.group@qs(area;salesman) |
4 | =A3.skip(999999999) |
5 | =A3.fetch(1) |
6 | =interval@s(A1,now()) |
Test results (Unit: second):
Data size | 400 million rows of records | 2 billion rows of records |
---|---|---|
Oracle | 326 | 2556 |
SPL | 186 | 1266 |
According to the test results, the execution time of the SPL first-half ordered sorting algorithm is 60% of that of full sorting in Oracle when there are 400 million rows of data and only 50% of that with 2 billion rows of data, which shows better performance. In conclusion, the larger the data size is, the more effective the first-half ordered sorting algorithm is in enhancing the performance.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code