-
Notifications
You must be signed in to change notification settings - Fork 332
SPL:Parallel data retrieval skill
Java applications must retrieve data from databases through JDBC. Sometimes we find that the database is not heavily burdened and SQL is quite simple, but the speed of retrieving is still slow when the data is big enough. This is due to the poor performance of JDBC itself. SPL provides parallel executing, which can take advantage of multi-CPUs to retrieve data at the same time, improving the efficiency of retrieving data.
There is a CUSTOMER table in the ORACLE database with 15 million rows of data records in eight fields, where C_CUSTKEY is the natural numbers starting from 1.
The SPL script for parallel retrieval of data is as follows:
A | B | |
---|---|---|
1 | =now() | //record the starting time |
2 | =connect("oracle").query@1x("SELECT COUNT(*) FROM CUSTOMER") | |
3 | >n=12 | //set the number of parallel based on the physical CPU cores of the computer |
4 | =n.(range(1,A2+1,~:n)) | //segment by the total number of records and the number of parallel, and record the beginning numbers of this segment and the next segment |
5 | fork A4 | =connect("oracle") |
6 | =B5.query@x("SELECT * FROM CUSTOMER WHERE C_CUSTKEY>=? AND C_CUSTKEY<?",A5(1),A5(2)) | |
7 | =A5.conj() | //concatenate the retrieval results |
8 | =interval@s(A1,now()) | //calculate the execution time |
In parallel retrieval, the source data is divided into several relatively average segments. In this case, C_CUSTKEY is natural numbers starting from 1, thus we can first find the total number of records (A2) and then divide it averagely into n segments (A4). Parallel operation of A5 is then performed (notice: each thread must connect to the database individually, otherwise the database will execute multiple queries on the same connection in a serial manner and the speed will not be quickened), and take the segments as parameters in order to place C_CUSTKEY in the segment to execute the SQL query. Finally, concatenate the retrieval results of each thread as the final result.
Sometimes it cannot be segmented like this, and it may be very slow to calculate count(*) first (if a WHERE condition exists), resulting in a longer total time.
In practice, other appropriate methods can be taken to set WHERE conditions in order to obtain a relatively average segment. If C_CUSTKEY is natural numbers, whether it starts with 1 or not, or whether it is continuous or not, we can segment them by the remainder of the parallel number, with remainders such as 0, 1, 2, ..., n-1 being one segment each, and the script is as follows:
A | B | |
---|---|---|
1 | =now() | //record the starting time |
2 | >n=12 | //set the number of parallel based on the physical CPU cores of the computer |
3 | fork to(n) | =connect("oracle") |
4 | =B3.query@x("SELECT * FROM CUSTOMER WHERE MOD(C_CUSTKEY,?)=?", n, A3-1) | |
5 | =A3.conj() | //concatenate the retrieval results |
6 | =interval@s(A1,now()) | //calculate the execution time |
The number of parallel threads in the fork is not necessarily equal to the number of physical CPU cores in the computer, which can be larger. The number of concurrent threads does not exceed the number of cores, and excessive threads can only be executed when some threads finish executing and free a core. When the segment data are not evenly distributed, some segment data can be executed quickly due to their small data, so that the waiting threads can execute as early as possible to achieve load balance.
Parallel retrieval skills can also be used to execute multiple SQL statements to retrieve data. For example, retrieve data of five tables from the ORACLE database, and the script that does not retrieve data in parallel is as follows:
A | B | |
---|---|---|
1 | SELECT * FROM SUPPLIER | |
2 | SELECT * FROM PART | |
3 | SELECT * FROM CUSTOMER | |
4 | SELECT * FROM PARTSUPP | |
5 | SELECT * FROM ORDERS | |
6 | =now() | //record the starting time |
7 | =connect("oracle") | //connect to the database |
8 | =[A1:A5].(A7.query(~)) | //execute each SQL in turn |
9 | >A7.close() | //close the database connection |
10 | =interval@s(A6,now()) | //calculate the execution time |
Instead, retrieve data in parallel. The script at the beginning of the 7th row reads as follows:
A | B | |
---|---|---|
7 | fork [A1:A5] | =connect("oracle") |
8 | =B7.query@x(A7) | |
9 | =interval@s(A6,now()) | //calculate the execution time |
Although parallel retrieving in multiple tables can not guarantee the even distribution of data, the performance of data retrieval can still be effectively improved.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code