Skip to content

SPL:Parallel data retrieval skill

esProcSPL edited this page Jun 14, 2024 · 1 revision

Java applications must retrieve data from databases through JDBC. Sometimes we find that the database is not heavily burdened and SQL is quite simple, but the speed of retrieving is still slow when the data is big enough. This is due to the poor performance of JDBC itself. SPL provides parallel executing, which can take advantage of multi-CPUs to retrieve data at the same time, improving the efficiency of retrieving data.

1. parallel retrieving data in single table

There is a CUSTOMER table in the ORACLE database with 15 million rows of data records in eight fields, where C_CUSTKEY is the natural numbers starting from 1.

The SPL script for parallel retrieval of data is as follows:

A B
1 =now() //record the starting time
2 =connect("oracle").query@1x("SELECT COUNT(*) FROM CUSTOMER")
3 >n=12 //set the number of parallel based on the physical CPU cores of the computer
4 =n.(range(1,A2+1,~:n)) //segment by the total number of records and the number of parallel, and record the beginning numbers of this segment and the next segment
5 fork A4 =connect("oracle")
6 =B5.query@x("SELECT * FROM CUSTOMER WHERE C_CUSTKEY>=? AND C_CUSTKEY<?",A5(1),A5(2))
7 =A5.conj() //concatenate the retrieval results
8 =interval@s(A1,now()) //calculate the execution time

In parallel retrieval, the source data is divided into several relatively average segments. In this case, C_CUSTKEY is natural numbers starting from 1, thus we can first find the total number of records (A2) and then divide it averagely into n segments (A4). Parallel operation of A5 is then performed (notice: each thread must connect to the database individually, otherwise the database will execute multiple queries on the same connection in a serial manner and the speed will not be quickened), and take the segments as parameters in order to place C_CUSTKEY in the segment to execute the SQL query. Finally, concatenate the retrieval results of each thread as the final result.

Sometimes it cannot be segmented like this, and it may be very slow to calculate count(*) first (if a WHERE condition exists), resulting in a longer total time.

In practice, other appropriate methods can be taken to set WHERE conditions in order to obtain a relatively average segment. If C_CUSTKEY is natural numbers, whether it starts with 1 or not, or whether it is continuous or not, we can segment them by the remainder of the parallel number, with remainders such as 0, 1, 2, ..., n-1 being one segment each, and the script is as follows:

A B
1 =now() //record the starting time
2 >n=12 //set the number of parallel based on the physical CPU cores of the computer
3 fork to(n) =connect("oracle")
4 =B3.query@x("SELECT * FROM CUSTOMER WHERE MOD(C_CUSTKEY,?)=?", n, A3-1)
5 =A3.conj() //concatenate the retrieval results
6 =interval@s(A1,now()) //calculate the execution time

The number of parallel threads in the fork is not necessarily equal to the number of physical CPU cores in the computer, which can be larger. The number of concurrent threads does not exceed the number of cores, and excessive threads can only be executed when some threads finish executing and free a core. When the segment data are not evenly distributed, some segment data can be executed quickly due to their small data, so that the waiting threads can execute as early as possible to achieve load balance.

2. parallel retrieving data in multiple tables

Parallel retrieval skills can also be used to execute multiple SQL statements to retrieve data. For example, retrieve data of five tables from the ORACLE database, and the script that does not retrieve data in parallel is as follows:

A B
1 SELECT * FROM SUPPLIER
2 SELECT * FROM PART
3 SELECT * FROM CUSTOMER
4 SELECT * FROM PARTSUPP
5 SELECT * FROM ORDERS
6 =now() //record the starting time
7 =connect("oracle") //connect to the database
8 =[A1:A5].(A7.query(~)) //execute each SQL in turn
9 >A7.close() //close the database connection
10 =interval@s(A6,now()) //calculate the execution time

Instead, retrieve data in parallel. The script at the beginning of the 7th row reads as follows:

A B
7 fork [A1:A5] =connect("oracle")
8 =B7.query@x(A7)
9 =interval@s(A6,now()) //calculate the execution time

Although parallel retrieving in multiple tables can not guarantee the even distribution of data, the performance of data retrieval can still be effectively improved.

Clone this wiki locally