Skip to content

Performance optimization skill:Pre Joining

esProcSPL edited this page Aug 19, 2024 · 1 revision

I Problem introduction & solving

The JOIN performance in SQL is a long-standing problem, especially when there are a lot of tables to be joined, the performance plummets dramatically.

SQL adopts the HASH partition approach to handle JOINs, which first calculates the HASH values of the join keys and then traverses and compares the records with the same HASH values from two tables. The same computation steps are necessary for each JOIN.

If the data volume is small compared to the size of memory, we can load the data into memory in advance and create the association using the in-memory pointer mechanism. Specifically, this approach calculates HASH values and makes comparisons while loading data, and saves the associations with pointers. The subsequent calculations can directly refer to the associated records, which spares the trouble of calculating and comparing HASH values and improves the performance.

The problem is that SQL doesn’t support the pointer type of data to implement this optimization idea. Even if the data can be wholly loaded into the memory, we still can’t take advantage of the pre-association approach, which is also the same drawback of most SQL-based relational databases. On the contrary, SPL can implement the optimization thanks to its support in pointer data type.

In the following part, we’ll test SQL’s performing differences in achieving two-table join and multi-table join and then do the same calculations with SPL using the pre-association technique to make a comparison.

II Test environment

Eight data tables, 50G data in total (to be small enough to fit into the memory), have been generated according to TPCH standards. The structure of TPCH data table is vastly described online, which will not be elaborated here.

The server for testing has two Intel 2670 CPUs, 2.6G frequency, 16 cores in total, 128G memory and an SSD hard disk.

The lineitem table consists of too large data to be loaded in memory, so we create an orderdetail table of the same structure with appropriate data volume that can be loaded in memory. Hereinafter we will use this table for testing.

The following tests are calculated in single-thread without the help of multi-core to make the difference clearer.

III SQL query test

The Oracle database is used here as a representative for the SQL test, which queries the orderdetail table to get the total revenue of parts order for every year.

1. Two-table join

SQL statements:

select
       l_year,
       sum(volume) as revenue
from
       (
              select
                     extract(year from l_shipdate) as l_year,
                     (l_extendedprice * (1 - l_discount) ) as volume
              from
                     orderdetail,
                     part
              where
                     p_partkey = l_partkey
                     and length(p_type)>2
       ) shipping
group by
       l_year
order by
       l_year;

2. Six-table join

SQL statements:

select
       l_year,
       sum(volume) as revenue
from
       (
              select
                     extract(year from l_shipdate) as l_year,
                     (l_extendedprice * (1 - l_discount) ) as volume
              from
                     supplier,
                     orderdetail,
                     orders,
                     customer,
                     part,
                     nation n1,
                     nation n2
              where
                     s_suppkey = l_suppkey
                     and p_partkey = l_partkey
                     and o_orderkey = l_orderkey
                     and c_custkey = o_custkey
                     and s_nationkey = n1.n_nationkey
                     and c_nationkey = n2.n_nationkey
                     and length(p_type) > 2
                     and n1.n_name is not null
                     and n2.n_name is not null
                     and s_suppkey > 0
       ) shipping
group by
       l_year
order by
       l_year;

3. Test results

Two-table join Six-table join
Query time (s) 26 167

Both query statements contain nested queries. The auto-optimized Oracle query performs even better than a query without nested queries (The latter may calculate “group by” and “select” repeatedly).

Both results are the best-performing among multiple executions. We found that the first Oracle query execution is always the slowest, which means that the database will cache all the data into the memory (Oracle has a big cache pool) when the memory is sufficient to hold all the data. We select the fastest execution to get the computing time as pure as possible by getting rid of the external storage retrieval time.

By defining the filtering condition as always true, that is, no record is filtered out, we perform the queries over the whole orderdetail table, thus both tests have equal computing amount.

According to the test results, the six-table join is 167/26=6.4 times slower than the two-table join. Besides the external storage retrieval time, the performance degradation originates from the time-consuming table joins and simple comparisons between join key values.

In conclusion, the JOIN performance in SQL is markedly bad.

IV SPL pre-association test

1. Pre-association

SPL script:

A
1 >env(region, file(path+"region.ctx").open().memory().keys@i(R_REGIONKEY))
2 >env(nation, file(path+"nation.ctx").open().memory().keys@i(N_NATIONKEY))
3 >env(supplier, file(path+"supplier.ctx").open().memory().keys@i(S_SUPPKEY))
4 >env(customer, file(path+"customer.ctx").open().memory().keys@i(C_CUSTKEY))
5 >env(part, file(path+"part.ctx").open().memory().keys@i(P_PARTKEY))
6 >env(orders,file(path+"orders.ctx").open().memory().keys@i(O_ORDERKEY))
7 >env(orderdetail,file(path+"orderdetail.ctx").open().memory())
8 >nation.switch(N_REGIONKEY,region)
9 >customer.switch(C_NATIONKEY,nation)
10 >supplier.switch(S_NATIONKEY,nation)
11 >orders.switch(O_CUSTKEY,customer)
12 >orderdetail.switch(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier)

The first 7 lines read in seven composite tables respectively into memory to generate in-memory tables and set them as global variables. The last 5 lines establish association among the seven tables. The pre-association script will be executed at the start of the SPL server to make preparations for subsequent query.

Below is the structure of pre-associated table objects in memory, taking the orderdetail table for example:

Here only the first pre-associated record of the orderdetail table are shown in the picture, and other records are similar. Limited to the width of the page, only some fields are listed in each table.

2. Two-table join

SPL script:

A
1 =orderdetail.select(len(L_PARTKEY.P_TYPE)>2)
2 =A1.groups(year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)

3. Six-table join

SPL script:

A
1 =orderdetail.select(len(L_PARTKEY.P_TYPE)>2 && L_ORDERKEY.O_CUSTKEY.C_NATIONKEY.N_NAME!=null && L_SUPPKEY.S_NATIONKEY.N_NAME!=null && L_SUPPKEY.S_SUPPKEY>0 )
2 =A1.groups(year(L_SHIPDATE):l_year;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)

The SPL script is quite simple after the tables are pre-associated. We can directly refer a field of the associated table as the sub-attribute of the referencing attribute, which makes the code easier to understand.

4. Test results

Two-table join Six-table join
Query time (s) 28 56

The six-table join query is only 2 times slower than the two-table join query due to the added computing time (spent in referring the associated field). Yet we don’t spend time on the join thanks to the pre-association.

V Summary

Test results:

Query time (s) Two-table join Six-table join Performance decline (times)
SQL 26 167 6.4
SPL pre-association 28 56 2

A six-table SQL join is 6.4 times slower than a two-table SQL join, which means that a SQL join consumes a lot of CPU, resulting in obviously bad performance. With the pre-association method, SPL six-table join is only 2 times slower, and the performance degradation is kept in a very limited range.

On the premise that the memory is sufficient enough to load all the data (the application scenario for in-memory databases), pre-association is an effective technique to improve joins query performance when there are a lot of tables involved. SPL can use the method to make a difference while relational databases, including in-memory databases, cannot adopt the optimization skill.

Clone this wiki locally