-
Notifications
You must be signed in to change notification settings - Fork 332
JOINs in SQL(JOIN Simplification and Acceleration Series 1)
JOINs are long-standing SQL headaches. As the number of tables to be joined increases, coding becomes error prone. The complicated statements for achieving JOINs make associative queries a constantly BI software weakness. Almost no BI software can provide solutions for smooth multi-table associative queries. It is also hard for SQL to optimize and increase JOIN performance when the number of tables involved increases slightly or a large volume of data is involved.
Our JOIN operations series will take a deep dive to examine the SQL pain and propose a way to simplify the JOIN syntax and enhance performance.
First let’s look at how SQL defines the JOIN operation.
The SQL JOIN definition is simple. It is the Cartesian product between two sets (tables) filtered by a specific condition. The syntax is A JOIN B ON …. In principle, the result set of performing a Cartesian product should be composed of 2-tuples whose members come from both sets. As a SQL set is a table whose members are always the records made up of fields, and as SQL does not support using a generic data type to describe a 2-tuple whose members are records, the language simply joins up fields of records from both tables to form a new set of records to return. Rather than the multiplication (Cartesian product), that is the original meaning of the name JOIN, which joins up fields of two records. But whether members of the Cartesian product are considered 2-tuples or records generated by combining fields from both tables will not affect our analysis.
The SQL JOIN definition does not stipulate the filtering condition form. Theoretically, an operation is considered a logical JOIN as long as its result set is a subset of the Cartesian product of the two original sets. Suppose there are set A={1,2} and set B={1,2,3}, the result set of A JOIN B ON A=B is {(1,1),(2,2)} and the result set of A JOIN B ON A<B is {(1,2),(1,3),(2,3)}. We call a join where the filtering condition consists of one or more equations the equi-join, and one which isn’t an equi-join the non-equi-join. For the above two join operations, the first is an equi-join and the second is a non-equi-join.
An equi-join between two data tables may have a filtering condition that contains multiple equations concatenated by AND. The syntax is A JOIN B ON A.ai=B.bi AND ..., where ai and bi are respectively fields of table A and table B. According to experience, most of the JOINs in real-world business situations are equi-joins. The non-equi-joins are rare. And on many occasions, a non-equi-join can be handled by being converted into an equi-join, so our analysis focuses on equi-joins and uses tables and records instead of sets and members in the following examples.
According to the different rules of handling null values, there are three types of equi-joins, INNER JOIN, which is the strictest equi-join, LEFT JOIN, and FULL JOIN (RIGHT JOIN can be understood as the reversed form of LEFT JOIN and thus will not be taken as a separate type). Based on the number of corresponding records (which are the 2-tuples satisfying a specific filtering condition) in two tables respectively, there are also four types of join relationships – one-to-one, one-to-many, many-to-one, and many-to-many. As these terms are always covered by SQL and database documentations, we will skip them here.
Let’s move on to take a look at JOIN implementations.
A simple way that is easy to think of is to perform hardcoded traversal without distinguishing the equi-join from the non-equi-join. Suppose table A has n records and table B has m records, the complexity of hardcoded traversal for computing A JOIN B ON A.a=B.b is n*m, that is, a total number of n*m filtering condition computations are needed.
Obviously, it is a slow algorithm. Yet that is what some reporting tools that support multiple or diverse data sources use to achieve a join. In those reporting tools, the join relationship (that is, the filtering condition in a JOIN) between data sets is split and scattered into the cell formulas and becomes invisible. Only traversal can be used to compute the joining expressions.
A fully developed database does not choose to be slow. Generally, hash join algorithm is used to achieve an equi-join. The algorithm divides the records of both associative tables into groups according to their joining keys’ hash values (the joining keys are fields on both sides of the filtering condition equation, which are A.a and B.b). Each group contains records where the key fields have same hash values. Suppose the hash value range is 1…k, it will split table A and table B respectively into k subsets, which are A1,…,Ak and B1,…,Bk. The hash value of the joining key a in records of Ai is i and The hash value of the joining key b in records of Bi is also i. We just need to traverse Ai and Bi to join up records. Since each hash value corresponds to a different field value, it is impossible to correspond a record of Ai to one of Bj if i!=j. Suppose the number of records in Ai is ni and that in Bi is mi, a total number of SUM(ni*mi) is needed to compute the filtering condition. In an ideal case when ni=n/k and mi=m/k, the overall degree of complexity will be reduced to 1/k of that of the original hardcoded traversal. This is far more efficient.
So, to speed up a join between multiple data sources in a reporting tool, it would be the best to do the join at the data preparation phrase, otherwise performance will be reduced sharply if the involved data is relatively large.
But the hash function cannot ensure an even split each time. Sometimes a rather large group appears, and there is just small effect for performance optimization. And it would be better to use a relatively simple hash function, otherwise it will take longer to calculate hash values.
When the data involved cannot fit into the memory, the database will adopt hash partition method, which is the extension of hash join algorithm. The hash partition method traverses table A and table B to divide records into multiple subsets according to their joining keys’ hash values and buffer them to an external storage, which are called heaps. Then we can perform an in-memory JOIN operation on corresponding heaps. The join will occur between corresponding heaps because different hash values correspond to different key values. The method thus transforms a JOIN between large data sets into one between pairs of smaller data sets.
It is probable that the hash function generates a particularly large heap that cannot be wholly loaded into the memory. In that case a second hash partition is needed, and a new hash function is used to hash partition the particularly large heap using the hash partition method. In view of this, multiple buffers may happen during the execution of a JOIN operation on the external storage and the computing performance is not so manageable.
Similar process for implementing a JOIN in a distributed system. The system transmits records among nodes according to joining keys’ hash values, which is called Shuffle action, and then performs JOIN on single nodes. When there are many nodes, the network transmission delay will compromise a part of the performance gain obtained through multi-node task sharing. Therefore, there is limit to the number of nodes for a distributed database system performing JOIN. When the limit is reached, more nodes will not bring better performance.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code