Skip to content

Performance optimization case course:TPCH Q20

esProcSPL edited this page Dec 3, 2024 · 1 revision
    select
        s_name,s_address
    from
        supplier,nation
    where
        s_suppkey in (
            select
                ps_suppkey
            from
                partsupp
            where
                ps_partkey in (
                    select
                        p_partkey
                    from
                        part
                    where
                        p_name like 'bisque%'
                )
                and ps_availqty > (
                    select
                        0.5 * sum(l_quantity)
                    from
                        Lineitem
                    where
                        l_partkey = ps_partkey
                        and l_suppkey = ps_suppkey
                        and l_shipdate >= date '1995-01-01'
                        and l_shipdate < date '1995-01-01' + interval '1' year
                )
        )
        and s_nationkey = n_nationkey
        and n_name = 'CHINA'
    order by
        s_name;

The main query is simple, but the nested filtering condition is complex.

The filtering condition is mainly for the S_SUPPKEY field of the supplier table. Since this table is relatively small, it is easy to get S_NAME and S_ADDRESS as long as the SUPPKEY set meeting the condition is calculated. Calculating the SUPPKEY set in steps can make the problem-solving idea clearer.

The original SQL statement is to filter supplier with partsupp. Now we modify it to filter partsupp with supplier, the obtained PS _ SUPPKEY set is the same. The partsupp is relatively large, so regarding supplier and part as dimension tables to filter the fact table partsupp is more conducive to improving performance. Calculation steps:

  1. Filter the nation table according to the condition, and then filter the supplier table with the result.

  2. Filter the part table according to the condition.

  3. Take the results of steps 1 and 2 as foreign key tables to match and filter partsupp table.

These three steps can be viewed as a view v1, expressed in SQL as follows:

    select
    	ps_suppkey
    from
    	partsupp,
    	(
    		select
    			p_partkey
    		from
    			part
    		where
    			p_name like 'bisque%'
    	),
    	(
    		select
    			s_suppkey
    		from
    			supplier,nation
    		where
    			s_nationkey = n_nationkey
    			and n_name = 'CHINA'
    	)
    where
    	ps_partkey=p_partkey
    	and ps_suppkey=s_suppkey

With v1, the original SQL statement can be modified as:

    select
        s_name,s_address
    from
        supplier,
    	(
    	    select
               distinct ps_suppkey
           from
                v1
           where
                ps_availqty > (
                    select
                        0.5 * sum(l_quantity)
                    from
                        lineitem
                    where
                        l_partkey = ps_partkey
                        and l_suppkey = ps_suppkey
                        and l_shipdate >= date '1995-01-01'
                        and l_shipdate < date '1995-01-01' + interval '1' year
                )
    	)  q_suppkey
    where s_suppkey=ps_suppkey;

Observe the subquery q_suppkey, where the calculationsof v1 and lineitem fall under the situation of associating with the primary table according to the equivalence condition. We can use the method explained in previous articles to change the calculations to a JOIN computation between v1 and lineitem. After modification, the SQL statement can be rewritten as:

    select
        s_name,s_address
    from
        supplier,
    	(
    		select
    			distinct ps_suppkey
    		from
    			v1,
    			(select
    				l_partkey,
    				l_suppkey,
    				0.5 * sum(l_quantity) quantity
    			from
    				lineitem
    			where
    				l_shipdate >= date '1995-01-01'
    				and l_shipdate < date '1995-01-01' + interval '1' year
    			group by
    				l_partkey,
    				l_suppkey
    			)
    		where
    			ps_partkey = l_partkey
    			and ps_suppkey = l_suppkey
    			and ps_availqty > quantity
    	)  q_suppkey
    where s_suppkey=ps_suppkey;

Next, we need to continue to modify the subquery q_suppkey. Since the lineitem table is very large, it is necessary to first associate it with v1 to filter out records that do not satisfy the condition, then perform grouping and aggregation, and finally filter out the same L_SUPPKEY set. Rewrite the SQL statement as:

    select
        s_name,s_address
    from
        supplier,
    	(
    		select
    			distinct l_suppkey
    		from
    			(select
    				l_partkey,
    				l_suppkey,
    				ps_availqty,
    				0.5 * sum(l_quantity) quantity
    			from
    				lineitem,v1
    			where
    				ps_partkey = l_partkey
    				and ps_suppkey = l_suppkey
    				and l_shipdate >= date '1995-01-01'
    				and l_shipdate < date '1995-01-01' + interval '1' year
    			group by
    				l_partkey,
    				l_suppkey,
    				ps_availqty
    			)
    		where
    			ps_availqty > quantity
    	) suppkey
    where s_suppkey=l_suppkey;

1. Data storage

There is no special requirement for tables involved in calculation, store them in order by primary key.

Continue to use lineitem.ctx from Q3, part.ctx, supplier.ctx, partsupp.ctx and nation.btx from Q2. Copy them to the main directory of this query.

2. General method

Calculation code:

A
1 =now()
2 1995-1-1
3 =elapse@y(A2,1)
4 >partname="bisque"
5 >nationname="CHINA"
6 =file("nation.btx").import@b().select@1(N_NAME==nationname).N_NATIONKEY
7 =file("part.ctx").open().cursor@m(P_PARTKEY;pos@h(P_NAME,partname)).fetch().keys@im(P_PARTKEY)
8 =file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME,S_ADDRESS;S_NATIONKEY==A6).fetch().keys@im(S_SUPPKEY)
9 =file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;A7.find(PS_PARTKEY),A8.find(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)
10 =file("lineitem.ctx").open().cursor@m(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A2 && L_SHIPDATE<A3)
11 =A10.join@i(L_PARTKEY:L_SUPPKEY,A9:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY)
12 =A11.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity)
13 =A12.select(PS_AVAILQTY*2>quantity).id(L_SUPPKEY)
14 =A8.join@im(S_SUPPKEY,A13:~)
15 =A14.new(S_NAME,S_ADDRESS).sort(S_NAME)
16 =interval@ms(A1,now())

A9 is equivalent to calculating the view v1 mentioned above.

A8 retrieves S_NAME and S_ADDRESS while retrieving S_SUPPKEY from the supplier table, which avoids reading them again in A14.

A13 is equivalent to calculating the subquery q_suppkey.

A12 uses the redundant grouping key introduced in previous articles, and PS_AVAILQTY is no longer used as a grouping field, which improves performance.

A13 changes sum(L_QUANTITY)*0.5 to PS_AVAILQTY*2. Both L_QUANTITY and PS_AVAILQTY are integers, which changes decimal multiplication into integer multiplication.

A14 uses join@im, where @i option deletes non-matching records and @m option enables using order-based merge to speed up computation when both S_SUPPKEY and A13 are ordered.

Test result:

Test items Execution time (seconds)
General method 15

3. Data conversion

For this query, we need to use two optimization methods mentioned in previous articles: dimension table primary key sequence-numberization and date-integer conversion. Copy part_2.ctx, supplier_2.ctx, partsupp_2.ctx and nation_2.btx from Q2 and lineitem_14_4.ctx from Q14 to the main directory of this query.

Calculation code:

A B
1 =now()
2 1995-1-1
3 =days@o(elapse@y(A2,1))
4 =days@o(A2)
5 >partname="bisque"
6 >nationname="CHINA"
7 =file("nation_2.btx").import@b().(N_NAME==nationname)
8 =file("part_2.ctx").open() =A8.cursor().skip().(false)
9 =A8.cursor@m(P_PARTKEY;pos@h(P_NAME,partname)).fetch().run(B8(P_PARTKEY)=true)
10 =file("supplier_2.ctx").open() =A10.cursor().skip().(null)
11 =A10.cursor@m(S_SUPPKEY,S_NAME,S_ADDRESS;A7(S_NATIONKEY)).fetch() =A11.run(B10(S_SUPPKEY)=~)
12 =file("partsupp_2.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;B8(PS_PARTKEY) && B10(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)
13 =file("lineitem_14_4.ctx").open().cursor@m(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A4 && L_SHIPDATE<A3)
14 =A13.join@i(L_PARTKEY:L_SUPPKEY,A12:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY)
15 =A14.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity)
16 =A15.select(PS_AVAILQTY*2>quantity).id(L_SUPPKEY)
17 =B10(A16)
18 =A17.new(S_NAME,S_ADDRESS).sort(S_NAME)
19 =interval@ms(A1,now())

A7, B8, B10 and B11 are all aligned sequences. The value of A10 is null or a supplier record, which enables the direct use of the record in A17, without having to read it again.

Test result:

Test items Execution time (seconds)
General method 15
Data conversion 10

4. Column-wise computing

Calculation code:

A B
1 =now()
2 1995-1-1
3 =days@o(elapse@y(A2,1))
4 =days@o(A2)
5 >partname="bisque"
6 >nationname="CHINA"
7 =file("nation_2.btx").import@b().select@1(N_NAME==nationname).N_NATIONKEY
8 =file("part_2.ctx").open() =A8.cursor@m().skip().(false)
9 =A8.cursor@mv(P_PARTKEY;pos@h(P_NAME,partname)).fetch().run(B8(P_PARTKEY)=true)
10 =file("supplier_2.ctx").open() =A10.cursor@m().skip().(null)
11 =A10.cursor@mv(S_SUPPKEY,S_NAME,S_ADDRESS;S_NATIONKEY==A7).fetch() =A11.run(B10(S_SUPPKEY)=~)
12 =file("partsupp_2.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;B8(PS_PARTKEY) && B10(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)
13 =file("lineitem_14_4.ctx").open().cursor@mv(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A4 && L_SHIPDATE<A3)
14 =A13.join@i(L_PARTKEY:L_SUPPKEY,A12:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY)
15 =A14.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity)
16 =A15.select@v(PS_AVAILQTY*2>quantity).id(L_SUPPKEY)
17 =B10(A16)
18 =A17.new(S_NAME,S_ADDRESS).sort(S_NAME)
19 =interval@ms(A1,now())

Test result:

Test items Execution time (seconds)
General method 15
Data conversion 10
Column-wise computing 6
Clone this wiki locally