Changes quicksort and quickselect to use template based sorting networks #61

sterrettm2 · 2023-08-16T19:40:24Z

This changes quicksort and quickselect to use generic template based sorting networks, instead of the current implementations. It also changed the 32-bit qsort and qselect to use a 256 element sorting network instead of a 128 element sorting network. Here is a performance comparison for these changes:

Comparing avx512qsort (from ./builddir-main/benchexe) to avx512qsort (from ./builddir-generic_nets/benchexe)
Benchmark                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------
[avx512qsort vs. avx512qsort]/random_5k/uint64_t                   -0.0360         -0.0359         33438         32234         33444         32244
[avx512qsort vs. avx512qsort]/random_100k/uint64_t                 -0.0115         -0.0115        865275        855309        865193        855264
[avx512qsort vs. avx512qsort]/random_1m/uint64_t                   -0.0173         -0.0173      10741680      10555924      10741331      10555114
[avx512qsort vs. avx512qsort]/random_10m/uint64_t                  -0.0118         -0.0118     136559623     134951610     136546646     134936074
[avx512qsort vs. avx512qsort]/sorted_10k/uint64_t                  -0.0315         -0.0315         69900         67696         69903         67704
[avx512qsort vs. avx512qsort]/constant_10k/uint64_t                +0.0015         +0.0021          3690          3695          3692          3700
[avx512qsort vs. avx512qsort]/reverse_10k/uint64_t                 -0.0333         -0.0334         69063         66761         69075         66766
[avx512qsort vs. avx512qsort]/random_5k/int64_t                    -0.0338         -0.0337         33341         32214         33347         32224
[avx512qsort vs. avx512qsort]/random_100k/int64_t                  -0.0128         -0.0128        867071        855931        866977        855898
[avx512qsort vs. avx512qsort]/random_1m/int64_t                    -0.0103         -0.0104      10717308      10606445      10716676      10605129
[avx512qsort vs. avx512qsort]/random_10m/int64_t                   -0.0124         -0.0124     136539834     134844137     136526382     134831179
[avx512qsort vs. avx512qsort]/sorted_10k/int64_t                   -0.0317         -0.0316         69799         67589         69799         67596
[avx512qsort vs. avx512qsort]/constant_10k/int64_t                 +0.0025         +0.0025          3707          3716          3709          3718
[avx512qsort vs. avx512qsort]/reverse_10k/int64_t                  -0.0303         -0.0302         68871         66783         68873         66791
[avx512qsort vs. avx512qsort]/random_5k/uint32_t                   -0.0812         -0.0813         15791         14509         15776         14493
[avx512qsort vs. avx512qsort]/random_100k/uint32_t                 -0.1246         -0.1246        503111        440404        503087        440393
[avx512qsort vs. avx512qsort]/random_1m/uint32_t                   -0.0940         -0.0940       6041983       5474206       6041463       5473690
[avx512qsort vs. avx512qsort]/random_10m/uint32_t                  -0.0774         -0.0774      74048966      68317504      74044358      68311628
[avx512qsort vs. avx512qsort]/sorted_10k/uint32_t                  -0.0383         -0.0383         30645         29471         30653         29479
[avx512qsort vs. avx512qsort]/constant_10k/uint32_t                -0.0107         -0.0082          2451          2425          2448          2428
[avx512qsort vs. avx512qsort]/reverse_10k/uint32_t                 -0.0535         -0.0541         31913         30207         31924         30198
[avx512qsort vs. avx512qsort]/random_5k/int32_t                    -0.0865         -0.0855         15724         14364         15709         14366
[avx512qsort vs. avx512qsort]/random_100k/int32_t                  -0.1134         -0.1133        499691        443046        499676        443072
[avx512qsort vs. avx512qsort]/random_1m/int32_t                    -0.0993         -0.0993       6024619       5426142       6024042       5425605
[avx512qsort vs. avx512qsort]/random_10m/int32_t                   -0.0771         -0.0771      73932950      68230166      73924441      68223964
[avx512qsort vs. avx512qsort]/sorted_10k/int32_t                   -0.0335         -0.0328         30363         29346         30367         29372
[avx512qsort vs. avx512qsort]/constant_10k/int32_t                 +0.0043         +0.0069          2440          2451          2437          2453
[avx512qsort vs. avx512qsort]/reverse_10k/int32_t                  -0.0322         -0.0317         31479         30464         31484         30487
[avx512qsort vs. avx512qsort]/random_5k/uint16_t                   +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_100k/uint16_t                 +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_1m/uint16_t                   +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_10m/uint16_t                  +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/sorted_10k/uint16_t                  +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/constant_10k/uint16_t                +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/reverse_10k/uint16_t                 +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_5k/int16_t                    +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_100k/int16_t                  +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_1m/int16_t                    +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_10m/int16_t                   +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/sorted_10k/int16_t                   +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/constant_10k/int16_t                 +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/reverse_10k/int16_t                  +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_5k/float                      -0.1183         -0.1183         19512         17203         19498         17191
[avx512qsort vs. avx512qsort]/random_100k/float                    -0.1615         -0.1615        601574        504390        601528        504374
[avx512qsort vs. avx512qsort]/random_1m/float                      -0.1295         -0.1296       7125608       6202775       7124888       6201848
[avx512qsort vs. avx512qsort]/random_10m/float                     -0.1162         -0.1162      86397107      76361225      86391399      76354809
[avx512qsort vs. avx512qsort]/sorted_10k/float                     -0.0819         -0.0818         37950         34843         37957         34853
[avx512qsort vs. avx512qsort]/constant_10k/float                   +0.0039         +0.0046          3106          3118          3111          3125
[avx512qsort vs. avx512qsort]/reverse_10k/float                    -0.0920         -0.0919         39663         36013         39671         36024
[avx512qsort vs. avx512qsort]/random_5k/double                     +0.0193         +0.0195         25980         26481         25988         26494
[avx512qsort vs. avx512qsort]/random_100k/double                   +0.0269         +0.0271        733494        753256        733402        753283
[avx512qsort vs. avx512qsort]/random_1m/double                     +0.0127         +0.0126       9667365       9789783       9666505       9787822
[avx512qsort vs. avx512qsort]/random_10m/double                    +0.0041         +0.0041     127170883     127694855     127164371     127680360
[avx512qsort vs. avx512qsort]/sorted_10k/double                    +0.0202         +0.0203         55805         56932         55808         56941
[avx512qsort vs. avx512qsort]/constant_10k/double                  +0.0015         +0.0023          5146          5154          5148          5160
[avx512qsort vs. avx512qsort]/reverse_10k/double                   +0.0189         +0.0190         55185         56226         55188         56235

r-devulap · 2023-08-17T21:31:37Z

We will want to use sort_n<vtype, 512> for 16-bit sorting. Improves performance by quite a bit:

Comparing avx512qsort.*random_.*int16_t (from ./builddir-main/benchexe) to avx512qsort.*random_.*int16_t (from ./builddir-PR_61/benchexe)
Benchmark                                                                           Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                +0.0205         +0.0199           357           365           360           367
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.3711         -0.3697           801           504           803           506
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.4204         -0.4185          1464           848          1465           852
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.2388         -0.2376          2783          2118          2782          2121
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.2608         -0.2604         17290         12781         17269         12772
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.3195         -0.3194        536458        365037        535734        364602
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.2632         -0.2631       6326398       4661572       6316169       4654128
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.4735         -0.4737     104542766      55037495     104368373      54933646
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                +0.0156         +0.0158           357           362           360           365
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.3587         -0.3579           790           506           792           508
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.4097         -0.4085          1440           850          1442           853
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.2353         -0.2331          2773          2121          2773          2127
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.1860         -0.1857         15881         12928         15866         12920
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.3037         -0.3035        539217        375481        538469        375041
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.2882         -0.2883       6707845       4774356       6695880       4765406
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.4776         -0.4778     107951777      56389958     107773684      56281560

r-devulap

Looks awesome!

src/xss-network-qsort.hpp

r-devulap · 2023-08-18T16:18:41Z

src/xss-network-qsort.hpp

+        int64_t num_to_write
+                = std::min((int64_t)std::max(0, N - i * vtype::numlanes),
+                           (int64_t)vtype::numlanes);
+        typename vtype::opmask_t load_mask = ((0x1ull << num_to_write) - 0x1ull)


Use vtype::get_partial_loadmask instead. See

x86-simd-sort/src/avx512-32bit-qsort.hpp

Line 259 in 0890de5

static opmask_t get_partial_loadmask(int size)

You also won't need num_to_write to be a int64_t

you might need to update that function. It won't work when num_to_write == numlanes

r-devulap · 2023-08-18T16:26:50Z

src/xss-network-qsort.hpp

+        int64_t num_to_write
+                = std::min((int64_t)std::max(0, N - i * vtype::numlanes),
+                           (int64_t)vtype::numlanes);
+        typename vtype::opmask_t load_mask = ((0x1ull << num_to_write) - 0x1ull)


Ditto. Use vtype::get_partial_loadmask.

src/xss-network-qsort.hpp

r-devulap · 2023-08-18T17:34:12Z

src/avx512-common-qsort.h

@@ -738,4 +819,37 @@ inline void avx512_partial_qsort_fp16(uint16_t *arr,
    avx512_qselect_fp16(arr, k - 1, arrsize, hasnan);
    avx512_qsort_fp16(arr, k - 1);
 }
+
+template <typename vtype, typename type_t>
+X86_SIMD_SORT_INLINE type_t get_pivot_64bit(type_t *arr,


Lets move these template specializations to their respective files.

r-devulap · 2023-08-18T17:35:05Z

src/avx512-common-qsort.h

+                                            const int64_t right);
+
+template <typename vtype, typename type_t>
+X86_SIMD_SORT_INLINE type_t get_pivot(type_t *arr,


We probably don't need this once you move the specialized functions to their own files.

r-devulap

Minor changes, otherwise looks good.

src/avx512-16bit-qsort.hpp

src/avx512-64bit-common.h

src/avx512-common-qsort.h

src/xss-network-qsort.hpp

r-devulap · 2023-09-07T04:45:01Z

numpy/numpy#24498 was merged. Could you rebase with main and update? Also, use X86_SIMD_SORT_UNROLL_LOOP instead of #pragma GCC unroll. See

x86-simd-sort/src/avx512-common-qsort.h

Line 98 in b9f9340

#define X86_SIMD_SORT_UNROLL_LOOP(num) PRAGMA(GCC unroll num)

…sks; also a few smaller changes

r-devulap

LGTM. One last fix and it will be good to go.

src/avx512-common-qsort.h

r-devulap

Awesome! Thanks @sterrettm2 :)

sterrettm2 force-pushed the generic_nets branch from 6830beb to 70424a6 Compare August 16, 2023 22:59

r-devulap reviewed Aug 18, 2023

View reviewed changes

r-devulap requested changes Aug 22, 2023

View reviewed changes

r-devulap mentioned this pull request Sep 5, 2023

MAINT: Re-write 16-bit qsort dispatch numpy/numpy#24498

Merged

sterrettm2 force-pushed the generic_nets branch 2 times, most recently from d5d5a90 to 4ff3e1d Compare September 7, 2023 17:01

sterrettm2 added 8 commits September 7, 2023 10:44

Changed quicksort and quickselect to use template based sorting networks

bf9b413

Templatizes qsort_ more to reduce code duplication

c869e20

Renamed zmm_t and ymm_t to be more generic; other smaller changes

8d1eb31

Removes xss namespace and changes to scalar pivot selection

b050ce5

Cleaned up pivot selection logic and changed to storing store/load ma…

6967e21

…sks; also a few smaller changes

Fixed minor error

d9a8723

Changed pivot code back to previous logic for performance reasons

09fce7a

Changed to using new unroll macro

70733ff

sterrettm2 force-pushed the generic_nets branch from 4ff3e1d to 70733ff Compare September 7, 2023 18:00

r-devulap requested changes Sep 7, 2023

View reviewed changes

src/avx512-common-qsort.h Outdated Show resolved Hide resolved

src/avx512-common-qsort.h Outdated Show resolved Hide resolved

src/avx512-common-qsort.h Outdated Show resolved Hide resolved

src/avx512-common-qsort.h Outdated Show resolved Hide resolved

Small cleanup to pivot selection code

cea0a72

r-devulap approved these changes Sep 7, 2023

View reviewed changes

r-devulap merged commit 1b39637 into intel:main Sep 7, 2023
4 checks passed

zbjornson mentioned this pull request Sep 15, 2023

Recent changes don't compile with MSVC, cause warnings in GCC #75

Closed

r-devulap mentioned this pull request Sep 19, 2023

Bug fix in avx512_qselect_fp16 #80

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes quicksort and quickselect to use template based sorting networks #61

Changes quicksort and quickselect to use template based sorting networks #61

sterrettm2 commented Aug 16, 2023

r-devulap commented Aug 17, 2023

r-devulap left a comment

r-devulap Aug 18, 2023

r-devulap Aug 18, 2023

r-devulap Aug 18, 2023

r-devulap Aug 18, 2023

r-devulap Aug 18, 2023

r-devulap left a comment

r-devulap commented Sep 7, 2023

r-devulap left a comment

r-devulap left a comment

Changes quicksort and quickselect to use template based sorting networks #61

Changes quicksort and quickselect to use template based sorting networks #61

Conversation

sterrettm2 commented Aug 16, 2023

r-devulap commented Aug 17, 2023

r-devulap left a comment

Choose a reason for hiding this comment

r-devulap Aug 18, 2023

Choose a reason for hiding this comment

r-devulap Aug 18, 2023

Choose a reason for hiding this comment

r-devulap Aug 18, 2023

Choose a reason for hiding this comment

r-devulap Aug 18, 2023

Choose a reason for hiding this comment

r-devulap Aug 18, 2023

Choose a reason for hiding this comment

r-devulap left a comment

Choose a reason for hiding this comment

r-devulap commented Sep 7, 2023

r-devulap left a comment

Choose a reason for hiding this comment

r-devulap left a comment

Choose a reason for hiding this comment