Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes quicksort and quickselect to use template based sorting networks #61

Merged
merged 9 commits into from
Sep 7, 2023

Conversation

sterrettm2
Copy link
Contributor

This changes quicksort and quickselect to use generic template based sorting networks, instead of the current implementations. It also changed the 32-bit qsort and qselect to use a 256 element sorting network instead of a 128 element sorting network. Here is a performance comparison for these changes:

Comparing avx512qsort (from ./builddir-main/benchexe) to avx512qsort (from ./builddir-generic_nets/benchexe)
Benchmark                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------
[avx512qsort vs. avx512qsort]/random_5k/uint64_t                   -0.0360         -0.0359         33438         32234         33444         32244
[avx512qsort vs. avx512qsort]/random_100k/uint64_t                 -0.0115         -0.0115        865275        855309        865193        855264
[avx512qsort vs. avx512qsort]/random_1m/uint64_t                   -0.0173         -0.0173      10741680      10555924      10741331      10555114
[avx512qsort vs. avx512qsort]/random_10m/uint64_t                  -0.0118         -0.0118     136559623     134951610     136546646     134936074
[avx512qsort vs. avx512qsort]/sorted_10k/uint64_t                  -0.0315         -0.0315         69900         67696         69903         67704
[avx512qsort vs. avx512qsort]/constant_10k/uint64_t                +0.0015         +0.0021          3690          3695          3692          3700
[avx512qsort vs. avx512qsort]/reverse_10k/uint64_t                 -0.0333         -0.0334         69063         66761         69075         66766
[avx512qsort vs. avx512qsort]/random_5k/int64_t                    -0.0338         -0.0337         33341         32214         33347         32224
[avx512qsort vs. avx512qsort]/random_100k/int64_t                  -0.0128         -0.0128        867071        855931        866977        855898
[avx512qsort vs. avx512qsort]/random_1m/int64_t                    -0.0103         -0.0104      10717308      10606445      10716676      10605129
[avx512qsort vs. avx512qsort]/random_10m/int64_t                   -0.0124         -0.0124     136539834     134844137     136526382     134831179
[avx512qsort vs. avx512qsort]/sorted_10k/int64_t                   -0.0317         -0.0316         69799         67589         69799         67596
[avx512qsort vs. avx512qsort]/constant_10k/int64_t                 +0.0025         +0.0025          3707          3716          3709          3718
[avx512qsort vs. avx512qsort]/reverse_10k/int64_t                  -0.0303         -0.0302         68871         66783         68873         66791
[avx512qsort vs. avx512qsort]/random_5k/uint32_t                   -0.0812         -0.0813         15791         14509         15776         14493
[avx512qsort vs. avx512qsort]/random_100k/uint32_t                 -0.1246         -0.1246        503111        440404        503087        440393
[avx512qsort vs. avx512qsort]/random_1m/uint32_t                   -0.0940         -0.0940       6041983       5474206       6041463       5473690
[avx512qsort vs. avx512qsort]/random_10m/uint32_t                  -0.0774         -0.0774      74048966      68317504      74044358      68311628
[avx512qsort vs. avx512qsort]/sorted_10k/uint32_t                  -0.0383         -0.0383         30645         29471         30653         29479
[avx512qsort vs. avx512qsort]/constant_10k/uint32_t                -0.0107         -0.0082          2451          2425          2448          2428
[avx512qsort vs. avx512qsort]/reverse_10k/uint32_t                 -0.0535         -0.0541         31913         30207         31924         30198
[avx512qsort vs. avx512qsort]/random_5k/int32_t                    -0.0865         -0.0855         15724         14364         15709         14366
[avx512qsort vs. avx512qsort]/random_100k/int32_t                  -0.1134         -0.1133        499691        443046        499676        443072
[avx512qsort vs. avx512qsort]/random_1m/int32_t                    -0.0993         -0.0993       6024619       5426142       6024042       5425605
[avx512qsort vs. avx512qsort]/random_10m/int32_t                   -0.0771         -0.0771      73932950      68230166      73924441      68223964
[avx512qsort vs. avx512qsort]/sorted_10k/int32_t                   -0.0335         -0.0328         30363         29346         30367         29372
[avx512qsort vs. avx512qsort]/constant_10k/int32_t                 +0.0043         +0.0069          2440          2451          2437          2453
[avx512qsort vs. avx512qsort]/reverse_10k/int32_t                  -0.0322         -0.0317         31479         30464         31484         30487
[avx512qsort vs. avx512qsort]/random_5k/uint16_t                   +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_100k/uint16_t                 +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_1m/uint16_t                   +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_10m/uint16_t                  +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/sorted_10k/uint16_t                  +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/constant_10k/uint16_t                +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/reverse_10k/uint16_t                 +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_5k/int16_t                    +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_100k/int16_t                  +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_1m/int16_t                    +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_10m/int16_t                   +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/sorted_10k/int16_t                   +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/constant_10k/int16_t                 +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/reverse_10k/int16_t                  +0.0000         +0.0000             0             0             0             0
[avx512qsort vs. avx512qsort]/random_5k/float                      -0.1183         -0.1183         19512         17203         19498         17191
[avx512qsort vs. avx512qsort]/random_100k/float                    -0.1615         -0.1615        601574        504390        601528        504374
[avx512qsort vs. avx512qsort]/random_1m/float                      -0.1295         -0.1296       7125608       6202775       7124888       6201848
[avx512qsort vs. avx512qsort]/random_10m/float                     -0.1162         -0.1162      86397107      76361225      86391399      76354809
[avx512qsort vs. avx512qsort]/sorted_10k/float                     -0.0819         -0.0818         37950         34843         37957         34853
[avx512qsort vs. avx512qsort]/constant_10k/float                   +0.0039         +0.0046          3106          3118          3111          3125
[avx512qsort vs. avx512qsort]/reverse_10k/float                    -0.0920         -0.0919         39663         36013         39671         36024
[avx512qsort vs. avx512qsort]/random_5k/double                     +0.0193         +0.0195         25980         26481         25988         26494
[avx512qsort vs. avx512qsort]/random_100k/double                   +0.0269         +0.0271        733494        753256        733402        753283
[avx512qsort vs. avx512qsort]/random_1m/double                     +0.0127         +0.0126       9667365       9789783       9666505       9787822
[avx512qsort vs. avx512qsort]/random_10m/double                    +0.0041         +0.0041     127170883     127694855     127164371     127680360
[avx512qsort vs. avx512qsort]/sorted_10k/double                    +0.0202         +0.0203         55805         56932         55808         56941
[avx512qsort vs. avx512qsort]/constant_10k/double                  +0.0015         +0.0023          5146          5154          5148          5160
[avx512qsort vs. avx512qsort]/reverse_10k/double                   +0.0189         +0.0190         55185         56226         55188         56235

@r-devulap
Copy link
Contributor

We will want to use sort_n<vtype, 512> for 16-bit sorting. Improves performance by quite a bit:

Comparing avx512qsort.*random_.*int16_t (from ./builddir-main/benchexe) to avx512qsort.*random_.*int16_t (from ./builddir-PR_61/benchexe)
Benchmark                                                                           Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                +0.0205         +0.0199           357           365           360           367
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.3711         -0.3697           801           504           803           506
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.4204         -0.4185          1464           848          1465           852
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.2388         -0.2376          2783          2118          2782          2121
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.2608         -0.2604         17290         12781         17269         12772
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.3195         -0.3194        536458        365037        535734        364602
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.2632         -0.2631       6326398       4661572       6316169       4654128
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.4735         -0.4737     104542766      55037495     104368373      54933646
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                +0.0156         +0.0158           357           362           360           365
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.3587         -0.3579           790           506           792           508
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.4097         -0.4085          1440           850          1442           853
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.2353         -0.2331          2773          2121          2773          2127
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.1860         -0.1857         15881         12928         15866         12920
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.3037         -0.3035        539217        375481        538469        375041
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.2882         -0.2883       6707845       4774356       6695880       4765406
[avx512qsort.*random_.*int16_t vs. avx512qsort.*random_.*int16_t]                -0.4776         -0.4778     107951777      56389958     107773684      56281560

Copy link
Contributor

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome!

src/xss-network-qsort.hpp Show resolved Hide resolved
int64_t num_to_write
= std::min((int64_t)std::max(0, N - i * vtype::numlanes),
(int64_t)vtype::numlanes);
typename vtype::opmask_t load_mask = ((0x1ull << num_to_write) - 0x1ull)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use vtype::get_partial_loadmask instead. See

static opmask_t get_partial_loadmask(int size)

You also won't need num_to_write to be a int64_t

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might need to update that function. It won't work when num_to_write == numlanes

int64_t num_to_write
= std::min((int64_t)std::max(0, N - i * vtype::numlanes),
(int64_t)vtype::numlanes);
typename vtype::opmask_t load_mask = ((0x1ull << num_to_write) - 0x1ull)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Use vtype::get_partial_loadmask.

src/xss-network-qsort.hpp Outdated Show resolved Hide resolved
src/xss-network-qsort.hpp Outdated Show resolved Hide resolved
src/xss-network-qsort.hpp Outdated Show resolved Hide resolved
@@ -738,4 +819,37 @@ inline void avx512_partial_qsort_fp16(uint16_t *arr,
avx512_qselect_fp16(arr, k - 1, arrsize, hasnan);
avx512_qsort_fp16(arr, k - 1);
}

template <typename vtype, typename type_t>
X86_SIMD_SORT_INLINE type_t get_pivot_64bit(type_t *arr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move these template specializations to their respective files.

const int64_t right);

template <typename vtype, typename type_t>
X86_SIMD_SORT_INLINE type_t get_pivot(type_t *arr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need this once you move the specialized functions to their own files.

Copy link
Contributor

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes, otherwise looks good.

src/avx512-16bit-qsort.hpp Outdated Show resolved Hide resolved
src/avx512-64bit-common.h Outdated Show resolved Hide resolved
src/avx512-common-qsort.h Show resolved Hide resolved
src/avx512-common-qsort.h Outdated Show resolved Hide resolved
src/avx512-common-qsort.h Outdated Show resolved Hide resolved
src/xss-network-qsort.hpp Outdated Show resolved Hide resolved
src/xss-network-qsort.hpp Outdated Show resolved Hide resolved
src/xss-network-qsort.hpp Outdated Show resolved Hide resolved
@r-devulap
Copy link
Contributor

numpy/numpy#24498 was merged. Could you rebase with main and update? Also, use X86_SIMD_SORT_UNROLL_LOOP instead of #pragma GCC unroll. See

#define X86_SIMD_SORT_UNROLL_LOOP(num) PRAGMA(GCC unroll num)

@sterrettm2 sterrettm2 force-pushed the generic_nets branch 2 times, most recently from d5d5a90 to 4ff3e1d Compare September 7, 2023 17:01
Copy link
Contributor

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One last fix and it will be good to go.

src/avx512-common-qsort.h Outdated Show resolved Hide resolved
src/avx512-common-qsort.h Outdated Show resolved Hide resolved
src/avx512-common-qsort.h Outdated Show resolved Hide resolved
src/avx512-common-qsort.h Outdated Show resolved Hide resolved
Copy link
Contributor

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks @sterrettm2 :)

@r-devulap r-devulap merged commit 1b39637 into intel:main Sep 7, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants