Compare with built-in sorting implementation? #1

mzy2240 · 2023-03-18T00:58:45Z

Thank you for the effort to integrate Intel's x86-simd-sort into Julia! I am wondering have you compared the performance with the default Julia implementation. Do you think it has the potential to be used to globally replace the Julia's sorting implementation in AVX512-enabled computers?

mosullivan93 · 2023-03-18T01:44:16Z

Thanks for your interest in the project, Zeyu.

I haven't done a formal benchmark, but I can give you a brief overview from my memory of the anecdotal comparisons I've performed. I ran most experiments on the Intel nodes of QUT's supercomputer which sport Xeon Gold 6140 CPU (note these don't support the instructions required for the 16bit sorting algorithms provided by x86-simd-sort, and there is no fallback so check compatibility before you load the module).

I've discussed the x86-simd-sort project with some folks on the HoJ Discord and I was told that Julia v1.9 has a lot of improvements to sorting which meant using AVX512 only produced about a 2x speedup. However, on my own (and non-exhaustively), I have compared with Julia v1.8.3, and found between 2-6x speedup (specifically for vectors of type Float32 and Float64) on the workloads needed for my research. Also be aware that when I tested Float16 using my university provided desktop (i7-11700K), the performance was more on par with the existing Julia sort. I believe that the original author of the upstream x86-simd-sort is working on improving the half precision performance, though.

One task I do often is partially sorting each column in a matrix. I found the speedup depends on both the number of rows and the number of threads available (I would normally sort in parallel, but the relative speedup goes down as the number of threads used increases). As single precision is enough for my task I am quite satisfied to use Float32s and AVX512 to dramatically improve single-core performance (which helps boost my chances on the HPC scheduler!).

I have planned to do some formal benchmarking, specifically including Julia v1.9, but that will have to wait for the time being as other priorities are abound. As for whether this could become the global implementation, I'm not sure. I would like to explore whether we could reproduce this algorithm in (almost) pure Julia using its SIMD capabilities and I also have a hunch that you could implement an efficient AVX512 based MergeSort algorithm but I don't have the time to find out right now. Whether this particular package could benefit more people (as it's designed to be loaded and redirect all sort calls to the accelerated methods), I would need some help building and packaging the jll module before that could become viable, right now I just ]dev ... the package and go make the library.

mzy2240 · 2023-03-18T02:40:24Z

I was told that Julia v1.9 has a lot of improvements to sorting which meant using AVX512 only produced about a 2x speedup

That sounds amazing! Just the theoretical speedup of AVX512 over AVX2! Wow!

I have compared with Julia v1.8.3, and found between 2-6x speedup (specifically for vectors of type Float32 and Float64) on the workloads needed for my research

Even for small arrays? Like length <= 20 or 50? I can see the pointer of the array is passed to x86-simd-sort, but I am not sure whether the array would be copied and thus a non-neglectable overhead was introduced for such small arrays.

I have planned to do some formal benchmarking, specifically including Julia v1.9

That will be more than awesome!

I would need some help building and packaging the jll module before that could become viable, right now I just ]dev ... the package and go make the library

Yeah packaging the jll is a must to release such a package. It is also the reason that stops me to publish some wrapper packages for some great C/C++ library lol.

mosullivan93 · 2023-03-18T12:23:10Z

Even for small arrays? Like length <= 20 or 50? I can see the pointer of the array is passed to x86-simd-sort, but I am not sure whether the array would be copied and thus a non-neglectable overhead was introduced for such small arrays.

I was testing using arrays of length 500. The x86-simd-sort algorithms will use bitonic mergesort when the region to be sorted is less than 128 elements, so such cases will need particular care in considering their performance. Its speedup otherwise comes from using the AVX512 vector operations for comparison while partitioning after selecting a pivot. The C++ functions mutate the data in place, but I'm not sure what Julia does when using ccall.

mzy2240 · 2023-06-24T01:44:27Z

Irrelevant but do you mind to bump the x86-simd-sort version to v2.0? I hope it would be a LTS : )

mosullivan93 · 2023-06-28T15:57:43Z

No worries, I've just pushed a few changes that includes updating to the current HEAD of the upstream repo.

By the way, I pinged the developer that made significant contributions to sorting in Julia v1.9 to ask about her benchmarking methodology. I haven't had time to give it much attention but I'll let you know when there's a proper comparison.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare with built-in sorting implementation? #1

Compare with built-in sorting implementation? #1

mzy2240 commented Mar 18, 2023 •

edited

Loading

mosullivan93 commented Mar 18, 2023

mzy2240 commented Mar 18, 2023 •

edited

Loading

mosullivan93 commented Mar 18, 2023

mzy2240 commented Jun 24, 2023

mosullivan93 commented Jun 28, 2023

Compare with built-in sorting implementation? #1

Compare with built-in sorting implementation? #1

Comments

mzy2240 commented Mar 18, 2023 • edited Loading

mosullivan93 commented Mar 18, 2023

mzy2240 commented Mar 18, 2023 • edited Loading

mosullivan93 commented Mar 18, 2023

mzy2240 commented Jun 24, 2023

mosullivan93 commented Jun 28, 2023

mzy2240 commented Mar 18, 2023 •

edited

Loading

mzy2240 commented Mar 18, 2023 •

edited

Loading