-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare with built-in sorting implementation? #1
Comments
Thanks for your interest in the project, Zeyu. I haven't done a formal benchmark, but I can give you a brief overview from my memory of the anecdotal comparisons I've performed. I ran most experiments on the Intel nodes of QUT's supercomputer which sport Xeon Gold 6140 CPU (note these don't support the instructions required for the 16bit sorting algorithms provided by x86-simd-sort, and there is no fallback so check compatibility before you load the module). I've discussed the x86-simd-sort project with some folks on the HoJ Discord and I was told that Julia v1.9 has a lot of improvements to sorting which meant using AVX512 only produced about a 2x speedup. However, on my own (and non-exhaustively), I have compared with Julia v1.8.3, and found between 2-6x speedup (specifically for vectors of type Float32 and Float64) on the workloads needed for my research. Also be aware that when I tested Float16 using my university provided desktop (i7-11700K), the performance was more on par with the existing Julia sort. I believe that the original author of the upstream x86-simd-sort is working on improving the half precision performance, though. One task I do often is partially sorting each column in a matrix. I found the speedup depends on both the number of rows and the number of threads available (I would normally sort in parallel, but the relative speedup goes down as the number of threads used increases). As single precision is enough for my task I am quite satisfied to use Float32s and AVX512 to dramatically improve single-core performance (which helps boost my chances on the HPC scheduler!). I have planned to do some formal benchmarking, specifically including Julia v1.9, but that will have to wait for the time being as other priorities are abound. As for whether this could become the global implementation, I'm not sure. I would like to explore whether we could reproduce this algorithm in (almost) pure Julia using its SIMD capabilities and I also have a hunch that you could implement an efficient AVX512 based MergeSort algorithm but I don't have the time to find out right now. Whether this particular package could benefit more people (as it's designed to be loaded and redirect all sort calls to the accelerated methods), I would need some help building and packaging the jll module before that could become viable, right now I just |
That sounds amazing! Just the theoretical speedup of AVX512 over AVX2! Wow!
Even for small arrays? Like length <= 20 or 50? I can see the pointer of the array is passed to x86-simd-sort, but I am not sure whether the array would be copied and thus a non-neglectable overhead was introduced for such small arrays.
That will be more than awesome!
Yeah packaging the jll is a must to release such a package. It is also the reason that stops me to publish some wrapper packages for some great C/C++ library lol. |
I was testing using arrays of length 500. The x86-simd-sort algorithms will use bitonic mergesort when the region to be sorted is less than 128 elements, so such cases will need particular care in considering their performance. Its speedup otherwise comes from using the AVX512 vector operations for comparison while partitioning after selecting a pivot. The C++ functions mutate the data in place, but I'm not sure what Julia does when using |
Irrelevant but do you mind to bump the x86-simd-sort version to v2.0? I hope it would be a LTS : ) |
No worries, I've just pushed a few changes that includes updating to the current HEAD of the upstream repo. By the way, I pinged the developer that made significant contributions to sorting in Julia v1.9 to ask about her benchmarking methodology. I haven't had time to give it much attention but I'll let you know when there's a proper comparison. |
Thank you for the effort to integrate Intel's x86-simd-sort into Julia! I am wondering have you compared the performance with the default Julia implementation. Do you think it has the potential to be used to globally replace the Julia's sorting implementation in AVX512-enabled computers?
The text was updated successfully, but these errors were encountered: