-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up the "slow", not compiled with numba.jit, _fisher_jenks_means by 4 to 5 times. Saves 3 minutes classifiying 8000 data points. #201
base: main
Are you sure you want to change the base?
Conversation
…. Remove unused HAS_NUMPY check
This reverts commit 34c7a25.
thanks for this :) we'll take a look. (funny timing, actually, because i have some brand new materials that use jenks (for the first time in ages) albeit in an python 3.12 environment where numba isnt yet available and the slowdown is very noticeable on a medium-sized dataset) |
You're welcome. I didn't realise there were more tests than I spotted at first glance (I've delved into those and fixed them by the way...). Commit 0b48854 makes self.bins a numpy array again. So including this PR no longer change the return typedbreak anyone's code now. |
@JamesParrott make sure the files you are pushing to this PR are formatted and linted properly. Currently, pre-commit is failing due to this. |
As I said above, this PR is for info only. But sure, no problem. Will do once the currently running actions stop, the tests pass, and I have time. Is that why the actions runners are taking so long to complete, or is that just Windows and MacOS ones? |
exactly |
Hi everyone,
Thanks for a great project, and for all your work. I really like mapclassify. This PR is for information only, to highlight an interesting way to improve performance for users not using Numba. Using Numba should be preferred - the version with it is blazingly fast. Given so many other dependencies are needed, I don't know why it's optional, even if llvmlite is very big.
I would say Numba should be included by default.
Otherwise, if there's a willingness to relax and generalise the input types, and instance variables created by MapClassifier (to lists of numbers, instead of only numpy arrays), technically making a breaking change, then this PR is a good UX improvement for casual mapclassify users. I don't know if something else that takes even longer is on the critical path for them instead (or if typical mapclassify users who use the default mapclassify without Numba, have much more powerful machines than me), if they all classify much much larger data sets than I've tested, but I myself would always prefer a library that gets the answer 3 minutes faster.
From running
mapclassify\tests\time_fisher_jenkss.py
:Background:
I've made a fork to target IronPython in Grasshopper, in which Numpy and numba are not directly available. This got me wondering how much performance its users would be missing out on. Compared to Numba users, they might have to wait 30 seconds more for 3000 data points. But compared to non-Numba users, the answer, to my surprise, is not a lot. And in fact, compared to your own users who haven't installed numba, my fork that doesn't use Numpy at all is significantly faster, then the slow version of _fisher_jenks_means without Numba (that someone saw fit to raise a warning about in FisherJenks).
Basically, creating and iterating over numpy arrays is not free - there's actually an overhead. That overhead is well worth it for large matrix multiplications, and other vectorisable calculations in scientific computing on modern hardware. But on my laptop at least, running Python 3.11, using an implementation of an algorithm like the original fisher jenks, that must iterate over every element in sequence anyway, Numpy arrays are slower than plain Python iteration over nested lists.
I don't know how Numpy implemented it in C exactly, but when I tried to implement
__getitem__
on tuples in pure Python, there was a an even larger performance hit. So the only change I've made in my alternative function, is to switch[i, j]
with the admittedly uglier[i][j]
, to generalise a numpy specific type reference, and to pass in a Class for numpy (np), that defines int and float etc.. Hopefully otherwise, the original code in _fisher_jenks_means is unaltered.Fixes #118 if Numba is unavailable