-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed with large arrays #7
Comments
Thank you for your feedback and measurements and sorry for the late answer. Indeed, by its nature, the Fisher Jenks classification algo is much more complex than quantile and equal interval classifications: it involves creating 2 large matrices of sizes (number of values * number of desired classes) as well as nested loops to compute the variance and determine the best class boundaries. The implementation proposed by As a comparaison, here are the timing (in seconds) that I am obtaining when comparing
Of course these timings have to be taken with a grain of salt (because they depend on my machine, on the number of requested classes, etc.) but they give you an idea (and they seem to be consistent with your timings). I am preparing a notebook with comparaison (performance and accuracy of the results) of various implementations, i will post the link here if you are interested. The idea you had to sample the data to be discretized is probably the right one if you want to use the Fisher-Jenks algorithm. This is for example what is done with various software such as QGIS Desktop and classInt R library which are both using (unless otherwise specified) a sample of 3000 values. With jenkspy, on not too old computers, you can easily use a 5000-values or a 10000-values sample while keeping execution time below the second. |
@mthh, thanks for the thorough answer. Yes, I'd be interested in seeing your comparison notebook once you've finished it. For my purposes, sampling the dataset gives me a good approximation of the classification I want. |
I couldn't say for sure about ArcGIS but, for example, I know that QGIS doesn't take the whole dataset but only a sample of the values when you ask it for a classification with Jenks (you can see it in the code here: https://github.com/qgis/QGIS/blob/master/src/core/classification/qgsclassificationjenks.cpp#L80C8-L80C8 - the variable Regarding Jenkspy, I think it's up to the user to implement this logic upstream if they wish. |
While searching on the Web I found a mention of the fact that ArcGIS also seems to take a sample (of 10000 values) for Natural breaks classification with Jenks: https://gis.stackexchange.com/a/84243. |
Thanks for this package, @mthh. This is more of a question than issue. I'm using your code in a raster GIS context and trying to get natural breaks for large rasters (> 1,000,000 cells). I've got equal interval and quantile classification modes built in for comparison. I'm curious if you've ever done speed comparisons on large datasets and if you'd advocate sampling to reduce classification times.
Here's a sample of what I'm seeing:
When timing these, jenks is slower than I would expect, but I admit to not studying the algorithm carefully enough to know if this is expected.
Sampling the original array obviously helps speeds, but the bin boundaries aren't exact:
Thanks for any pointers you may have on how to use
jenkspy
more efficiently.The text was updated successfully, but these errors were encountered: