-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Area-weighted interpolation in Dask #180
Conversation
sweet, I'll take a look. A couple quick reactions
it would be useful to see whether this can provide a boost to the existing functionality we have for vectorizing rasters |
(need to add dask/daskgpd to the ci environment files to get the tests to execute) |
The notebook has already a bit on this. With the baseline data (in the small hundreds) it's clearly overkill but, as soon as you move to a few thousands, it seems worthwhile. The main catch with all this is that dask is just a way of better using better hardware. The example there is in a beefy machine with 16 cores, but if you have a machine with two cores you won’t see much benefit other than out of core support (i.e., when your data don't fit in memory).
I'm not sure what you mean by "heavy"? I think they're mostly python by not sure. Install is straightforward on conda/mamba. I've added them to the
I can certainly add another illustration of how this can be used, but I'm not sure it'd add much? This PR really doesn’t add new functionality,
It’s slightly different. We could think of a way of vectorizing pixels and doing a spatial dissolve with dask. I don’t know if that’d be faster (it'd be at least parallel/out-of-core), but it’s definitely different code (though similar philosophy), so I'd be tempted to leave that for a different PR, perhaps create an issue to remember this option in case we have bandwidth (or need) in the future to explore it. |
UPDATE - I think I've fixed the bug where sometimes computation would bomb on the final |
what i'd like to do is give users a sense for when they should reach for this versus the usual function. So for that, it would be useful to have a more concrete example, maybe with an actual use case. I think it would be a real boon to folks if they understood what scale of dataset requires this function versus the normal one, and what actual questions you'd go about asking at that scale. The NYC taxi points thing Rocklin cooked up for the original dask-geopandas example was a nice example of this. I have no doubts this is a useful addition to the codebase, but i just want to make sure we fully communicate that benefit to the users
the notebook mentions
so it would be nice to see some examples that do include the time reqiured for shuffling and a sense for a real dataset that requires this kind of processing
yep, that was the question. Lets move these to function-level imports like we treat numba in libpysal and bokeh in splot (or h3 here in tobler). We can leave them in the conda-forge recipe so folks have access by default, but move them out of modul-level imports so they're not required
What i mean is, one of the things the library struggles with is user documentation, and demonstrating to folks how to actually apply our tools. The current notebook is really focused on showing that the dask implementation matches the existing implementation, and that when you 40x a single dataset, the dask-scaling kicks in as it should. That's a really nice resource for us developers, but its still kinda opaque for users trying to figure out which tool to lift off the shelf. I can imagine lots of folks wondering if their data is "big enough" for this to matter, so in an ideal world, it would be nice to give them some orientation. Obviously we cant test datasets at a ton of different scales, but often a motivating use-case is more insightful than an abstract one (and it sounded like maybe you hand a land-use one on hand?). So that's what i was curious about I definitely wouldnt say this is a blocker, and im happy to merge even if we dont build out more examples, but i wanted to raise the issue while there's some energy behind this
cool that makes sense. Was just curious if this might be useful as a drop in scaler for some other parts of the package |
Too many threads here, all worrth keeping track:
|
update: @darribas and @knaaptime had a call on 8/15 to discuss expansion and integration of dask-based interpolation. Summarizing, we decided this PR likely has the logic/bones necessary to implement intensive/extensive interpolation (in addition to categorical) but initial tests dont pass. We thought maybe refactoring slightly, so that the distributed task works only on building the correspondence table, (not the interpolation/multiplication through the table) but in that case the final interpolation/aggregation after coming back from the workers seems very expensive. We've decided to move forward with the option for categorical variables using dask with the idea that intensive/extensive should be on the menu and may be a good task for someone to pick up in the near future |
import dask_geopandas | ||
import numpy as np | ||
from dask.base import tokenize | ||
from dask.highlevelgraph import HighLevelGraph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to move these into a try/except inside the function (like with h3fy), otherwise they will force dask to be installed on import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Levi also has a more clever way of doing this with decorators)
dask_geopandas.from_geopandas(sac2, npartitions=2) | ||
.spatial_shuffle(by='hilbert', shuffle='tasks') | ||
) | ||
area = area_interpolate_dask.area_interpolate_dask( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah. The issue with the tests is that this is a function not a method. do area_interpolate_dask()
, not area_interpolate_dask.area_interpolate_dask()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've changed and I'll how the CI takes it!
Codecov Report
@@ Coverage Diff @@
## main #180 +/- ##
==========================================
+ Coverage 51.39% 54.52% +3.12%
==========================================
Files 14 15 +1
Lines 786 840 +54
==========================================
+ Hits 404 458 +54
Misses 382 382
|
Yahoo! tests passing! |
This PR adds functionality to run area-weighted interpolation (AWI) through Dask. The code heavily leverages
dask-geopandas
(learning a lot from their distributed spatial join) and makes it possible to run AWI out-of-core (i.e., larger-than-memory datasets), distributed (i.e., across different cores or machines), and in parallel (with a significantly more efficient algorithm than the current multi-core implementation).A notebook (for now on root, if this gets merged, it should be moved to
notebooks
) is provided that demonstrates how the method works, correctness, and performance.I'd also like to use this PR conversation to discuss whether you think this is the right place to contribute and if you'd modify anything on the API/location of the code.
NOTE: currently, only interpolation of categorical variables is implemented. My preference, unless someone can provide an implementation for extensive/intensive variables quickly, would be this does not block merging (and possibly cutting a small release). Categoricals are important, for example, to interpolate rasters (e.g., land use), and having the functionality out in the wild would help it get tested. I need to think (and would welcome thoughts!) how intensive/extensive should be done in this framework. I think it needs a second step of reweighting/interpolating when collecting chunks from the workers. For categorical, this step is simply adding up all the values for a given target geometry, but this is not the case for intensive/extensive.
Before merging, the following (at least) needs to be ticked off:
notebooks
folderI'm happy to push through as the discussion evolves to add the remaining bits. I'd love to see this merged.