-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional caching to AreaDefinition.get_area_slices #553
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #553 +/- ##
==========================================
+ Coverage 94.11% 94.13% +0.02%
==========================================
Files 82 84 +2
Lines 13078 13188 +110
==========================================
+ Hits 12308 12415 +107
- Misses 770 773 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice with the refactoring! Just a couple of comments, but looks good otherwise.
small for many cached results. | ||
|
||
When setting this as an environment variable, this should be set with the | ||
string equivalent of the Python boolean values ``="True"`` or ``="False"``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add a sentence or two on what kind of improvement we can expect on performance and in which situation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh your comment is reminding me: Do you have a go-to benchmark for the gradient search that you can tell me or run yourself with this PR and with this caching enabled? I think the gradient search is the only other part of pyresample that uses the area slices directly and I don't want to make it unnecessarily slow. That said, if it caches the slices for an area -> area resampling (the only thing gradient search supports right now) then it'd probably make all future operations fast. Especially since iirc satpy doesn't do reduce_data
for gradient search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I added more information, but I got a little wordy so let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pnuu is using gradient search alot, maybe he can help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good with the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have my usual test script at hand, but can test this on monday. If I remember....
Timings using Satpy
As a comparison, with Dask graphs: |
Thanks @pnuu , this looks good! |
Thanks for testing @pnuu! |
While profiling some Satpy computations (ABI full disk -> nearest neighbor resampling) I noticed that a decent amount of time at the beginning of processing was spent outside of dask computations and was using a single core. After some print-statement debugging I discovered it was Satpy's
reduce_data
functionality and theAreaDefinition.get_area_slices
that was taking the most time. The majority of that time is spent in the polygon intersection operation to see if the two areas being used intersect and where.This PR adds a decorator and a couple configuration settings for caching the results of
AreaDefinition.get_area_slices
to on-disk JSON files. For my testing case I was seeing about ~10-12 seconds being used forget_area_slices
per area definition pair. I was using 2 resolutions of ABI data and one target area so that was ~22 seconds. With this caching enabled that time basically disappears.This PR is only a proof of concept at this point and I will continue to improve it. I just wanted to get the initial commits up on github for others to see.
git diff origin/main **/*py | flake8 --diff