Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional caching to AreaDefinition.get_area_slices #553

Merged
merged 10 commits into from
Nov 20, 2023

Conversation

djhoese
Copy link
Member

@djhoese djhoese commented Nov 12, 2023

While profiling some Satpy computations (ABI full disk -> nearest neighbor resampling) I noticed that a decent amount of time at the beginning of processing was spent outside of dask computations and was using a single core. After some print-statement debugging I discovered it was Satpy's reduce_data functionality and the AreaDefinition.get_area_slices that was taking the most time. The majority of that time is spent in the polygon intersection operation to see if the two areas being used intersect and where.

This PR adds a decorator and a couple configuration settings for caching the results of AreaDefinition.get_area_slices to on-disk JSON files. For my testing case I was seeing about ~10-12 seconds being used for get_area_slices per area definition pair. I was using 2 resolutions of ABI data and one target area so that was ~22 seconds. With this caching enabled that time basically disappears.

This PR is only a proof of concept at this point and I will continue to improve it. I just wanted to get the initial commits up on github for others to see.

  • Closes #xxxx
  • Tests added
  • Tests passed
  • Passes git diff origin/main **/*py | flake8 --diff
  • Fully documented

@djhoese djhoese added enhancement performance improves speed or decreases memory consumption, but does not otherwise change functionality labels Nov 12, 2023
@djhoese djhoese self-assigned this Nov 12, 2023
@djhoese djhoese changed the title Add cache directory and cache geometry slices configuration options Add optional caching to AreaDefinition.get_area_slices Nov 12, 2023
@djhoese djhoese requested a review from mraspaud November 17, 2023 03:41
@djhoese djhoese marked this pull request as ready for review November 17, 2023 03:41
Copy link

codecov bot commented Nov 17, 2023

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (6a8afc0) 94.11% compared to head (b0a2579) 94.13%.

Files Patch % Lines
pyresample/future/geometry/_subset.py 90.00% 8 Missing ⚠️
pyresample/_caching.py 96.10% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #553      +/-   ##
==========================================
+ Coverage   94.11%   94.13%   +0.02%     
==========================================
  Files          82       84       +2     
  Lines       13078    13188     +110     
==========================================
+ Hits        12308    12415     +107     
- Misses        770      773       +3     
Flag Coverage Δ
unittests 94.13% <94.71%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@coveralls
Copy link

coveralls commented Nov 17, 2023

Coverage Status

coverage: 93.72% (+0.03%) from 93.69%
when pulling b0a2579 on djhoese:cache-area-slices
into 6a8afc0 on pytroll:main.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice with the refactoring! Just a couple of comments, but looks good otherwise.

small for many cached results.

When setting this as an environment variable, this should be set with the
string equivalent of the Python boolean values ``="True"`` or ``="False"``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add a sentence or two on what kind of improvement we can expect on performance and in which situation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh your comment is reminding me: Do you have a go-to benchmark for the gradient search that you can tell me or run yourself with this PR and with this caching enabled? I think the gradient search is the only other part of pyresample that uses the area slices directly and I don't want to make it unnecessarily slow. That said, if it caches the slices for an area -> area resampling (the only thing gradient search supports right now) then it'd probably make all future operations fast. Especially since iirc satpy doesn't do reduce_data for gradient search.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I added more information, but I got a little wordy so let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pnuu is using gradient search alot, maybe he can help here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with the explanation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have my usual test script at hand, but can test this on monday. If I remember....

docs/source/howtos/configuration.rst Outdated Show resolved Hide resolved
pyresample/_caching.py Outdated Show resolved Hide resolved
@pnuu
Copy link
Member

pnuu commented Nov 20, 2023

Timings using Satpy main and this PR. Using gradient_search to load, resample and save 10 composites (some Day/Night, some normal) for FCI L1c to 3868 x 3918 EPSG:3035 area.

reduce_data=True:

  • no caching 73.1 s (Dask graph A)
  • caching, first run: 72.3 s
  • caching, 2nd run: 57.7 s (B)

As a comparison, with reduce_data=False it takes ~59.7 s (C) to run the same script.

Dask graphs:

A
Screenshot 2023-11-20 at 08-36-33 Bokeh Plot

B
Screenshot 2023-11-20 at 08-36-49 Bokeh Plot

C
Screenshot 2023-11-20 at 08-37-04 Bokeh Plot

@mraspaud
Copy link
Member

Thanks @pnuu , this looks good!
So I'm merging this.

@mraspaud mraspaud merged commit d8f45cd into pytroll:main Nov 20, 2023
21 checks passed
@djhoese djhoese deleted the cache-area-slices branch November 20, 2023 15:25
@djhoese
Copy link
Member Author

djhoese commented Nov 20, 2023

Thanks for testing @pnuu!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement performance improves speed or decreases memory consumption, but does not otherwise change functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants