Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache NUMBA kernels between CI runs #279

Merged
merged 50 commits into from
Jan 31, 2024
Merged

Conversation

sjperkins
Copy link
Member

Closes #278

@JSKenyon
Copy link
Collaborator

This looks awesome! I will probably move this to the dev branch before merging.

@sjperkins
Copy link
Member Author

This looks awesome! I will probably move this to the dev branch before merging.

Cool. Need to prod it a bit to see if it works.

- name: Cache Numba Kernels
uses: actions/cache@v3
with:
key: numba-cache-${{ steps.numba-cache-key.outputs.date }}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructing the key out of the date may be overkill. I suspect we could just use numba-cache and it would propagate and be updated between runs.

I guess the downside is that it might accumulate a bunch of crufty old kernels. Note AFAICT there's a 10GB cache limit per repo and cache entries expire weekly so it may not be a big deal.

@bennahugo suggested we add the numba version to the cache key. I wonder if numba is clever enough to trigger recompiles on new numba versions.

The python version may also be relevant given a codex __pycache__ dir looks as follows

__init__.cpython-310.pyc
__init__.cpython-39.pyc
bda_avg.cpython-310.pyc
bda_avg.cpython-39.pyc
bda_avg.row_average-23.py310.1.nbc
bda_avg.row_average-23.py310.2.nbc
bda_avg.row_average-23.py310.nbi
bda_avg.row_average-23.py39.1.nbc
bda_avg.row_average-23.py39.2.nbc
bda_avg.row_average-23.py39.nbi
bda_avg.row_chan_average-313.py310.1.nbc
bda_avg.row_chan_average-313.py310.2.nbc
bda_avg.row_chan_average-313.py310.3.nbc
bda_avg.row_chan_average-313.py310.4.nbc
bda_avg.row_chan_average-313.py310.5.nbc
bda_avg.row_chan_average-313.py310.nbi
bda_avg.row_chan_average-313.py39.1.nbc
bda_avg.row_chan_average-313.py39.2.nbc
bda_avg.row_chan_average-313.py39.3.nbc
bda_avg.row_chan_average-313.py39.4.nbc
bda_avg.row_chan_average-313.py39.nbi
bda_mapping.bda_mapper-341.py310.1.nbc
bda_mapping.bda_mapper-341.py310.nbi
bda_mapping.bda_mapper-341.py39.1.nbc
bda_mapping.bda_mapper-341.py39.2.nbc
bda_mapping.bda_mapper-341.py39.nbi
bda_mapping.cpython-310.pyc
bda_mapping.cpython-39.pyc

@sjperkins
Copy link
Member Author

Hmmm, I was fairly sure I mkdir'd that

image

@sjperkins
Copy link
Member Author

So the kernel caching does not seem to be improving the test suite run time. even though kernel caches are created: https://github.com/ratt-ru/QuartiCal/actions/caches. This would also seem to suggest NUMBA_CACHE_DIR is respected.

@JSKenyon
Copy link
Collaborator

So the kernel caching does not seem to be improving the test suite run time. even though kernel caches are created: https://github.com/ratt-ru/QuartiCal/actions/caches. This would also seem to suggest NUMBA_CACHE_DIR is respected.

Is respected, or isn't? Might need to rerun the tests a few times - I have muddied the waters by merging in main. I do think that there is probably something which can be done - will take a closer look at the end of the week.

@sjperkins
Copy link
Member Author

sjperkins commented Jun 20, 2023

So the kernel caching does not seem to be improving the test suite run time. even though kernel caches are created: https://github.com/ratt-ru/QuartiCal/actions/caches. This would also seem to suggest NUMBA_CACHE_DIR is respected.

Is respected, or isn't?

I think it is respected -- The caches are about 11MB.

Another thought occurred, the cached kernel modification times are probably earlier than the checked out python code -- this might trigger recompilation: https://numba.readthedocs.io/en/stable/developer/caching.html

The cache is invalidated when the corresponding source file is modified.

Edit: Referenced the main article on caching, rather than the cuda article.

@sjperkins
Copy link
Member Author

Another thought occurred, the cached kernel modification times are probably earlier than the checked out python code -- this might trigger recompilation: https://numba.readthedocs.io/en/stable/developer/caching.html

The cache is invalidated when the corresponding source file is modified.

Unfortunately it looks like it is the case that the timestamp is only the input to the cache key (at least as of Aug 22): https://numba.discourse.group/t/cache-behaviour/1520

I would like to propose to move away from invalidating the cache index based on the timestamp of the file, and use only the code+closure signature of the function itself. Would anyone see a problem with that? When I say code+closure signature I mean the exact same information that is used to select the overload from within the cache.

So this approach doesn't seem viable.

@JSKenyon
Copy link
Collaborator

So this approach doesn't seem viable.

Ah unfortunate. Perhaps there will be progress upstream at some point.

JSKenyon and others added 10 commits January 26, 2024 10:58
… required. (#285)

* Fix version drift.

* Bump to 0.2.0

* Use nearest-neighbour interpolation for points requiring extrapolation.
* Fix version drift.

* Bump to 0.2.0

* Inspect envvar for scheduler address when one isn't specified.

* Encode environment varraible as ascii.

* Simplify.
* Fix version drift.

* Bump to 0.2.0

* Initial commit of basic plotting functionality.

* Change naming convention.

* Improve transform argument.

* Simplify transform selection.

* Add rudimentary time and frequency selection.

* Checkpoint ploter changes. Can now handle scans and spws, but is very slow.

* More work on plotter - can now plot datasets in parallel.

* Some tidying.

* Slightly improve plot speed. Dominant cost is still saving the figures.

* Commit some minor changes which speed up figure saving.

* Lots of tiny fixes.

* Tiny cosmetic changes.

* Add custom tick formatter so that plots are the same size regardless.

* Add matplotlib dependency.

* Rework construction of plotting dictionary. Add a few utility functions which will likely be useful in other places in QC.

* Rename variable to avoid confusion.

* Fix bug affecting recursive grouping.

* Avoid copies in grouping code.

* Checkpoint work on extending functionality.

* Make plotter more powerful. Add colourization option. Begin simplifying interface.

* Allow user specification of colourmap.

* Add plotsize parameter.
* Fix version drift.

* Bump to 0.2.0

* Fix #293.
* Fix version drift.

* Bump to 0.2.0

* Add optional label and single field selection to backup app

* remove item instead of pop@index

* do not .remove() from xds_list

* Simplify using some existing functionality.

---------

Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: landmanbester <[email protected]>
* Fix version drift.

* Bump to 0.2.0

* Setting MAD threshold to zero will disable flagging on a given statistic.
* Fix version drift.

* Bump to 0.2.0

* Disable flagging based on off-diagonal correlations in the mad flagger by default. This should make the mad flagger less agressive on data with unmodelled polarised emission.
* Fix version drift.

* Bump to 0.2.0

* Fix a bug afecting the use of non-standard columns in data column input.
* assign to ms to avoid over-writing metadata in restore app

* zip datasets in enumerate

* add comment to document failure case

* use backup_column_name in restore app

* Apply OCD.

---------

Co-authored-by: landmanbester <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
* Fix version drift.

* Bump to 0.2.0

* Make summary correctly report FIELD_ID and SOURCE_ID.
@JSKenyon JSKenyon changed the base branch from main to v0.2.1-dev January 29, 2024 12:24
Base automatically changed from v0.2.1-dev to main January 30, 2024 06:46
@JSKenyon JSKenyon changed the base branch from main to v0.2.2-dev January 31, 2024 09:24
@JSKenyon JSKenyon merged commit 572cdfb into v0.2.2-dev Jan 31, 2024
6 checks passed
@JSKenyon JSKenyon deleted the cache-numba-kernels branch January 31, 2024 11:51
JSKenyon added a commit that referenced this pull request Feb 2, 2024
* Cache NUMBA kernels between CI runs (#279)

* Cache NUMBA kernels between CI runs

* Use actions/cache@v3

* Cache per python version

* runner.tmp -> runner.temp

* Debugging

* Fix

* Run entire test suite

* timestamp needed otherwise cache hit occurs and cache not updated

* Fix output

* Add revert_me.txt

* Use nearest-neighbour interpolation in regions where extrapolation is required. (#285)

* Fix version drift.

* Bump to 0.2.0

* Use nearest-neighbour interpolation for points requiring extrapolation.

* Utilise environment variable when dask.address is unset. (#288)

* Fix version drift.

* Bump to 0.2.0

* Inspect envvar for scheduler address when one isn't specified.

* Encode environment varraible as ascii.

* Simplify.

* Add plotting functionality (#290)

* Fix version drift.

* Bump to 0.2.0

* Initial commit of basic plotting functionality.

* Change naming convention.

* Improve transform argument.

* Simplify transform selection.

* Add rudimentary time and frequency selection.

* Checkpoint ploter changes. Can now handle scans and spws, but is very slow.

* More work on plotter - can now plot datasets in parallel.

* Some tidying.

* Slightly improve plot speed. Dominant cost is still saving the figures.

* Commit some minor changes which speed up figure saving.

* Lots of tiny fixes.

* Tiny cosmetic changes.

* Add custom tick formatter so that plots are the same size regardless.

* Add matplotlib dependency.

* Rework construction of plotting dictionary. Add a few utility functions which will likely be useful in other places in QC.

* Rename variable to avoid confusion.

* Fix bug affecting recursive grouping.

* Avoid copies in grouping code.

* Checkpoint work on extending functionality.

* Make plotter more powerful. Add colourization option. Begin simplifying interface.

* Allow user specification of colourmap.

* Add plotsize parameter.

* Fix #293 - OOB access caused by `output.subtract_directions`  (#294)

* Fix version drift.

* Bump to 0.2.0

* Fix #293.

* Namedbackups (#296)

* Fix version drift.

* Bump to 0.2.0

* Add optional label and single field selection to backup app

* remove item instead of pop@index

* do not .remove() from xds_list

* Simplify using some existing functionality.

---------

Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: landmanbester <[email protected]>

* Selectively disable MAD flagging criteria (#298)

* Fix version drift.

* Bump to 0.2.0

* Setting MAD threshold to zero will disable flagging on a given statistic.

* Disable mad flagging on off-diagonals by default (#300)

* Fix version drift.

* Bump to 0.2.0

* Disable flagging based on off-diagonal correlations in the mad flagger by default. This should make the mad flagger less agressive on data with unmodelled polarised emission.

* Fix bug affecting non-standard columns in `input_ms.data_column` (#301)

* Fix version drift.

* Bump to 0.2.0

* Fix a bug afecting the use of non-standard columns in data column input.

* Don't allow restore app to overwrite metadata (#307)

* assign to ms to avoid over-writing metadata in restore app

* zip datasets in enumerate

* add comment to document failure case

* use backup_column_name in restore app

* Apply OCD.

---------

Co-authored-by: landmanbester <[email protected]>
Co-authored-by: JSKenyon <[email protected]>

* Fix for summary reporting SOURCE_ID as FIELD_ID (#309)

* Fix version drift.

* Bump to 0.2.0

* Make summary correctly report FIELD_ID and SOURCE_ID.

* Fix receptor summary (#310)

* Fix version drift.

* Bump to 0.2.0

* Fix incorrect assumption that FEED substable will always have 2 receptors.

* Fix similar problem affecting parallactic angle construction.

* Update missing column selection for compatibility with upsteam changes.

* Fix xarray dims (#318)

* Fix version drift.

* Bump to 0.2.0

* Move all usage of xds.dims[dim] to xds.sizes[dim] in preparation for change of return type in xds.dims.

* Fixes for changes relating to Numba error types. (#319)

* Move now-deprecated graph metrics function into the scheduler plugin code. (#320)

* Make small changes to enable 3.11 compatibilty. Requires changes in stimela + a release. (#321)

* Restringify keys in scheduler plugin. (#322)

* Attempt very dodgy solution to caching problem.

* Look for code in the correct place.

* Update pyproject.toml. Add poetry.lock. Update docs. (#323)

* Drop 3.8. Commit poetry lock file.

* Update stimela requirement.

* Update docs.

* Set min and max versions in pyproject.toml.

* Remove python3.8 from test matrix.

* Some debugging.

* Fix unsaved file.

* More debugging.

* Temporarily make test suite much smaller.

* Fix path.

* Actually fix path.

* Attempt at safer caching.

* More fiddling with paths.

* Fix bad tabbing.

* Try to find out where things are failing.

* More fiddling.

* More fiddling.

* More fiddling.

* Try restore time action.

* Tidy up caching approach. Use action. Restore matrix and test everything.

* Remove tmp file.

* Reword CI step name.

---------

Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: Landman Bester <[email protected]>
Co-authored-by: JSKenyon <[email protected]>
Co-authored-by: landmanbester <[email protected]>

* Bump dask-ms and codex-africanus dependencies. Update lock.

---------

Co-authored-by: Simon Perkins <[email protected]>
Co-authored-by: Landman Bester <[email protected]>
Co-authored-by: landmanbester <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suggestion: Use github actions to cache numba kernels
3 participants