-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More elaborate mechanism to detect code changes? #99
Comments
Hi @codethief thanks for asking this. redun does use a conservative way of hashing code by default, which is to just hash the string of the source code within the task definition. That does mean that changes to constants defined outside the task will not get detected as changes that would bust the cache. For common constants like that, one idiom is to make sure they are passed through as arguments to the task as default values. MY_CONSTANT = "value"
@task
def foo(x, my_constant=MY_CONSTANT):
# Now you can use x and my_constant Changing "value" to something else like "value2" will cause an argument hash change and a proper re-execution of task More complex tracing or static analysis is not done because the results can often be surprising/unpredictable to users, such as a change to code imported from another module or library causing a cache bust. There might be a to define it wher eyou can have your cake and eat too, but thus far we've stuck to a conservative approach. We discuss some of the related details in the docs: https://insitro.github.io/redun/tasks.html#task-hashing However, if you would like to implement a more complex hashing approach, especially for a build tool system, perhaps you might offer your users a task decorator that wraps redun's from redun import task
def build_step(**options):
def deco(func):
version = calc_complex_hash(func). # Here, you implement something more sophisticated.
return task(version=version, **options)(func) # Here we use redun's manual versioning to influence the task hash.
return deco
@build_step()
def my_build_step(x, y):
# ... Let me know if that helps. We have implemented several wrapper task decorators internally for situations where we want to customize the task definition process.
For cases like this, we do support an opt-in approach like this: def double(x):
# A normal function
return 2 * x
@task(hash_includes=[double])
def my_task(x):
return double(x) + 2
|
Hi @mattrasmus, thanks so much for the elaborate response!
To be honest, I find such complex tracing to be the more conservative option, given that it matches at least my expectations of how caching should behave more closely. Then again, you seem to have put a lot of thought into this and you surely have more experience with what users expect and don't expect. In fact, now that I'm thinking about it, I could imagine one situation where some "complex static analysis" could lead to pitfalls like the one below (maybe you had something similar in mind?): If task A depends on task B and B is changed by the user, naive static analysis would likely invalidate the cache for A, too. However, it's probably desirable for the cache for task A to not get invalidated automatically – at least as long as B still returns the same value. Conceptionally, the return value of B should just be another input parameter to A whose hash gets incorporated into the cache key. Anyway, the idea to use a custom decorator is very neat, as this will give me the option to experiment a bit. I'll report back! |
If your interested in understanding more the subtlety of caching for tasks that call other tasks, check out what we discuss in the implementation notes related to single and ultimate reductions: https://insitro.github.io/redun/implementation/graph_reduction_caching.html I'm also interested to hear how the build system goes! I've always imagined redun might help in such a case. |
Consider this slightly modified version of 01_hello_world:
If I change the constant
planet
to another value like"Mars"
, redun will not bust the cache and instead still yield the output"Hello, World!"
for themain
task. This is of course not entirely surprising given how redun detects code changes, but I'd still say it represents a rather big footgun for people who are used to writing "normal" Python code.[0]Similar situations arise when people use regular (non-
@task
) Python functions in their code.In light of this, would it be feasible to offer an alternative caching mechanism which is less aggressive, i.e. more careful about what code changes could cause a task to change? Two ideas that come to mind:
File changes to a given Python module could always bust the cache of all tasks in the module. Of course this would invalidate the cache much more frequently but, personally, I think I would still prefer this over accidentally running into incorrect cache hits time and again.
Use Python's
trace
module to track calls to other functions and, more generally, which lines of code get executed. Use the content of those lines to compute a hash.[0]: I am looking into using redun to define build & CI/CD pipelines for a (largely) non-Python project. Since the CI/CD code will not be something people will work on all the time, I would like to reduce footguns like erroneous cache hits as far as possible to make it easier for my coworkers to contribute to the pipelines.
The text was updated successfully, but these errors were encountered: