BUG: using `pytask.task()(func_defined_in_other_file)` doesn't work #513

NickCrews · 2023-12-05T21:17:26Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pytask.
(optional) I have confirmed this bug exists on the main branch of pytask.

Code Sample, a copy-pastable example

# lib.py
def get_name():
    return "jane"

# task_bug.py
from pathlib import Path

import pytask

from .lib import get_name

task_get_name = pytask.task(produces=Path("name.txt"))(get_name)

Problem description

It appears that inpytask.task(), the created task is stored in COLLECTED_TASKS,
but is stored under COLLECTED_TASKS["lib.py"] (the place where the function was defined), not under COLLECTED_TASKS["task_bug.py"] (where the task was created). Then, when we actually go looking through COLLECTED_TASKS for tasks to execute, we miss it because we only look under "task_bug.py"

Expected Output

I think the created task should be stored under COLLECTED_TASKS["task_bug.py"] where the task was created.

The text was updated successfully, but these errors were encountered:

Possible fix for pytask-dev#513 I didn't add any tests yet until I got confirmation this is the right direction.

NickCrews · 2023-12-06T05:23:37Z

Ehh, actually the solution of using the module where pytask.task() was called will fall apart if I have some make_task() wrapper in my_tasks.py, since then it will always be "my_tasks.py". Why do we even need to store tasks in a dict? Is that only needed for automatic collection, but when using the programmatic interface should it work a different way? Can/should we add a param to task() so you can override the module where the task is supposed to live?

tobiasraabe · 2023-12-06T20:00:40Z

As a workaround, does wrapping it in lambda work? Like lambda **x: get_name(**x).

NickCrews · 2023-12-07T00:01:15Z

yes that does work, but long term it would be nice if that wasn't required.

tobiasraabe · 2023-12-09T00:04:28Z

In #521 I added an error message that tells the user about tasks that have not been collected and that the tasks need to be wrapped with a lambda. So, you would have at least received an error in your case.

Why do we even need to store tasks in a dict?

It is to prevent garbage-collecting tasks defined in a loop that will be overwritten in the next iteration. The dictionary could be replaced by anything else that keeps the objects alive.

Can/should we add a param to task() so you can override the module where the task is supposed to live?

This could be an option. The injected path would be used to store the function, and it would be collected with the correct module. But I am unsure how I feel about it.

I am also wondering what the more reasonable assumption is to determine the task module of a function wrapped with @task. Is it the module where the function receives the decorator or where the function is defined? First, seems more natural.

What are the use cases we have?

We wrap a function imported from another module that should be declared in the task module as a task.
What is the wrapper you were talking about? Is it like a task generator that is called from the task module with a task function? And the task generator lives outside the task module and wraps the function with @task?

What if we walk up the call stack when @task is applied to the function until we hit a task module? This would cover both use cases, but it feels even more hacky.

NickCrews · 2023-12-12T23:49:31Z

Can we take a step back and question: why does a task need to have a module attribute?

For CLI-discovered tasks (eg a def task_* inside a task_*.py file, found using pytask collect then it makes sense why every task needs a module attribute (so that the live rendering of tasks completing works, and maybe for other reasons?).

But for programmatically-defined tasks (eg created with pytask.task()), do they really need to know which module they were defined in? If not, maybe we can sidestep this whole problem?

tobiasraabe · 2023-12-18T00:00:58Z

I agree that tasks passed via the programmatic interface pytask.build(tasks=[...]) or marked with the @task decorator should be collected regardless of the location.

The linked PR fixes that and allows setting the module name via the module argument to correct the displayed path.

I still have to see whether it conflicts with code in pytask-paralle. The path is used as a fallback if the function does not yield a path.

NickCrews · 2023-12-18T00:48:10Z

Wait wait, if we can I really think we should try to avoid adding this argument to .task(). Why does a task need to have a module attribute? I'm wondering if we can just get rid of that requirement for programmatically defined tasks? I'm probably missing something obvious and there is a real reason they do, but I want to double check. Thanks!

tobiasraabe · 2024-01-03T14:51:48Z

Hi! Belated happy holidays and a happy new year!

Sorry for skipping right to a change without answering your question. I appreciate it. Let's dig deeper and see whether we find a better solution.

And, sorry for the lengthy response. I am trying to figure out how to best look at the problem.

A bit of background on why there is a path attribute.

There are currently two kinds of tasks, PTaskWithPath not PTask, and only the first kind of tasks have a .path attribute.

PTask is used for tasks defined in Jupyter notebooks or the terminal without a real source file.

PTaskWithPath is the task protocol based on functions found in task modules.

A task decorated with @task can be converted into either of the two, depending on the situation.

The .path attribute is used in the following situations.

To detect whether a task has changed, pytask hashes the module in which a PTaskWithPath is defined. For PTask, pytask hashes the source code of the function found with inspect.getsource.
To create the signature of a task that uniquely identifies it among all other tasks, PTaskWithPath uses the path + function name + identifier when part of a repetition. PTask only uses the name + identifier.
For @task decorated tasks, the path is necessary to collect the task. In Collect all tasks in COLLECTED_TASKS. #529, I changed it and added a collection step for all tasks decorated with @task even if pytask does not step through their module.
Generally, .path is used when displaying the task in the execution table or elsewhere. Then, parts of the path are shown next to the task name to ease identification.

The amount of path parts shown depends on how many are needed to make the displayed name unique.

Here, your imported function might need some help because the task's path points to the imported module instead of the task module where the task is defined or that triggered the task's creation. Since this might confuse a user, I added patching the location with @task(module=...).

This is purely cosmetic and more helpful for the user, especially in the absence of
vscode's support for terminal hyperlinks. Since we have the signature, the names do not need to be unique or long
for pytask.
.path is also used in pytask-parallel as a fallback to register task modules with cloudpickle which allows to pickle them and send them to another process. But I wonder if it is essential.

What are possible solutions? 2-4 are more cosmetic, and five would be a bigger change.

Leave it as it is. The module name might be misleading, but everything is still working, and I think this use case will only happen sometimes.
Whenever the task name is set automatically, like ../task_module.py::task_name, and the module name is not prefixed with task_, we remove the path part from the task name. The name becomes just task_example.

It avoids confusion with the module but makes it harder for users to track down the source of the task.

As long as vscode does not support terminal links (https://github.com/microsoft/vscode/issues/176812) navigating to the source of the task without knowing the module is cumbersome.
Walking up the stack until we are in a task module and take that as the path module.
Allow users to patch the module name with @task(module=...). Probably the worst way. People will still be irritated at first; even if they want to solve it, they probably would not read about the module keyword easily.
Do we have to show the module or parts of the path to identify the task?

Again, vscode poses a problem for the short- to mid-run.

On the other hand, we could unify PTask and PTaskWithPath which is a weird differentiation.

If we would remove the .path attribute, what about the five points above?
1. We could fall back to inspect.getsource even for PTaskWithPath. Or, better, we treat the module, if it exists, as a special dependency of the task. 2. Without .path, the signature still needs the module path as a hash component to be unique within bigger projects. We could take it from some attribute for unique dependencies.
2. ignore
3. Use only the task name + identifier.
  
  For pytest, it makes more sense that modules are inherent in how tasks are structured.
  
  For pytask, the task name might capture all relevant information. But, usually, the module or sometimes the first parent directory can carry information as well.
4. ignore

NickCrews added the bug Something isn't working label Dec 5, 2023

NickCrews added a commit to NickCrews/pytask that referenced this issue Dec 5, 2023

use caller's module as ID for task in task()

c98217a

Possible fix for pytask-dev#513 I didn't add any tests yet until I got confirmation this is the right direction.

NickCrews mentioned this issue Dec 5, 2023

use caller's module as ID for task in task() #516

Closed

3 tasks

tobiasraabe mentioned this issue Dec 8, 2023

Add error message for not collected tasks with @task decorator. #521

Merged

tobiasraabe closed this as completed in #521 Dec 12, 2023

tobiasraabe reopened this Dec 12, 2023

tobiasraabe mentioned this issue Dec 17, 2023

Collect all tasks in COLLECTED_TASKS. #529

Closed

Ostheer mentioned this issue Jan 11, 2024

BUG: task.execute returns None despite the function returning a value #547

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: using `pytask.task()(func_defined_in_other_file)` doesn't work #513

BUG: using `pytask.task()(func_defined_in_other_file)` doesn't work #513

NickCrews commented Dec 5, 2023 •

edited by tobiasraabe

Loading

NickCrews commented Dec 6, 2023

tobiasraabe commented Dec 6, 2023

NickCrews commented Dec 7, 2023

tobiasraabe commented Dec 9, 2023

NickCrews commented Dec 12, 2023

tobiasraabe commented Dec 18, 2023

NickCrews commented Dec 18, 2023 •

edited

Loading

tobiasraabe commented Jan 3, 2024 •

edited

Loading

BUG: using pytask.task()(func_defined_in_other_file) doesn't work #513

BUG: using pytask.task()(func_defined_in_other_file) doesn't work #513

Comments

NickCrews commented Dec 5, 2023 • edited by tobiasraabe Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

NickCrews commented Dec 6, 2023

tobiasraabe commented Dec 6, 2023

NickCrews commented Dec 7, 2023

tobiasraabe commented Dec 9, 2023

NickCrews commented Dec 12, 2023

tobiasraabe commented Dec 18, 2023

NickCrews commented Dec 18, 2023 • edited Loading

tobiasraabe commented Jan 3, 2024 • edited Loading

BUG: using `pytask.task()(func_defined_in_other_file)` doesn't work #513

BUG: using `pytask.task()(func_defined_in_other_file)` doesn't work #513

NickCrews commented Dec 5, 2023 •

edited by tobiasraabe

Loading

NickCrews commented Dec 18, 2023 •

edited

Loading

tobiasraabe commented Jan 3, 2024 •

edited

Loading