Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MADNESS firing tasks into PaRSEC deadlocks when MAD_NUM_THREADS=1 #139

Open
evaleev opened this issue Aug 29, 2021 · 10 comments
Open

MADNESS firing tasks into PaRSEC deadlocks when MAD_NUM_THREADS=1 #139

evaleev opened this issue Aug 29, 2021 · 10 comments
Assignees

Comments

@evaleev
Copy link
Contributor

evaleev commented Aug 29, 2021

No description provided.

@devreal
Copy link
Contributor

devreal commented Aug 29, 2021

Does that happen with any example? Do you think it's a bug in MADNESS or PaRSEC?

@evaleev
Copy link
Contributor Author

evaleev commented Aug 29, 2021

@devreal this probably happens with all examples, and the issue is how the PaRSEC is used in MADNESS (if MAD_NUM_THREADS=1 PaRSEC is not supposed to use any threads to execute tasks, only main is supposed to execute tasks during fences. My guess is there is no way actually to make the main thread part of PaRSEC thread group, neither is there to make a non-PaRSEC main to execute current task pool).

This really is a MADNESS issue, but put it here to attract @therault and @bosilca 's eyes + PaRSEC backend in MADNESS was implemented in the TESSE project anyway.

@robertjharrison
Copy link
Contributor

robertjharrison commented Aug 30, 2021 via email

@therault
Copy link
Contributor

therault commented Aug 30, 2021 via email

@evaleev
Copy link
Contributor Author

evaleev commented Aug 30, 2021

@robertjharrison yes, they have parsec_context_wait ... simply replacing ThreadPool::::run_task() in ThreadPool::await() with parsec_context_wait(parsec) causes lack of progress. So clearly logic is a bit more spread out.

@devreal
Copy link
Contributor

devreal commented Aug 30, 2021

What progress is lacking in that case? Communication?

@robertjharrison
Copy link
Contributor

robertjharrison commented Aug 30, 2021 via email

@devreal
Copy link
Contributor

devreal commented Aug 30, 2021

Does TTG+PaRSEC itself execute OK with just one thread (i.e., no threads in
the task pool)?

It does. In the PaRSEC backend in TTG, we call parsec_context_wait so the main thread participates in the task execution. PaRSEC has a separate communication thread though. I'm not sure whether the main thread would have to drive communication in MADNESS (which it couldn't if it is stuck in parsec_context_wait. I don't know the PaRSEC integration of MADNESS though.

@therault
Copy link
Contributor

Current PaRSEC implementation in MADNESS mimics the TBB implementation:

  • In the constructor of madness::ThreadPool:
    • parsec_init
    • parsec_remote_dep_set_ctx
    • parsec_taskpool_update_runtime_nb_tasks(+1)
    • parsec_context_start
  • In the destructor of madness:ThreadPool:
    • parsec_taskpool_update_runtime_nb_tasks(-1)
    • parsec_context_wait
    • parsec_fini
      Threads call __parsec_schedule with the ready task to schedule.
      The main thread only enters parsec_context_wait during the destruction, so no work is ever done by this thread.

There are two ways to provide active participation of the main thread:

  • active polling: we could provide a parsec function that runs a task, and returns. MADNESS could call that in its progress loop in await(), if it needs to progress other things than the tasks
  • enter parsec_context_wait() when the main thread of MADNESS has only one thing to do: progress tasks.

@evaleev In addition to call parsec_context_wait(), we need to update the runtime_nb_tasks() before and after calling it, and do some things with the taskpool, before we can call parsec_context_wait() again.

The second option is probably cleaner. Does that happen? Where should I look?

@bosilca
Copy link
Contributor

bosilca commented Aug 30, 2021

Calling parsec_context_wait works in all DSL where all dependencies tracking happen in parsec, as there is no escape for the main thread from this blocking function until all known tasks in the context are completed. If MADNESS has it's communication thread outside the main thread, and the communication thread continue to guarantee communication progress (i.e it will trigger known but not-yet-ready tasks), blocking the main thread in parsec_context_wait should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants