Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EnTK deadlock on non-trivial pipeline counts. #410

Closed
andre-merzky opened this issue Jan 27, 2020 · 7 comments
Closed

EnTK deadlock on non-trivial pipeline counts. #410

andre-merzky opened this issue Jan 27, 2020 · 7 comments

Comments

@andre-merzky
Copy link
Member

I am trying to run the following code (simplified, but reproduces the problem):

#!/usr/bin/env python3

import radical.entk  as re

if __name__ == '__main__':

    pipelines = set()
    for i in range(128):
        t = re.Task()
        t.executable = '/bin/date'

        s = re.Stage()
        s.add_tasks(t)

        p = re.Pipeline()
        p.add_stages(s)

        pipelines.add(p)


    amgr = re.AppManager(autoterminate=True, hostname='localhost', port=5672)
    amgr.resource_desc = {'resource': 'local.localhost',
                          'walltime': 10,
                          'cpus'    : 4}

    amgr.workflow = set(pipelines)
    amgr.run()
    amgr.terminate()

For small numbers of pipelines (line 8), this runs usually ok - but the more pipelines I use, the more likely it becomes that the code hangs during workload execution (50% chance at 128 - might vary on different machines). From what I have seen with the RU DebugHelper, it seems that the AppMgr and the WorkflowManager eventually hold locks on the same pipeline, and bang - they seem to race for those locks, and more pipelines seem to make the race much more likely.

FWIW, the last line of output is:

Update: pipeline.0127.stage.0127 state: SCHEDULED

I am using this stack:

$ rs

  python               : 3.7.5
  pythonpath           :
  virtualenv           : /home/merzky/radical/radical.repex/ve3

  radical.entk         : 1.0.0-v1.0.0-7-gd6c4c290@devel
  radical.pilot        : 1.0.0-v1.0.0-8-g2f930ad2@devel
  radical.saga         : 1.0.0
  radical.utils        : 1.0.0

If you are able to reproduce the lockup, you can inspect the locks like this:

  • apply this patch to RE (devel):
--- i/src/radical/entk/pipeline/pipeline.py
+++ w/src/radical/entk/pipeline/pipeline.py
@@ -15,6 +15,9 @@ class Pipeline(object):

     """

+    pipe_num = 0
+
+
     def __init__(self):

         self._uid = None
@@ -32,7 +35,9 @@ class Pipeline(object):
         self._cur_stage = 0

         # Lock around current stage
-        self._lock = threading.Lock()
+
+        self._lock = ru.Lock(name='pipe.%06d' % Pipeline.pipe_num)
+        Pipeline.pipe_num += 1

(FWIW, I would suggest to make this patch permanent, it has next to no runtime overhead)

  • export RADICAL_DEBUG_HELPER=True
  • export RADICAL_DEBUG=True
  • run the code
  • in a second shell, send kill -USR1 <pid1> <pid2> to the two python processes (one is created internally by RE)
  • find stack traces and lock info in /tmp/ru.<pid1|pid2>.log

Let me know if I can provide more details. Alas, this is somewhat urgent as one of our collaborators depends on this functionality...

@andre-merzky
Copy link
Member Author

Hi @lee212 : did you manage to make progress on this item, or do you need help doing so? Was the problem reproducible for you?

@lee212
Copy link
Contributor

lee212 commented Feb 1, 2020

I confirmed this is reproducible at the moment.

@andre-merzky
Copy link
Member Author

Hi @lee212 - this issue can be worked around by placing a sleep(3) right here. This basically delays the first send on the enqueue RMQ channel. I think this reeks like a synchronization issue between thread startup and RMQ startup. But alas, I did not manage to track this down further. A git bisect did not help either - it seems to stem from before the Py3 transition.

For now the RepEx use case is unstuck with that workaround, and thus I lowered the priority of this issue and would like to pass it back to you. My suggestion would be to look into options to track the messages on the RMQ channels, and to see if any differences show up wrt. to the presence of the delay?

@lee212 lee212 added this to the Jan 2021 Release milestone Jul 7, 2020
@mturilli mturilli assigned iparask and unassigned lee212 Nov 23, 2020
@mturilli
Copy link
Contributor

@iparask could you pick this up? Based on priority and after you review it, could you tell whether we can target this for Dec release?

@mturilli
Copy link
Contributor

mturilli commented Dec 3, 2020

@iparask ping

@iparask
Copy link
Contributor

iparask commented Jan 4, 2021

I made the change suggested by @andre-merzky and remove a sleep(3) in the wfprocessor, but I could not reproduce the issue. @lee212, do you have any logs I can look at?

@iparask
Copy link
Contributor

iparask commented Jan 5, 2021

I do not think this is an issue anymore as I am able to run thousands of tasks. Also, ICEBERG use cases have several thousands of pipelines and never faced this issue. I'm closing.

@iparask iparask closed this as completed Jan 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants