EnTK deadlock on non-trivial pipeline counts. #410

andre-merzky · 2020-01-27T15:23:27Z

I am trying to run the following code (simplified, but reproduces the problem):

#!/usr/bin/env python3

import radical.entk  as re

if __name__ == '__main__':

    pipelines = set()
    for i in range(128):
        t = re.Task()
        t.executable = '/bin/date'

        s = re.Stage()
        s.add_tasks(t)

        p = re.Pipeline()
        p.add_stages(s)

        pipelines.add(p)


    amgr = re.AppManager(autoterminate=True, hostname='localhost', port=5672)
    amgr.resource_desc = {'resource': 'local.localhost',
                          'walltime': 10,
                          'cpus'    : 4}

    amgr.workflow = set(pipelines)
    amgr.run()
    amgr.terminate()

For small numbers of pipelines (line 8), this runs usually ok - but the more pipelines I use, the more likely it becomes that the code hangs during workload execution (50% chance at 128 - might vary on different machines). From what I have seen with the RU DebugHelper, it seems that the AppMgr and the WorkflowManager eventually hold locks on the same pipeline, and bang - they seem to race for those locks, and more pipelines seem to make the race much more likely.

FWIW, the last line of output is:

Update: pipeline.0127.stage.0127 state: SCHEDULED

I am using this stack:

$ rs

  python               : 3.7.5
  pythonpath           :
  virtualenv           : /home/merzky/radical/radical.repex/ve3

  radical.entk         : 1.0.0-v1.0.0-7-gd6c4c290@devel
  radical.pilot        : 1.0.0-v1.0.0-8-g2f930ad2@devel
  radical.saga         : 1.0.0
  radical.utils        : 1.0.0

If you are able to reproduce the lockup, you can inspect the locks like this:

apply this patch to RE (devel):

--- i/src/radical/entk/pipeline/pipeline.py
+++ w/src/radical/entk/pipeline/pipeline.py
@@ -15,6 +15,9 @@ class Pipeline(object):

     """

+    pipe_num = 0
+
+
     def __init__(self):

         self._uid = None
@@ -32,7 +35,9 @@ class Pipeline(object):
         self._cur_stage = 0

         # Lock around current stage
-        self._lock = threading.Lock()
+
+        self._lock = ru.Lock(name='pipe.%06d' % Pipeline.pipe_num)
+        Pipeline.pipe_num += 1

(FWIW, I would suggest to make this patch permanent, it has next to no runtime overhead)

export RADICAL_DEBUG_HELPER=True
export RADICAL_DEBUG=True
run the code
in a second shell, send kill -USR1 <pid1> <pid2> to the two python processes (one is created internally by RE)
find stack traces and lock info in /tmp/ru.<pid1|pid2>.log

Let me know if I can provide more details. Alas, this is somewhat urgent as one of our collaborators depends on this functionality...

The text was updated successfully, but these errors were encountered:

andre-merzky · 2020-01-31T21:18:48Z

Hi @lee212 : did you manage to make progress on this item, or do you need help doing so? Was the problem reproducible for you?

lee212 · 2020-02-01T20:22:23Z

I confirmed this is reproducible at the moment.

andre-merzky · 2020-02-08T22:38:20Z

Hi @lee212 - this issue can be worked around by placing a sleep(3) right here. This basically delays the first send on the enqueue RMQ channel. I think this reeks like a synchronization issue between thread startup and RMQ startup. But alas, I did not manage to track this down further. A git bisect did not help either - it seems to stem from before the Py3 transition.

For now the RepEx use case is unstuck with that workaround, and thus I lowered the priority of this issue and would like to pass it back to you. My suggestion would be to look into options to track the messages on the RMQ channels, and to see if any differences show up wrt. to the presence of the delay?

mturilli · 2020-11-23T14:31:39Z

@iparask could you pick this up? Based on priority and after you review it, could you tell whether we can target this for Dec release?

mturilli · 2020-12-03T16:41:57Z

@iparask ping

iparask · 2021-01-04T20:46:45Z

I made the change suggested by @andre-merzky and remove a sleep(3) in the wfprocessor, but I could not reproduce the issue. @lee212, do you have any logs I can look at?

iparask · 2021-01-05T22:13:30Z

I do not think this is an issue anymore as I am able to run thousands of tasks. Also, ICEBERG use cases have several thousands of pipelines and never faced this issue. I'm closing.

andre-merzky added Replica Exchange priority:critical topic:communication type:bug labels Jan 27, 2020

lee212 self-assigned this Jan 29, 2020

andre-merzky mentioned this issue Feb 6, 2020

deadlock during rp.umgr creation #421

Closed

andre-merzky added priority:high and removed priority:critical labels Feb 8, 2020

lee212 added this to the Jan 2021 Release milestone Jul 7, 2020

mturilli assigned iparask and unassigned lee212 Nov 23, 2020

iparask closed this as completed Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EnTK deadlock on non-trivial pipeline counts. #410

EnTK deadlock on non-trivial pipeline counts. #410

andre-merzky commented Jan 27, 2020

andre-merzky commented Jan 31, 2020

lee212 commented Feb 1, 2020

andre-merzky commented Feb 8, 2020

mturilli commented Nov 23, 2020

mturilli commented Dec 3, 2020

iparask commented Jan 4, 2021

iparask commented Jan 5, 2021

EnTK deadlock on non-trivial pipeline counts. #410

EnTK deadlock on non-trivial pipeline counts. #410

Comments

andre-merzky commented Jan 27, 2020

andre-merzky commented Jan 31, 2020

lee212 commented Feb 1, 2020

andre-merzky commented Feb 8, 2020

mturilli commented Nov 23, 2020

mturilli commented Dec 3, 2020

iparask commented Jan 4, 2021

iparask commented Jan 5, 2021