-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EnTK deadlock on non-trivial pipeline counts. #410
Comments
Hi @lee212 : did you manage to make progress on this item, or do you need help doing so? Was the problem reproducible for you? |
I confirmed this is reproducible at the moment. |
Hi @lee212 - this issue can be worked around by placing a For now the RepEx use case is unstuck with that workaround, and thus I lowered the priority of this issue and would like to pass it back to you. My suggestion would be to look into options to track the messages on the RMQ channels, and to see if any differences show up wrt. to the presence of the delay? |
@iparask could you pick this up? Based on priority and after you review it, could you tell whether we can target this for Dec release? |
@iparask ping |
I made the change suggested by @andre-merzky and remove a |
I do not think this is an issue anymore as I am able to run thousands of tasks. Also, ICEBERG use cases have several thousands of pipelines and never faced this issue. I'm closing. |
I am trying to run the following code (simplified, but reproduces the problem):
For small numbers of pipelines (line 8), this runs usually ok - but the more pipelines I use, the more likely it becomes that the code hangs during workload execution (50% chance at 128 - might vary on different machines). From what I have seen with the RU DebugHelper, it seems that the AppMgr and the WorkflowManager eventually hold locks on the same pipeline, and bang - they seem to race for those locks, and more pipelines seem to make the race much more likely.
FWIW, the last line of output is:
I am using this stack:
If you are able to reproduce the lockup, you can inspect the locks like this:
(FWIW, I would suggest to make this patch permanent, it has next to no runtime overhead)
export RADICAL_DEBUG_HELPER=True
export RADICAL_DEBUG=True
kill -USR1 <pid1> <pid2>
to the two python processes (one is created internally by RE)/tmp/ru.<pid1|pid2>.log
Let me know if I can provide more details. Alas, this is somewhat urgent as one of our collaborators depends on this functionality...
The text was updated successfully, but these errors were encountered: