-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task Manager process gets terminated during runtime #285
Comments
Hey Giannis, this is a pending issue to fix in EnTK. Once a pilot manager or unit manager fails, it is created automatically by invoking a new process. Since it is a new process the uids (which are unique within a process) start back from 0000. When the pmgr/umgr is pushed to mongodb, there is a conflict and you see the error:
I have had a discussion with Andre on how to tackle this. I'll be picking this up in the next release. However, this error is probably triggered by a previous error. Can you share the entire verbose EnTK log? It might be useful to attach the RP logs as well if you have them. |
Hey, I am not sure that the Unit Manager failed. I did not see anything in the logs that indicated that the unit manager failed. Is there a way for me to verify it? Also, is it possible that the process monitoring the unit manager fails? |
Yes, that is possible as well (but unlikely unless you have >2048 concurrent tasks). I think you have added only a part of the verbose log starting from 8:52:42. There should be more messages before that which will let us determine the source of the error. If you have access to them, could you add those messages as well. |
So I got 4096 pipelines :). I cannot find an ERROR from the Unit Manager. In addition, I see that the Task Manager process closes and tries to start again. If I understand correctly, every time the task manager process restarts it tries to create a unit manager, right? If the Unit Manager exists and the task manager tries to create a new one it will fail. Also, when is the release that fixes this scheduled? |
Ah okay. In that case, you are right that the heartbeat timesout (#270). You can increase the heartbeat by The following issues are what I was referring to and require some more time to add: I just have to these primarily for the next release. Next release should be on Dec 4. |
I see, let me increase the interval and I will close if all my experiments succeed. It may take a few days, though |
Cool, let me know how it goes 👍 |
This worked! |
I do not know that is the expected sequence here. The task manager rocess gets terminated after it submitted all the tasks of a stage 0 from a set of pipelines.
I am seeing the following
ERROR
from EnTK:My stack includes EnTK
0.7.8
.The text was updated successfully, but these errors were encountered: