Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task Manager process gets terminated during runtime #285

Closed
iparask opened this issue Nov 19, 2018 · 8 comments
Closed

Task Manager process gets terminated during runtime #285

iparask opened this issue Nov 19, 2018 · 8 comments

Comments

@iparask
Copy link
Contributor

iparask commented Nov 19, 2018

I do not know that is the expected sequence here. The task manager rocess gets terminated after it submitted all the tasks of a stage 0 from a set of pipelines.

I am seeing the following ERROR from EnTK:

2018-11-19 08:52:42,820: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2018-11-19 08:52:43,004: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Task Manager process started
Process task-manager:
Traceback (most recent call last):
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 270, in _tmgr
    raise EnTKError(ex)
EnTKError: E11000 duplicate key error collection: entk.bridges.4096.0000 index: _id_ dup key: { : "umgr.0000" }
2018-11-19 08:53:12,860: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating heartbeat thread
2018-11-19 08:53:12,862: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Hearbeat thread terminated
2018-11-19 08:53:12,904: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating tmgr process
2018-11-19 08:53:12,905: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Task manager process closed
2018-11-19 08:53:12,905: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Restarting task manager process
2018-11-19 08:53:12,907: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Starting task manager process
2018-11-19 08:53:12,921: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Restarting heartbeat thread
2018-11-19 08:53:12,922: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Starting heartbeat thread
2018-11-19 08:53:12,949: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2018-11-19 08:53:13,005: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Task Manager process started
Process task-manager:
Traceback (most recent call last):
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 270, in _tmgr
    raise EnTKError(ex)
EnTKError: E11000 duplicate key error collection: entk.bridges.4096.0000 index: _id_ dup key: { : "umgr.0000" }
2018-11-19 08:53:42,955: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : ERROR   : Heartbeat failed with error: (404, "NOT_FOUND - no queue 'bridges.4096.0000-hb-response' in vhost '/'")
Traceback (most recent call last):
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/radical/entk/execman/base/task_manager.py", line 152, in _heartbeat
    method_frame, props, body = mq_channel.basic_get(queue=self._hb_response_q)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2077, in basic_get
    self._basic_getempty_result.is_ready)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 1292, in _flush_output
    *waiters)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 458, in _flush_output
    self._impl.ioloop.poll()
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 495, in poll
    self._poller.poll()
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 1114, in poll
    self._dispatch_fd_events(fd_event_map)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 831, in _dispatch_fd_events
    handler(fileno, events)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 410, in _handle_events
    self._handle_read()
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 464, in _handle_read
    self._on_data_available(data)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/connection.py", line 2021, in _on_data_available
    self._process_frame(frame_value)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/connection.py", line 2142, in _process_frame
    if self._process_callbacks(frame_value):
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/connection.py", line 2123, in _process_callbacks
    frame_value)  # Args
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/callback.py", line 60, in wrapper
    return function(*tuple(args), **kwargs)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/callback.py", line 92, in wrapper
    return function(*args, **kwargs)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/callback.py", line 236, in process
    callback(*args, **keywords)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 1358, in _on_channel_closed
    method.reply_text)
ChannelClosed: (404, "NOT_FOUND - no queue 'bridges.4096.0000-hb-response' in vhost '/'")
Exception in thread heartbeat:
Traceback (most recent call last):
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/radical/entk/execman/base/task_manager.py", line 152, in _heartbeat
    method_frame, props, body = mq_channel.basic_get(queue=self._hb_response_q)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 2077, in basic_get
    self._basic_getempty_result.is_ready)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 1292, in _flush_output
    *waiters)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 458, in _flush_output
    self._impl.ioloop.poll()
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 495, in poll
    self._poller.poll()
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 1114, in poll
    self._dispatch_fd_events(fd_event_map)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/select_connection.py", line 831, in _dispatch_fd_events
    handler(fileno, events)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 410, in _handle_events
    self._handle_read()
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/base_connection.py", line 464, in _handle_read
    self._on_data_available(data)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/connection.py", line 2021, in _on_data_available
    self._process_frame(frame_value)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/connection.py", line 2142, in _process_frame
    if self._process_callbacks(frame_value):
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/connection.py", line 2123, in _process_callbacks
    frame_value)  # Args
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/callback.py", line 60, in wrapper
    return function(*tuple(args), **kwargs)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/callback.py", line 92, in wrapper
    return function(*args, **kwargs)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/callback.py", line 236, in process
    callback(*args, **keywords)
  File "/home/iparask/miniconda2/envs/rp_conda_forge/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 1358, in _on_channel_closed
    method.reply_text)
ChannelClosed: (404, "NOT_FOUND - no queue 'bridges.4096.0000-hb-response' in vhost '/'")

2018-11-19 08:53:42,978: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating heartbeat thread
2018-11-19 08:53:42,978: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Hearbeat thread terminated
2018-11-19 08:53:43,002: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating tmgr process

My stack includes EnTK 0.7.8.

@vivek-bala
Copy link
Contributor

Hey Giannis, this is a pending issue to fix in EnTK. Once a pilot manager or unit manager fails, it is created automatically by invoking a new process. Since it is a new process the uids (which are unique within a process) start back from 0000. When the pmgr/umgr is pushed to mongodb, there is a conflict and you see the error:

EnTKError: E11000 duplicate key error collection: entk.bridges.4096.0000 index: _id_ dup key: { : "umgr.0000" }

I have had a discussion with Andre on how to tackle this. I'll be picking this up in the next release.

However, this error is probably triggered by a previous error. Can you share the entire verbose EnTK log? It might be useful to attach the RP logs as well if you have them.

@iparask
Copy link
Contributor Author

iparask commented Nov 20, 2018

Hey, I am not sure that the Unit Manager failed. I did not see anything in the logs that indicated that the unit manager failed. Is there a way for me to verify it?

Also, is it possible that the process monitoring the unit manager fails?

@vivek-bala
Copy link
Contributor

Yes, that is possible as well (but unlikely unless you have >2048 concurrent tasks). I think you have added only a part of the verbose log starting from 8:52:42. There should be more messages before that which will let us determine the source of the error. If you have access to them, could you add those messages as well.

@iparask
Copy link
Contributor Author

iparask commented Nov 20, 2018

So I got 4096 pipelines :).

I cannot find an ERROR from the Unit Manager. In addition, I see that the Task Manager process closes and tries to start again. If I understand correctly, every time the task manager process restarts it tries to create a unit manager, right? If the Unit Manager exists and the task manager tries to create a new one it will fail.
All my logs:
bridges_4096_1.zip

entk.tar.gz

Also, when is the release that fixes this scheduled?

@vivek-bala
Copy link
Contributor

Ah okay. In that case, you are right that the heartbeat timesout (#270). You can increase the heartbeat by export ENTK_HB_INTERVAL=200 (default is 30).

The following issues are what I was referring to and require some more time to add:

I just have to these primarily for the next release. Next release should be on Dec 4.

@iparask
Copy link
Contributor Author

iparask commented Nov 20, 2018

I see, let me increase the interval and I will close if all my experiments succeed. It may take a few days, though

@vivek-bala
Copy link
Contributor

Cool, let me know how it goes 👍

@iparask
Copy link
Contributor Author

iparask commented Dec 6, 2018

This worked!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants