ROS2 / Navigation Server Heartbeats #1754

SteveMacenski · 2020-05-20T00:17:21Z

As discussed in #1745, we really have no way of dealing with a random server crashing mid-run. If a user-provided controller plugin crashes the server, the BT node calling it will spin indefinitely because it can't know it failed.

For servers with feedback, we can track the feedback and if it stops coming in, then we know there's a problem, but not all actions (or services for that matter) provide feedback.

We should think about a general way in ROS2 that we can have heartbeats of all the servers and report or transition into inactive if one that is critical is unresponsive for some period of time. In ROS1, we could do this with Bond, but that wasn't translated over to ROS2. In ROS2, we have lifecycle nodes, so potentially we can use the lifecycle API to ping nodes to see if they're active.

Because we have the lifecycle manager, this may be a suitable use for it to ping and transition down all its nodes if one is failing to work for safety. I think once this is complete, we should look at moving Lifecycle Manager to its own repository with its nav2_utils and removing Navigation2 specific code (there's only a little). At that point, its a stand-alone lifecycle manager that anyone can use for their lifecycle nodes and also does heartbeat checking.

shivaang12 · 2020-05-21T23:55:45Z

I would like to add one more thing. When running the lifecycle manager, suppose suddenly you did CTRL + c, the lifecycle manager should automatically call on_shutdown method to all the registered lifecycle nodes and then can exit the process. This will give user a good space where they can manage some task before whole thing gets destroy.

tl;dr we should add CTRL + c handler which calls on_shutdown before exiting.

SteveMacenski · 2020-05-22T01:36:27Z

I'm not sure with the current capabilities of ros launch we can do that. Keep in mind the launch system is who sees the control C, not the servers themselves. I think that would only be possible if the lifecycle manager was part of the launch system itself. It would then need to use the fact we have access to python to override SIGINT to call a service asking the lifecycle manager to exit.

This is a topic that has been brought up in rclcpp and other places as well. There should be a solution to that but for right now there's nothing "ROS-y" out of the box, we'd have to implement ourselves like I describe above. Unless someone wants to get dirty in ros2 launch and then we could figure out how to handle signals in the backend for lifecycle nodes and have it do this.

Really good point. We just need to figure out the best way to get the signal to the lifecycle manager. The "navigation2" way is simpler, but this is a larger issue in ROS2 Lifecycle that could be great to have a general solution to.

naiveHobo · 2020-05-22T19:30:33Z

Ticket related to what @shivaang12 mentioned: ros2/rclcpp#997

SteveMacenski · 2020-05-22T19:59:43Z

I'd be curious for the lifecycle manager if that's already possible (e.g. the lifecycle API has a ping "you still ok?" or a callback you can register that will trigger if it goes down) to do the heartbeat checks. If so, that should be really easy to add: if any die, bring it all into the finalized state in order, ignoring the crashed server(s).

SteveMacenski · 2020-05-22T21:05:35Z

What I'm thinking that could be done is polling the servers with https://github.com/ros-planning/navigation2/blob/master/nav2_util/include/nav2_util/lifecycle_service_client.hpp#L53 get_state(), its not really my version of the most ideal solution in the world, but its pretty decent. If we fail to get state after 2 calls to a server we say that there's a critical failure and then transition all into the finalized state.

This is something I'm thinking about working on myself next week. Its been too long since I pushed actual code here 🤣 and I'd like to do a run at cleaning up the lifecycle manager so we can make it a general package and push out of the repo.

gramss · 2020-05-23T00:44:10Z

@SteveMacenski

I'm not sure with the current capabilities of ros launch we can do that. Keep in mind the launch system is who sees the control C, not the servers themselves.

https://github.com/ros2/launch/blob/f243fc6b3bcd276c194151e6933cda90761b1944/launch/launch/actions/execute_process.py#L410

I would say that signals are forwarded to the launched sub-processes, at least for actions(?). But, I have not tested / proofen it. Also worth noting: Windows does not support SIGINT but SIGKILL for "now" (since 2015).. (this was inside the launch capabilities of ROS2 since the very beginning: ros2/launch#4 )

SteveMacenski · 2020-05-23T01:33:00Z

This is a little of an off topic discussion - but good to know. This ticket is more about how do we make sure everyone's alive to be safe. Control C transitioning to shutdown is another problem.

SteveMacenski · 2020-06-30T00:59:18Z

https://github.com/ros/bond_core/blob/ros2/ros2_migration_readme.md

It looks like bond was ported to ROS2 so we can setup bond for all of these servers to tell the lifecycle manager if one fails. I think this makes more sense than to poll the lifecycle for its state. So we should add a bond for all the lifecycle nodes (add to the lifecycle base class?) and then the lifecycle manager should register when it stops or when the servers go down.

SteveMacenski · 2020-07-28T22:21:19Z

Fun, in testing that my bring up and down is working through bond I've uncovered several simple action server and planner server bugs I need to fix too :-)

Turns out if you activate -> deactivate -> activate again, the planner server only sees the last goal or a 0,0 goal if no last goal even if the BT navigator gets a real one.

SteveMacenski · 2020-07-30T22:34:20Z

Merge incoming

SteveMacenski self-assigned this May 22, 2020

SteveMacenski added this to the Galactic Milestone milestone Jul 1, 2020

SteveMacenski mentioned this issue Jul 14, 2020

prototype of lifecycle bond system #1869

Closed

5 tasks

SteveMacenski closed this as completed Jul 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROS2 / Navigation Server Heartbeats #1754

ROS2 / Navigation Server Heartbeats #1754

SteveMacenski commented May 20, 2020

shivaang12 commented May 21, 2020

SteveMacenski commented May 22, 2020 •

edited

Loading

naiveHobo commented May 22, 2020 •

edited

Loading

SteveMacenski commented May 22, 2020

SteveMacenski commented May 22, 2020

gramss commented May 23, 2020 •

edited

Loading

SteveMacenski commented May 23, 2020

SteveMacenski commented Jun 30, 2020

SteveMacenski commented Jul 28, 2020 •

edited

Loading

SteveMacenski commented Jul 30, 2020

ROS2 / Navigation Server Heartbeats #1754

ROS2 / Navigation Server Heartbeats #1754

Comments

SteveMacenski commented May 20, 2020

shivaang12 commented May 21, 2020

SteveMacenski commented May 22, 2020 • edited Loading

naiveHobo commented May 22, 2020 • edited Loading

SteveMacenski commented May 22, 2020

SteveMacenski commented May 22, 2020

gramss commented May 23, 2020 • edited Loading

SteveMacenski commented May 23, 2020

SteveMacenski commented Jun 30, 2020

SteveMacenski commented Jul 28, 2020 • edited Loading

SteveMacenski commented Jul 30, 2020

SteveMacenski commented May 22, 2020 •

edited

Loading

naiveHobo commented May 22, 2020 •

edited

Loading

gramss commented May 23, 2020 •

edited

Loading

SteveMacenski commented Jul 28, 2020 •

edited

Loading