-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROS2 / Navigation Server Heartbeats #1754
Comments
I would like to add one more thing. When running the lifecycle manager, suppose suddenly you did CTRL + c, the lifecycle manager should automatically call tl;dr we should add |
I'm not sure with the current capabilities of ros launch we can do that. Keep in mind the launch system is who sees the control C, not the servers themselves. I think that would only be possible if the lifecycle manager was part of the launch system itself. It would then need to use the fact we have access to python to override SIGINT to call a service asking the lifecycle manager to exit. This is a topic that has been brought up in rclcpp and other places as well. There should be a solution to that but for right now there's nothing "ROS-y" out of the box, we'd have to implement ourselves like I describe above. Unless someone wants to get dirty in ros2 launch and then we could figure out how to handle signals in the backend for lifecycle nodes and have it do this. Really good point. We just need to figure out the best way to get the signal to the lifecycle manager. The "navigation2" way is simpler, but this is a larger issue in ROS2 Lifecycle that could be great to have a general solution to. |
Ticket related to what @shivaang12 mentioned: ros2/rclcpp#997 |
I'd be curious for the lifecycle manager if that's already possible (e.g. the lifecycle API has a ping "you still ok?" or a callback you can register that will trigger if it goes down) to do the heartbeat checks. If so, that should be really easy to add: if any die, bring it all into the finalized state in order, ignoring the crashed server(s). |
What I'm thinking that could be done is polling the servers with https://github.com/ros-planning/navigation2/blob/master/nav2_util/include/nav2_util/lifecycle_service_client.hpp#L53 This is something I'm thinking about working on myself next week. Its been too long since I pushed actual code here 🤣 and I'd like to do a run at cleaning up the lifecycle manager so we can make it a general package and push out of the repo. |
I would say that signals are forwarded to the launched sub-processes, at least for actions(?). But, I have not tested / proofen it. Also worth noting: Windows does not support SIGINT but SIGKILL for "now" (since 2015).. (this was inside the launch capabilities of ROS2 since the very beginning: ros2/launch#4 ) |
This is a little of an off topic discussion - but good to know. This ticket is more about how do we make sure everyone's alive to be safe. Control C transitioning to shutdown is another problem. |
https://github.com/ros/bond_core/blob/ros2/ros2_migration_readme.md It looks like bond was ported to ROS2 so we can setup bond for all of these servers to tell the lifecycle manager if one fails. I think this makes more sense than to poll the lifecycle for its state. So we should add a bond for all the lifecycle nodes (add to the lifecycle base class?) and then the lifecycle manager should register when it stops or when the servers go down. |
Fun, in testing that my bring up and down is working through bond I've uncovered several simple action server and planner server bugs I need to fix too :-) Turns out if you activate -> deactivate -> activate again, the planner server only sees the last goal or a 0,0 goal if no last goal even if the BT navigator gets a real one. |
Merge incoming |
As discussed in #1745, we really have no way of dealing with a random server crashing mid-run. If a user-provided controller plugin crashes the server, the BT node calling it will spin indefinitely because it can't know it failed.
For servers with feedback, we can track the feedback and if it stops coming in, then we know there's a problem, but not all actions (or services for that matter) provide feedback.
We should think about a general way in ROS2 that we can have heartbeats of all the servers and report or transition into inactive if one that is critical is unresponsive for some period of time. In ROS1, we could do this with Bond, but that wasn't translated over to ROS2. In ROS2, we have lifecycle nodes, so potentially we can use the lifecycle API to ping nodes to see if they're active.
Because we have the lifecycle manager, this may be a suitable use for it to ping and transition down all its nodes if one is failing to work for safety. I think once this is complete, we should look at moving Lifecycle Manager to its own repository with its nav2_utils and removing Navigation2 specific code (there's only a little). At that point, its a stand-alone lifecycle manager that anyone can use for their lifecycle nodes and also does heartbeat checking.
The text was updated successfully, but these errors were encountered: