-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address thread starvation bug on joint trajectory server preemption #150
Conversation
168db27
to
979abfa
Compare
979abfa
to
725efc4
Compare
Done with requested fixes. @hello-binit I don't have access to a Stretch right now, so could you just run this code on your Stretch and ensure it still works with the Also, I saw that you've been adding some of these edge cases to the regression tests. What do you think of adding the above test case to the regression tests? I can create an issue for it. |
I tested this on |
Verified on 3004. |
Description
There is a thread starvation race condition in
stretch_driver
/joint_trajectory_server
. Consider the following scenario, where steps 2 and 3 happen in quick succession:stretch_driver
receives a joint trajectory server goal.stretch_driver
receives another joint trajectory server goal.stretch_driver
receives a change mode request.The expected behavior here is that the second goal will preempt the first, then the second goal will execute, and then the change mode request will be processed.
In actuality, the following can happen. Note that the executor is only allocated two threads to process all callbacks.
robot_mode_rwlock
, and occupies all the resources of one thread (it never handles control back to the executor, because it usestime.sleep()
instead ofrate.sleep()
). Thus, the executor is only left with one thread until the first goal's execution completes.robot_mode_rwlock
to be released, which only happens after execution of the first goal finishes. This uses all the resources of the second thread. Thus, there are no threads left to process the new goal.latest_goal_id
is only incremented after the goal starts exeucting, and goal execution uses thelatest_goal_id
to determine whether to preempt, the first goal never knows whether to preempt.This PR implements multiple fixes to address the above bug and cleanup the joint trajectory server. Each fix directly corresponds to one of the commits.
stretch_driver
. That way, if their application commands short trajectories, they can reduce the timeout.stretch_driver
node, so it doesn't need to be a node. This may reduce unnecessary executor computation.server.action_server_rate
, which invokes callbacks at a rate of 10 Hz and unnecessarily takes up executor compute.latest_goal_id
when the goal is accepted, as opposed to when the goal starts executing, so that previous goals know to cancel as soon as a next goal is accepted.MutuallyExclusiveCallbackGroup
has all services. That needs 1 thread.cmd_vel
andgamepad_joy
, have a dedicatedMutuallyExclusiveCallbackGroup
. That needs 1 thread.command_mobile_base_velocity_and_publish_state
at a rate of 30 Hz, which is what moves the base (and stops it if the motion command is stale). So that timer should have 1 thread available to handle its callbacks.ReentrantCallbackGroup
for all its callbacks (goal request, cancellation request, execution). We should account for at least 2 actions being executed in parallel (e.g., one action starts executing while the next one is wrapping up and preempting -- this happens quite often). Thus, we should have 2 threads for this.Testing
The below script is a minimal example of the problem. Testing was done as detailed below. I did the testing with the tablet on the end-effector, which is helpful because of #141 (where trajectories go till timeout when the wrist has additional force on it).
humble
, copy the below script into a ROS package and add it to theCMakeLists.txt
/setup.py
file.ros2 launch stretch_core stretch_driver.launch.py
ros2 run <package_with_the_text_script> stretch_driver_thread_starvation_test.py
ros2 launch stretch_core stretch_driver.launch.py
ros2 run <package_with_the_text_script> stretch_driver_thread_starvation_test.py
stretch_driver_thread_starvation_test.py
(To more reliably re-create the actual race condition, I swap the nav mode callback with the second goal callback, compared with what I wrote above. However, the reality is that even if we follow the steps I wrote above from the web app, it is possible for the requests to get swapped over roslibjs and/or rosbridge.)