-
Notifications
You must be signed in to change notification settings - Fork 425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starvation in MultiThreadedExecutor #2645
Comments
Can you re-profile with the callback groups set to |
Hi @mjcarroll What you asked for: MultiThreadedExecutor with Callback Group ReentrantAs you might expect, there is no starvation, but the user must manage any race conditions. Bonus!! EventsExecutorThis is correct, as it is not multithreaded, but it confirms that there is no starvation. Discussion on Starvation on a reasonable setupI understand that it may be an antipattern, but I think it is not unreasonable for a user to want to use a MultiThreadedExecutor with MutuallyExclusive when he has several nodes, to parallelize both. In this case, there is also a starvation of one of the two callbacks of each node (
|
In this case, each node (and all corresponding internal entities) would end up with a separate callback group (by default, at least). I believe this also makes a small difference in your first example and second example. In the first, it was multiple subscriptions on the same node (and callback group). In the second, you created two separate nodes, each having their own callback group, so re-entrant vs mutex has less of an impact. There are two goals. By default, users shouldn't have to think about the threading safety, so all callbacks will be fired mutually exclusive. |
I understand that you don't consider this a problem if the user knows it can suffer starvation and actively do something to avoid it. It would be good to write a warning anywhere in the documentation to warn the user to take the appropriate actions. Anyway, I'd like to waste a few cycles of my boring youth exploring some alternatives :P Thanks for your time @mjcarroll |
I do think it's probably surprising and warrants a notice or warning in the documentation. Everything comes with a trade-off on effort vs reward, I suppose. The best case performance should come when each possible entity is either re-entrant or part of it's own callback group. The worst case performance should be when every entity is in a single mutually exclusive callback group. Also, since the executor has a fixed priority order where timers are executed first, high rate timers can cause the starvation issue that you are discussing as well. Since you are also in the process of evaluating, executors, you may also want to take a look at @jmachowinski's multithreaded events executor: https://discourse.ros.org/t/the-ros-2-c-executors/38296/21 |
IMO, this result indicates the inefficiency
This result from 1st example.
with after all, i think this issue stands out as inefficiency of CC: @alsora @jmachowinski |
Yes, there is a cost to the bookkeeping that the multithreaded executor requires. Perhaps it can continue to be made better (that overhead can be reduced), but there is always going to be some additional cost over the single-threaded implementation. We should generally be advising people to use the multithreaded executor only when it makes sense for their application. There isn't a general "one thread good, more threads better" recommendation in this case. |
I believe that we determined that even the single threaded executor can be starved with high enough frequency timer updates #392 |
I think that with waitset executors, the implicit ordering in which entities is checked (timers -> subscriptions -> services -> clients -> waitables from https://github.com/ros2/rclcpp/blob/rolling/rclcpp/src/rclcpp/executor.cpp#L754) means that every type of entity can starve the lower-priority ones. Events-based executors address the problem by executing entities in the order they are triggered.
Yes, when possible single-threaded executors should currently be preferred. |
For me, this is total expected behavior. This is the 'if a timer takes longer than it's period, only the timer is executed' problem, just a bit different. Note that in the example shown, the subscriptions already take all available processing time. Therefore the timer will stall one callback because of the priority ordering. |
The effect we are seeing here is the rebuild / masking of the waitsets.
Multi Threaded Executor:
This all comes down to the fact, that the list of ready entities is 'forgotten' as soon as a new waitset returns from wait. And exactly the reason why I programmed the CallbackGroupExecutor... |
@jmachowinski thanks for the explanation. and yes, this is really expensive.
with understanding all the facts, what i mean is that this is really hard to see for user... i think that is the origin of this issue.
probably we should add documentation https://docs.ros.org/en/rolling/Concepts/Intermediate/About-Executors.html about this? in general, |
IMHO, this is not a solid generalization because we could use a MultiThreadedExecutor with a node with two callback groups (all MutuallyExclusive) for an example like the following figures, in which callbacks do not execute concurrently, but the timer callback does. I think it could be enough to warn that with a MultiThreadedExecutor, it is not guaranteed that all callbacks in the same callback group will execute when there are always messages to process in more than one subscription. |
Hi, I have found a possible fix to this problem. The idea is to isolate from the order in which the subscriptions were created. The method I have implemented a draft of the solution here: There is probably a better implementation, but the idea seems to work. Before this changeAfter this change |
While working for this special problem, your fix would not work for the timer starvation problem. Try this executor : https://github.com/cellumation/cm_executors |
We had previously discussed having the execution order of ready entities be managed by some sort of policy. That way users could specify which items to prioritize. I think that the idea wasn't followed through on because it's a difficult space to generalize for all applications. Depending on the level of control and the layout of the application, I would recommend trying the cellumation executor (or at least coming to hear about it at ROSCon). Alternatively, if you are looking for deep control of the scheduling semnatics, you could also directly use the |
Hello, I had a pull request for this issue last year around this time (#2360), and have since proposed an executor design that prevents starvation in the multi-threaded executor: You can read about it here https://ieeexplore.ieee.org/document/9622336 https://daes.cs.tu-dortmund.de/storages/daes-cs/r/publications/teper2024emsoft_preprint.pdf I fixed this issue using an additional mutex lock, which prevents callbacks from being removed from blocked callback groups. Although this may introduce large latencies for callbacks of such groups, it prevents callbacks from being completely starved, a property that I think is very useful to natively support the execution of "most" systems. If needed, I can modify the current executor design and implement the changes I proposed in the paper. This would guarantee that no callback is starved due to mutually exclusive callback groups. |
i do not think that #2360 can fix the problem here. i may be mistaken but can you elaborate a bit on #2360 i left the comment there. |
My previous pull request does not include the fix that is proposed in the work that I published, as I completely reworked the mechanism to prevent starvation. However, when I opened that pull request, I was looking if this issue is worth fixing. I am planning to either update my previous pull request or create a pull request with my newest executor version. Which option do you prefer? |
New PR |
Bug report
Required Info:
Steps to reproduce issue
Use this program. The commented lines are used by me to trace the problem, using https://github.com/fmrico/yaets
Expected behavior
The behavior expected is more or less what happens with SingleThreadedExecution:
Actual behavior
Callback for /topic2 (cb2) suffers starvation because most of the time timer and cb1 has event and messages to process
The text was updated successfully, but these errors were encountered: