-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/deadlock detection #299
base: develop
Are you sure you want to change the base?
Conversation
Improve documentation of the DeadlockDetector and add unit tests
Move the variable to Communicator to simplify some logic
Also pass the manager as a direct argument to the Communicator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start! I do have an architectural issue and an algorithmic one, and some small comments.
Reduces the complexity of the Communicator
This reverts commit a727b1c.
Keep on `poll()`ing until message to receive is available and check with manager if deadlocked after each timeout. This approach also enables deadlock detection for runs where the instances are not started by `muscle_manager`.
- Remove ability to shutdown simulation (components will now crash when deadlocked) - Process messages immediately (in the MMP server thread) instead of creating a custom DeadlockDetector Thread.
And fix bugs with the C++ implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, nice work again! I've left some comments, but it's looking good overall.
libmuscle/python/libmuscle/manager/test/test_deadlock_detector.py
Outdated
Show resolved
Hide resolved
public: | ||
ReceiveTimeoutHandler( | ||
MMPClient & manager, | ||
std::string const & peer_instance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a ymmsl::Reference
? That would save converting to a string when calling this in Communicator, and instances are referred to by a Reference everywhere. Of course there'd still be a conversion to a string in the MMPClient, but those are all over the place there so that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, I've also updated it for the Python library to keep that consistent
throw std::runtime_error("Unexpected error during poll(): "+std::to_string(errno)); | ||
|
||
// poll() was interrupted by a signal: retry with re-calculated timeout | ||
} while (1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while (true)
? The 1 reads like C to me, because it doesn't have a boolean, or at least it didn't, but the one it now has is ugly.
|
||
double ReceiveTimeoutHandler::get_timeout() | ||
{ | ||
return timeout_; | ||
// Increase timeout by a factor 1.5 with every timeout we hit: | ||
return timeout_ * std::pow(1.5, (double)num_timeout_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static_cast<double>
is the C++ way to do (double)
here, but can't it just be omitted? Or is that a narrowing conversion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(double)
can be omitted 👍
WAITING_FOR_RECEIVE_DONE MMP messages, which are submitted by the MMPServer. | ||
|
||
When a deadlock is detected, the cycle of instances that is waiting on each other is | ||
logged with FATAL severity. If this deadlock does not get resoled in | ||
logged with FATAL severity. If this deadlock does not get resolved in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's only logged if it isn't resolved, right? That's how it should be, so the comment should be clarified I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently logged as soon as the manager identifies a cycle of waiting instances. I think this will never be resolved, but I'm not sure.
I can also update the logic to only print the deadlock cycle when any of the deadlocked instances calls is_deadlocked
and starts shutting down. At that point we're sure that there was a deadlock and that the simulation will shut down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that. Either we're convinced that the grace period is unnecessary and that as soon as a cycle is detected there's an actual deadlock, in which case we should remove the grace period, or we decide that we need it, but then the warning should be consistent. Otherwise you could get a false positive in the log, and that could confuse people who are doing everything right.
Add a deadlock detection mechanism to MUSCLE3.
New functionality:
muscle_deadlock_receive_timeout
seconds (configurable in muscle settings)TODO