Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/deadlock detection #299

Open
wants to merge 19 commits into
base: develop
Choose a base branch
from
Open

Conversation

maarten-ic
Copy link
Collaborator

@maarten-ic maarten-ic commented Aug 13, 2024

Add a deadlock detection mechanism to MUSCLE3.

New functionality:

  • Instance sends a message to the manager when it is waiting for longer than muscle_deadlock_receive_timeout seconds (configurable in muscle settings)
  • The manager reports deadlocks (a cycle of instances that are all waiting on each other) and shuts down the simulation

TODO

  • Implement Instance logic in C++ as well
  • Add integration tests
  • Add documentation

Improve documentation of the DeadlockDetector and add unit tests
Move the variable to Communicator to simplify some logic
Also pass the manager as a direct argument to the Communicator
@maarten-ic maarten-ic requested a review from LourensVeen August 13, 2024 14:52
Copy link
Contributor

@LourensVeen LourensVeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start! I do have an architectural issue and an algorithmic one, and some small comments.

@maarten-ic maarten-ic self-assigned this Aug 16, 2024
Reduces the complexity of the Communicator
Keep on `poll()`ing until message to receive is available and check with manager if deadlocked after each timeout.

This approach also enables deadlock detection for runs where the instances are not started by `muscle_manager`.
- Remove ability to shutdown simulation (components will now crash when deadlocked)
- Process messages immediately (in the MMP server thread) instead of creating a custom DeadlockDetector Thread.
And fix bugs with the C++ implementation.
@maarten-ic maarten-ic marked this pull request as ready for review August 20, 2024 12:24
@maarten-ic maarten-ic requested a review from LourensVeen August 20, 2024 12:24
Copy link
Contributor

@LourensVeen LourensVeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, nice work again! I've left some comments, but it's looking good overall.

libmuscle/python/libmuscle/manager/deadlock_detector.py Outdated Show resolved Hide resolved
public:
ReceiveTimeoutHandler(
MMPClient & manager,
std::string const & peer_instance,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a ymmsl::Reference? That would save converting to a string when calling this in Communicator, and instances are referred to by a Reference everywhere. Of course there'd still be a conversion to a string in the MMPClient, but those are all over the place there so that makes sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I've also updated it for the Python library to keep that consistent

throw std::runtime_error("Unexpected error during poll(): "+std::to_string(errno));

// poll() was interrupted by a signal: retry with re-calculated timeout
} while (1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while (true) ? The 1 reads like C to me, because it doesn't have a boolean, or at least it didn't, but the one it now has is ugly.

libmuscle/cpp/src/libmuscle/mcp/tcp_transport_client.cpp Outdated Show resolved Hide resolved
libmuscle/cpp/src/libmuscle/receive_timeout_handler.cpp Outdated Show resolved Hide resolved

double ReceiveTimeoutHandler::get_timeout()
{
return timeout_;
// Increase timeout by a factor 1.5 with every timeout we hit:
return timeout_ * std::pow(1.5, (double)num_timeout_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_cast<double> is the C++ way to do (double) here, but can't it just be omitted? Or is that a narrowing conversion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(double) can be omitted 👍

WAITING_FOR_RECEIVE_DONE MMP messages, which are submitted by the MMPServer.

When a deadlock is detected, the cycle of instances that is waiting on each other is
logged with FATAL severity. If this deadlock does not get resoled in
logged with FATAL severity. If this deadlock does not get resolved in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's only logged if it isn't resolved, right? That's how it should be, so the comment should be clarified I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently logged as soon as the manager identifies a cycle of waiting instances. I think this will never be resolved, but I'm not sure.

I can also update the logic to only print the deadlock cycle when any of the deadlocked instances calls is_deadlocked and starts shutting down. At that point we're sure that there was a deadlock and that the simulation will shut down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that. Either we're convinced that the grace period is unnecessary and that as soon as a cycle is detected there's an actual deadlock, in which case we should remove the grace period, or we decide that we need it, but then the warning should be consistent. Otherwise you could get a false positive in the log, and that could confuse people who are doing everything right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants