Feature/deadlock detection #299

maarten-ic · 2024-08-13T14:52:53Z

Add a deadlock detection mechanism to MUSCLE3.

New functionality:

Instance sends a message to the manager when it is waiting for longer than muscle_deadlock_receive_timeout seconds (configurable in muscle settings)
The manager reports deadlocks (a cycle of instances that are all waiting on each other) and shuts down the simulation

TODO

Implement Instance logic in C++ as well
Add integration tests
Add documentation

Improve documentation of the DeadlockDetector and add unit tests

Move the variable to Communicator to simplify some logic

Also pass the manager as a direct argument to the Communicator

LourensVeen

Good start! I do have an architectural issue and an algorithmic one, and some small comments.

libmuscle/python/libmuscle/mcp/tcp_transport_client.py

libmuscle/python/libmuscle/communicator.py

libmuscle/python/libmuscle/manager/instance_manager.py

libmuscle/python/libmuscle/manager/deadlock_detector.py

libmuscle/python/libmuscle/manager/manager.py

…ection

Reduces the complexity of the Communicator

This reverts commit a727b1c.

Keep on `poll()`ing until message to receive is available and check with manager if deadlocked after each timeout. This approach also enables deadlock detection for runs where the instances are not started by `muscle_manager`.

- Remove ability to shutdown simulation (components will now crash when deadlocked) - Process messages immediately (in the MMP server thread) instead of creating a custom DeadlockDetector Thread.

And fix bugs with the C++ implementation.

LourensVeen

Okay, nice work again! I've left some comments, but it's looking good overall.

libmuscle/python/libmuscle/manager/deadlock_detector.py

libmuscle/python/libmuscle/manager/test/test_deadlock_detector.py

LourensVeen · 2024-08-23T10:23:05Z

libmuscle/cpp/src/libmuscle/receive_timeout_handler.hpp

+    public:
+        ReceiveTimeoutHandler(
+                MMPClient & manager,
+                std::string const & peer_instance,


Could this be a ymmsl::Reference? That would save converting to a string when calling this in Communicator, and instances are referred to by a Reference everywhere. Of course there'd still be a conversion to a string in the MMPClient, but those are all over the place there so that makes sense.

Sure thing, I've also updated it for the Python library to keep that consistent

LourensVeen · 2024-08-23T10:43:02Z

libmuscle/cpp/src/libmuscle/mcp/tcp_transport_client.cpp

+                throw std::runtime_error("Unexpected error during poll(): "+std::to_string(errno));
+
+            // poll() was interrupted by a signal: retry with re-calculated timeout
+        } while (1);


while (true) ? The 1 reads like C to me, because it doesn't have a boolean, or at least it didn't, but the one it now has is ugly.

libmuscle/cpp/src/libmuscle/mcp/tcp_transport_client.cpp

libmuscle/cpp/src/libmuscle/receive_timeout_handler.cpp

LourensVeen · 2024-08-23T12:18:25Z

libmuscle/cpp/src/libmuscle/receive_timeout_handler.cpp


 double ReceiveTimeoutHandler::get_timeout()
 {
-    return timeout_;
+    // Increase timeout by a factor 1.5 with every timeout we hit:
+    return timeout_ * std::pow(1.5, (double)num_timeout_);


static_cast<double> is the C++ way to do (double) here, but can't it just be omitted? Or is that a narrowing conversion?

(double) can be omitted 👍

LourensVeen · 2024-08-23T12:29:52Z

libmuscle/python/libmuscle/manager/deadlock_detector.py

    WAITING_FOR_RECEIVE_DONE MMP messages, which are submitted by the MMPServer.

    When a deadlock is detected, the cycle of instances that is waiting on each other is
-    logged with FATAL severity. If this deadlock does not get resoled in
+    logged with FATAL severity. If this deadlock does not get resolved in


Actually, it's only logged if it isn't resolved, right? That's how it should be, so the comment should be clarified I think.

This is currently logged as soon as the manager identifies a cycle of waiting instances. I think this will never be resolved, but I'm not sure.

I can also update the logic to only print the deadlock cycle when any of the deadlocked instances calls is_deadlocked and starts shutting down. At that point we're sure that there was a deadlock and that the simulation will shut down.

I'd prefer that. Either we're convinced that the grace period is unnecessary and that as soon as a cycle is detected there's an actual deadlock, in which case we should remove the grace period, or we decide that we need it, but then the warning should be consistent. Otherwise you could get a false positive in the log, and that could confuse people who are doing everything right.

…imeoutHandler

maarten-ic added 5 commits August 12, 2024 16:27

[WIP] Implement communication deadlock detection mechanism

303e2d9

Mypy and unit test fixes

523d1fd

Implement timeout for shutdown after a deadlock

1f6de94

Improve documentation of the DeadlockDetector and add unit tests

Refactor receive_timeout

d37b663

Move the variable to Communicator to simplify some logic

Update documentation and logging

c1d1d57

Also pass the manager as a direct argument to the Communicator

maarten-ic requested a review from LourensVeen August 13, 2024 14:52

LourensVeen requested changes Aug 15, 2024

View reviewed changes

maarten-ic self-assigned this Aug 16, 2024

maarten-ic added 10 commits August 16, 2024 09:14

Update waiting times in unit tests (should fix a CI failure)

a3ccab2

Process review feedback for the manager component of the deadlock det…

c3a13b1

…ection

Refactor receive timeout logic

a727b1c

Reduces the complexity of the Communicator

Revert "Refactor receive timeout logic"

cd79b7d

This reverts commit a727b1c.

Refactor ReceiveTimeoutHandler to a separate file.

f98e5a3

Implement C++ components of deadlock detection

94869da

Update deadlock detection

8934916

Keep on `poll()`ing until message to receive is available and check with manager if deadlocked after each timeout. This approach also enables deadlock detection for runs where the instances are not started by `muscle_manager`.

Simplify manager DeadlockDetector

45a6cd7

- Remove ability to shutdown simulation (components will now crash when deadlocked) - Process messages immediately (in the MMP server thread) instead of creating a custom DeadlockDetector Thread.

Add integration tests for deadlock detection

f3f8e31

And fix bugs with the C++ implementation.

Add documentation for deadlock detection

f227051

maarten-ic marked this pull request as ready for review August 20, 2024 12:24

maarten-ic requested a review from LourensVeen August 20, 2024 12:24

LourensVeen reviewed Aug 23, 2024

View reviewed changes

maarten-ic added 4 commits August 26, 2024 10:43

Pass peer_instance as ymmsl.Reference instead of string in ReceiveT…

6370431

…imeoutHandler

Add logic for minimum muscle_deadlock_receive_timeout (0.1 seconds)

a6bf6f3

Process review feedback: fix typo and prettier C++

3cf5d89

Change moment of logging deadlocks

b63f047

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/deadlock detection #299

Feature/deadlock detection #299

maarten-ic commented Aug 13, 2024 •

edited

Loading

LourensVeen left a comment

LourensVeen left a comment

LourensVeen Aug 23, 2024

maarten-ic Aug 26, 2024

LourensVeen Aug 23, 2024

LourensVeen Aug 23, 2024

maarten-ic Aug 26, 2024

LourensVeen Aug 23, 2024

maarten-ic Aug 26, 2024

LourensVeen Aug 26, 2024

Feature/deadlock detection #299

Are you sure you want to change the base?

Feature/deadlock detection #299

Conversation

maarten-ic commented Aug 13, 2024 • edited Loading

TODO

LourensVeen left a comment

Choose a reason for hiding this comment

LourensVeen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maarten-ic commented Aug 13, 2024 •

edited

Loading