Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lua: switch to recursive mutex in lua_lock() #5061

Merged
merged 1 commit into from
Dec 1, 2024

Conversation

lethosor
Copy link
Member

Previously, calling a Lua C API function that throws an error (e.g. with any of the lua_error() family of functions, which use luaD_throw() internally) in a function that also uses our own StackUnwinder would deadlock, because:

  • luaD_throw() calls lua_lock() but apparently expects something else to call lua_unlock()
  • ~StackUnwinder() calls lua_settop() to rewind the stack, which internally also calls lua_lock() and lua_unlock(). This is called before the mutex can be unlocked.

This was the root cause of the issue behind #5040, #5055, #5056 - GetVector() was called with an invalid index (only when invoked from gui/launcher, because this changed the data on the Lua stack) and threw an exception, which caused the ~StackUnwinder() destructor to run while the Lua state was still locked.

Other things to note:

  • We control the locking/unlocking implementation - the default, defined in llimits.h, is a no-op.
  • @ab9rf points out that the CRITICAL_SECTION API we're using on Windows already appears to be equivalent to a recursive mutex, so this issue likely would not have occurred there.

The issue can be reproduced relatively easily with a simple test command, e.g.:

@@ -716,6 +723,15 @@ void ls_helper(color_ostream &con, const std::vector<std::string> &params) {
 command_result Core::runCommand(color_ostream &con, const std::string &first_, std::vector<std::string> &parts)
 {
     std::string first = first_;
+    if (first == "luaerror") {
+        auto L = Lua::Core::State;
+        Lua::StackUnwinder top(L);
+
+        luaL_error(L, "test error");
+
+        return CR_OK;
+    }
+
     CommandDepthCounter counter;
     if (!counter.ok())
     {

In gui/launcher:

  • before this change, invoking luaerror would hang DF.
  • after this change, luaerror prints nil in red to the native console and does nothing. This comes from the last-resort error handler in LuaTools.cpp:report_error(), and is equivalent to how other unexpected errors are handled in code invoked from Lua. Not ideal, but better than a crash.

From the native console, invoking luaerror crashes DF in both cases (SIGABRT), and logs PANIC: unprotected error in call to Lua API (test error) to stderr.log. The difference in behavior is because report_error() above is part of the error handling system present when code is called from Lua, through variants of SafeCall. There is no Lua layer involved in the native console. (Note: the same crash would have been observed in the original issue in #5056 et. al. if the error had occurred when invoked through the console.)

Previously, calling a Lua C API function that throws an error (e.g. with any of
the `lua_error()` family of functions, which use `luaD_throw()` internally) in a
function that also uses our own `StackUnwinder` would deadlock, because:

- `luaD_throw()` calls `lua_lock()` but apparently expects something else to
  call `lua_unlock()`
- `~StackUnwinder()` calls `lua_settop()` to rewind the stack, which internally
  also calls `lua_lock()` and `lua_unlock()`. This is called before the mutex
  can be unlocked.

This was the root cause of the issue behind DFHack#5040, DFHack#5055, DFHack#5056 - `GetVector()`
was called with an invalid index (only when invoked from `gui/launcher`, because
this changed the data on the Lua stack) and threw an exception, which caused the
`~StackUnwinder()` destructor to run while the Lua state was still locked.

Other things to note:
- We control the locking/unlocking implementation - the default, defined in
  `llimits.h`, is a _no-op_.
- @ab9rf points out that the `CRITICAL_SECTION` API we're using on Windows
  already appears to be equivalent to a recursive mutex, so this issue likely
  would not have occurred there.

The issue can be reproduced relatively easily with a simple test command, e.g.:

```diff
@@ -716,6 +723,15 @@ void ls_helper(color_ostream &con, const std::vector<std::string> &params) {
 command_result Core::runCommand(color_ostream &con, const std::string &first_, std::vector<std::string> &parts)
 {
     std::string first = first_;
+    if (first == "luaerror") {
+        auto L = Lua::Core::State;
+        Lua::StackUnwinder top(L);
+
+        luaL_error(L, "test error");
+
+        return CR_OK;
+    }
+
     CommandDepthCounter counter;
     if (!counter.ok())
     {
```

In `gui/launcher`:
- before this change, invoking `luaerror` would hang DF.
- after this change, `luaerror` prints `nil` in red to the native console and
  does nothing. This comes from the last-resort error handler in
  `LuaTools.cpp:report_error()`, and is equivalent to how other unexpected
  errors are handled in code invoked from Lua. Not ideal, but better than a
  crash.

From the native console, invoking `luaerror` crashes DF in both cases (SIGABRT),
and logs `PANIC: unprotected error in call to Lua API (test error)` to
`stderr.log`. The difference in behavior is because `report_error()` above is
part of the error handling system present when code is called _from Lua_,
through variants of `SafeCall`. There is no Lua layer involved in the native
console. (Note: the same crash would have been observed in the original issue
in DFHack#5056 et. al. if the error had occurred when invoked through the console.)
@lethosor lethosor requested review from myk002 and ab9rf November 30, 2024 07:02
@lethosor
Copy link
Member Author

I would like for someone to confirm that the behaviors I listed also occur on Windows before merging, at least the "after" behaviors, to make sure the deadlock really is absent there.

pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE); \
pthread_mutex_init(luai_mutex(L), &attr); \
} while (0)
#define luai_userstateclose(L) do { \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no changes, just reformatted to match luai_userstateopen)

Copy link
Member

@myk002 myk002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for Linux, but waiting for Windows testing (though obviously the code in this PR so far is all non-windows-specific)

@ab9rf
Copy link
Member

ab9rf commented Nov 30, 2024

I would like for someone to confirm that the behaviors I listed also occur on Windows before merging, at least the "after" behaviors, to make sure the deadlock really is absent there.

Windows develop HEAD exhibits the behavior you describe: luaerror in the launcher with this code results in nil being printed on the console in red text, with the launcher appearing to do nothing). luaerror in the console. luaerror in the console caused an assertion failure window or uncaught exception to pop up (this was a debug build) but it crashed before I could read the window to tell which it was

further poking around with the debugger leads me to think that there is a logic error in convert_to_exception in LuaTools, specifically in the case that there are no stack frames on the Lua stack. my poking around in the debugger indicates that the Lua engine throws two assertions for invalid indices before causing a native access violation, and i strongly suspect that the problem is that it's being asked to produce a traceback with no stack frames to trace

in any case, this doesn't produce a hang on Windows, instead causing two assertion failures and an application exit. i don't think there's a locking issue on Windows, but i actually think the locking issue is due to a underlying defect elsewhere and that "fixing" the locking issue isn't actually fixing the problem

i have no comment on this PR one way or the other because it solely applies to Linux pthreads which i know basically nothing about. there's definitely a problem here, regardless, and i don't think it's architecture specific

Copy link
Member

@ab9rf ab9rf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't specifically approve this PR because it's solely impacting Linux code. With or without this PR, there is definitely a problem with Windows, in that the example test reliably causes a Lua assertion failures followed by abrupt program termination, and this doesn't appear to be in any way related to locking issues

No opinion on merging this

@myk002
Copy link
Member

myk002 commented Nov 30, 2024

ack. I think we should go ahead and merge to at least bring the Linux behavior in line with Windows

@myk002 myk002 merged commit acb24ba into DFHack:develop Dec 1, 2024
14 checks passed
@lethosor lethosor deleted the le/lua-lock-recursive branch December 1, 2024 02:07
@lethosor
Copy link
Member Author

lethosor commented Dec 1, 2024

You just beat me to it :)

From following up on Discord, the crash from running luaerror from the native console is expected, just the specific way it's crashing on Windows is weird. So I'm not too concerned here.

@ab9rf
Copy link
Member

ab9rf commented Dec 1, 2024

i am concerned about the possibility of unsynchronized access to DFHack::Core::State but i don't think it will make a difference in nonpathological situations so i'm not inclined to spend a lot of effort on it

one thing to think about: once bay12 makes the render hook available (which is called from DF's main, or render, thread instead of from the subsidiary simulation thread) we will have to be careful about not allowing lua code to run from the render thread, or figuring out circumstances under which it is safe for such code to run

dhthwy added a commit to dhthwy/dfhack that referenced this pull request Dec 6, 2024
ab9rf pushed a commit that referenced this pull request Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants