-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lua: switch to recursive mutex in lua_lock() #5061
Conversation
Previously, calling a Lua C API function that throws an error (e.g. with any of the `lua_error()` family of functions, which use `luaD_throw()` internally) in a function that also uses our own `StackUnwinder` would deadlock, because: - `luaD_throw()` calls `lua_lock()` but apparently expects something else to call `lua_unlock()` - `~StackUnwinder()` calls `lua_settop()` to rewind the stack, which internally also calls `lua_lock()` and `lua_unlock()`. This is called before the mutex can be unlocked. This was the root cause of the issue behind DFHack#5040, DFHack#5055, DFHack#5056 - `GetVector()` was called with an invalid index (only when invoked from `gui/launcher`, because this changed the data on the Lua stack) and threw an exception, which caused the `~StackUnwinder()` destructor to run while the Lua state was still locked. Other things to note: - We control the locking/unlocking implementation - the default, defined in `llimits.h`, is a _no-op_. - @ab9rf points out that the `CRITICAL_SECTION` API we're using on Windows already appears to be equivalent to a recursive mutex, so this issue likely would not have occurred there. The issue can be reproduced relatively easily with a simple test command, e.g.: ```diff @@ -716,6 +723,15 @@ void ls_helper(color_ostream &con, const std::vector<std::string> ¶ms) { command_result Core::runCommand(color_ostream &con, const std::string &first_, std::vector<std::string> &parts) { std::string first = first_; + if (first == "luaerror") { + auto L = Lua::Core::State; + Lua::StackUnwinder top(L); + + luaL_error(L, "test error"); + + return CR_OK; + } + CommandDepthCounter counter; if (!counter.ok()) { ``` In `gui/launcher`: - before this change, invoking `luaerror` would hang DF. - after this change, `luaerror` prints `nil` in red to the native console and does nothing. This comes from the last-resort error handler in `LuaTools.cpp:report_error()`, and is equivalent to how other unexpected errors are handled in code invoked from Lua. Not ideal, but better than a crash. From the native console, invoking `luaerror` crashes DF in both cases (SIGABRT), and logs `PANIC: unprotected error in call to Lua API (test error)` to `stderr.log`. The difference in behavior is because `report_error()` above is part of the error handling system present when code is called _from Lua_, through variants of `SafeCall`. There is no Lua layer involved in the native console. (Note: the same crash would have been observed in the original issue in DFHack#5056 et. al. if the error had occurred when invoked through the console.)
I would like for someone to confirm that the behaviors I listed also occur on Windows before merging, at least the "after" behaviors, to make sure the deadlock really is absent there. |
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE); \ | ||
pthread_mutex_init(luai_mutex(L), &attr); \ | ||
} while (0) | ||
#define luai_userstateclose(L) do { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(no changes, just reformatted to match luai_userstateopen
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for Linux, but waiting for Windows testing (though obviously the code in this PR so far is all non-windows-specific)
Windows further poking around with the debugger leads me to think that there is a logic error in in any case, this doesn't produce a hang on Windows, instead causing two assertion failures and an application exit. i don't think there's a locking issue on Windows, but i actually think the locking issue is due to a underlying defect elsewhere and that "fixing" the locking issue isn't actually fixing the problem i have no comment on this PR one way or the other because it solely applies to Linux pthreads which i know basically nothing about. there's definitely a problem here, regardless, and i don't think it's architecture specific |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't specifically approve this PR because it's solely impacting Linux code. With or without this PR, there is definitely a problem with Windows, in that the example test reliably causes a Lua assertion failures followed by abrupt program termination, and this doesn't appear to be in any way related to locking issues
No opinion on merging this
ack. I think we should go ahead and merge to at least bring the Linux behavior in line with Windows |
You just beat me to it :) From following up on Discord, the crash from running |
i am concerned about the possibility of unsynchronized access to one thing to think about: once bay12 makes the |
Previously, calling a Lua C API function that throws an error (e.g. with any of the
lua_error()
family of functions, which useluaD_throw()
internally) in a function that also uses our ownStackUnwinder
would deadlock, because:luaD_throw()
callslua_lock()
but apparently expects something else to calllua_unlock()
~StackUnwinder()
callslua_settop()
to rewind the stack, which internally also callslua_lock()
andlua_unlock()
. This is called before the mutex can be unlocked.This was the root cause of the issue behind #5040, #5055, #5056 -
GetVector()
was called with an invalid index (only when invoked fromgui/launcher
, because this changed the data on the Lua stack) and threw an exception, which caused the~StackUnwinder()
destructor to run while the Lua state was still locked.Other things to note:
llimits.h
, is a no-op.CRITICAL_SECTION
API we're using on Windows already appears to be equivalent to a recursive mutex, so this issue likely would not have occurred there.The issue can be reproduced relatively easily with a simple test command, e.g.:
In
gui/launcher
:luaerror
would hang DF.luaerror
printsnil
in red to the native console and does nothing. This comes from the last-resort error handler inLuaTools.cpp:report_error()
, and is equivalent to how other unexpected errors are handled in code invoked from Lua. Not ideal, but better than a crash.From the native console, invoking
luaerror
crashes DF in both cases (SIGABRT), and logsPANIC: unprotected error in call to Lua API (test error)
tostderr.log
. The difference in behavior is becausereport_error()
above is part of the error handling system present when code is called from Lua, through variants ofSafeCall
. There is no Lua layer involved in the native console. (Note: the same crash would have been observed in the original issue in #5056 et. al. if the error had occurred when invoked through the console.)