Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C#] Fix thread deadlock when using a worker thread to load a script with a generic base class #99798

Closed
wants to merge 1 commit into from

Conversation

preslavnpetrov
Copy link
Contributor

@preslavnpetrov preslavnpetrov commented Nov 28, 2024

Fix for #99839 and #99128

Currently, if we use either a worker thread with ResourceLoader::load() or use ResourceLoader::load_threaded_request(), and try to load any resource with a dependency on a script that inherits a generic C# class, we run into a _scriptTypeBiMap.ReadWriteLock deadlock.

This happens in a very specific circumstance:

  1. The script and its base hasn't been loaded before, so it's not populated into the BiMap
  2. Calling the load from a worker thread will make the ResourceLoader::load() call from inside godot_sharp_internal_script_load to spin up a separate worker thread to load the dependency base script
  3. This secondary worker thread will try to get the script from the BiMap unsuccessfully, since it hasn't been populated yet, and will then fall back to creating the script bridge. While doing so, it will try to lock the already locked BiMap ReadWriteLock from when we were loading it's child script

This PR addresses this by moving the call to godot_sharp_internal_reload_registered_script after we've unlocked the lock on the BiMap. This method doesn't need, and in fact shouldn't be run while we're locked. This has made it necessary to always call this method after we call CreateScriptBridgeForType from inside the lock, however that method has only two callsites at the moment, so it wasn't a difficult change to make.

To make sure the method is called by anyone making further changes to the code in the future, I've added a doc comment to the CreateScriptBridgeForType method noting that the registration needs to be done manually.


Detailed callstack of the problem with explanations:

Managed code is italics to make it easier to read when we switch between managed and unmanaged

  1. ResourceFormatLoaderCSharpScript::load(derived_script_path)
  2. ScriptManagerBridge_GetOrCreateScriptBridgeForPath
  3. GetOrCreateScriptBridgeForType
  4. lock(_scriptTypeBiMap.ReadWriteLock)
  5. CreateScriptBridgeForType
  6. godot_sharp_internal_reload_registered_script(derived_script_path)
  7. reload_registered_script
  8. update_script_class_info
  9. ScriptManagerBridge_UpdateScriptClassInfo -- As part of populating the class info, this will try to load the base script
  10. *GetOrLoadOrCreateScriptForType(generic_base_type)
  11. godot_sharp_internal_script_load
  12. ResourceLoader::load -- Because we're calling this whole callstack from a worker thread, the resource loader will automatically start the loading of this resource in a different worker thread, with LOAD_THREAD_SPAWN_SINGLE
  13. wait -- The thread will now wait for the worker thread to resolve before it continues execution

Then we call the following on the new worker thread:

  1. ResourceFormatLoaderCSharpScript::load(generic_base_type)
  2. ScriptManagerBridge_GetOrCreateScriptBridgeForPath
  3. GetOrCreateScriptBridgeForType
  4. lock(_scriptTypeBiMap.ReadWriteLock) -- Second lock of the already locked BiMap

lock to prevent worker threads from the
ResourceLoader::load method from causing a
deadlock
@preslavnpetrov preslavnpetrov requested a review from a team as a code owner November 28, 2024 15:39
@Chaosus Chaosus added this to the 4.4 milestone Nov 29, 2024
@clayjohn
Copy link
Member

CC @RandomShaper I suspect this is an alternative approach to #99539 as #99128 sounds very similar to #99839

@raulsntos
Copy link
Member

Closing as superseded by #99539.

As discussed in RC, I believe the other PR to be less risky. I'm worried about potential regressions from moving the native call outisde the lock, it's possible that this results in not locking in some cases where we should be. Thanks for the contribution nonetheless.

@raulsntos raulsntos closed this Dec 18, 2024
@raulsntos raulsntos removed this from the 4.4 milestone Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants