-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.x] stuck and crash on version from master #3463
Comments
We'll need more details than that 😁 |
The logs suggest either a deadlock or a resource exhaustion, with the latter potentially being an effect of the former. |
ok, we build and run version with asan, I will return with additional info |
voex-logs.zip situattion were at ~05:30 |
this is asan version, so no warnings were fired |
we run branch without d014cff commit and it works normally |
@spscream is that the result of a |
I'm not sure line numbers match the commits I'm looking at: line 4521, for instance, is an unlock, and it makes no sense for many threads to be stuck there. Not sure if line numbers are messed up because libasan was built in (I seem to remember asan and gdb don't always get along). |
I tried to rebase on different commits from master, since we are using our custom logic(https://github.com/spscream/janus-gateway/tree/bisect/1.3.0-1 branch) and I can't use bisect directly. |
@spscream can you double check that Asan has been linked?
|
sure:
|
I'm sorry but we can't look at forks. Please try and reproduce with a branch from this repo. |
I reviewed commit d014cff and found lock here d014cff#diff-06af86ce190449ef81620ff5ad5f6ecda030a07fff31683a30c0a1960657e279R5828 doesn't unlock in case of error of opening udp socket - it jumps to prepare_response. Added pr with fix |
It got stuck today on branch with fix today. We will switch to master version and provide log with gdb info. |
Ack, reopening then. |
#3469 another one found and checked all other cases, seems like they are ok. Please don't close the issue, we will test it on load tomorrow morning again and I'll provide feedback. |
I didn't close the issue yesterday, your PR did because it mentioned this issue with a "fix" in it 🤭 |
:) today feedback is good, janus is working without hangs, we will run it on more servers to check it under higher load. |
nope, hangs again |
Have you tried with locks debugging enabled? |
I'm checking it now. Plan now is the following:
|
I have suspicion on race condition between streams_mutex and room->mutex,
theoretically if in other place they obtained in different order: room->mutex is obtained first and streams_mutex second, it possibly could be a deadlock. |
this place: janus-gateway/src/plugins/janus_videoroom.c Line 8385 in 51fe38a
and other place calling in other order is listparticipants command janus-gateway/src/plugins/janus_videoroom.c Line 6920 in 51fe38a
|
this is what I see in my gdb:
and other one is
based on our codebase second is listparticipants request trying to obtain streams_mutex |
We still can run master on our servers to prove it with gdb, if it is neccessary |
Yes, an inverted lock order can definitely cause that problem. Maybe we should take the room->mutex lock out of the |
We can test performance if you provide pr with fix, we have pretty solid testing environment. |
or other variant is make lock of streams inside of janus_videoroom_notify_about_publisher and remove outer lock |
but I think it could lead to strange races - state of streams can change while we entering janus_videoroom_notify_about_publisher |
I'm preparing a patch |
#3473 I suggest this, just in case, if you have better option it would be great |
Closing as we merged the PR. |
What version of Janus is this happening on?
branch based on master from b092e3d commit
Have you tested a more recent version of Janus too?
yes, branch based on master from 504daf5
Was this working before?
yes
Is there a gdb or libasan trace of the issue?
no
Additional context
we got janus stuck in production on recent commits from master, it stops to accept creation of sessions, logs are having "DestroyingWebSocket client" message repeatedly - it accepts ws connections but doesn't create new sessions:
before stuck logs looks normally:
and we also have crash on another server, before crash it had strange messages, unfortuately we don't have asan enabled on this version:
The text was updated successfully, but these errors were encountered: