Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.x] stuck and crash on version from master #3463

Closed
spscream opened this issue Oct 24, 2024 · 35 comments · Fixed by #3468
Closed

[1.x] stuck and crash on version from master #3463

spscream opened this issue Oct 24, 2024 · 35 comments · Fixed by #3468
Labels
multistream Related to Janus 1.x

Comments

@spscream
Copy link
Contributor

What version of Janus is this happening on?
branch based on master from b092e3d commit

Have you tested a more recent version of Janus too?
yes, branch based on master from 504daf5

Was this working before?
yes

Is there a gdb or libasan trace of the issue?
no

Additional context
we got janus stuck in production on recent commits from master, it stops to accept creation of sessions, logs are having "DestroyingWebSocket client" message repeatedly - it accepts ws connections but doesn't create new sessions:

2024-10-24T07:46:05.181877549Z [WSS-0x7f9e608f5420] Destroying WebSocket client
2024-10-24T07:46:05.195416485Z [WSS-0x7f9e608f5980] Destroying WebSocket client
2024-10-24T07:46:05.220954130Z [WSS-0x7f9e608f5ea0] Destroying WebSocket client
2024-10-24T07:46:05.224866921Z [WSS-0x7f9e608f6400] Destroying WebSocket client
2024-10-24T07:46:05.237925976Z [WSS-0x7f9e608f6960] Destroying WebSocket client
2024-10-24T07:46:05.241920620Z [WSS-0x7f9e57f8e010] Destroying WebSocket client
2024-10-24T07:46:05.245381267Z [WSS-0x7f9e57f8e550] Destroying WebSocket client
2024-10-24T07:46:05.254307072Z [WSS-0x7f9e57f8f520] Destroying WebSocket client
2024-10-24T07:46:05.258169941Z [WSS-0x7f9e57f8f000] Destroying WebSocket client

before stuck logs looks normally:

2024-10-24T07:02:18.806658918Z [WSS-0x7f9e608f0030] Destroying WebSocket client
2024-10-24T07:02:18.806665306Z Destroying session 5866593402157672; 0x7f9e5ad2eb50
2024-10-24T07:02:18.806668003Z Detaching handle from JANUS VideoRoom plugin; 0x7f9dff8ff790 0x7f9e24a842e0 0x7f9dff8ff790 0x7f9e4f1254e0
2024-10-24T07:02:18.806670957Z Detaching handle from JANUS AudioBridge plugin; 0x7f9e1cc50920 0x7f9e4cb184d0 0x7f9e1cc50920 0x7f9e0610ecf0
2024-10-24T07:02:18.806736777Z No WebRTC media anymore
2024-10-24T07:02:18.807177631Z [570986776409605] Handle and related resources freed; 0x7f9e1cc50920 0x7f9e5ad2eb50
2024-10-24T07:02:18.810302128Z [WSS-0x7f9e41845460] Destroying WebSocket client
2024-10-24T07:02:18.810308679Z Destroying session 7089416138070467; 0x7f9df436b630
2024-10-24T07:02:18.810311270Z Detaching handle from JANUS VideoRoom plugin; 0x7f9dfe7b8ac0 0x7f9e0aa12460 0x7f9dfe7b8ac0 0x7f9e1a69e950
2024-10-24T07:02:18.810313991Z Detaching handle from JANUS AudioBridge plugin; 0x7f9e00b64990 0x7f9e2603c380 0x7f9e00b64990 0x7f9e1dbbbb10
2024-10-24T07:02:18.810354973Z No WebRTC media anymore
2024-10-24T07:02:18.810372382Z [janus.plugin.videoroom-0x7f9e0aa12460] No WebRTC media anymore; 0x7f9dfe7b8ac0 0x7f9e1a69e950
2024-10-24T07:02:18.811349618Z [2724536361622841] Handle and related resources freed; 0x7f9e00b64990 0x7f9df436b630
2024-10-24T07:02:18.814146440Z [WSS-0x7f9e3381bff0] Destroying WebSocket client
2024-10-24T07:02:18.817745061Z [WSS-0x7f9e608f0aa0] Destroying WebSocket client
2024-10-24T07:02:18.817751693Z Destroying session 2354713624930254; 0x7f9e51e7e5a0
2024-10-24T07:02:18.817754209Z Detaching handle from JANUS AudioBridge plugin; 0x7f9e2dd104d0 0x7f9dffc07e80 0x7f9e2dd104d0 0x7f9e134e7d10
2024-10-24T07:02:18.817760655Z Detaching handle from JANUS VideoRoom plugin; 0x7f9df8e5f1d0 0x7f9e3fb7e830 0x7f9df8e5f1d0 0x7f9e60f1f970
2024-10-24T07:02:18.817763935Z No WebRTC media anymore
2024-10-24T07:02:18.817865921Z [janus.plugin.videoroom-0x7f9e3fb7e830] No WebRTC media anymore; 0x7f9df8e5f1d0 0x7f9e60f1f970
2024-10-24T07:02:18.820019772Z [6571659618785227] Handle and related resources freed; 0x7f9e2dd104d0 0x7f9e51e7e5a0
2024-10-24T07:02:18.821921905Z [WSS-0x7f9e3381c4f0] Destroying WebSocket client
2024-10-24T07:02:18.825347627Z [WSS-0x7f9e3381ca90] Destroying WebSocket client
2024-10-24T07:02:18.834477339Z [WSS-0x7f9e58c41050] Destroying WebSocket client
2024-10-24T07:02:18.837975765Z [WSS-0x7f9e338213f0] Destroying WebSocket client
2024-10-24T07:02:18.841785953Z [WSS-0x7f9e525a9ea0] Destroying WebSocket client
2024-10-24T07:02:18.850660737Z [WSS-0x7f9e3381f9b0] Destroying WebSocket client

and we also have crash on another server, before crash it had strange messages, unfortuately we don't have asan enabled on this version:

2024-10-24T05:22:57.608134337Z ESC[31m[ERR]ESC[0m [ice.c:janus_ice_handle_attach_plugin:1528] [1115535250946754] Got error 0 (Error creating thread: Resource temporarily unavailable) trying
to launch the handle thread...
2024-10-24T05:22:57.608144372Z Detaching handle from JANUS VideoRoom plugin; 0x7f2b9dc0f020 0x7f2b9ac96740 0x7f2b9dc0f020 0x7f2b9e1d8f20
2024-10-24T05:22:57.608146956Z ESC[31m[ERR]ESC[0m [janus.c:janus_process_incoming_request:1268] Couldn't attach to plugin 'janus.plugin.videoroom', error '-1'
2024-10-24T05:22:57.611193773Z [WSS-0x7f2cbb8d6590] Destroying WebSocket client
2024-10-24T05:22:57.611202457Z Destroying session 8543728103486778; 0x7f2b9d4a60e0
2024-10-24T05:22:57.777902886Z Creating new session: 2405354435022227; 0x7f2b9b9cc950
2024-10-24T05:22:57.779105497Z Creating new handle in session 2405354435022227: 8941610963864597; 0x7f2b9b9cc950 0x7f2b9dc0f660
2024-10-24T05:22:57.782748571Z ESC[31m[ERR]ESC[0m [ice.c:janus_ice_handle_attach_plugin:1528] [8941610963864597] Got error 0 (Error creating thread: Resource temporarily unavailable) trying
to launch the handle thread...
2024-10-24T05:22:57.782754197Z Detaching handle from JANUS VideoRoom plugin; 0x7f2b9dc0f660 0x7f2b9ac964a0 0x7f2b9dc0f660 0x7f2b9fd057f0
2024-10-24T05:22:57.782756726Z ESC[31m[ERR]ESC[0m [janus.c:janus_process_incoming_request:1268] Couldn't attach to plugin 'janus.plugin.videoroom', error '-1'
2024-10-24T05:22:57.785420646Z [WSS-0x7f2cbb8d6a90] Destroying WebSocket client
2024-10-24T05:22:57.785460698Z Destroying session 2405354435022227; 0x7f2b9b9cc950
2024-10-24T05:22:57.918043092Z [WSS-0x7f2cbb69f3d0] Destroying WebSocket client
2024-10-24T05:22:57.918067859Z Destroying session 6480537602660548; 0x7f2b9e804bd0
2024-10-24T05:22:57.918070597Z Detaching handle from JANUS VideoRoom plugin; 0x7f2b9e1c8610 0x7f2b9e1ca880 0x7f2b9e1c8610 0x7f2b9e804c90
2024-10-24T05:22:57.952247654Z Creating new session: 145744641918553; 0x7f2b9e1d9160
2024-10-24T05:22:57.954430940Z Creating new handle in session 145744641918553: 4885986968201247; 0x7f2b9e1d9160 0x7f2b9dc0f7f0
2024-10-24T05:22:57.960693299Z ESC[31m[ERR]ESC[0m [janus.c:janus_transport_requests:3578] Got error 0 (Error creating thread: Resource temporarily unavailable) trying to push task in thread
pool...
2024-10-24T05:22:57.962866653Z [WSS-0x7f2cbb8d74f0] Destroying WebSocket client
2024-10-24T05:22:57.962896045Z Destroying session 145744641918553; 0x7f2b9e1d9160
2024-10-24T05:22:57.962899170Z Detaching handle from JANUS VideoRoom plugin; 0x7f2b9dc0f7f0 0x7f2ba5518cb0 0x7f2b9dc0f7f0 0x7f2b9e804e70
2024-10-24T05:22:57.965095743Z [4885986968201247] Handle and related resources freed; 0x7f2b9dc0f7f0 0x7f2b9e1d9160
2024-10-24T05:22:58.118391275Z [WSS-0x7f2cb1ac6020] Destroying WebSocket client
2024-10-24T05:22:58.118402152Z Destroying session 2176121126391641; 0x7f2b9e7fec30
2024-10-24T05:22:58.118404756Z Detaching handle from JANUS VideoRoom plugin; 0x7f2b9fdc0800 0x7f2b9e1c94a0 0x7f2b9fdc0800 0x7f2b9e7feed0
2024-10-24T05:22:58.145426850Z Creating new session: 133155872958563; 0x7f2b9e1d9160
2024-10-24T05:22:58.146833021Z Creating new handle in session 133155872958563: 7842851111017661; 0x7f2b9e1d9160 0x7f2b9dc0fb20
2024-10-24T05:22:58.932496292Z Janus version: 1300 (1.3.0)
2024-10-24T05:22:58.932530620Z Janus commit: bcc31f873addbeeded71642bc238ae86fde411c8
2024-10-24T05:22:58.932533915Z Compiled on:  Mon Oct 21 09:59:54 UTC 2024
@spscream spscream added the multistream Related to Janus 1.x label Oct 24, 2024
@lminiero
Copy link
Member

We'll need more details than that 😁
A gdb trace may help see if there's a deadlock somewhere. You can also try to do a git bisect to spot the commit that causes it for you, if you know for sure that on a specific commit it wasn't happening.

@atoppi
Copy link
Member

atoppi commented Oct 24, 2024

The logs suggest either a deadlock or a resource exhaustion, with the latter potentially being an effect of the former.
As you already know, the first step should be running with ASan in order to exlcude "use-after-free" situations that lead to undefined behaviors.

@spscream
Copy link
Contributor Author

ok, we build and run version with asan, I will return with additional info

@spscream
Copy link
Contributor Author

voex-logs.zip
we got the same situation today, I did gdb dump and logs.

situattion were at ~05:30

@spscream
Copy link
Contributor Author

this is asan version, so no warnings were fired

@spscream
Copy link
Contributor Author

we run branch without d014cff commit and it works normally

@lminiero
Copy link
Member

@spscream is that the result of a git bisect or just trying different commits that involved the VideoRoom?
At any rate, if that's the culprit, I'd say that a full gdb trace when an instance deadlocks is more helpful than a libasan dump, as it might tell us on which lines/locks the instance is stuck on. The crash, in fact, may be the result of stuff piling up while the server is unresponding, and so of limited use if we don't know why it locked in the first place.

@lminiero
Copy link
Member

I'm not sure line numbers match the commits I'm looking at: line 4521, for instance, is an unlock, and it makes no sense for many threads to be stuck there. Not sure if line numbers are messed up because libasan was built in (I seem to remember asan and gdb don't always get along).

@spscream
Copy link
Contributor Author

I tried to rebase on different commits from master, since we are using our custom logic(https://github.com/spscream/janus-gateway/tree/bisect/1.3.0-1 branch) and I can't use bisect directly.

@atoppi
Copy link
Member

atoppi commented Oct 25, 2024

@spscream can you double check that Asan has been linked?

ldd /path/to/janus/bin | grep -i asan

@spscream
Copy link
Contributor Author

sure:

# ldd /usr/bin/janus | grep -i asan
        libasan.so.8 => /lib/libasan.so.8 (0x00007f9b853d1000)

@lminiero
Copy link
Member

I'm sorry but we can't look at forks. Please try and reproduce with a branch from this repo.

@spscream
Copy link
Contributor Author

I reviewed commit d014cff and found lock here d014cff#diff-06af86ce190449ef81620ff5ad5f6ecda030a07fff31683a30c0a1960657e279R5828 doesn't unlock in case of error of opening udp socket - it jumps to prepare_response. Added pr with fix

@spscream
Copy link
Contributor Author

It got stuck today on branch with fix today. We will switch to master version and provide log with gdb info.

@lminiero
Copy link
Member

Ack, reopening then.

@lminiero lminiero reopened this Oct 29, 2024
@spscream
Copy link
Contributor Author

#3469 another one found and checked all other cases, seems like they are ok. Please don't close the issue, we will test it on load tomorrow morning again and I'll provide feedback.

@lminiero
Copy link
Member

I didn't close the issue yesterday, your PR did because it mentioned this issue with a "fix" in it 🤭

@spscream
Copy link
Contributor Author

I didn't close the issue yesterday, your PR did because it mentioned this issue with a "fix" in it 🤭

:) today feedback is good, janus is working without hangs, we will run it on more servers to check it under higher load.

@spscream
Copy link
Contributor Author

nope, hangs again

@atoppi
Copy link
Member

atoppi commented Oct 30, 2024

Have you tried with locks debugging enabled?
A missing unlock is only a potential reason of deadlock. Also a double lock or unlock might lead to that (and other undefined behaviors).

@spscream
Copy link
Contributor Author

I'm checking it now.
We don't run with locks debugging, it won't work normally since of high load on our servers. We also testing version of our app running from master janus(I moved our functionallity from janus to business logic of our app).

Plan now is the following:

  • I'm looking into our current code, what could possibly lead to deadlock (based on gdb dump I provided above and based on our code it hangs on sessions_mutex since most of threads are waiting of it).
  • run version of our app on master and wait for hangs to obtain fresh logs and gdb dump

@spscream
Copy link
Contributor Author

I have suspicion on race condition between streams_mutex and room->mutex,
there is a place in janus_videroom_notify_about_publisher where room->mutex obtained:

 	if(room && !g_atomic_int_get(&room->destroyed)) {
 		janus_refcount_increase(&room->ref);
 		janus_mutex_lock(&room->mutex);
		janus_videoroom_notify_participants(p, pub, FALSE);
		janus_mutex_unlock(&room->mutex);
 		janus_refcount_decrease(&room->ref);
	}

theoretically if in other place they obtained in different order: room->mutex is obtained first and streams_mutex second, it possibly could be a deadlock.

@spscream
Copy link
Contributor Author

spscream commented Oct 30, 2024

this place:

janus_videoroom_notify_about_publisher(participant, FALSE);
is calling janus_videoroom_notify_about_publisher so streams mutex is locked first and room->mutex second

and other place calling in other order is listparticipants command

janus_mutex_lock(&videoroom->mutex);

@spscream
Copy link
Contributor Author

this is what I see in my gdb:

#2  0x00007f027a31f2d5 in janus_videoroom_notify_about_publisher (p=p@entry=0x6150027d2680, update=update@entry=0) at plugins/janus_videoroom.c:4390

and other one is

#2  0x00007f027a37e5aa in janus_videoroom_process_synchronous_request (session=<optimized out>, session@entry=0x607001d10530, message=<optimized out>, message@entry=0x607001e08860) at plugins/janus_videoroom.c:7090

based on our codebase second is listparticipants request trying to obtain streams_mutex

@spscream
Copy link
Contributor Author

We still can run master on our servers to prove it with gdb, if it is neccessary

@lminiero
Copy link
Member

Yes, an inverted lock order can definitely cause that problem. Maybe we should take the room->mutex lock out of the janus_videoroom_notify_about_publisher function, and always do it outside instead, so that it can be put before mutexes it always precedes in other contexts. This would also mean the lock would cover more code and could take longer to be released, though, so it's a matter of figuring out if this could have an impact on performance in certain use cases.

@brave44
Copy link
Contributor

brave44 commented Oct 31, 2024

Yes, an inverted lock order can definitely cause that problem. Maybe we should take the room->mutex lock out of the janus_videoroom_notify_about_publisher function, and always do it outside instead, so that it can be put before mutexes it always precedes in other contexts. This would also mean the lock would cover more code and could take longer to be released, though, so it's a matter of figuring out if this could have an impact on performance in certain use cases.

We can test performance if you provide pr with fix, we have pretty solid testing environment.

@spscream
Copy link
Contributor Author

or other variant is make lock of streams inside of janus_videoroom_notify_about_publisher and remove outer lock

@spscream
Copy link
Contributor Author

but I think it could lead to strange races - state of streams can change while we entering janus_videoroom_notify_about_publisher

@spscream
Copy link
Contributor Author

spscream commented Nov 4, 2024

@lminiero @atoppi so, what will we do on this?

@atoppi
Copy link
Member

atoppi commented Nov 4, 2024

I'm preparing a patch

@spscream
Copy link
Contributor Author

spscream commented Nov 4, 2024

#3473 I suggest this, just in case, if you have better option it would be great

@atoppi
Copy link
Member

atoppi commented Nov 4, 2024

@spscream please test #3474

spscream pushed a commit to spscream/janus-gateway that referenced this issue Nov 5, 2024
@brave44
Copy link
Contributor

brave44 commented Nov 9, 2024

@spscream please test #3474

Fix helps and working, no stuck any more. )

@lminiero
Copy link
Member

Closing as we merged the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multistream Related to Janus 1.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants