Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jack stops working after some time #307

Open
loblik opened this issue Nov 11, 2017 · 64 comments
Open

Jack stops working after some time #307

loblik opened this issue Nov 11, 2017 · 64 comments

Comments

@loblik
Copy link

loblik commented Nov 11, 2017

I start jack using following command:
jackd -R -d alsa -d hw:Device -r 44100

Then I also load netmanager with:
jack_load netmanager -i "-c"

After some time jack stops working with error "Connection time dout". This is probably because sem_timedwait in JackPosixSemaphore.cpp returns ETIMEDOUT.

I noticed JackAndroidSemaphore.cpp has very similiar code and it does some special handling of ETIMEDOUT error. So could it be something is missing here?

It's running as realtime process on quadcore ARMv7 without any other workload. So I guess performance is not a problem.

Full log follows.

Nov 10 01:10:23 echo jackd[4543]: jackdmp 1.9.11
Nov 10 01:10:23 echo jackd[4543]: Copyright 2001-2005 Paul Davis and others.
Nov 10 01:10:23 echo jackd[4543]: Copyright 2004-2014 Grame.
Nov 10 01:10:23 echo jackd[4543]: jackdmp comes with ABSOLUTELY NO WARRANTY
Nov 10 01:10:23 echo jackd[4543]: This is free software, and you are welcome to redistribute it
Nov 10 01:10:23 echo jackd[4543]: under certain conditions; see the file COPYING for details
Nov 10 01:10:23 echo jackd[4543]: JACK server starting in realtime mode with priority 10
Nov 10 01:10:23 echo jackd[4543]: self-connect-mode is "Don't restrict self connect requests"
Nov 10 01:10:23 echo jackd[4543]: audio_reservation_init
Nov 10 01:10:23 echo jackd[4543]: Acquire audio card Audio1
Nov 10 01:10:23 echo jackd[4543]: creating alsa driver ... hw:Device|hw:Device|1024|2|44100|0|0|nomon|swmeter|-|32bit
Nov 10 01:10:23 echo jackd[4543]: configuring for 44100Hz, period = 1024 frames (23.2 ms), buffer = 2 periods
Nov 10 01:10:23 echo jackd[4543]: ALSA: final selected sample format for capture: 24bit little-endian in 3bytes format
Nov 10 01:10:23 echo jackd[4543]: ALSA: use 2 periods for capture
Nov 10 01:10:23 echo jackd[4543]: ALSA: final selected sample format for playback: 24bit little-endian in 3bytes format
Nov 10 01:10:23 echo jackd[4543]: ALSA: use 2 periods for playback
Nov 10 01:10:24 echo jack_wait[4544]: server is available
Nov 10 01:10:24 echo jackd[4543]: Starting Jack NetManager
Nov 10 01:10:24 echo jackd[4543]: Listening on '225.3.19.154:19000'
Nov 10 01:10:24 echo jack_load[4552]: netmanager is running.
Nov 10 01:10:24 echo jack_load[4552]: client name = netmanager
Nov 10 01:10:24 echo systemd[1]: Started Jack Audio Connection Kit daemon.
Nov 11 06:43:21 echo jackd[4543]: JackPosixSemaphore::TimedWait err = Connection timed out
Nov 11 06:43:21 echo jackd[4543]: SuspendRefNum error
Nov 11 06:43:21 echo jackd[4543]: JackClient::Execute error name = netmanager
Nov 11 06:43:21 echo jackd[4543]: JackPosixProcessSync::LockedTimedWait error usec = 464380 err = Connection timed out
Nov 11 06:43:21 echo jackd[4543]: JackEngine::ClientDeactivate wait error ref = 3 name = netmanager
Nov 11 10:56:34 echo jackd[4543]: JackPosixProcessSync::LockedTimedWait error usec = 5000000 err = Connection timed out
Nov 11 10:56:34 echo jackd[4543]: Driver is not running
Nov 11 10:56:34 echo jackd[4543]: Cannot create new client
Nov 11 10:56:34 echo jackd[4543]: CheckSize error size = 32 Size() = 12
Nov 11 10:56:34 echo jackd[4543]: CheckRead error
Nov 11 10:56:34 echo jackd[4543]: CheckSize error size = -1 Size() = 4
Nov 11 10:56:34 echo jackd[4543]: CheckRead error
Nov 11 10:56:34 echo jackd[4543]: CheckSize error size = 0 Size() = 12
Nov 11 10:56:34 echo jackd[4543]: CheckRead error
Nov 11 16:22:29 echo jackd[4543]: Jack main caught signal 15

@falkTX
Copy link
Member

falkTX commented Nov 11, 2017

JackPosixSemaphore is not used anymore since 1.9.11-RC1, please update your jack2 version and try again.
no point trying to fix this since the latest code does not use the code where the timeout occurs...

@loblik
Copy link
Author

loblik commented Nov 14, 2017

Ok. I've recompiled from source and now semaphore error is gone. But JackPosixProcessSync::LockedTimedWait error is still there.

Nov 13 12:24:42 echo jackd[2911]: jackdmp 1.9.12
Nov 13 12:24:42 echo jackd[2911]: Copyright 2001-2005 Paul Davis and others.
Nov 13 12:24:42 echo jackd[2911]: Copyright 2004-2016 Grame.
Nov 13 12:24:42 echo jackd[2911]: Copyright 2016-2017 Filipe Coelho.
Nov 13 12:24:42 echo jackd[2911]: jackdmp comes with ABSOLUTELY NO WARRANTY
Nov 13 12:24:42 echo jackd[2911]: This is free software, and you are welcome to redistribute it
Nov 13 12:24:42 echo jackd[2911]: under certain conditions; see the file COPYING for details
Nov 13 12:24:42 echo jackd[2911]: JACK server starting in realtime mode with priority 10
Nov 13 12:24:42 echo jackd[2911]: self-connect-mode is "Don't restrict self connect requests"
Nov 13 12:24:43 echo jackd[2911]: creating alsa driver ... hw:Device|hw:Device|1024|2|44100|0|0|nomon|swmeter|-|32bit
Nov 13 12:24:43 echo jackd[2911]: configuring for 44100Hz, period = 1024 frames (23.2 ms), buffer = 2 periods
Nov 13 12:24:43 echo jackd[2911]: ALSA: final selected sample format for capture: 24bit little-endian in 3bytes format
Nov 13 12:24:43 echo jackd[2911]: ALSA: use 2 periods for capture
Nov 13 12:24:43 echo jackd[2911]: ALSA: final selected sample format for playback: 24bit little-endian in 3bytes format
Nov 13 12:24:43 echo jackd[2911]: ALSA: use 2 periods for playback
Nov 13 12:24:43 echo jack_wait[2912]: server is available
Nov 13 12:24:43 echo jackd[2911]: Starting Jack NetManager
Nov 13 12:24:43 echo jackd[2911]: Listening on '225.3.19.154:19000'
Nov 13 12:24:43 echo jack_load[2920]: netmanager is running.
Nov 13 12:24:43 echo jack_load[2920]: client name = netmanager
Nov 13 12:24:44 echo systemd[1]: Started Jack Audio Connection Kit daemon.
Nov 14 20:14:39 echo jackd[2911]: Takes physical 2 audio input(s) for slave
Nov 14 20:14:39 echo jackd[2911]: Takes physical 2 audio output(s) for slave
Nov 14 20:14:39 echo jackd[2911]: Takes physical 0 MIDI input(s) for slave
Nov 14 20:14:39 echo jackd[2911]: Takes physical 0 MIDI output(s) for slave
Nov 14 20:14:39 echo jackd[2911]: Sending parameters to lemra...
Nov 14 20:14:39 echo jackd[2911]: New NetMaster started
Nov 14 20:14:39 echo jackd[2911]: **************** Network parameters ****************
Nov 14 20:14:39 echo jackd[2911]: Name : lemra
Nov 14 20:14:39 echo jackd[2911]: Protocol revision : 8
Nov 14 20:14:39 echo jackd[2911]: MTU : 1500
Nov 14 20:14:39 echo jackd[2911]: Master name : echo
Nov 14 20:14:39 echo jackd[2911]: Slave name : lemra
Nov 14 20:14:39 echo jackd[2911]: ID : 1
Nov 14 20:14:39 echo jackd[2911]: Transport Sync : no
Nov 14 20:14:39 echo jackd[2911]: Send channels (audio - midi) : 2 - 0
Nov 14 20:14:39 echo jackd[2911]: Return channels (audio - midi) : 2 - 0
Nov 14 20:14:39 echo jackd[2911]: Sample rate : 44100 frames per second
Nov 14 20:14:39 echo jackd[2911]: Period size : 1024 frames per period
Nov 14 20:14:39 echo jackd[2911]: Network latency : 5 cycles
Nov 14 20:14:39 echo jackd[2911]: SampleEncoder : Float
Nov 14 20:14:39 echo jackd[2911]: Slave mode : async
Nov 14 20:14:39 echo jackd[2911]: ****************************************************
Nov 14 20:14:39 echo jackd[2911]: Waiting for a slave...
Nov 14 20:14:39 echo jackd[2911]: Synching with latency = 0
Nov 14 20:14:39 echo jackd[2911]: Synching with latency = 1
Nov 14 20:14:39 echo jackd[2911]: Synching with latency = 2
Nov 14 20:14:39 echo jackd[2911]: Synching with latency = 3
Nov 14 20:14:39 echo jackd[2911]: Synching with latency = 4
Nov 14 20:14:43 echo jackd[2911]: Send fd = 16 err = Connection refused
Nov 14 20:14:43 echo jackd[2911]: Send connection lost error = Connection refused, 'lemra' exiting
Nov 14 20:14:43 echo jackd[2911]: Exiting 'lemra' 225.3.19.154
Nov 14 20:14:43 echo jackd[2911]: Waiting for a slave...
Nov 14 21:59:51 echo jackd[2911]: JackPosixProcessSync::LockedTimedWait error usec = 5000000 err = Connection timed out
Nov 14 21:59:51 echo jackd[2911]: Driver is not running
Nov 14 21:59:51 echo jackd[2911]: Cannot create new client
Nov 14 21:59:51 echo jackd[2911]: CheckSize error size = 32 Size() = 12
Nov 14 21:59:51 echo jackd[2911]: CheckRead error
Nov 14 21:59:51 echo jackd[2911]: CheckSize error size = -1 Size() = 4
Nov 14 21:59:51 echo jackd[2911]: CheckRead error
Nov 14 21:59:51 echo jackd[2911]: CheckSize error size = 0 Size() = 12
Nov 14 21:59:51 echo jackd[2911]: CheckRead error

@arre525
Copy link
Contributor

arre525 commented Jan 14, 2018

Hi loblik,

Can you run jackd with -v (verbose mode) and paste the output while it is failing?

I had the same issue and narrowed it down to the alsa driver thread being stuck in xruns. I have a fix in my local copy, and will issue a pull request for it soon. Would be nice to know if it would help you too.

@sadko4u
Copy link

sadko4u commented Jan 25, 2018

I'm getting the same problem. I'm just running LSP plugins as standalone JACK applications in debug mode and spontaneously get:

Jack: JackRequest::Notification
Jack: JackDriver::ClientNotify ref = 1 driver = freewheel name = freewheel notify = 18
Jack: JackExternalClient::ClientNotify ref = 2 client = ardour name = ardour notify = 18
Jack: JackDriver::ClientNotify ref = 1 driver = freewheel name = freewheel notify = 18
Jack: JackDriver::ClientNotify ref = 1 driver = freewheel name = freewheel notify = 18
Jack: JackExternalClient::ClientNotify ref = 2 client = ardour name = ardour notify = 18
Jack: JackDriver::ClientNotify ref = 1 driver = freewheel name = freewheel notify = 18
Jack: JackEngine::ClientNotify: no callback for notification = 4
Jack: JackEngine::ClientNotify: no callback for notification = 4
Jack: JackExternalClient::ClientNotify ref = 2 client = ardour name = ardour notify = 4
Jack: JackSocketServerChannel::Execute : fPollTable i = 2 fd = 8
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 7
Jack: JackSocketServerChannel::Execute : fPollTable i = 2 fd = 8
Jack: JackSocketServerChannel::ClientCreate socket
Jack: JackSocketServerChannel::BuildPoolTable size = 4
Jack: JackSocketServerChannel::BuildPoolTable fSocketTable i = 1 fd = 7
Jack: JackSocketServerChannel::BuildPoolTable fSocketTable i = 2 fd = 8
Jack: JackSocketServerChannel::BuildPoolTable fSocketTable i = 3 fd = 11
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 7
Jack: JackSocketServerChannel::Execute : fPollTable i = 2 fd = 8
Jack: JackSocketServerChannel::Execute : fPollTable i = 3 fd = 11
Jack: JackSocketServerChannel::Execute : poll client error err = No such file or directory
Jack: JackSocketServerChannel::ClientKill ref = -1 fd = 11
Jack: Client was not opened : probably correspond to server_check
Jack: JackClientSocket::Close

On the plugin side:
[TRC][../src/ui/ws/x11/X11Window.cpp: 296] handle_event: redraw location = 0 x 0, size = 913 x 549
[TRC][../src/ui/ws/x11/X11Display.cpp: 898] handleEvent: The keyboard state was changed!
SuspendRefNum error
JackClient::Execute error name = spectrum_analyzer_x1
[WRN][/home/sadko/eclipse/lsp-plugins-rc-1.1.1/include/container/jack/wrapper.h: 200] shutdown: JACK NOTIFICATION: shutdown
Jack: JackLibClient::ShutDown
Jack: JackClient::ShutDown
Jack: JackPosixThread::Terminate
[TRC][../src/container/jack.cpp: 43] jack_ui_sync: Connection to JACK was lost
Jack: JackClient::Deactivate
[TRC][/home/sadko/eclipse/lsp-plugins-rc-1.1.1/include/core/plugin.h: 63] deactivate_ui: UI has been deactivated
Jack: JackClient::PortUnRegister port_index = 67
Server is not running
Jack: JackClient::PortUnRegister port_index = 68
Server is not running
Jack: jack_client_close
Jack: JackClient::Close ref = 3
Jack: JackClient::Deactivate
Jack: JackSocketClientChannel::Stop
Jack: JackPosixThread::Kill
Server is not running
Jack: JackClientSocket::Close
Jack: JackClientSocket::Close
Jack: JackLibClient::~JackLibClient
Jack: JackShmReadWritePtr1::~JackShmReadWritePtr1 3
Jack: Succeeded in unlocking 422 byte memory area
Jack: JackLibGlobals Destroy c6ae00
Jack: ~JackLibGlobals
Jack: no message buffer overruns
Jack: JackPosixThread::ThreadHandler : exit
Jack: JackPosixThread::Stop
Jack: JackShmReadWritePtr::~JackShmReadWritePtr 1
Jack: Succeeded in unlocking 1187 byte memory area
Jack: JackShmReadWritePtr::~JackShmReadWritePtr 0
Jack: Succeeded in unlocking 82280346 byte memory area
Jack: jack_client_close res = -1
[TRC][../src/container/jack.cpp: 80] jack_ui_sync: Trying to connect to JACK
Jack: jack_client_open spectrum_analyzer_x1
Jack: JackLibGlobals Init 0
Jack: JackLibGlobals
Jack: JackPosixThread::StartImp : create non RT thread
Jack: JackPosixThread::ThreadHandler : start
Jack: JackGenericClientChannel::ServerCheck = default
Jack: JackClientSocket::Connect : addr.sun_path /dev/shm/jack_default_1000_0
Jack: JackClientSocket::Close
Jack: JackLibClient::JackLibClient table = c6ae20
Jack: JackLibClient::Open name = spectrum_analyzer_x1
Jack: JackSocketClientChannel::Open name = spectrum_analyzer_x1
Jack: JackClientSocket::Connect : addr.sun_path /dev/shm/jack_default_1000_0
Jack: JackServerSocket::Bind : addr.sun_path /dev/shm/jack_spectrum_analyzer_x1_1000_0
Jack: JackSocketClientChannel::Start
Jack: JackPosixThread::StartImp : create non RT thread
Jack: JackPosixThread::ThreadHandler : start
Jack: JackSocketClientChannel::Init
Jack: JackServerSocket::Close /dev/shm/jack_spectrum_analyzer_x1_1000_0
Jack: JackClient::ClientNotify ref = 0 name = system notify = 0
Jack: JackClient::AddClient name = system, ref = 0
Jack: JackLinuxFutex::Connect name = jack_sem.1000_default_system
Jack: JackClient::ClientNotify ref = 1 name = freewheel notify = 0
Jack: JackClient::AddClient name = freewheel, ref = 1
Jack: JackLinuxFutex::Connect name = jack_sem.1000_default_freewheel
Jack: JackClient::ClientNotify ref = 2 name = ardour notify = 0
Jack: JackClient::AddClient name = ardour, ref = 2
Jack: JackLinuxFutex::Connect name = jack_sem.1000_default_ardour
Jack: JackShmReadWritePtr::Init 1 -1
Jack: Succeeded in locking 1187 byte memory area
Jack: JackShmReadWritePtr::Init 0 -1
Jack: Succeeded in locking 82280346 byte memory area
Jack: JackShmReadWritePtr1::Init 3 -1
Jack: Succeeded in locking 422 byte memory area
Jack: JackClient::SetupDriverSync driver sem in flush mode
Jack: JackLinuxFutex::Connect name = jack_sem.1000_default_spectrum_analyzer_x1
Jack: Clock source : system clock via clock_gettime
Jack: JackLibClient::Open name = spectrum_analyzer_x1 refnum = 3
Jack: JackClient::PortRegister ref = 3 name = spectrum_analyzer_x1:in0 type = 32 bit float mono audio port_index = 67
Jack: JackClient::PortRegister ref = 3 name = spectrum_analyzer_x1:out0 type = 32 bit float mono audio port_index = 68
[TRC][/home/sadko/eclipse/lsp-plugins-rc-1.1.1/include/core/plugin.h: 53] activate_ui: UI has been activated
[TRC][../src/ui/ctl/CtlButton.cpp: 84] commit_value: commit value=1.000000
[TRC][../src/ui/ctl/CtlButton.cpp: 84] commit_value: commit value=0.000000
Jack: JackClient::Activate
Jack: JackPosixThread::StartImp : create non RT thread
Jack: JackPosixThread::ThreadHandler : start
Jack: JackClient::kBufferSizeCallback buffer_size = 1024
Jack: JackClient::Init : period = 21333 computation = 100 constraint = 21333
Jack: JackPosixThread::AcquireRealTimeImp priority = 75
Jack: JackClient::ClientNotify ref = 3 name = spectrum_analyzer_x1 notify = 2
Jack: JackClient::kActivateClient name = spectrum_analyzer_x1 ref = 3
[TRC][../src/container/jack.cpp: 83] jack_ui_sync: Successful connected to JACK

JACK version:
jackdmp version 1.9.12 tmpdir /dev/shm protocol 8

@Rippert
Copy link

Rippert commented Apr 15, 2018

I had the same issue and narrowed it down to the alsa driver thread being stuck in xruns. I have a fix in my local copy, and will issue a pull request for it soon. Would be nice to know if it would help you too.

arre525 - I have similar issues with both 1.9.11 and 1.9.12 versions. My use case is different, I don't use the network manger, I use various plugin hosts and effects programs for realtime music performance. I've suspected the ALSA driver also, but did not have the insight on how to fix it.

I see that you have a clone of the Jack2 repo, but I could not find any commit that looked like a fix for the ALSA driver. I would be happy to test your fix, or to provide you with verbose data logs from a driver failure event. If you're still monitoring this issue, please get back to me. I would love to get this issue fixed and be able to reliably use jack.

@arre525
Copy link
Contributor

arre525 commented Apr 22, 2018

@Rippert - Yes, I had been planning on pushing my change to my repo and then issueing a pull request, but when doing some final tests I noticed there were still more issues then just the one I fixed. (And due to lack of time to do a more proper analysis I stopped working on it altogether).

Anyway, the main point was that in verbose mode, you'd get something like this:

Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 9
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: **** alsa_pcm: xrun of at least -1177622.016 msecs
Jack: ALSA XRun wait_status = 0
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 9
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: **** alsa_pcm: xrun of at least -1177651.968 msecs
.. (repeats over and over again)

Which goes on and on.
What I did (see http://arre234.blogspot.be/ at section "Patching jackd to deal with latency spikes") was to force the jackd alsa "driver" (part of jackd) to do an alsa_prepare again each time and xrun occured. This helps the linux kernel sound driver to better recover before restarting and made quite a difference. But.. as my analysis showed, not in all cases. Further analysis would have involved actually digging into the specific kernel code to see what it was doing wrong (but again, this is then a kernel driver issue, and not jackd).

Anyway, I'll commit what I had already on my repo (or you just patch it from the diff on my blog), and hope it helps you as well!

@arre525
Copy link
Contributor

arre525 commented Apr 22, 2018

@Rippert : See arre525@a4c9ee5

@Rippert
Copy link

Rippert commented Apr 22, 2018

@arre525 : Thanks. I have seen similar errors. I didn't save them, as you said, they go on and on. I've also seen your blog before. It's one of the reasons I kept trying to get Jack to work on a Raspberry Pi. I will try your code when I get back home.

Which version of the kernel are you using? The xrun problem seemed to get better for me recently when Raspbian upgraded to 4.14. Not totally fixed, just less frequent. The lockups do seem to occur after a burst of xruns, latency spikes as you call them. Like when opening or closing a process that connects to Jack. Sometimes when starting up Jack itself. Pretty random, and increasing the number of samples per period doesn't really fix it as you still get the bursts of xruns occasionally.

You should go ahead and submit your pull request to the official repo. You've documented it and it's shortcomings pretty well, so the maintainers can decide whether it's worth including.

Thanks,
-Ted

@Rippert
Copy link

Rippert commented Apr 22, 2018

@falkTX and other jack audio devs: I have seen the problem arre525 mentions on my Macbook Air with a core i5 running the KXStudio OS, so it's not just an ARM problem. It is a lot less frequent on my x86 machine, but still happens occasionally. I'm not sure if there's anything you can do if it's a bug in the kernel code as arre525 suspects, but just to let you know.

@arre525
Copy link
Contributor

arre525 commented Apr 23, 2018

@Rippert : I am using kernel 4.18.8 (with the 4.18.8-RT9 patchset for realtime behavior)

@falkTX
Copy link
Member

falkTX commented Mar 22, 2019

@arre525 your change makes sense, and something I worked-around in ARM devices too (though in a different way).
Since we have a develop branch now in jack2, I will merge your fix there.

@7890
Copy link
Contributor

7890 commented Mar 26, 2019

@loblik @sadko4u @Rippert are issues solved for you with the most recent jack?
@arre525 I've added you to AUTHORS.rst (#445)

@sadko4u
Copy link

sadko4u commented Mar 26, 2019

@7890 not tested yet. I believe I should build and install JACK from the develop branch?

@7890
Copy link
Contributor

7890 commented Mar 26, 2019

@sadko4u yes, thanks

@Rippert
Copy link

Rippert commented Mar 27, 2019

Just did a test on Raspbian Lite - looks like it still has the problem.

Control:
I fired up jackd 1.9.11 installed via apt-get with -r 48000 and p 64. Then start Guitarix and loaded a heavy preset. after a couple minutes, I could see the endlessly scrolling xruns in the Guitarix messages window, jack_lsp returned only errors and I could not stop jackd with killall jackd. I had to use killall -KILL jackd. This is what has happened in the past.

Experiment:
Cloned jack2 develop branch. ./waf configure succeeded, ./waf build succeeded, sudo ./waf install succeeded. jackd -V said jackdmp version 1.9.12 tmpdir /dev/shm protocol 8. loaded jackd with -r 48000, -p 64. Ran guitarix with heavy preset. After a couple minutes, rolling xruns in messages window and jack_lsp errors. again had to use the -KILL switch to kill jackd.

Please let me know if you want me to try something differently, I will try.

Also, please note - I have been using Arch Linux Arm for the last month with the default repo version of jack2 from pacman (1.9.12), and it does not have this problem. This is with the same hardware. I can get some heavy xruns, but it is always recoverable by changing the buffersize, and I can always kill jackd without the -KILL switch. This seems in line with the statement by @arre525 that his patch does not fix all cases and a change to the kernel may be needed. The Arch kernel is more recent and has the PREEMPT patch compared with the Raspbian kernel. There are surely many other differences also.

Let me know if you would like more info. This was just a very quick test.

Thanks for all your work on this,
-Ted

@falkTX
Copy link
Member

falkTX commented Mar 27, 2019

Getting a new kernel for raspbian would be the final test then, if it is possible.

@Rippert
Copy link

Rippert commented Mar 27, 2019

I tried using rpi-update to upgrade the kernel. it started at:
Linux raspberrypi 4.14.98-v7+ #1200 SMP Tue Feb 12 20:27:48 GMT 2019 armv7l GNU/Linux
and updated to:
Linux raspberrypi 4.19.30-v7+ #1209 SMP Tue Mar 26 13:14:20 GMT 2019 armv7l GNU/Linux

The new kernel didn't change anything in terms of jackd locking up. It still ended up with rolling xruns and needing the -KILL switch to stop it.

I'm not sure that this is a valid test as the kernel wasn't compiled with the same options as the Arch Linux Kernel, for example it doesn't have the PREEMPT option. I'll check the uname -a output on Arch tomorrow when I get a chance to re-configure the RPi setup.

There are also other differences between Arch and Raspbian, Arch being a much more minimal system than even Raspbian Lite intrinsically.

I could try compiling a new kernel for Raspbian, but I'd need to be able to get the compilation parameters of the Arch kernel to try to get them equal. If anyone has suggestions for where to find such information, let me know.

@arre525
Copy link
Contributor

arre525 commented Mar 27, 2019

Hi,

Thanks for merging the change and promoting me to co-author:)
Some clarifications:

  • I used the PREEMT**-RT** patch, not just the PREEMPT kernel option. This is a patchset that is not fully upstreamed yet, but makes the kernel much more low-latent than it can otherwise be.
  • It is normal for jackd to report xruns if it takes too long for it to get scheduled to feed or consume audio buffer data from the kernel. The idea is of course that, the bigger the buffer, the more margin (in time) before the under/overrun occurs. The more low latent the kernel scheduler (non-PREEMPT is bad, PREEMPT is good, PREEMPT RT is better, ..), the more frequent jackd gets scheduled and can do this, avoiding xruns. @Rippert I am assuming that the volume of xruns you are seeing is not "normal" (not in proportion to possible latency spikes ), and that the situation never seems to recover (even when using big buffers), right?
  • It is possible that there is some underlying issue in an alsa driver which is only triggered during a number of consecutive xruns. By going with a more low-latent kernel, you might hence "avoid" the issue, but perhaps not fully fix it. (This is pure speculation at this point)
  • The exact audio HW used (and hence the driver) is very relevant, aside from indeed the kernel version and config. @Rippert , which audio device are you using on the pi? (arecord -L, aplay -L, lsmod and dmesg are of course useful)

In any case, the ideal scenario is that we do a more proper investigation for the real cause here.
@Rippert , since you are running two kernels on the same HW, one of which gives better results, I would actually even recommended trying to boot raspbian with the arch kernel. (This is perhaps easier said then done, but on the other hand doing a kernel recompilation is perhaps around the same difficulty).
You asked which were the relevant options? I would assume mainly the preemptiveness, but you can just as well take the full config from the arch kernel as well. (effectively building the same kernel assuming the version is the same). (you can easily check the config of a running kernel by doing "zcat /proc/config.gz" (if it was compiled with CONFIG_IKCONFIG_PROC)

I have plenty of pi's laying around here. If I find some time (might not be any time soon but I'll do my best), I can also have a further look into this. @Rippert , perhaps to kickstart me, do you have an image that I can simply flash on an SD card (and a description of the HW you are using?)

We'll nail this the bug eventually:)

@Rippert
Copy link

Rippert commented Mar 27, 2019

@arre525 thanks for the detailed info. Responding to your bullets:

  • The PREEMPT I was referring to was the default ARCH kernel setup, which is different than the stock Raspbian kernel (NO PREEMPT) or your kernels (PREEMPT RT). It's one of the obvious differences between Raspbian and ARCH on ARM.

  • You are correct that the essence of the bug is that the jackd process becomes unrecoverable, not only failing to operate properly with clients, but requiring a SIGKILL signal rather than just a SIGTERM signal to stop. This happens on Raspbian, but not on ARCH. It is possible to get the ARCH jackd to get into a state where it is undergoing a large amount of xruns, but it has always been recoverable, and I have never has to use a SIGKILL signal to stop jackd on ARCH. In fact it is usually recoverable by just using jack_bufsize to increase the buffer, and the audio never really stops, but becomes full of "static" from all the xruns. Guitarix is useful as a tool to see this, as the message window not only shows the xruns, but shows a somewhat specific signature when the unrecoverable xruns start. Instead of listing xruns of a few or even fractional microseconds, it will list xruns that are something like xx.xx e09 usec. essentially a nonsense result that each xrun is taking thousands of seconds.

  • I understand your point that the ARCH system may still be susceptible to the unrecoverable state, but so far, I have not been able to reproduce it on that system. I can push the buffer down to ridiculously small levels like 16 periods, and still recover, although the clients tend to spontaneously die at such small buffer sizes.

  • I am using a Behringer UMC 204HD USB interface on my test setup, with a RPi 3B+ booting off of a usb SSD. I have also seen the unrecoverable xruns using an AudioInjector Stereo Hat on a 3B and a 3B+ running Raspbian Full. I'll have to pull that unit out of my Amp Rack and test it with Arch when I get a chance. I'll get the full listings out later for both Raspbian and ARCH, but for now:

aplay -l
**** List of PLAYBACK Hardware Devices ****
card 1: U192k [UMC204HD 192k], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

Unfortunately I'm getting this on Raspbian:

$ zcat /proc/config.gz
gzip: /proc/config.gz: No such file or directory

I like the idea of just using the ARCH kernel in Raspbian. Other than copying over the /boot partition and adjusting the PART_UUIDs in there, what else would I have to do?

I don't have an image right now, but could make one. It will be fairly large given the multiple operating systems (using PINN) and all the customization inside each one. It's on a 60GB SSD, but only about 10GB of actual data between ARCH and Raspbian. I can probably extract each OS from the disk separately if that would help.

@Rippert
Copy link

Rippert commented Mar 27, 2019

I rebooted into Arch and tried reducing the buffersize of jackd. I got it down to 8 periods (jack_bufsize spit errors when I tried 4). Looking at guitarix, there were the rolling xruns, and they had the xx.xxe08 signature. However, I was still able to increase the buffer back to 64 without losing control of jackd. Guitarix needed a restart, but it worked fine afterward. So, similar results in terms of xruns, but jack remains responsive to commands and does not need to be killed to recover. The sound never stopped working.

For Arch:

$ uname -a
Linux alarmpi 4.19.25-2-ARCH #1 SMP PREEMPT Tue Feb 26 01:20:52 UTC 2019 armv7l GNU/Linux

@arre525 I was able to get the kernel configuration out of Arch. It's attached, along with the other diagnostics you asked for, as text files because they are so long. Let me know if you have a problem with those and I can just cut and paste them.

Raspbian_lsmod_dmesg.txt
Arch_uname_aplay-l_lsmod_dmesg.txt
Arch_KernelConfig.txt

I'll start looking at how to compile a new kernel for raspbian with the Arch config. I compiled kernels back in the 90s, but not since then. I think some things may have changed in the meantime :)

@arre525
Copy link
Contributor

arre525 commented Mar 28, 2019

Thanks.

I don't have an image right now, but could make one

Nah, never mind then. It will take us more time than it saves.

I like the idea of just using the ARCH kernel in Raspbian. Other than copying over the /boot partition and adjusting the PART_UUIDs in there, what else would I have to do?

Copying the full /boot would probably also work, but it somewhat overkill. Normally the minimum that you should do it (from the top of my head):

  • Copy the zImage-arch to the raspbian boot part.
  • modify /boot/config.txt to use it
  • (As long as the two kernel verions are similar, no need to modify the dts'ses normally.)
  • And then also copy the modules (from /lib/modules/<ARCHKERNEL/ into the /lib/modules/ of the raspbian). The modules from the two kernels can just coexist next to each other, and depending on which kernel you boot, one or the other will be taken.

I would guess that's all you need to swap kernels. (Does seems easier now then compiling the kernel yourself:)

@Rippert
Copy link

Rippert commented Mar 28, 2019

@arre525 If you want a prepackaged image for ARCH, you can install the PINN image on your SD card (or HD):
https://sourceforge.net/projects/pinn/files/Etcher/pinn-lite.img.zip/download

PINN is a fork of NOOBS, and one of the OS options is ARCH. That's where I got my current ARCH install from. I also tried out the install direct from the ARCH LINUX ARM site (not an image, so more work), and it worked well with jack too, but I wanted to get more use out of my 64G SSD, so I switched to PINN so I could have multiple OSes on one drive.

I'll try your instructions to get the ARCH kernel into Raspbian when I get the chance, thanks.

@Rippert
Copy link

Rippert commented Mar 28, 2019

I followed your instructions and was able to boot Raspbian using the ARCH kernel, but it wouldn't load any modules:

$ uname -a
Linux raspberrypi 4.19.25-2-ARCH #1 SMP PREEMPT Tue Feb 26 01:20:52 UTC 2019 armv7l GNU/Linux
$ lsmod
Module                  Size  Used by

Took a look at the snd_bcm2835 module to see if I could load it but no go:

$ sudo modprobe snd-bcm2835
modprobe: ERROR: could not insert 'snd_bcm2835': Exec format error
$ sudo modinfo snd-bcm2835
filename:       /lib/modules/4.19.25-2-ARCH/kernel/drivers/staging/vc04_services/bcm2835-audio/snd-bcm2835.ko.gz
modinfo: ERROR: could not get modinfo from 'snd_bcm2835': Exec format error

Seems to be looking in the right place, but acting like the modules were compiled against a different kernel or something.

I did try replacing the whole /boot partition contents, and I had to change /sbin/init from an absolute symbolic link to a relative one to get it to finish booting. Unfortunately, still the same Exec format error on the modules.

@Rippert
Copy link

Rippert commented Mar 29, 2019

OK, that was dumb. Raspbian doesn't like compressed modules. So a quick gunzip and depmod later, the audio was working.

Unfortunately, I still ended up with unrecoverable xruns in jackd.

So it's something else that's different between ARCH and Raspbian.

By the way, I did double check:

$ uname -a
Linux raspberrypi 4.19.25-2-ARCH #1 SMP PREEMPT Tue Feb 26 01:20:52 UTC 2019 armv7l GNU/Linux
$ modinfo snd-bcm2835
filename:       /lib/modules/4.19.25-2-ARCH/kernel/drivers/staging/vc04_services/bcm2835-audio/snd-bcm2835.ko
license:        GPL
description:    Alsa driver for BCM2835 chip
author:         Dom Cobley
alias:          of:N*T*Cbrcm,bcm2835-audioC*
alias:          of:N*T*Cbrcm,bcm2835-audio
depends:        snd-pcm,snd
staging:        Y
intree:         Y
name:           snd_bcm2835
vermagic:       4.19.25-2-ARCH SMP preempt mod_unload ARMv6 p2v8 
parm:           force_bulk:Force use of vchiq bulk for audio (bool)
parm:           enable_hdmi:Enables HDMI virtual audio device (bool)
parm:           enable_headphones:Enables Headphones virtual audio device (bool)
parm:           enable_compat_alsa:Enables ALSA compatibility virtual audio device (bool)

It's definitely running the ARCH kernel and modules.

@arre525
Copy link
Contributor

arre525 commented Mar 29, 2019

Nice, thanks for putting the time into it.

Fascinating. The exact same kernel and same jackd version, and still different results?
You could always try to build jackd statically (--static) on arch and copy the binary to raspbian to have the exact same code twice (to make sure its not a compilation issue or library). If it then still gives a difference, it's pretty baffling.

@Rippert
Copy link

Rippert commented Mar 29, 2019

It's not really the same jackd version. They just both say 1.9.12 when queried. On ARCH I just installed from the standard repo using pacman. On Raspbian, I built from source using the latest develop branch code (originally to test the new commit added from your fork). I'll look into trying to transfer the binary from ARCH to Raspbian, or building on ARCH statically as you mention if that doesn't work.

I suppose it's possible to download the source and the PKGBUILD files on ARCH similar to how you can do that on Debian. I've only ever done that on Debian. It's possible the Arch folks have some trick up their sleeve in there.

I seem to remember that you were using some different hardware for one of your builds in your blog, an OrangePi? Is that right? If it had the same problem, then there should be some kind of common factor.

@arre525
Copy link
Contributor

arre525 commented Mar 29, 2019

If you haven't compiled it yet on arch, I guess the simplest is to compile (statically) on raspbian and transfer that to ARCH. (If you compile staticallly you dont depend on any libs and so its easier to be sure it works when you move it to a new platform)
(I used the Orange Pi Zero with armbian.)

@7890
Copy link
Contributor

7890 commented Mar 29, 2019

Hi, I don't overview the current status - just a note on statically: there is libjack somewhere in this scenario, you can't (or shouldn't) statically link to it. If you see different behaviour of same jackd, you could check if libjack is (compiled) the same (way). If this is not related or you have thought already about it, please just ignore.

@arre525
Copy link
Contributor

arre525 commented Mar 30, 2019

If you believe that the xrun msecs listed are a sort of timestamp

Yes, I do. This is the code (linux/alsa/alsa_driver.c)

                struct timeval now, diff, tstamp;
		driver->xrun_count++;
		snd_pcm_status_get_tstamp(status,&now);
		snd_pcm_status_get_trigger_tstamp(status, &tstamp);
		timersub(&now, &tstamp, &diff);
		*delayed_usecs = diff.tv_sec * 1000000.0 + diff.tv_usec;
		jack_log("**** alsa_pcm: xrun of at least %.3f msecs",*delayed_usecs / 1000.0);

The idea is that is shows the delay between when the last trigger tstamp changed (e.g. when audio started processing again) and the current time. Regardless of the fact that is is printed in the wrong format (a big negative number) , as long as the alsa tstamp is constant, the difference between those numbers is still an accurate delay.

From what I make up from your post and mine, is that it simply takes too long for it to service it, causing the "logical" xruns. Simply having more processing running that linux has to schedule between could be a simple explanation between raspbian and arch, but to know for sure you'd have to force xruns on both and compare the delays.

What I still want to do, is look into whether jackd isn't being too inefficient in handling the xrun. (If it has to do extra processing recovering from it, it will take more time, and jump right back into one. ). (Again, once I find more time, will keep you posted)

Regarding the comment of the static build, yes, typically in the configure you can specify this so it ends up in the cflags. But as @7890 mentionned this might be more tricky as I first naively anticipated. Since you would compile on the two targets anyway, it seems to make more sense simply to compile the exact same code and take it from there.

@Rippert
Copy link

Rippert commented Mar 30, 2019

Ahh, thanks for the code snippet. I understand the large negative numbers now. diff is supposed to be the difference between now and stamp (diff = now - stamp). But when the runaway xrun condition happens, now becomes a small constant, probably zero or the last valid value of now, so diff becomes a large negative number.

@Rippert
Copy link

Rippert commented Mar 30, 2019

Well, I can't get ARCH to become unrecoverable, but I can get it to spit out a lot of xruns just by lowering the number of periods. So here is some output from ARCH when Guitarix is running and I reduce the periods to 8:

Jack: **** alsa_pcm: xrun of at least -584367.424 msecs
Jack: JackGraphManager::GetBuffer : port = 12 is released state
Jack: JackGraphManager::GetBuffer : port = 13 is released state
Jack: ALSA XRun wait_status = 0
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 11
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: **** alsa_pcm: xrun of at least -584387.264 msecs
Jack: JackGraphManager::GetBuffer : port = 12 is released state
Jack: JackGraphManager::GetBuffer : port = 13 is released state
Jack: ALSA XRun wait_status = 0
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 11
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: **** alsa_pcm: xrun of at least -584407.232 msecs
Jack: JackGraphManager::GetBuffer : port = 12 is released state
Jack: JackGraphManager::GetBuffer : port = 13 is released state
Jack: ALSA XRun wait_status = 0
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 11
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: **** alsa_pcm: xrun of at least -584427.264 msecs
Jack: JackGraphManager::GetBuffer : port = 12 is released state
Jack: JackGraphManager::GetBuffer : port = 13 is released state
Jack: ALSA XRun wait_status = 0
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 11
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: **** alsa_pcm: xrun of at least -584447.232 msecs
Jack: JackGraphManager::GetBuffer : port = 12 is released state
Jack: JackGraphManager::GetBuffer : port = 13 is released state
Jack: ALSA XRun wait_status = 0
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 11
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: **** alsa_pcm: xrun of at least -584467.264 msecs
Jack: JackGraphManager::GetBuffer : port = 12 is released state
Jack: JackGraphManager::GetBuffer : port = 13 is released state
Jack: ALSA XRun wait_status = 0
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 11
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: **** alsa_pcm: xrun of at least -584487.232 msecs
Jack: JackGraphManager::GetBuffer : port = 12 is released state
Jack: JackGraphManager::GetBuffer : port = 13 is released state
Jack: ALSA XRun wait_status = 0
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 11
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: **** alsa_pcm: xrun of at least -584507.264 msecs
Jack: JackGraphManager::GetBuffer : port = 12 is released state
Jack: JackGraphManager::GetBuffer : port = 13 is released state
Jack: ALSA XRun wait_status = 0
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 11
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: **** alsa_pcm: xrun of at least -584527.296 msecs

So, instead of the ~1 msec differential on Raspbian, ARCH has ~20 msec. So it's possible that ARCH simply has more time to recover between each xrun than Raspbian. Which seems different than the idea that ARCH has more available processor cycles than Raspbian due to less load. Of course, it's possible that whatever is causing the weird operation of the time routines is not really providing a good insight into the time differences (i.e. now is not a constant but is simply falling behind stamp at some constant rate)

@Rippert
Copy link

Rippert commented Mar 30, 2019

Forget the last post. I just tried it again, and now the timestamps are ~1 msec apart on ARCH. Not sure why the difference, but obviously not a very reliable metric.

@arre525
Copy link
Contributor

arre525 commented Mar 31, 2019

Check this out.

I quickly tested this on my x86 virtual machine, recompiling jackd myself.

diff --git a/linux/alsa/alsa_driver.c b/linux/alsa/alsa_driver.c
index 7b42566..2ffb856 100644
--- a/linux/alsa/alsa_driver.c
+++ b/linux/alsa/alsa_driver.c
@@ -1294,6 +1294,15 @@ alsa_driver_wait (alsa_driver_t *driver, int extra_fd, int *status, float
        jack_time_t poll_enter;
        jack_time_t poll_ret = 0;
 
+       {
+               struct timeval now, diff;
+               static struct timeval prevtv;
+               gettimeofday(&now,NULL);
+               timersub(&now, &prevtv, &diff);
+               jack_log("%d us (and %d s) passed since last alsa_driver_wait", diff.tv_usec, diff.tv_sec);
+               memcpy(&prevtv, &now, sizeof(now));
+       }
+
        *status = -1;
        *delayed_usecs = 0;

This allows to accurately show how long it takes for the alsa_driver_wait function to be called each time (without relying on alsa timestamps).

Some findings:

  • As expected, simply doing the log message each time creates a significant latency issue, causing xruns to occur much more frequently. (Contributing to the hanging xruns)
  • And also my suspicion was confirmed that during the xrun, somehow jackd takes considerably longer to recover:
testbox@testbox-VirtualBox:~/tmp/bin$ LD_LIBRARY_PATH=/home/testbox/tmp/lib ./jackd -v -dalsa -r48000 -p128 -D -Chw:CARD=Headset,DEV=0 -Phw:CARD=Headset,DEV=0 

...
hw:CARD=Headset,DEV=0|hw:CARD=Headset,DEV=0|128|2|48000|0|0|nomon|swmeter|-|32bit
configuring for 48000Hz, period = 128 frames (2.7 ms), buffer = 2 periods
ALSA: final selected sample format for capture: 16bit little-endian
ALSA: use 2 periods for capture
..
Jack: 3030 us (and 0 s) passed since last alsa_driver_wait
Jack: 2954 us (and 0 s) passed since last alsa_driver_wait
Jack: 1716 us (and 0 s) passed since last alsa_driver_wait
Jack: 3052 us (and 0 s) passed since last alsa_driver_wait
Jack: 2257 us (and 0 s) passed since last alsa_driver_wait
Jack: **** alsa_pcm: xrun of at least 0.031 msecs
Jack: Repreparing capture
Jack: Repreparing playback
Jack: ALSA XRun wait_status = 0
Jack: 29275 us (and 0 s) passed since last alsa_driver_wait
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 10
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: 3985 us (and 0 s) passed since last alsa_driver_wait
Jack: 2774 us (and 0 s) passed since last alsa_driver_wait
Jack: 1576 us (and 0 s) passed since last alsa_driver_wait
Jack: 3018 us (and 0 s) passed since last alsa_driver_wait

Note how " Jack: 29275 us (and 0 s) passed since last alsa_driver_wait " takes 30ms while the rest only takes 2-5ms. Also the buffer length (and even jackd) says the buffer is 2 ms. The delays are around that, but they don't immediately cause an xrun.

Going to look a bit further, but I think I can pretty much nail it down from here.
(btw on x86 the alsa xrun delay is accurate. TBD what goes wrong there on arm)

@arre525
Copy link
Contributor

arre525 commented Mar 31, 2019

Some more printing:

Jack: 5249 us (and 0 s) (+- 251 frames) passed in entire function (5075 us in poll time)
Jack: 5278 us (and 0 s) passed since last alsa_driver_wait
Jack: 3632 us (and 0 s) passed due to poll
Jack: playback ok
Jack: 1960 us (and 0 s) passed due to poll
Jack: capture ok
Jack: 288 frames capture avail
Jack: 288 frames play avail
Jack: 5768 us (and 0 s) (+- 276 frames) passed in entire function (5592 us in poll time)
Jack: 5782 us (and 0 s) passed since last alsa_driver_wait
Jack: 3138 us (and 0 s) passed due to poll
Jack: playback ok
Jack: 1796 us (and 0 s) passed due to poll
Jack: capture ok
Jack: 272 frames capture avail
Jack: 272 frames play avail
Jack: 5123 us (and 0 s) (+- 245 frames) passed in entire function (4934 us in poll time)
Jack: 5173 us (and 0 s) passed since last alsa_driver_wait
Jack: 4165 us (and 0 s) passed due to poll
Jack: playback ok
Jack: 2 us (and 0 s) passed due to poll
Jack: capture ok
Jack: 256 frames capture avail
Jack: 256 frames play avail
Jack: 4272 us (and 0 s) (+- 205 frames) passed in entire function (4167 us in poll time)
Jack: 4285 us (and 0 s) passed since last alsa_driver_wait
Jack: 4528 us (and 0 s) passed due to poll
Jack: playback ok
Jack: 8200 us (and 0 s) passed due to poll
POLLERR
Jack: capture ok
Jack: -32 frames capture avail
Jack: -32 frames play avail
Jack: **** alsa_pcm: xrun of at least 1.687 msecs
Jack: Repreparing capture
Jack: Repreparing playback
Jack: 33989 us (and 0 s) (+- 1631 frames) passed in entire function (12728 us in poll time)
Jack: ALSA XRun wait_status = 0
Jack: 34291 us (and 0 s) passed since last alsa_driver_wait
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 10
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: 3124 us (and 0 s) passed due to poll
Jack: playback ok
Jack: 1902 us (and 0 s) passed due to poll
Jack: capture ok
Jack: 288 frames capture avail
Jack: 288 frames play avail
Jack: 5200 us (and 0 s) (+- 249 frames) passed in entire function (5026 us in poll time)
Jack: 5219 us (and 0 s) passed since last alsa_driver_wait
Jack: 4092 us (and 0 s) passed due to poll
Jack: playback ok
Jack: 1314 us (and 0 s) passed due to poll
Jack: capture ok
Jack: 272 frames capture avail
Jack: 272 frames play avail
Jack: 5550 us (and 0 s) (+- 266 frames) passed in entire function (5406 us in poll time)
Jack: 5570 us (and 0 s) passed since last alsa_driver_wait
Jack: 4496 us (and 0 s) passed due to poll
Jack: playback ok
Jack: capture ok

This is pretty retarted:

  • First of all, the poll function (system call) is executed twice EACH time. The whole code seems to suggest it being able to do it in a single call, which would be alot more efficient, but it does not
  • For some reason, the poll function also waits quite long (I assume to have at least a period of buffers ready). In theory it should be perfectly possible for jackd to split up its samples and doing a bit more kernel communication instead of having only 2 frames ready and hoping we never fail.

Anyway, pretty interesting stuff. If I find some more jucy details, will share:)

@arre525
Copy link
Contributor

arre525 commented Mar 31, 2019

@Rippert I just discovered "soft mode" in jackd. Even without recompiling, could you simply try the -s option (in the alsa parameters).
What this does, is avoid the kernel from reporting overruns and silently overwriting data. Clients wont be informed of the xrun (and obviously you will still "hear" it), but it serves as a good test to check whether the actual result you get out of it is better than .. well nothing if it hangs:)

@arre525
Copy link
Contributor

arre525 commented Mar 31, 2019

Finally brought out the orange pi again. Guitarix has a damn good reverb setting:)

Anyway, some observations:

  • -s setting does not really help and makes stuff worse
  • The issue I run into at this point is not jackd overrunning in alsa, rather guitarix not being fast enough. (it cannot follow at a period of 64 (48000Hz), while it does at 128. I even tried using-n 128 or something (a huge buffer in alsa), the sound is still crap due to the lost packets of guitarix itself.
  • Since I access the device over wifi, the printfs cause major xruns. If I send both the output of guitarix and jackd to /dev/null, I can get the best performance (-n2 -p 128 -r 48000). But this is most likely the reason that I assumed to patch not to be complete in my case. (Once a sporadic xrun comes through when the system is already at its limit, it just never recovers). But this is not really a jackd issue.

@arre525
Copy link
Contributor

arre525 commented Mar 31, 2019

For the negative xrun delay printing, I can confirm this is fixed already now by commit 756b4fa (setting snd_pcm_sw_params_set_tstamp_mode).

Before:

root@orangepizero:~# LD_LIBRARY_PATH=/root/jackdcomp/build/lib /root/jackdcomp/build/bin/jackd -v -dalsa -r48000 -p64 -n 2 -D -Chw:Device,0 -Phw:Headset,0 | grep least
no message buffer overruns
no message buffer overruns
no message buffer overruns
Jack: **** alsa_pcm: xrun of at least -1011733.504 msecs
Jack: **** alsa_pcm: xrun of at least -1011753.664 msecs
Jack: **** alsa_pcm: xrun of at least -1011779.648 msecs
Jack: **** alsa_pcm: xrun of at least -1011799.680 msecs

Now:


root@orangepizero:~# LD_LIBRARY_PATH=/root/jackdcomp/build/lib /root/jackdcomp/build/bin/jackd -v -dalsa -r48000 -p64 -n 2 -D -Chw:Device,0 -Phw:Headset,0 | grep least
no message buffer overruns
no message buffer overruns
no message buffer overruns
Jack: **** alsa_pcm: xrun of at least 0.026 msecs
Jack: **** alsa_pcm: xrun of at least 0.039 msecs
Jack: **** alsa_pcm: xrun of at least 0.014 msecs
Jack: **** alsa_pcm: xrun of at least 0.012 msecs
Jack: **** alsa_pcm: xrun of at least 0.014 msecs

@arre525
Copy link
Contributor

arre525 commented Mar 31, 2019

@Rippert what build are you using? (git show) Cause I don't understand why you still see the negative timestamps..

@7890
Copy link
Contributor

7890 commented Mar 31, 2019

@arre525 really interesting findings!
A short note on period size: Running at size of 128 or lower usually needs special tweaks to the system in order to run smoothly. It of course depends on the backend, AFAIK USB devices have at least 1ms delay due to how audio is read from the device. With firewire (separate IRQ) it's possible to go very low. If the system runs smoothly with tight buffers, it also means that every client behaves "correctly". It even depends on the graph, eg. how much in can be parallelized. I want to say that seeing xruns with a buffer below say 64 samples is not uncommon, when considering that rpi normally doesn't run with rt kernel, it's not unexpected to see xruns. However from what I see you identified possible points of improvement for the alsa driver that are beyond the current issue! Such as

First of all, the poll function (system call) is executed twice EACH time. The whole code seems to suggest it being able to do it in a single call, which would be alot more efficient, but it does not

This must be remembered - it's a valuable discovery

@arre525
Copy link
Contributor

arre525 commented Mar 31, 2019

Maybe given this much traffic (mea cupla:) on this thread, perhaps also to focus it back on the initial issue.
What I still see is the following (no guitarix needed):


root@orangepizero:~# LD_LIBRARY_PATH=/root/jackdcomp/build/lib /root/jackdcomp/build/bin/jackd -v -dalsa -r48000 -p64 -n 2 -D -Chw:Device,0 -Phw:Headset,0 2>&1 | grep xrun
^Croot@orangepizero:~# LD_LIBRARY_PATH=/root/jackdcomp/build/lib /root/jackdcomp/build/bin/jackd -v -dalsa -r48000 -p64 -n 2 -D -Chw:Device,0 -Phw:Headset,0 2>&1 | gr xrun
^Croot@orangepizero:~# LD_LIBRARY_PATH=/root/jackdcomp/build/lib /root/jackdcomp/build/bin/jackd -v -dalsa -r48000 -p64 -n 2 -D -Chw:Device,0 -Phw:Headset,0 2>&1 | gr xrun
Jack: **** alsa_pcm: xrun of at least 0.044 msecs
Jack: **** alsa_pcm: xrun of at least 0.052 msecs
Jack: **** alsa_pcm: xrun of at least 0.050 msecs
Jack: **** alsa_pcm: xrun of at least 0.061 msecs
Jack: **** alsa_pcm: xrun of at least 0.070 msecs
Jack: **** alsa_pcm: xrun of at least 0.077 msecs
Jack: **** alsa_pcm: xrun of at least 0.076 msecs
Jack: **** alsa_pcm: xrun of at least 0.071 msecs

Most of the time there is no issue; but sometimes it suddenly goes into a loop of xruns and does not recover. Even though printfs can play a major role, this is not the only cause: If, in the case it "works", I open other ssh session and do a print as fast as possible, it does not trigger the xruns.

=> So this is I guess still the main focus of the thread

@7890
Copy link
Contributor

7890 commented Mar 31, 2019

Yes.. it seems like once the error is triggered, it can not recover. It also seems like the system can cope with desired arguments, but not for every try. It has the signature of a bug.

@Rippert
Copy link

Rippert commented Mar 31, 2019

$ git show
commit b35fa69f3faee3f95fc11497c1775e21b1d93019
Author: Thomas Brand <[email protected]>
Date:   Tue Mar 26 22:47:48 2019 +0100

    Add Arnout Diels (arre525) to authors

diff --git a/AUTHORS.rst b/AUTHORS.rst
index 515e6a3f..c00784f6 100644
--- a/AUTHORS.rst
+++ b/AUTHORS.rst
@@ -12,6 +12,7 @@ Andrew Kelley
 Andrzej Szombierski
 Andy Wingo
 Anthony Van Groningen
+Arnout Diels
 Arnaud Rebillout
 Arnold Krille
 Bernhard M. Wiedemann

But it may be using the wrong library path. I tried to use LD_LIBRARY_PATH, but it won't start when I do. I need to create a new fresh Raspbian and ARCH to try out the static build anyway, so I'll have to get back to you after I do that.

@Rippert
Copy link

Rippert commented Mar 31, 2019

Tried one more thing, but the results are weird, in a kinda good way. I reconfigured compiled and installed jack2 with the libpath set to overwrite the old apt-get installed libraries in /usr/lib/arm-linux-gnueabihf/. I was still getting problems starting jackd, until I removed the -v flag. Something in the new libraries was making the verbose mode crash. So I started it without it, opened Guitarix, and just used the Guitarix message window to see the xruns. Now I'm getting results like on ARCH, where I can lower and raise the bufsize without getting an unrecoverable state. Also, all the xruns times reported are small positive numbers.

So it looks like the code changes did do something, but there is some new problem now with verbose output. It may still be just something in my system setup, but the positive changes in the non-recoverable mode and the xrun out to clients seem pretty good.

@arre525 I'm not sure why you are still getting unrecoverable states, except that you're using an OrangePi zero which is a bit slower than an RPi 3B+.

Anyway, thanks for the improved code, and sorry about my poor use of libraries.
-Ted

@Rippert
Copy link

Rippert commented Apr 1, 2019

Just for completeness, I reverted to the experimental Raspbian kernel:

uname -a
Linux raspberrypi 4.19.30-v7+ #1209 SMP Tue Mar 26 13:14:20 GMT 2019 armv7l GNU/Linux

Results were the same, no unrecoverable state.
Then back to the current released Raspbian kernel:

uname -a
Linux raspberrypi 4.14.98-v7+ #1200 SMP Tue Feb 12 20:27:48 GMT 2019 armv7l GNU/Linux

Again, no unrecoverable state when lowering jackd all the way down to 8 periods and then back up to a reasonable number.

@Rippert
Copy link

Rippert commented Apr 1, 2019

Testing @arre525 alsa driver patch. I reverted to the commit just before it was added and rebuilt/installed:

$ git show
commit cbe17f421059ba0cebd2d594e8a1a62dd30e9b7c
Merge: 59550f67 e298b440
Author: falkTX <[email protected]>
Date:   Fri Mar 22 12:33:53 2019 +0100

    Merge branch 'master' into develop

Sure enough, it is now quite easy to get into the unrecoverable state by lowering the periods. So Arnout's patch definitely improved things.

@7890
Copy link
Contributor

7890 commented Apr 1, 2019

Something is still odd - I've tried with and without patch and jackd gets into unrecoverable state when doing this:
jackd -dalsa this starts jackd fine with on-board card (I've left out -R and --verbose)
A sequence of jack_bufsize 4096 and jack_bufsize 128 shows expected behaviour, everything seems sane. As soon as going to a lower bufsize (64 is enough), jackd gets into unrecoverable state. When trying to set a higher period size again with jack_bufsize, client hangs, server shows errors.
Some output here:

jackd:

....
configuring for 48000Hz, period = 4096 frames (85.3 ms), buffer = 2 periods
ALSA: final selected sample format for capture: 32bit integer little-endian
ALSA: use 2 periods for capture
ALSA: final selected sample format for playback: 32bit integer little-endian
ALSA: use 2 periods for playback
configuring for 48000Hz, period = 128 frames (2.7 ms), buffer = 2 periods
ALSA: final selected sample format for capture: 32bit integer little-endian
ALSA: use 2 periods for capture
ALSA: final selected sample format for playback: 32bit integer little-endian
ALSA: use 2 periods for playback
configuring for 48000Hz, period = 4096 frames (85.3 ms), buffer = 2 periods
ALSA: final selected sample format for capture: 32bit integer little-endian
ALSA: use 2 periods for capture
ALSA: final selected sample format for playback: 32bit integer little-endian
ALSA: use 2 periods for playback
configuring for 48000Hz, period = 128 frames (2.7 ms), buffer = 2 periods
ALSA: final selected sample format for capture: 32bit integer little-endian
ALSA: use 2 periods for capture
ALSA: final selected sample format for playback: 32bit integer little-endian
ALSA: use 2 periods for playback
configuring for 48000Hz, period = 64 frames (1.3 ms), buffer = 2 periods
ALSA: final selected sample format for capture: 32bit integer little-endian
ALSA: use 2 periods for capture
ALSA: final selected sample format for playback: 32bit integer little-endian
ALSA: use 2 periods for playback
JackPosixProcessSync::LockedTimedWait error usec = 5332 err = Connection timed out
JackEngine::ClientCloseAux wait error ref = 2
JackPosixProcessSync::LockedTimedWait error usec = 5000000 err = Connection timed out
Driver is not running
Cannot create new client
Unknown request 4294967295
CheckSize error size = 0 Size() = 12
CheckRead error

jack_bufsize 64 works, then trying to set 128 again stalls:

.....
$ jack_bufsize 128
$ jack_bufsize 64
$ jack_bufsize 128
^C^C^C^C^C^CCannot read socket fd = 7 err = Connection reset by peer
Could not read result type = 24
^C^C^C


^C^C^C^C^C^C^C^C^C^C

Actually it can happen that it hangs very hard and only ctrl+z and killall -9 jack_bufsize helps in this case.

Worst case is if hitting once a limit will render the server unresponsive, not good. Btw there were no other clients running during test. I think this behavior is not limited to rpi computers - we'll have to investigate more how alsa + jackd organize for a smooth data passing and handle odd cases in a reasonably robust way, this is not the most simple task.

Thank you both, OP and all on this thread to bring it up and track so far!

Unknown request 4294967295 needs a separate look. Interestingly enough using that number incremented by one makes it a multiple of 1024 (?).
echo "scale=3; 4294967296/1024/1024/1024"|bc #4.000 However not sure if relevant here :)

@7890
Copy link
Contributor

7890 commented Apr 2, 2019

Re "with and without patch": I naively only applied the patch @arre525 posted 2 days ago. If there is anything else to apply I will give it a spin! Meanwhile I'll try to understand the core of the issue better, we'll find a solution.

@Rippert
Copy link

Rippert commented Apr 2, 2019

@7890 your comment about not being limited to rpi computers sounds like you are using some other hardware, are you? If it's an rpi, can you show your /boot/config.txt contents and the output of aplay -L? I can try to reproduce your results if so.

@7890
Copy link
Contributor

7890 commented Apr 2, 2019

@Rippert yes not an rpi just a random PC.

$ aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: PCH [HDA Intel PCH], device 0: ALC887-VD Analog [ALC887-VD Analog]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: PCH [HDA Intel PCH], device 1: ALC887-VD Digital [ALC887-VD Digital]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: PCH [HDA Intel PCH], device 3: HDMI 0 [HDMI 0]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

@Rippert
Copy link

Rippert commented Apr 2, 2019

@7890 I've seen similar problems on my old MacBook running KXStudio, a while back, but they went away after an update and I haven't been able to reproduce them since. You may be having problems with your internal soundcard. Do you have a USB interface you could try?

@vikwil
Copy link
Contributor

vikwil commented Apr 24, 2019

I'm running into this problem a lot now, I think this is a regression in the Alsa driver. I have not bisected to find out when it occurred, but I don't recall it being this bad when running v1.9.12. I'm running commit b35fa69 (Tue Mar 26 22:47:48 2019 +0100).

When I run jackd with the Alsa driver I can get into the unrecoverable state by loading the system (my method is simply spamming ctrl+n in Chrome to spawn new windows).
The printed errors look like this:

Jack: **** alsa_pcm: xrun of at least 0.016 msecs
Jack: Repreparing capture
Jack: Repreparing playback
Jack: ALSA XRun wait_status = 0
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 10
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: **** alsa_pcm: xrun of at least 0.016 msecs
Jack: Repreparing capture
Jack: Repreparing playback
Jack: ALSA XRun wait_status = 0
Jack: JackSocketServerChannel::Execute : fPollTable i = 1 fd = 10
Jack: JackRequest::Notification
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: JackEngine::ClientNotify: no callback for notification = 3
Jack: **** alsa_pcm: xrun of at least 0.023 msecs

I get the same behavior even if I don't use verbose mode.

If I instead run jackd with the dummy driver and hook up alsa_in and alsa_out to the same sound card I do not end up in the unrecoverable state. I get some audio glitches when loading the system but it always recovers.

@jofemodo
Copy link

jofemodo commented May 14, 2019

JackPosixSemaphore is not used anymore since 1.9.11-RC1, please update your jack2 version and try again. No point trying to fix this since the latest code does not use the code where the timeout occurs...

Hi @falkTX !

We're having similar issues:

 JackPosixSemaphore::TimedWait err = Connection timed out
 SuspendRefNum error
 JackClient::Execute error name = ZynMidiRouter
 JackPosixSemaphore::TimedWait err = Connection timed out
 SuspendRefNum error
 JackClient::Execute error name = jackpeak
 Server is not running
 Server is not running
 Server is not running
 ...

with jackd2 1.9.12, fresh build from repository code (tag v1.9.12).

Exactly same problem with 1.9.11, installed package from stretch repo.

Initially everything works OK, but after a variable time (from minutes to hours), all clients are disconnected like that. Jack server is still running and, in fact, we can restart the clients (without restarting jackd) and everything works OK until the next "error event".

Please, note that v1.9.12 also prints the "JackPosixSemaphore::TimedWait err = Connection timed out", what doesn't fit your statement about.

Thanks!

@jofemodo
Copy link

Hi @falkTX & everybody!
It seems we found a solution. It's a curious bug, related with changing system date/time. Take a look to this thread:

https://github.com/zynthian/zynthian-sys/issues/93

Resume:
If the time change is greater than a dT (2149s??), we get the error described above.
The problem is quite bizarre, but we have a time synchronization daemon and some time server was returning a date/time with 1 hour offset randomly, so the problem was perceived as "random" until @riban-bw realize the cause ...

Regards,

@arre525
Copy link
Contributor

arre525 commented May 15, 2019

Nice find @jofemodo! I have a feeling jackd must be trying to compare system timestamps somewhere instead of relying on a monotonic clock for this. If someone had time to look at this, can't be too hard to trace back to the relevant code. Don't think this is the same root cause as most other ppl are encountering in this thread, but anyway one that needs fixing. Fancy work at zynthian btw

@jofemodo
Copy link

Thanks! We are really happy with our little wonderful machine ;-)
I will try to take a look to the code. Perhaps i could trace back the "bug" ;-)

Regards,

@jofemodo
Copy link

I open a separate issue: #469 , although both could be related ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants