Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jack_transport_locate crashes on macOS / Apple Silicon / HomeBrew #749

Open
cme opened this issue Apr 18, 2021 · 13 comments
Open

jack_transport_locate crashes on macOS / Apple Silicon / HomeBrew #749

cme opened this issue Apr 18, 2021 · 13 comments
Labels

Comments

@cme
Copy link
Contributor

cme commented Apr 18, 2021

Describe the bug

Use of the jack_transport_locate function seems to reliably cause a crash in clients on Apple Silicon, with the current Homebrew packages of JACK.

This can be seen just be running jack_test -v, the process will crash after calling jack_transport_locate()

Environment

  • JACK Version: 1.9.17 (current Homebrew version)
  • Operating System: macOS Big Sur 11.2.3, Apple Silicon M1
  • Installation: Homebrew

Steps To Reproduce

% jack_test -v

Expected vs. actual behavior

Expected behaviour (from a Linux ARM 32-bit host):

Transport is rolling...
Test jack_transport_locate while rolling
Transport current frame = 4096
Transport is rolling...
Do jack_transport_locate
Locating.... frame = 5120
Transport is rolling...
Locating.... frame = 5120
Transport is rolling...
Locating.... frame = 6144
Transport is rolling...
Locating.... frame = 6144
Transport is rolling...
sync callback : Releasing status : now ready...

Observed behaviour:

Transport is rolling...
Test jack_transport_locate while rolling
Transport current frame = 6144
Transport is rolling...
Do jack_transport_locate
zsh: bus error  jack_test -v
@cme cme added the bug label Apr 18, 2021
@falkTX
Copy link
Member

falkTX commented Apr 18, 2021

Can you get a (back)trace?

@cme
Copy link
Contributor Author

cme commented Apr 20, 2021

The crash is in CAS (in macosx/JackAtomic_os.h ) under JackTransportEngine::GenerateUniqueID(), and the faulting instruction is a CASAL instruction (generated by the __sync_bool_compare_and_swap intrinsic added in #685 )

The memory address looks valid and can be read and written via the debugger, although it's in shared memory which should probably be mapped with MAP_HASSEMAPHORE but isn't. I suspected this might be a problem, but adding the flag doesn't seem to help.

I can't seem to find any reference that mentions any new restrictions or requirements on semaphores in shared memory with these instrinsics or instructions on AArch64 macOS.

@falkTX
Copy link
Member

falkTX commented Apr 20, 2021

the code was crashing in a similar path before.
and was fixed by offseting some struct members so memory was aligned to 32bit (or 64bits not sure now)

this might need something like that too.

@falkTX
Copy link
Member

falkTX commented Apr 20, 2021

Odd enough, using something like https://kx.studio/Applications:Carla I remember the transport controls behaving normal, no crashes from that.

@cme
Copy link
Contributor Author

cme commented Apr 20, 2021

Alignment had been my first thought, but it seemed that some CAS calls were misaligned without raising an alignment exception, so I discounted that thought.

But after a bit more digging, it is indeed an alignment fault. It looks like memory that's potentially shared with another core will raise an alignment fault on misaligned compare-and-swap (which should be expected, because the architecture can't guarantee atomicity for misaligned acesses) but for local memory the OS is letting it get away with it without raising an exception.

I think these struct members do really need to be naturally aligned. If I'm understanding correctly, the only reason they're packed on macOS is because x86_64 and aarch64 get the same structures that way to share memory between native & Rosetta2 processes?

Theoretically we should be able to force alignment on anything that's accessed atomically, but the __packed__ attribute on containing structs seems to make contained structs lose their aligment requirements. That may be a Clang bug.

@cme
Copy link
Contributor Author

cme commented May 29, 2021

I've let this sit for a while since I've been busy with other things.

Specifying the alignment for members of a packed inherently doesn't work as 'packed' structs inherently have an alignment requirement of only byte-alignment. This breaks atomicity guarantees on ARM, and having packed structs is anyway undesirable for performance reasons on most architectures as well. I'd like to fix this, but before I do so I wanted to run the options by interested parties.

There are two potential options I see:

  1. Remove use of packed structs and use other means (standard sized ints, field oldering and explicit padding) to ensure that shared-memory structures have exactly the same layout whether compiled for 32 bit or 64 bit.
    • this would also involve adding some test framework to verify the format is as expected

or:

  1. encapsulate atomically-accessed counters into something that ensures atomicity without the same alignment constraints

I think #1 is more desirable as it should also give some performance improvement as well, but would likely be more work than #2. Hence why I want to check in before embarking on such a project!

Just to double-check my understanding, JACK is required to use shared-libraries so that client and server both pick up their definitions of structures from the same place. Is there any other mechanism for keeping things in sync? Is there any guarantee that a system with both 32-bit and 64-bit libraries will have the same versions (and hence structure definitions)?

Thanks!

@cme
Copy link
Contributor Author

cme commented May 29, 2021

Perhaps this warrants a more general discussion, perhaps I should take this to the mailing list?

@falkTX
Copy link
Member

falkTX commented May 29, 2021

It is fine to discuss it here.

It is hard to know the right answer, because any change has implications for long-term future.
So I am very reserved when it comes to changing stuff like the struct padding.

The theory behind packing is that we want:

  1. compatibility between different architectures of the same base system (e.g. linux 32bit and linux 64bit)
  2. compatibility across different OSes, mainly for use as network packets (encoded as little-endian afaik)
  3. compatibility across compilers, like mingw and msvc on windows

I dont think all those points are achieved right now though, specially no. 2.

If we keep the packing active, but place the item(s) that require specific alignment as the first member in the structs, would that work?

@cme
Copy link
Contributor Author

cme commented May 30, 2021

The structures are used directly for network transfer? I hadn't realised that. That would mean that changing the data formats at all will break compatability across network uses. I guess I will also review for which structures are used for network and which for shared memory.

Adding test framework for the structure layouts has the effect of building an (executable) specification of the data formats, so that seems pretty desirable.

Placing things with alignment requirements first doesn't work in the general case becase the containing structs themselves don't have alignment requirements, but might happen to work if it's possible to arrange it that way. It's fragile though, and can be broken by things like a struct containing more than one member struct which has alignment requirements.

But it's worth trying since it seems a smaller amount of work! The fragility could be mitigated by a small amount of selective assertions at call sites for CAS that assert the struct offset just for these members.

@cme
Copy link
Contributor Author

cme commented May 30, 2021

Following uses of atomically accesed structures and adding some checks, I have changed a few field orders and now seem to have working JACK on Apple Silicon (including Hydrogen).

Are there any other access controls or API version information that need to be updated on changing the struct layout? I didn't find anything.

@falkTX
Copy link
Member

falkTX commented Jul 5, 2021

I recall there is a global ABI version macro.. we need to bump this.

@falkTX
Copy link
Member

falkTX commented Jul 5, 2021

Done in 13d5ecc

So if a client is running on an old jack server somehow (jack updated while clients running or different versions of jack connected via network) an error is thrown.

We can consider this fixed, right?

(v1.9.19 to be tagged in around 10 days)

@cme
Copy link
Contributor Author

cme commented Jul 5, 2021

We can consider this fixed, right?

Awesome. That should take care of it yup! I'll leave you to close since I don't know what process you follow for testing and verification. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants