Avoid calling useResource on resources in argument buffers #2402

js6i · 2024-12-03T15:03:08Z

This PR implements execution barriers with Metal fences and puts all resources in a residency set to avoid having to useResource all resources in bound argument buffers. That makes it possible to run programs that use descriptor indexing with large descriptor tables efficiently.

Consider a pipeline executing some render passes with a couple vertex to fragment barriers:

1 2   3 4   5 6
v v B v v B v v
f f B f f B f f

Here v and f symbolize the vertex and fragment stages of a render pass, and B stands for the barrier.
In this example, stages v1 and v2 need to run before f3..6, and v1..4 before f5 and f6.

To implement this I maintain a set of fences that will be waited on before each stage, and updated after it. Here's a diagram with the fences a and b placed before the stage symbol when waited on, and after when updated:

1  2     3   4     5  6
va va B avb avb B av av
f  f  B af  af  B bf bf

Here v1 updates fence a, v4 waits for a and updates b, f4 waits for a, etc.

Note that the synchronization is a little stronger than the original - v3..6 are forced to execute after v1 and v2. This is for practical reasons - I want to keep a constant, limited set of fences active, only wait for one fence per stage pair, and only update one fence per stage.

There's some things that could be improved here:

Keep the number of fences in flight more limited, reuse, at the potential cost of incurring extra synchronization.
Don't add so many release handlers. I am quite defensive with retain/release here, but doing any less caused use after free errors. I think it should be possible to do better though, or at least maybe batch the releases in a single handler.
I think the fences should be assigned per queue, not device, and I'm a bit worried about using fences across queues. I don't think we want to rely on which queue we'll be be executing on to encode though.

etang-cw · 2024-12-05T19:50:14Z

MoltenVK/MoltenVK/GPUObjects/MVKDevice.mm

+	@synchronized (_physicalDevice->getMTLDevice()) {
+		for (auto fence: _activeBarriers[stage]) {


Vulkan barriers run in submission order, so the fact that this is on MVKDevice (and requires synchronization) worries me
Have you tested what happens if e.g. you encode command buffers in immediate mode and then submit them in the opposite order that you encoded them? Yes, it won't crash thanks to the @synchronized but the fact that this is in a place that requires synchronization at all means that two threads could fight over the _activeBarriers list and probably do unexpected (but non-crashy) things.

Also, any reason you're retaining and releasing all the fences? Don't they live as long as the MVKDevice (which according to Vulkan should outlive any active work on it)?

Right, that's a good point about keeping the fences there, in addition to the multiple queue problem.

Maybe I could avoid requiring to encode only after submit (which would let us keep fences on MVKQueue) by keeping most fences local to the command buffer, and doing some boundary trick to synchronize between submissions on the queue. Not sure what that trick is yet.

The fences are currently only supposed to live as long as the last command buffer that uses them. When one gets removed from all wait/update slots, the only references left are those attached to the command buffer. It sure is more retaining and releasing than I originally expected, so I might just pull the trigger and keep a fixed number of reusable fences..

One possibility is to make sure the last group in a submission always updates a known fence, and then always start with waiting on that fence on new submissions:

1 2 3 4 5 6 avb avb B bvc bvc B cva cva f f B bf bf B cf cf

(And if you go the reusable fence route, just have everyone use the same array of fences. Always start at index 0, and update index 0 at the end of a submission. Note that fences in Metal, like barriers in Vulkan, also work in submission order, so the worst that could happen using the same fences across multiple encoders at once is more synchronization than you wanted, but assuming you don't mix fences for different pipeline stages, I don't think that will be a big issue.)

billhollings · 2024-12-10T00:03:30Z

Since there are a few design and implementation points under discussion, I've moved this to WIP.

etang-cw · 2024-12-17T16:07:02Z

MoltenVK/MoltenVK/GPUObjects/MVKDevice.mm

+	// Initialize fences for execution barriers
+	for (auto &stage: _barrierFences) for (auto &fence: stage) fence = [_physicalDevice->getMTLDevice() newFence];


Could you give the fences labels like [fence setLabel:[NSString stringWithFormat:@"%s Fence %d", stageName(stage), idx]]? Would be very convenient for debugging.

Sure, pushed it.

js6i · 2024-12-17T16:38:38Z

Note that I removed the host stage, I don't think it needs to be explicit, but there probably? should be some waits in applyMemoryBarrier and applyBufferMemoryBarrier before synchronizeResource. (hence still WIP)
I don't think pullFromDevice needs any as callers require the client to sync with device in some other way, which I think is sufficient?

etang-cw · 2024-12-17T19:30:26Z

Note that I removed the host stage, I don't think it needs to be explicit

My understanding is that Metal guarantees memory coherency once you're able to observe that an operation has completed (e.g. through a shared event or by checking the completed status of a command buffer), so I think this is correct, since you'd need to do the same even with the host memory barrier in Vulkan.

Some old Metal docs:

Similarly, after the MTLDevice object executes a MTLCommandBuffer object, the host CPU is only guaranteed to observe any changes the MTLDevice object makes to the storage allocation of any resource referenced by that command buffer if the command buffer has completed execution (that is, the status property of the MTLCommandBuffer object is MTLCommandBufferStatusCompleted).

js6i · 2024-12-19T17:16:32Z

Alright, my concern with synchronizeResource memory barriers seems moot, as it's only relevant on non-Apple devices, which don't support residency sets anyway.

billhollings · 2024-12-31T16:04:54Z

@js6i I see you've removed the WIP tag. Is this PR ready for overall review and merging?

js6i · 2024-12-31T21:23:10Z

@js6i I see you've removed the WIP tag. Is this PR ready for overall review and merging?

Yes, I meant to submit it for review.

billhollings

Thanks for submitting this!

I don't necessarily have any required changes (most of my changes I've recommended are style related).

However, I do have some significant design and behaviour questions that I'd like to get responses to before pulling this PR in.

I have run a full CTS on this PR, and it seems to be behaving well.

MoltenVK/MoltenVK/GPUObjects/MVKDevice.h

billhollings · 2025-01-23T04:24:20Z

MoltenVK/MoltenVK/GPUObjects/MVKQueue.mm

@@ -339,6 +339,7 @@
 // Retrieves and initializes the Metal command queue and Xcode GPU capture scopes
 void MVKQueue::initMTLCommandQueue() {
 	_mtlQueue = _queueFamily->getMTLCommandQueue(_index);	// not retained (cached in queue family)
+	_device->addResidencySet(_mtlQueue);


Huh.

When I was contemplating how to add residency sets into the flow, I was thinking of attaching one to each descriptor set, and then adding it to the MTLCommandBuffer when the corresponding Metal argument buffer was used. Basically, one Vulkan descriptor set = one Metal argument buffer = one MTLResidencySet.

I see the Metal docs make some noise about not flipping residency sets in and out willy-nilly, but this is going to the opposite extreme, where we're basically requesting everything resident all the time. I'm amazed that is even possible. And if it is, why does Metal bother getting us to make resources resident at all? Why not just hide it all away under Metal's own management.

I can't see any guidance in Metal docs about not doing it this way, but do we know what kind of under-the-cover gymnastics Metal has to do to swap what's really resident on the GPU, compared to the entirety of all resources in the app? I'm a little concerned that under the covers, there are going to be constant GPU residency cache hits.

I'm sure it's much better than potentially tens of thousands of calls to useResource:, but, I'm curious if we have run a sizeable performance comparison of using residency sets this way?

And if it is, why does Metal bother getting us to make resources resident at all? Why not just hide it all away under Metal's own management.

The new hotness is giving the engine developer complete control over everything. Compare Direct3D 12 and ID3D12Device::MakeResident()/Evict().

And I have to second Bill's concerns about the performance implications of essentially forcing Metal to juggle many thousands of resources just to satisfy the residency requirement--assuming it will even let us do this. Not to mention the possibility that this could cause an unrecoverable GPU page fault at a critical juncture...

As for how that performs, I ran Diablo 4 and Diablo 2 Resurrected (with descriptor indexing heaps), on an M1 Air using this code and it was fine, so I'm not too worried about practicality of this solution given that the alternative is often better measured in seconds per frame (and yes, I tried optimizing to only useResources as few things as needed). I'll compare some games that don't use descriptor indexing to see what difference does it make and report back.

I could try keeping things per-descriptor set as you say, that should work for descriptor sets and may be worth implementing, but looking at VK_EXT_descriptor_buffer it seems that the Vulkan model is in fact that all (non-sparse?) things are resident (and we should be able to support that too).

I checked a couple D3D11-on-wined3d-on-MoltenVK games with Metal argument buffers and do not see any negative impact, maybe a slight performance increase with these patches. They are strongly CPU bound though, it would be interesting to see how something that stresses the GPU more is affected.

I could try keeping things per-descriptor set as you say, that should work for descriptor sets and may be worth implementing, but looking at VK_EXT_descriptor_buffer it seems that the Vulkan model is in fact that all (non-sparse?) things are resident (and we should be able to support that too).

From my reading of VK_EXT_descriptor_buffer, it seems to map exactly to a Metal argument buffer, especially under Metal 3. That would seem to align with the design I discussed above, where one Metal argument buffer = one Metal residency set (but minus the mapping to one descriptor set).

In MoltenVK then, perhaps MVKMetalArgumentBuffer could also hold the corresponding Metal residency set, and as resources are added and removed from the Metal argument buffer, they are also added and removed from the attached residency set. Then in MVKResourcesCommandEncoderState::encodeMetalArgumentBuffer(), when the Metal argument buffer is bound, so would the corresponding Metal residency set be added to the MTLCommandBuffer.

When VK_EXT_descriptor_buffer comes along, perhaps an MVKBuffer with VK_BUFFER_USAGE_RESOURCE_DESCRIPTOR_BUFFER_BIT_EXT enabled could track a MVKMetalArgumentBuffer. Unfortunately, since the app directly copies memory into the Metal argument buffer data pointer via vkGetDescriptorEXT(), MoltenVK wouldn't know which Metal argument buffer is being used for what resources, and therefore, which Metal residency set to add the resource to. That would seem to be a problem that will need resolution, and maybe we'd end up back here with making all resources resident at all times, since we'd have no way of knowing that the app had mem-copied into the descriptor buffer. Damn. I'd love to figure out how VK_EXT_descriptor_buffer deals with this residency issue, given that the app could be mem-copying anything into any descriptor buffer.

Please give that approach some thought. It definitely feels more encapsulated, and aligns resource residency with resource use better.

In the end, if it looks like it would be a large amount of effort to approach it that way, since you've done a fair bit of testing (and my CTS run), we could pull in your device-level residency implementation, and see if we hit any problems in the wild with its all-or-nothing approach, and then optimize at that point.

Damn. I'd love to figure out how VK_EXT_descriptor_buffer deals with this residency issue, given that the app could be mem-copying anything into any descriptor buffer.

As Jan said, "it seems that the Vulkan model is in fact that all (non-sparse?) things are resident". Vulkan assumes that all resources are available to the GPU at all times. No residency management is required, because everything is always resident. Games (and vkd3d) are built around the performance characteristics of operations that this implies. It's been this way since at least descriptor indexing, and our attempts at scanning entire descriptor sets for stuff has done terribly. I'd like to move away from reference counted descriptors that can't memcpy to copy descriptor sets, not perpetuate them.

In the end, if it looks like it would be a large amount of effort to approach it that way, since you've done a fair bit of testing (and my CTS run), we could pull in your device-level residency implementation, and see if we hit any problems in the wild with its all-or-nothing approach, and then optimize at that point.

I'd suggest doing that, it's unclear if there are benefits to splitting the sets. Annoyingly, from a quick test, it looks like they don't reference count added allocations (if you add twice and remove, it's out), so we'd have to do that ourselves..

MoltenVK/MoltenVK/Commands/MVKCommandBuffer.mm

billhollings · 2025-01-23T04:55:04Z

MoltenVK/MoltenVK/Commands/MVKCommandBuffer.mm

+	finishQueries();
+
+	// Synchronize all stages to their fences at index 0, which will be waited on in the next command buffer.
+	if (isUsingMetalArgumentBuffers()) {


If MVK_CONFIG_PREFILL_METAL_COMMAND_BUFFERS_STYLE_NO_PREFILL is enabled (the very widely used default case), there will only be one MTLCommandBuffer per queue submission, even though that queue submission might have many (I've seen hundreds sometimes) of Vulkan command buffers.

In that case, "next command buffer" here does not mean another MTLCommandBuffer. Does the waiting and updating in this code here have meaning in that scenario? Is it doing anything? Is it necessary?

I wonder if he meant "next command encoder." IIRC fences are always manipulated on encoder boundaries, regardless of where and when the calls happen; waits happen at the beginning of an encoder, and updates happen at the end.

If MVK_CONFIG_PREFILL_METAL_COMMAND_BUFFERS_STYLE_NO_PREFILL is enabled (the very widely used default case), there will only be one MTLCommandBuffer per queue submission, even though that queue submission might have many (I've seen hundreds sometimes) of Vulkan command buffers.

In that case, "next command buffer" here does not mean another MTLCommandBuffer. Does the waiting and updating in this code here have meaning in that scenario? Is it doing anything? Is it necessary?

That's right, this part is relevant between Metal command buffers and introduces superfluous synchronization in the case you mention (only within stages though, not between). I did not want to require deferring Metal encoding to the point of queue submission (by e.g. keeping fence slot indices there), hence each MVKCommandEncoder/Vulkan command buffer has its own set of fence indices that it uses, with the boundaries synchronizing to fences at index 0.

It would be possible, and a good idea, to optimize the no-prefill case by passing the fence slots from the previous MVKCommandEncoder to the next, or something to that effect, so it can continue using them.

Alright, now I'm keeping the current state of which fences we're using in MVKCommandEncodingContext. I think that lives as long as our knowledge of what order things are submitted in.

cdavis5e

An alternative for consideration: -[MTLRenderCommandEncoder useResources:count:usage:stages:] (Note the s.) This also reduces the overhead of calling -useResource:usage:stages: thousands of times, and has the advantage of working prior to macOS 15.

cdavis5e · 2025-01-23T08:32:50Z

MoltenVK/MoltenVK/Commands/MVKCommandBuffer.h

+	#pragma mark Barriers
+
+	/** Encode waits in the current command encoder for the stage that corresponds to given use. */
+	void encodeBarrierWaits(MVKCommandUse use);
+
+	/** Update fences for the currently executing pipeline stage. */
+	void encodeBarrierUpdates();
+
+	/** Insert a new execution barrier */
+	void setBarrier(uint64_t sourceStageMask, uint64_t destStageMask);
+
+	/** Encode waits for a specific stage in given encoder. */
+	void barrierWait(MVKBarrierStage stage, id<MTLRenderCommandEncoder> mtlEncoder, MTLRenderStages beforeStages);
+	void barrierWait(MVKBarrierStage stage, id<MTLBlitCommandEncoder> mtlEncoder);
+	void barrierWait(MVKBarrierStage stage, id<MTLComputeCommandEncoder> mtlEncoder);
+
+	/** Encode update for a specific stage in given encoder. */
+	void barrierUpdate(MVKBarrierStage stage, id<MTLRenderCommandEncoder> mtlEncoder, MTLRenderStages afterStages);
+	void barrierUpdate(MVKBarrierStage stage, id<MTLBlitCommandEncoder> mtlEncoder);
+	void barrierUpdate(MVKBarrierStage stage, id<MTLComputeCommandEncoder> mtlEncoder);
+


Just making a note that we should explore the possibility of reimplementing VkEvents on top of these...

An application can signal or unsignal an event either on the host or on the device.

Don't think we can.

MoltenVK/MoltenVK/Commands/MVKCommandBuffer.mm

cdavis5e · 2025-01-23T08:40:36Z

MoltenVK/MoltenVK/Commands/MVKCommandBuffer.mm

+	finishQueries();
+
+	// Synchronize all stages to their fences at index 0, which will be waited on in the next command buffer.
+	if (isUsingMetalArgumentBuffers()) {


I wonder if he meant "next command encoder." IIRC fences are always manipulated on encoder boundaries, regardless of where and when the calls happen; waits happen at the beginning of an encoder, and updates happen at the end.

MoltenVK/MoltenVK/Commands/MVKCommandBuffer.mm

cdavis5e · 2025-01-23T09:06:25Z

MoltenVK/MoltenVK/GPUObjects/MVKQueue.mm

@@ -339,6 +339,7 @@
 // Retrieves and initializes the Metal command queue and Xcode GPU capture scopes
 void MVKQueue::initMTLCommandQueue() {
 	_mtlQueue = _queueFamily->getMTLCommandQueue(_index);	// not retained (cached in queue family)
+	_device->addResidencySet(_mtlQueue);


And I have to second Bill's concerns about the performance implications of essentially forcing Metal to juggle many thousands of resources just to satisfy the residency requirement--assuming it will even let us do this. Not to mention the possibility that this could cause an unrecoverable GPU page fault at a critical juncture...

…unt.

This lets us share them between command buffers if the encoding style allows for it, avoiding superflous synchronization.

js6i force-pushed the barriers branch from fa77f68 to 4b95eb7 Compare December 4, 2024 10:03

etang-cw reviewed Dec 5, 2024

View reviewed changes

billhollings changed the title ~~Avoid calling useResource on resources in argument buffers~~ WIP: Avoid calling useResource on resources in argument buffers Dec 10, 2024

js6i force-pushed the barriers branch from 4b95eb7 to 0464099 Compare December 17, 2024 14:25

etang-cw reviewed Dec 17, 2024

View reviewed changes

js6i force-pushed the barriers branch 2 times, most recently from c9ed102 to edaefc8 Compare December 19, 2024 17:12

js6i changed the title ~~WIP: Avoid calling useResource on resources in argument buffers~~ Avoid calling useResource on resources in argument buffers Dec 19, 2024

billhollings requested a review from cdavis5e January 23, 2025 03:20

billhollings reviewed Jan 23, 2025

View reviewed changes

cdavis5e requested changes Jan 23, 2025

View reviewed changes

js6i added 5 commits January 29, 2025 09:57

Implement barriers using Metal fences

294871b

Put resources in residency sets

c3f40ee

MVKDevice: Add debug labels to barrier fences.

9545e00

Remove inline hints.

bbe0974

Add a more semantic kMVKBarrierStageNone alias for kMVKBarrierStageCo…

3cf7082

…unt.

js6i force-pushed the barriers branch from edaefc8 to 3cf7082 Compare January 29, 2025 13:50

Keep fence indices for barriers in the encoding context.

42d0437

This lets us share them between command buffers if the encoding style allows for it, avoiding superflous synchronization.

js6i force-pushed the barriers branch from 1efd199 to 42d0437 Compare January 30, 2025 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid calling useResource on resources in argument buffers #2402

Avoid calling useResource on resources in argument buffers #2402

js6i commented Dec 3, 2024

etang-cw Dec 5, 2024

js6i Dec 6, 2024

etang-cw Dec 7, 2024

billhollings commented Dec 10, 2024

etang-cw Dec 17, 2024

js6i Dec 17, 2024

js6i commented Dec 17, 2024

etang-cw commented Dec 17, 2024

js6i commented Dec 19, 2024

billhollings commented Dec 31, 2024

js6i commented Dec 31, 2024

billhollings left a comment

billhollings Jan 23, 2025

cdavis5e Jan 23, 2025

cdavis5e Jan 23, 2025

js6i Jan 23, 2025

js6i Jan 23, 2025

billhollings Jan 24, 2025

etang-cw Jan 24, 2025

js6i Jan 30, 2025

billhollings Jan 23, 2025

cdavis5e Jan 23, 2025

js6i Jan 23, 2025

js6i Jan 30, 2025

cdavis5e left a comment

cdavis5e Jan 23, 2025

etang-cw Jan 24, 2025

cdavis5e Jan 23, 2025

cdavis5e Jan 23, 2025

		@synchronized (_physicalDevice->getMTLDevice()) {
		for (auto fence: _activeBarriers[stage]) {

		// Initialize fences for execution barriers
		for (auto &stage: _barrierFences) for (auto &fence: stage) fence = [_physicalDevice->getMTLDevice() newFence];

Avoid calling useResource on resources in argument buffers #2402

Are you sure you want to change the base?

Avoid calling useResource on resources in argument buffers #2402

Conversation

js6i commented Dec 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

billhollings commented Dec 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

js6i commented Dec 17, 2024

etang-cw commented Dec 17, 2024

js6i commented Dec 19, 2024

billhollings commented Dec 31, 2024

js6i commented Dec 31, 2024

billhollings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdavis5e left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment