Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Vulkan deferred operation handling errors #1192

Conversation

mizhen
Copy link
Contributor

@mizhen mizhen commented Jul 11, 2023

The problem
when using deferred operation object in command vkcreateraytracingpipelines, there are two cases with the driver. One is driver proceed all workload in current thread without deferring this command, another case is deferring this command to other threads. The second case is not covered by existing handling. It cause some
title playback failure because reading wrong pipeline data and capture/playback shader group handle, it also cause trim playback failure because ray tracing pipeline creation failure.

The solution
The pull request is the fix for these errors. It added needed handling for deferred command and fixed related errors on pipeline reading, capture/playback shader group handle handling and trim deferred operation object handling.

Result
All tests passed for the target title and other applications include trim trace and full trace capture/playback, these test results show the pull-request fixed the target title playback issue caused by these deferred operation handling errors.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 4468.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2958 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2958 failed.

@mizhen mizhen force-pushed the ming-dev-vk-fix-deferred-operation-handling-errors branch from 3cf98ed to 7b11eaf Compare July 18, 2023 17:55
@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 8587.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2993 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2993 passed.

@mizhen mizhen force-pushed the ming-dev-vk-fix-deferred-operation-handling-errors branch from 7b11eaf to 15e0f69 Compare August 2, 2023 13:50
@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 16590.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3050 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3050 passed.

@@ -6155,7 +6161,33 @@ VkResult VulkanReplayConsumerBase::OverrideCreateRayTracingPipelinesKHR(
createInfoCount,
modified_create_infos.data(),
in_pAllocator,
out_pPipelines);
deferred_operation_info->replayPipelines.data());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deferred_operation_info->replayPipelines is resized when if (deferred_operation_info). It means if it doesn't use deferred operation, it will cause an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. I'll follow up with a commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 30721.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3137 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3137 passed.

framework/encode/custom_vulkan_api_call_encoders.h Outdated Show resolved Hide resolved
Comment on lines 625 to 647
void VulkanExportJsonConsumerBase::Process_vkCreateRayTracingPipelinesKHR(
const ApiCallInfo& call_info,
VkResult returnValue,
format::HandleId device,
format::HandleId deferredOperation,
format::HandleId pipelineCache,
uint32_t createInfoCount,
StructPointerDecoder<Decoded_VkRayTracingPipelineCreateInfoKHR>* pCreateInfos,
StructPointerDecoder<Decoded_VkAllocationCallbacks>* pAllocator,
HandlePointerDecoder<VkPipeline>* pPipelines)
{
nlohmann::ordered_json& jdata = WriteApiCallStart(call_info, "vkCreateRayTracingPipelinesKHR");
FieldToJson(jdata[NameReturn()], returnValue, json_options_);
auto& args = jdata[NameArgs()];
HandleToJson(args["device"], device, json_options_);
HandleToJson(args["deferredOperation"], deferredOperation, json_options_);
HandleToJson(args["pipelineCache"], pipelineCache, json_options_);
FieldToJson(args["createInfoCount"], createInfoCount, json_options_);
FieldToJson(args["pCreateInfos"], pCreateInfos, json_options_);
FieldToJson(args["pAllocator"], pAllocator, json_options_);
HandleToJson(args["pPipelines"], pPipelines, json_options_);
WriteBlockEnd();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this, it is the same as the default generated version. We don't want custom code where generated code can work. You probably added this because the default implementation disappeared when you added vkCreateRayTracingPipelinesKHR to vulkan_generators/blacklists.json. Each generator can have its own blocklisk.blacklist JSON file: they don't have to share this one.

Either give the JSON consumer its own blocklist or remove this function from the default blocklist in the JSON consumer header and body generators with a line like self.APICALL_BLACKLIST.remove('vkCreateRayTracingPipelinesKHR') (untested), probably at the botton of __init__() or the top of generate_feature().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it accordingly, thanks.

StructPointerDecoder<Decoded_VkRayTracingPipelineCreateInfoKHR>* pCreateInfos,
StructPointerDecoder<Decoded_VkAllocationCallbacks>* pAllocator,
HandlePointerDecoder<VkPipeline>* pPipelines) override;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this copy of the generated version. See other comment on body of same module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it accordingly, thanks.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 72512.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3436 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3436 failed.

@bradgrantham-lunarg
Copy link
Contributor

CI gfxreconstruct build # 3436 failed.

This failure is just a replay failure for an unknown struct in a vulkaninfo capture in our CI tests. The failure occurred because the PR branch needs to be rebased on dev to bring in support for the upgraded Vulkan structs that were added in recent SDKs.

@mizhen
Copy link
Contributor Author

mizhen commented Oct 30, 2023

CI gfxreconstruct build # 3436 failed.

This failure is just a replay failure for an unknown struct in a vulkaninfo capture in our CI tests. The failure occurred because the PR branch needs to be rebased on dev to bring in support for the upgraded Vulkan structs that were added in recent SDKs.

I'll rebase, thank you.

@bradgrantham-lunarg bradgrantham-lunarg added the P1 Prevents an important capture from being replayed label Oct 30, 2023
@mizhen mizhen force-pushed the ming-dev-vk-fix-deferred-operation-handling-errors branch from dfadf84 to 46813fd Compare October 31, 2023 17:19
@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 74963.

@mizhen
Copy link
Contributor Author

mizhen commented Nov 2, 2023

While trying to test this, I made a simple case based on Sascha Willem's raytracingbasic example that creates a VkDeferredOperationKHR, uses it with vkCreateRayTracingPipelinesKHR, joins the operation until complete, and then destroys the operation. Please see https://github.com/bradgrantham-lunarg/Vulkan/tree/brad-sascha-raytracingbasic-deferred-crtp.

Is this use of deferredoperation valid, in your opinion? If not, would you let me know why?

With this branch, GFXR capture at c46523a (dev at this moment) can capture and replay raytracingbasic without a problem, but your PR branch outputs this warning on capture:

[gfxrecon] WARNING - GetWrappedId() couldn't find Handle: 14829735431805717965's wrapper. It might have been destroyed

And then gfxrecon-replay from your PR branch fails with this assertion failure:

Assertion failed: (device_info != nullptr) && (pipeline_info != nullptr) && (pData != nullptr) && (pData->GetOutputPointer() != nullptr), file F:\gfxreconstruct\framework\decode\vulkan_replay_consumer_base.cpp, line 6598

Can you please address what might be happening there? Thank you!

EDIT: I tested on an "NVIDIA GeForce RTX 2070 SUPER"

I'm looking into it, will keep you updated, thanks.

deferred_operation_info->record_modified_create_infos.clear();
deferred_operation_info->record_modified_pgroups.clear();
deferred_operation_info->replayPipelines.resize(createInfoCount);
deferred_operation_info->capturePipelines.resize(createInfoCount);
memcpy(deferred_operation_info->capturePipelines.data(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use deferred_operation_info->capturePipelines->insert? t allows you to omit the sizeof and is more idiomatic C++. I realize there are uses of memcpy in this file but we'd like to move away from memcpy when possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if ((result != VK_OPERATION_DEFERRED_KHR) ||
(!device_wrapper->property_feature_info.feature_rayTracingPipelineShaderGroupHandleCaptureReplay))
{
// If the operation is not deferred by driver. or the system doesn't support
Copy link
Contributor

@bradgrantham-lunarg bradgrantham-lunarg Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, this block ends up being written twice by the code in this branch. I don't think that's what we want for this solution. If there has to be a deferred operation read of the output variable, we probably need a MetaData command to write that value, and then replay should handle the case of postponing creation of the VkPipeline handles until that MetaData command is written.

I'm confused about what's happening here. Is the problem that the application creates an RT pipeline and gets VK_OPERATION_DEFERRED_KHR then joins the deferred operation until success, then uses the pipeline handles but GFXR somehow writes the operations out of order so the pipeline handles are used in replay before their creation is complete?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a test case where the application creates an RT pipeline and gets VK_OPERATION_DEFERRED_KHR then joins the deferred operation until success, then uses the pipeline handles but GFXR somehow writes the operations out of order so the pipeline handles are used in replay before their creation is complete?

If that's what's happening, then I think I need to understand how that could happen and we may need to introduce synchronization around the pipeline handles and anything using those pipeline handles, and not create this block twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe in steps how the problem occurs so I can understand how this branch is expected to the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modified app is valid.

Yes, I think the hardware doesn't support the feature rayTracingPipelineShaderGroupHandleCaptureReplay, it triggers the issue that the above code block generates one more trace block. I'll add a commit for the issue. It may playback the trace of raytracingbasic as current dev branch. However, for this hardware, it's not guarantee playback always work, the reason is the following:

GFXR current framework requires the feature for capture/replay raytracing titles. It needs the related GPU address/ShaderGroupHandle capture/replay extension to reproduce these capture time GPU address/handle value during replay time. Without the feature (rayTracingPipelineShaderGroupHandleCaptureReplay) support, it's not guarantee for replaying raytracing trace file because shadergroup handle value may have different value during replay time, especially for complicated apps or real game, simple app and full trace file replay sometimes may work because the used GPU addresses/handles way less than real title.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 87408.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3536 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3536 passed.

@locke-lunarg
Copy link
Contributor

locke-lunarg commented Dec 1, 2023

Thank you, @mizhen , for fixing it. I didn't realize the TrackRayTracingShaderGroupHandles issue for the deferred case. This branch looks good to me. Could you please add a warning message for delaying writing the block of vkCreateRayTracingPipelinesKHR to the capture file? I think users need to know it.

One idea is maybe we could still write the block of vkCreateRayTracingPipelinesKHR to the capture file at original moment, only delay TrackRayTracingShaderGroupHandles. The pipeline handles that are written could be invalid. We could add a MetaData to write pipelines when the pipelines are ready. But it's complicated. It might not be worth. Delaying writing is good.

@mizhen
Copy link
Contributor Author

mizhen commented Dec 8, 2023

Thank you, @mizhen , for fixing it. I didn't realize the TrackRayTracingShaderGroupHandles issue for the deferred case. This branch looks good to me. Could you please add a warning message for delaying writing the block of vkCreateRayTracingPipelinesKHR to the capture file? I think users need to know it.

One idea is maybe we could still write the block of vkCreateRayTracingPipelinesKHR to the capture file at original moment, only delay TrackRayTracingShaderGroupHandles. The pipeline handles that are written could be invalid. We could add a MetaData to write pipelines when the pipelines are ready. But it's complicated. It might not be worth. Delaying writing is good.

Thank you for your review, I will make some changes.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 97935.

@mizhen
Copy link
Contributor Author

mizhen commented Dec 8, 2023

vkCreateRayTracingPipelinesKHR

Done.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3612 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3612 passed.

@locke-lunarg
Copy link
Contributor

@mizhen Thank you. Could you please rebase dev to fix conflicts? And then, it will be ready to merge.

@mizhen
Copy link
Contributor Author

mizhen commented Dec 11, 2023

@mizhen Thank you. Could you please rebase dev to fix conflicts? And then, it will be ready to merge.
OK, will do.

when using deferred operation object in command vkcreateraytracingpipelines
, driver has two choices. One is driver proceed all workload in current
thread within this command, another choice is deferring this command to
other threads. The commit added handling to the second choice and fixed
related errors on pipeline reading and capture/playback shader group handle
error.

Change-Id: Id4704502714b866211555d007e09dcefa69a4138
The deferred operation object info pointer will be nullptr on the condition
and the following access to this pointer will cause access error. The
commit fixed the issue.

Change-Id: Ia9e729f4cf083360801a7ffbefcef759e00efe79
Remove the condition of checking system feature
rayTracingPipelineShaderGroupHandleCaptureReplay which may trigger error
for some hardware/driver which support raytracing without the feature.
Add warning message for delaying writing the block of
vkCreateRayTracingPipelinesKHR to the capture file.
@mizhen mizhen force-pushed the ming-dev-vk-fix-deferred-operation-handling-errors branch from 5c4311c to adee9fb Compare December 15, 2023 20:17
@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 101804.

@mizhen
Copy link
Contributor Author

mizhen commented Dec 15, 2023

@mizhen Thank you. Could you please rebase dev to fix conflicts? And then, it will be ready to merge.
OK, will do.

Done.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3626 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3626 failed.

@locke-lunarg
Copy link
Contributor

CI gfxreconstruct build # 3629 passed. It's good to merge.

@locke-lunarg locke-lunarg merged commit af796d3 into LunarG:dev Dec 18, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Prevents an important capture from being replayed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants