gfxrecon-replay: Implement retry if replayed api call times out #1220

davidlunarg · 2023-08-07T17:44:49Z

If any of the following api calls return VK_TIMEOUT during replay, they are retried with a 1 second timeout value:

vkAcquireNextImageKHR
vkAcquireNextImageKHR2
vkWaitForFences
vkWaitSemaphores
vkWaitForPresentKHR
vkAcquireProfilingLockKHR

The retry can be skipped by specifying "--no-retry-on-timeout" on the command line invoking gfxrecon-replay.

ci-tester-lunarg · 2023-08-07T17:44:53Z

CI gfxreconstruct build queued with queue ID 19321.

ci-tester-lunarg · 2023-08-07T17:45:01Z

CI gfxreconstruct build # 3058 running.

ci-tester-lunarg · 2023-08-07T18:34:15Z

CI gfxreconstruct build # 3058 passed.

If any of the following api calls return VK_TIMEOUT during replay, they are retried with a 1 second timeout value: vkAcquireNextImageKHR vkAcquireNextImageKHR2 vkWaitForFences vkWaitSemaphores vkWaitForPresentKHR vkAcquireProfilingLockKHR The retry can be skipped by specifying "--no-retry-on-timeout" on the command line invoking gfxrecon-replay.

ci-tester-lunarg · 2023-08-07T20:26:50Z

CI gfxreconstruct build queued with queue ID 19418.

ci-tester-lunarg · 2023-08-07T20:30:01Z

CI gfxreconstruct build # 3059 running.

ci-tester-lunarg · 2023-08-07T21:16:32Z

CI gfxreconstruct build # 3059 passed.

andrew-lunarg · 2023-08-09T15:51:18Z

framework/decode/vulkan_replay_consumer_base.cpp

-    if (original_result == VK_SUCCESS)
-    {
-        // Ensure that wait for fences waits until the fences have been signaled (or error occurs) by changing the
-        // timeout to UINT64_MAX.


Our grandchildren would still have been waiting :-O.

Still appears to me a better outcome than trying to continue with a broken synchronization after just one second, as this can result in silently giving bad data. Also, one second might seem like an eternity to developers working on highend GPUs in silicon, but eg for those of us who work on simulators where a single frame can take days to process, it is not. Please reconsider.

per-mathisen-arm

Not sure if I understand the motivation behind this change in general and it seems to lead to some wrong things happening. Also OverrideGetFenceStatus is still looping infinitely even with this option, which seems counterintuitive, I think. I would actually want to see the spinlock in OverrideGetFenceStatus replaced by a wait call to vkWaitForFences, as this will have less negative effect on a CPU overhead analysis.

per-mathisen-arm · 2023-08-11T12:40:45Z

framework/decode/vulkan_replay_consumer_base.cpp

-    if (original_result == VK_SUCCESS)
-    {
-        // Ensure that wait for fences waits until the fences have been signaled (or error occurs) by changing the
-        // timeout to UINT64_MAX.


Still appears to me a better outcome than trying to continue with a broken synchronization after just one second, as this can result in silently giving bad data. Also, one second might seem like an eternity to developers working on highend GPUs in silicon, but eg for those of us who work on simulators where a single frame can take days to process, it is not. Please reconsider.

per-mathisen-arm · 2023-08-11T12:43:58Z

USAGE_android.md

@@ -806,6 +807,10 @@ optional arguments:
                        setup for replay. The default without this option is to use a Virtual Swapchain
                        of images which match the swapchain in effect at capture time and which are
                        copied to the underlying swapchain of the implementation being replayed on.
+  --no-retry-on-timeout By default, Vulkan api calls that have a timeout parameter


What is the use case where you want to prioritize reproducing the exact timeout values rather than correct synchronizations of the original content?

It may be a good idea to rename this to something like --honor-captured-timeouts so it's clear we're not trying to change the synchronization behavior.

davidlunarg · 2023-08-14T20:34:26Z

@per-mathisen-arm : Responding to all your comments above in one response...

This change came about because we discovered at least one game title that called vkAcquireNextImage with a timeout of 0. If an image was not yet available, gfxrecon-replay would generate an error while replaying that API call, and subsequent api calls would fail because vkAcquireNextImage did not return a valid image.

A app may use a timeout of 0 and not get a VK_TIMEOUT because it did other activities before the call to vkAcquireNextImage such that a timeout does not occur when it is called from the app. gfxrecon-replay doesn't have this delay in calling vkAcquireNextImage, so when it is called, it times out. (Yes, one could argue that this is a bug in the app, but the reality is apps will do such things and it seems that gfxrecon-replay should try to handle that situation.)

Another scenario might be where an app calls vkAcquireNextImage with a timeout of 0, and if the call returns VK_TIMEOUT because an image is not yet available, it might try again. If the first call does not timeout when the trace is captured, only one vkAcquireNextImage call is captured and on replay it is expected that the return value will be success.

The one second timeout was chosen because it is is a very long time for a GPU/CPU, but should the user of gfxrecon-replay encounter a timeout in one of these api calls, a delay of one second before the error is reported is a very reasonable amount of time to see the error. An alternative might be the maximum int64 value, effectively forever, but that would hang the replay program. Would a different timeout value be preferable?

The reason the --no-retry-on-timeout is included in thei PR is just in case a user of gfxrecon-replay really does want API calls that timeout when they are replayed to return an error. For example, the retry might be disabled so as to detect a difference in timing in the Vulkan implementation in the driver/GPU.

The implementation of OverrideGetFenceStatus definitely includes a busy wait, consuming CPU cycles. Yes it might make sense to change it to a call to vkWaitForFences. Could you you submit an issue on it or possibly a pull request?

bradgrantham-lunarg · 2023-08-14T21:40:17Z

USAGE_android.md

@@ -806,6 +807,10 @@ optional arguments:
                        setup for replay. The default without this option is to use a Virtual Swapchain
                        of images which match the swapchain in effect at capture time and which are
                        copied to the underlying swapchain of the implementation being replayed on.
+  --no-retry-on-timeout By default, Vulkan api calls that have a timeout parameter


It may be a good idea to rename this to something like --honor-captured-timeouts so it's clear we're not trying to change the synchronization behavior.

bradgrantham-lunarg · 2023-08-14T21:43:49Z

USAGE_desktop_D3D12.md

@@ -189,7 +189,7 @@ Usage:
                        [-m <mode> | --memory-translation <mode>]
                        [--fw <width,height> | --force-windowed <width,height>]
                        [--log-level <level>] [--log-file <file>] [--log-debugview]
-                        [--api <api>] <file>
+                        [--api <api>] [--no-retry-on-timeout] <file>


Does this PR implement any changes in timeouts for D3D12?

bradgrantham-lunarg · 2023-08-14T21:51:02Z

framework/decode/vulkan_replay_consumer_base.cpp

@@ -5174,6 +5178,22 @@ VkResult VulkanReplayConsumerBase::OverrideAcquireNextImageKHR(PFN_vkAcquireNext
            result = swapchain_->AcquireNextImageKHR(
                func, device_info, swapchain_info, timeout, semaphore_info, fence_info, captured_index, replay_index);

+            // Retry if replay result is VK_TIMEOUT
+            if (result == VK_TIMEOUT && !options_.no_retry_on_timeout && original_result == VK_SUCCESS &&
+                timeout < kNanosPerSecond)


Why also timeout < kNanosPerSecond in this if?

davidlunarg · 2023-08-22T22:16:33Z

Will abandon this pull request. It will be split in two replaced by two new pull requests - one that addresses timeouts in vkAcquireNextImageKHR/vkAcquireNextImageKHR2, and another that addresses timeouts in the remaining api calls that specify a timeout.

davidlunarg force-pushed the davidp_acquire_timeout branch from 61a5384 to 949b1d5 Compare August 7, 2023 20:26

davidlunarg requested review from bradgrantham-lunarg, andrew-lunarg and locke-lunarg August 7, 2023 21:21

andrew-lunarg reviewed Aug 9, 2023

View reviewed changes

per-mathisen-arm reviewed Aug 11, 2023

View reviewed changes

bradgrantham-lunarg requested changes Aug 14, 2023

View reviewed changes

davidlunarg closed this Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gfxrecon-replay: Implement retry if replayed api call times out #1220

gfxrecon-replay: Implement retry if replayed api call times out #1220

davidlunarg commented Aug 7, 2023

ci-tester-lunarg commented Aug 7, 2023

ci-tester-lunarg commented Aug 7, 2023

ci-tester-lunarg commented Aug 7, 2023

ci-tester-lunarg commented Aug 7, 2023

ci-tester-lunarg commented Aug 7, 2023

ci-tester-lunarg commented Aug 7, 2023

andrew-lunarg Aug 9, 2023

per-mathisen-arm Aug 11, 2023

per-mathisen-arm left a comment

per-mathisen-arm Aug 11, 2023

per-mathisen-arm Aug 11, 2023

bradgrantham-lunarg Aug 14, 2023

davidlunarg commented Aug 14, 2023

bradgrantham-lunarg Aug 14, 2023

bradgrantham-lunarg Aug 14, 2023

bradgrantham-lunarg Aug 14, 2023

davidlunarg commented Aug 22, 2023

gfxrecon-replay: Implement retry if replayed api call times out #1220

gfxrecon-replay: Implement retry if replayed api call times out #1220

Conversation

davidlunarg commented Aug 7, 2023

ci-tester-lunarg commented Aug 7, 2023

ci-tester-lunarg commented Aug 7, 2023

ci-tester-lunarg commented Aug 7, 2023

ci-tester-lunarg commented Aug 7, 2023

ci-tester-lunarg commented Aug 7, 2023

ci-tester-lunarg commented Aug 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

per-mathisen-arm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidlunarg commented Aug 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidlunarg commented Aug 22, 2023