Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gfxrecon-replay: Implement retry if replayed api call times out #1220

Closed
wants to merge 1 commit into from

Conversation

davidlunarg
Copy link
Contributor

If any of the following api calls return VK_TIMEOUT during replay, they are retried with a 1 second timeout value:

vkAcquireNextImageKHR
vkAcquireNextImageKHR2
vkWaitForFences
vkWaitSemaphores
vkWaitForPresentKHR
vkAcquireProfilingLockKHR

The retry can be skipped by specifying "--no-retry-on-timeout" on the command line invoking gfxrecon-replay.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 19321.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3058 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3058 passed.

If any of the following api calls return VK_TIMEOUT during replay,
they are retried with a 1 second timeout value:

    vkAcquireNextImageKHR
    vkAcquireNextImageKHR2
    vkWaitForFences
    vkWaitSemaphores
    vkWaitForPresentKHR
    vkAcquireProfilingLockKHR

The retry can be skipped by specifying "--no-retry-on-timeout" on
the command line invoking gfxrecon-replay.
@davidlunarg davidlunarg force-pushed the davidp_acquire_timeout branch from 61a5384 to 949b1d5 Compare August 7, 2023 20:26
@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 19418.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3059 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3059 passed.

if (original_result == VK_SUCCESS)
{
// Ensure that wait for fences waits until the fences have been signaled (or error occurs) by changing the
// timeout to UINT64_MAX.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our grandchildren would still have been waiting :-O.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still appears to me a better outcome than trying to continue with a broken synchronization after just one second, as this can result in silently giving bad data. Also, one second might seem like an eternity to developers working on highend GPUs in silicon, but eg for those of us who work on simulators where a single frame can take days to process, it is not. Please reconsider.

Copy link
Contributor

@per-mathisen-arm per-mathisen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand the motivation behind this change in general and it seems to lead to some wrong things happening. Also OverrideGetFenceStatus is still looping infinitely even with this option, which seems counterintuitive, I think. I would actually want to see the spinlock in OverrideGetFenceStatus replaced by a wait call to vkWaitForFences, as this will have less negative effect on a CPU overhead analysis.

if (original_result == VK_SUCCESS)
{
// Ensure that wait for fences waits until the fences have been signaled (or error occurs) by changing the
// timeout to UINT64_MAX.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still appears to me a better outcome than trying to continue with a broken synchronization after just one second, as this can result in silently giving bad data. Also, one second might seem like an eternity to developers working on highend GPUs in silicon, but eg for those of us who work on simulators where a single frame can take days to process, it is not. Please reconsider.

@@ -806,6 +807,10 @@ optional arguments:
setup for replay. The default without this option is to use a Virtual Swapchain
of images which match the swapchain in effect at capture time and which are
copied to the underlying swapchain of the implementation being replayed on.
--no-retry-on-timeout By default, Vulkan api calls that have a timeout parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case where you want to prioritize reproducing the exact timeout values rather than correct synchronizations of the original content?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a good idea to rename this to something like --honor-captured-timeouts so it's clear we're not trying to change the synchronization behavior.

@davidlunarg
Copy link
Contributor Author

@per-mathisen-arm : Responding to all your comments above in one response...

This change came about because we discovered at least one game title that called vkAcquireNextImage with a timeout of 0. If an image was not yet available, gfxrecon-replay would generate an error while replaying that API call, and subsequent api calls would fail because vkAcquireNextImage did not return a valid image.

A app may use a timeout of 0 and not get a VK_TIMEOUT because it did other activities before the call to vkAcquireNextImage such that a timeout does not occur when it is called from the app. gfxrecon-replay doesn't have this delay in calling vkAcquireNextImage, so when it is called, it times out. (Yes, one could argue that this is a bug in the app, but the reality is apps will do such things and it seems that gfxrecon-replay should try to handle that situation.)

Another scenario might be where an app calls vkAcquireNextImage with a timeout of 0, and if the call returns VK_TIMEOUT because an image is not yet available, it might try again. If the first call does not timeout when the trace is captured, only one vkAcquireNextImage call is captured and on replay it is expected that the return value will be success.

The one second timeout was chosen because it is is a very long time for a GPU/CPU, but should the user of gfxrecon-replay encounter a timeout in one of these api calls, a delay of one second before the error is reported is a very reasonable amount of time to see the error. An alternative might be the maximum int64 value, effectively forever, but that would hang the replay program. Would a different timeout value be preferable?

The reason the --no-retry-on-timeout is included in thei PR is just in case a user of gfxrecon-replay really does want API calls that timeout when they are replayed to return an error. For example, the retry might be disabled so as to detect a difference in timing in the Vulkan implementation in the driver/GPU.

The implementation of OverrideGetFenceStatus definitely includes a busy wait, consuming CPU cycles. Yes it might make sense to change it to a call to vkWaitForFences. Could you you submit an issue on it or possibly a pull request?

@@ -806,6 +807,10 @@ optional arguments:
setup for replay. The default without this option is to use a Virtual Swapchain
of images which match the swapchain in effect at capture time and which are
copied to the underlying swapchain of the implementation being replayed on.
--no-retry-on-timeout By default, Vulkan api calls that have a timeout parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a good idea to rename this to something like --honor-captured-timeouts so it's clear we're not trying to change the synchronization behavior.

@@ -189,7 +189,7 @@ Usage:
[-m <mode> | --memory-translation <mode>]
[--fw <width,height> | --force-windowed <width,height>]
[--log-level <level>] [--log-file <file>] [--log-debugview]
[--api <api>] <file>
[--api <api>] [--no-retry-on-timeout] <file>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR implement any changes in timeouts for D3D12?

@@ -5174,6 +5178,22 @@ VkResult VulkanReplayConsumerBase::OverrideAcquireNextImageKHR(PFN_vkAcquireNext
result = swapchain_->AcquireNextImageKHR(
func, device_info, swapchain_info, timeout, semaphore_info, fence_info, captured_index, replay_index);

// Retry if replay result is VK_TIMEOUT
if (result == VK_TIMEOUT && !options_.no_retry_on_timeout && original_result == VK_SUCCESS &&
timeout < kNanosPerSecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why also timeout < kNanosPerSecond in this if?

@davidlunarg
Copy link
Contributor Author

Will abandon this pull request. It will be split in two replaced by two new pull requests - one that addresses timeouts in vkAcquireNextImageKHR/vkAcquireNextImageKHR2, and another that addresses timeouts in the remaining api calls that specify a timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants