-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gfxrecon-replay: Implement retry if replayed api call times out #1220
Conversation
CI gfxreconstruct build queued with queue ID 19321. |
CI gfxreconstruct build # 3058 running. |
CI gfxreconstruct build # 3058 passed. |
If any of the following api calls return VK_TIMEOUT during replay, they are retried with a 1 second timeout value: vkAcquireNextImageKHR vkAcquireNextImageKHR2 vkWaitForFences vkWaitSemaphores vkWaitForPresentKHR vkAcquireProfilingLockKHR The retry can be skipped by specifying "--no-retry-on-timeout" on the command line invoking gfxrecon-replay.
61a5384
to
949b1d5
Compare
CI gfxreconstruct build queued with queue ID 19418. |
CI gfxreconstruct build # 3059 running. |
CI gfxreconstruct build # 3059 passed. |
if (original_result == VK_SUCCESS) | ||
{ | ||
// Ensure that wait for fences waits until the fences have been signaled (or error occurs) by changing the | ||
// timeout to UINT64_MAX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our grandchildren would still have been waiting :-O.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still appears to me a better outcome than trying to continue with a broken synchronization after just one second, as this can result in silently giving bad data. Also, one second might seem like an eternity to developers working on highend GPUs in silicon, but eg for those of us who work on simulators where a single frame can take days to process, it is not. Please reconsider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understand the motivation behind this change in general and it seems to lead to some wrong things happening. Also OverrideGetFenceStatus is still looping infinitely even with this option, which seems counterintuitive, I think. I would actually want to see the spinlock in OverrideGetFenceStatus replaced by a wait call to vkWaitForFences, as this will have less negative effect on a CPU overhead analysis.
if (original_result == VK_SUCCESS) | ||
{ | ||
// Ensure that wait for fences waits until the fences have been signaled (or error occurs) by changing the | ||
// timeout to UINT64_MAX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still appears to me a better outcome than trying to continue with a broken synchronization after just one second, as this can result in silently giving bad data. Also, one second might seem like an eternity to developers working on highend GPUs in silicon, but eg for those of us who work on simulators where a single frame can take days to process, it is not. Please reconsider.
@@ -806,6 +807,10 @@ optional arguments: | |||
setup for replay. The default without this option is to use a Virtual Swapchain | |||
of images which match the swapchain in effect at capture time and which are | |||
copied to the underlying swapchain of the implementation being replayed on. | |||
--no-retry-on-timeout By default, Vulkan api calls that have a timeout parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use case where you want to prioritize reproducing the exact timeout values rather than correct synchronizations of the original content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a good idea to rename this to something like --honor-captured-timeouts
so it's clear we're not trying to change the synchronization behavior.
@per-mathisen-arm : Responding to all your comments above in one response... This change came about because we discovered at least one game title that called vkAcquireNextImage with a timeout of 0. If an image was not yet available, gfxrecon-replay would generate an error while replaying that API call, and subsequent api calls would fail because vkAcquireNextImage did not return a valid image. A app may use a timeout of 0 and not get a VK_TIMEOUT because it did other activities before the call to vkAcquireNextImage such that a timeout does not occur when it is called from the app. gfxrecon-replay doesn't have this delay in calling vkAcquireNextImage, so when it is called, it times out. (Yes, one could argue that this is a bug in the app, but the reality is apps will do such things and it seems that gfxrecon-replay should try to handle that situation.) Another scenario might be where an app calls vkAcquireNextImage with a timeout of 0, and if the call returns VK_TIMEOUT because an image is not yet available, it might try again. If the first call does not timeout when the trace is captured, only one vkAcquireNextImage call is captured and on replay it is expected that the return value will be success. The one second timeout was chosen because it is is a very long time for a GPU/CPU, but should the user of gfxrecon-replay encounter a timeout in one of these api calls, a delay of one second before the error is reported is a very reasonable amount of time to see the error. An alternative might be the maximum int64 value, effectively forever, but that would hang the replay program. Would a different timeout value be preferable? The reason the --no-retry-on-timeout is included in thei PR is just in case a user of gfxrecon-replay really does want API calls that timeout when they are replayed to return an error. For example, the retry might be disabled so as to detect a difference in timing in the Vulkan implementation in the driver/GPU. The implementation of OverrideGetFenceStatus definitely includes a busy wait, consuming CPU cycles. Yes it might make sense to change it to a call to vkWaitForFences. Could you you submit an issue on it or possibly a pull request? |
@@ -806,6 +807,10 @@ optional arguments: | |||
setup for replay. The default without this option is to use a Virtual Swapchain | |||
of images which match the swapchain in effect at capture time and which are | |||
copied to the underlying swapchain of the implementation being replayed on. | |||
--no-retry-on-timeout By default, Vulkan api calls that have a timeout parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a good idea to rename this to something like --honor-captured-timeouts
so it's clear we're not trying to change the synchronization behavior.
@@ -189,7 +189,7 @@ Usage: | |||
[-m <mode> | --memory-translation <mode>] | |||
[--fw <width,height> | --force-windowed <width,height>] | |||
[--log-level <level>] [--log-file <file>] [--log-debugview] | |||
[--api <api>] <file> | |||
[--api <api>] [--no-retry-on-timeout] <file> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR implement any changes in timeouts for D3D12?
@@ -5174,6 +5178,22 @@ VkResult VulkanReplayConsumerBase::OverrideAcquireNextImageKHR(PFN_vkAcquireNext | |||
result = swapchain_->AcquireNextImageKHR( | |||
func, device_info, swapchain_info, timeout, semaphore_info, fence_info, captured_index, replay_index); | |||
|
|||
// Retry if replay result is VK_TIMEOUT | |||
if (result == VK_TIMEOUT && !options_.no_retry_on_timeout && original_result == VK_SUCCESS && | |||
timeout < kNanosPerSecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why also timeout < kNanosPerSecond
in this if
?
Will abandon this pull request. It will be split in two replaced by two new pull requests - one that addresses timeouts in vkAcquireNextImageKHR/vkAcquireNextImageKHR2, and another that addresses timeouts in the remaining api calls that specify a timeout. |
If any of the following api calls return VK_TIMEOUT during replay, they are retried with a 1 second timeout value:
The retry can be skipped by specifying "--no-retry-on-timeout" on the command line invoking gfxrecon-replay.