Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selectively disable direct-scanout to reduce bandwidth #969

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Drakulix
Copy link
Member

So this is an attempt for a problem, that has been bugging me for quite some time, but I am not convinced this is a good implementation...

The problem

So we have run into a few issues with amdgpu cards, where certain modes were not selectable, e.g.:

The underlying problem, as I was able to confirm with drm_info logs and talking to some AMD engineers seems to be, that we are using overlay planes. Or rather, that overlay planes contribute to bandwidth usage, making new configurations with higher requirements (faster / more high-res mode) potentially impossible without disabling some planes. Unfortunately the drm-api currently has no way to communicate why a commit failed, so we just reject the new mode.

Solution 1: So the initial idea would be to disable planes on a given crtc before attempting to modeset. compositor.clear() seems to be an obvious solution here, given we modeset anyway.

Unfortunately that doesn't work. Roaming planes (e.g. planes that can be attached to multiple crtcs) exist and given this runs concurrently other DrmCompositors might be able to grab the just released planes before we are done with modesetting. Fun. Also this problem can apply to enabling completely new outputs.

Solution 2: So we need to synchronize this for the whole device?

Almost works, though compositor.clear is pretty ugly here, as we would be blanking all surfaces for but I have also seen another variant of this issue, where drm leasing fails, because we are using too much bandwidth. So now we suddenly need to keep resource usage low, while another process is involved, because we have no (good) way to asses how much bandwidth that adds at all. At this point we just want to disable overlay plane usage (though just disabling direct-scanout for now is roughly equivalent) for both the modeset (to avoid flicker) and while a lease is active.

Solution 3: This PR.

Solution

  • b538289 - Add a way to ask the thread to disable direct scanout and notify us, once the next (forced) frame returned, so we can be sure the planes are actually not in use. I don't like the complexity this commit introduces.
  • 458f668 - Disable direct scanout on all surfaces of a device before applying a new output configuration, re-enable afterwards - if applicable
  • a46be89 - Disable direct scanout while a lease is active

Alternatives

Perhaps we should instead synchronize by terminating the surface threads, doing everything on the main thread and restarting the threads. We clearly need more access to the DrmCompositors state and that is tricky. So instead maybe we should just add a way to destruct a DrmCompositor into it's DrmSurface and make that Send. That way we should hopefully be able to still avoid flickering, though we might need to still somehow obtain a fully-composited rendered image to not break the visuals for a frame while disabling planes.

This would also help us solve another issue in the future, that wlroots recently tackled:

Unfortunately it looks like some gpus are so resource constraint, that we need more logic to even enable all possible displays without any planes in use. And we need device-commits to make this work at all, which complicates synchronizing even more and makes shutting down the threads temporarily even more appealing.

That is a much more significant rework of both smithay code and cosmic-comp code though.

cc @ids1024 for general comments on both the problem and code. @cmeissl for ideas on the problem and the smithay side of things.

Draft while design isn't worked out and testing so far as only been done to verify, that this doesn't break anything, not to verify, that this actually fixes these bugs. (Lacking hardware that reproduces this.)

@cmeissl
Copy link
Contributor

cmeissl commented Oct 31, 2024

Okay, I am also procrastinating on exactly this topic for the last few weeks...so this is a quick brain dump:

General

What I have come up with so far (partially some PoC impls, but mostly just ideas):

  • Adding a mode parameter to DrmCompositor::render_frame allowing to force composition
  • Adding a DrmCompositor::commit_frame function allowing to make a blocking commit to make sure a specific state is reached.
  • Adding a DrmCompositor::with_format function in addition to the existing DrmCompositor::new allowing to force a format with a specific set of modifiers.
  • Adding a DrmCompositor::set_format function allowing to change the swapchain format.
  • Adding a function on DrmDevice to test a combination of CRTC with certain primary plane formats
  • Adding a DrmDeviceManager which internally manages all DrmCompositor with something like a add_crtc function.

DrmDeviceManager

Using an RwLock holding a list of current active CRTC/DrmCompositor combinations. add_crtc would return a token to the CRTC allowing access to the DrmCompositor which would take a read lock allowing concurrent access.
add_crtc could evaluate the format(s) with DrmDevice::test_format and take a write lock changing formats "atomically" on all crtc.

Before adding a crtc we might need to make sure there is no direct scan-out in place. This could be done by clear or by re-rendering all crtc under the write lock + DrmCompositor::commit_frame. So changing format might require rendering twice, once with the existing format but with mode Composition and again with the changed format. After that each thread could continue as usual including assigning overlay planes.

Other

There is still an issue with doing direct scan-out on the primary plane and a second CRTC. It could (actually I have seen this happen...) happen that the direct scan-out on CRTC 1 reduces the overall bandwidth requirement allowing CRTC 2 to use an additional overlay plane. But going back to composition on CRTC 1 might then fail...
I have thought about that extensively and have come up with two possible solutions:

  • Always include all other CRTC with their format of the pimary plane during atomic testing
  • Allow to limit which format is allowed for direct scan-out on the primary plane (probably just using the same format/modifier as the swapchain has)

The first option might fail in funny ways and the second sounds way easier. The second one could be implemented with something similar to the Capabilities we have in the gles renderer.
(Could enable/disable that depending on count of active CRTC in DrmDeviceManager)

@cmeissl
Copy link
Contributor

cmeissl commented Oct 31, 2024

Some kind of use_mode wrapper in the device token could use a similar approach. First optimistic by just calling use_mode on the compositor and when that fails take a write lock and retry after making sure all crtc do not use direct scan-out. Issues is more or less the same, we might have to redraw all crtc (or at least commit some buffer)

@Drakulix
Copy link
Member Author

Drakulix commented Nov 1, 2024

What I have come up with so far (partially some PoC impls, but mostly just ideas):

+1 for extending the DrmCompositor interface, making it more suitable for the complex reality of these issues.

* Adding a `mode` parameter to `DrmCompositor::render_frame` allowing to force composition

+1

* Adding a `DrmCompositor::commit_frame` function allowing to make a blocking commit to make sure a specific state is reached.

+1

* Adding a `DrmCompositor::with_format` function in addition to the existing `DrmCompositor::new` allowing to force a format with a specific set of modifiers.

So this is to aid a similar approach to wlroots swapchain helper by forcing a specific modifier set?

* Adding a `DrmCompositor::set_format` function allowing to change the swapchain format.

When would this be used? To fall back to 8-bit for more bandwidth?

* Adding a function on `DrmDevice` to test a combination of CRTC with certain primary plane formats

Not sure I get the picture here, is this to build the device-wide test commit?

DrmDeviceManager

Using an RwLock holding a list of current active CRTC/DrmCompositor combinations. add_crtc would return a token to the CRTC allowing access to the DrmCompositor which would take a read lock allowing concurrent access. add_crtc could evaluate the format(s) with DrmDevice::test_format and take a write lock changing formats "atomically" on all crtc.

+1 for the approach of a sharable (assuming DrmOutputManager would be Send+Sync) manager.
Not sure we actually need an RwLock though, we don't need access on more than one thread in parallel, so a simple Mutex might be enough?

Before adding a crtc we might need to make sure there is no direct scan-out in place. This could be done by clear or by re-rendering all crtc under the write lock + DrmCompositor::commit_frame. So changing format might require rendering twice, once with the existing format but with mode Composition and again with the changed format. After that each thread could continue as usual including assigning overlay planes.

+1

There is still an issue with doing direct scan-out on the primary plane and a second CRTC. It could (actually I have seen this happen...) happen that the direct scan-out on CRTC 1 reduces the overall bandwidth requirement allowing CRTC 2 to use an additional overlay plane. But going back to composition on CRTC 1 might then fail... I have thought about that extensively and have come up with two possible solutions:

Oh yeah, I forgot about this one...

* Always include all other CRTC with their format of the pimary plane during atomic testing

This could then fail scanout later, we essentially would need to make two tests, right? I like the idea conceptionally, but I also fear this adds to much complexity.

* Allow to limit which format is allowed for direct scan-out on the primary plane (probably just using the same format/modifier as the swapchain has)

I feel like this might potentially be too limiting, I don't think there is any guarantee of gbm selecting the same modifier as a client using vulkan/egl might, so we might leave a bunch of performance on the table here...

I guess we could limit the format, but not the modifier and if fallback fails re-create the swapchain with the same modifier as the previous scan-out modifier? I wonder if that could have weird performance effects, but it shouldn't be too different from selecting modifiers based on bandwidth for enabling more outputs as wlroots is doing with the helper.

@cmeissl
Copy link
Contributor

cmeissl commented Nov 1, 2024

What I have come up with so far (partially some PoC impls, but mostly just ideas):

+1 for extending the DrmCompositor interface, making it more suitable for the complex reality of these issues.

* Adding a `mode` parameter to `DrmCompositor::render_frame` allowing to force composition

+1

* Adding a `DrmCompositor::commit_frame` function allowing to make a blocking commit to make sure a specific state is reached.

+1

* Adding a `DrmCompositor::with_format` function in addition to the existing `DrmCompositor::new` allowing to force a format with a specific set of modifiers.

So this is to aid a similar approach to wlroots swapchain helper by forcing a specific modifier set?

Yeah, in the first version also probably falling back to Invalid if we fail to initialize a CRTC after making sure
we do not use overlay planes.

* Adding a `DrmCompositor::set_format` function allowing to change the swapchain format.

When would this be used? To fall back to 8-bit for more bandwidth?

Basically doing the same as DrmCompositor::with_format, but just for already initialized ones. I also thought
about destructing existing ones into their DrmSurface. But after thinking about it I do not see any benefit.
The destruction will make it harder to use because it requires introducing some Options here and there and
it also will drop the internal fb caches. I also thought about adding a builder which could transfer some of the
state with something like DrmCompositor::builder() and DrmCompositor::into_builder(self), but in the end I don't
think it is worth atm.

* Adding a function on `DrmDevice` to test a combination of CRTC with certain primary plane formats

Not sure I get the picture here, is this to build the device-wide test commit?

Yeah, we need to test all CRTC at once which can be a bit tricky without accessing internal state of DrmDevice.
So I thought about adding a function that can exactly do that, issue an atomic test commit enabling all provided
CRTC/Encoders with the specified format/modifier per CRTC and disabling all planes.

DrmDeviceManager

Using an RwLock holding a list of current active CRTC/DrmCompositor combinations. add_crtc would return a token to the CRTC allowing access to the DrmCompositor which would take a read lock allowing concurrent access. add_crtc could evaluate the format(s) with DrmDevice::test_format and take a write lock changing formats "atomically" on all crtc.

+1 for the approach of a sharable (assuming DrmOutputManager would be Send+Sync) manager. Not sure we actually need an RwLock though, we don't need access on more than one thread in parallel, so a simple Mutex might be enough?

Not exactly, the idea was to have it once and only a single place which is responsible for enabling CRTC. Adding a crtc would return an DrmOutput which would be Send but not Sync. We can use some tricks here to mitigate needed locks, especially for single threaded users.

In my Poc DrmOutputManager internally holds this:

type CompositorList<A, F, U, G> = Arc<RwLock<HashMap<crtc::Handle, RefCell<DrmCompositor<A, F, U, G>>>>>;

and DrmOutput is basically a wrapper around the CRTC and the list.

struct DrmOutput<A, F, U, G>
where
    A: Allocator,
    F: ExportFramebuffer<<A as Allocator>::Buffer>,
    <F as ExportFramebuffer<<A as Allocator>::Buffer>>::Framebuffer: std::fmt::Debug + 'static,
    G: AsFd + 'static,
{
    crtc: crtc::Handle,
    compositor: CompositorList<A, F, U, G>,
}

Before adding a crtc we might need to make sure there is no direct scan-out in place. This could be done by clear or by re-rendering all crtc under the write lock + DrmCompositor::commit_frame. So changing format might require rendering twice, once with the existing format but with mode Composition and again with the changed format. After that each thread could continue as usual including assigning overlay planes.

+1

There is still an issue with doing direct scan-out on the primary plane and a second CRTC. It could (actually I have seen this happen...) happen that the direct scan-out on CRTC 1 reduces the overall bandwidth requirement allowing CRTC 2 to use an additional overlay plane. But going back to composition on CRTC 1 might then fail... I have thought about that extensively and have come up with two possible solutions:

Oh yeah, I forgot about this one...

* Always include all other CRTC with their format of the pimary plane during atomic testing

This could then fail scanout later, we essentially would need to make two tests, right? I like the idea conceptionally, but I also fear this adds to much complexity.

Yeah and I am also not sure this will work on all devices tbh.

* Allow to limit which format is allowed for direct scan-out on the primary plane (probably just using the same format/modifier as the swapchain has)

I feel like this might potentially be too limiting, I don't think there is any guarantee of gbm selecting the same modifier as a client using vulkan/egl might, so we might leave a bunch of performance on the table here...

I guess we could limit the format, but not the modifier and if fallback fails re-create the swapchain with the same modifier as the previous scan-out modifier? I wonder if that could have weird performance effects, but it shouldn't be too different from selecting modifiers based on bandwidth for enabling more outputs as wlroots is doing with the helper.

Not sure this will work, I have seen CRTC fail to enable just because of the modifier. Afaik the intel CCS modifier requires more bandwidth and could again cause this issue when direct scan-out uses something like Y-Tiled. But on the other hand
I also have no good idea here...maybe we can just try to disable the capability in response to failed during render_frame.

@Drakulix
Copy link
Member Author

Drakulix commented Nov 4, 2024

I feel like this might potentially be too limiting, I don't think there is any guarantee of gbm selecting the same modifier as a client using vulkan/egl might, so we might leave a bunch of performance on the table here...
I guess we could limit the format, but not the modifier and if fallback fails re-create the swapchain with the same modifier as the previous scan-out modifier? I wonder if that could have weird performance effects, but it shouldn't be too different from selecting modifiers based on bandwidth for enabling more outputs as wlroots is doing with the helper.

Not sure this will work, I have seen CRTC fail to enable just because of the modifier. Afaik the intel CCS modifier requires more bandwidth and could again cause this issue when direct scan-out uses something like Y-Tiled. But on the other hand I also have no good idea here...maybe we can just try to disable the capability in response to failed during render_frame.

I think I had a typo here in my original response, that made this somewhat hard to pass.

Lets say we let this happen and once we fall-back to composition we fail because of the modifier of our swapchain. Couldn't we re-create the swapchain with the modifier previously in use when we did direct scanout?

@Drakulix Drakulix mentioned this pull request Nov 29, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants