Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware Decoding 10x faster than Software Decoding? #443

Open
softworkz opened this issue Nov 21, 2024 · 12 comments
Open

Hardware Decoding 10x faster than Software Decoding? #443

softworkz opened this issue Nov 21, 2024 · 12 comments

Comments

@softworkz
Copy link
Collaborator

softworkz commented Nov 21, 2024

I'm afraid, but not even close...

PC Laptop
Speed CPU GPU (3D) Speed CPU GPU (3D)
SW Decoding 9.05 100% 0% 3.75 100% 0%
Intel
HW Decoding 7.87 25% 0% 7.66 20% 0%
HW Decoding + HW Download 3.68 32% 80% 5.5 51% 61%
HW Decode, HW Dow, Subtitles, HW Up, HW Encode 1.95 40% 80% 1.89 48% 61%
SW Decode, Subtitles, HW Upload, HW Encode 3.54 75% 60% 2.79 100% 40%
Nvidia
HW Decoding 7.6 8% 79%
HW Decoding + HW Download 7.58 12% 90%
HW Decode, HW Dow, Subtitles, HW Up, HW Encode 1.34 25% 20%
SW Decode, Subtitles, HW Upload, HW Encode 4.65 60% 100%

Reproducing

Here's an Excel file including all the ffmpeg commands: SubtitleBurnInTests.xlsx

The test.mlkv is "Samsung Dubai" which you can find on DemoLandia.net
Subtitles file: subs.zip

Assessing the Results

General

First of all, this exactly aligns with what I wrote in this (#439 (comment)) and subsequent posts below.

My Laptop (Tigerlake) has similar graphics than my PC (RocketLake), that's why results are similar. Unfortunately the older Laptop I have is too old for this. Feel free to run these tests on weaker machines. You will see somewhat different results and relations, but all of the following conclusions will generally hold true (exceptions always possible).

When assessing transcoding performance results, these are often appearing to be odd and unexpected. It is important to understand that:

  • There's always one limiting factor which prevents the processing from being faster
  • In every single case, the limiting factor can be something different
  • With those fixed-function blocks in play, things can get very tricky. For example:
    • Intel GPUs can have multiple VDBOXes and VEBOXes, Nvidia GPUs can have multiple NVDEC "chips" (that's how they call it)
      So you may see a max speed of 7.6 for decoding a stream, but there may still be possible for the GPU to process a second stream at the same speed and without affecting the speed of the first one
    • Or when you use fixed-function processing for scaling, it can be that you add brightness/contrast adjustment and deinterlacing but the performance values don't change by a single percent - that's fixed-function processing: the feature is there - if you use it or not doesn't change anything. That's what I meant by "zero-cost" (I didn't mean "low" - low is not zero. only zero is zero)
  • The consequence of these things above is that you need to avoid thinking by ratios or factors when interpreting those results. These are useless in this contextt, because the next results you get, will often bust them (the factors you calculated from earlier results)
    It's more usful to think in "faster, slower, much faster, much slower", which is a lot more relatable than factors

This is why I said that you cannot reasonably talk in factors when trying to make comparisons in this area.

Observations

  • Impact of Data Transfer
    • In case of Intel we see that when downloading data after decoding, it hits hard and speed is slower when downloading the frame data to CPU mem
    • In case of Nvidia, there's no change, which is probably because it has a higher bandwidth (via its 16x PCIe lanes) than the iGPU
  • SW decoding can be a better choice than HW decoding in cases
    That's essentIially what I said and you can easily see that it's true when comparing the last two lines for Intel and Nvidia
    • Opposed to the previous lines, these are full transcodes with hw encoding as well, and that also means that the frames need to be hw uploaded for encoding
    • while the hw download didn't impact Nvidia much, you see that it does have an impact when there's also a hw upload in play
    • In all those cases you get better results when doing sw decoding instead of hw decoding

Yet, my statements aren't based on some synthetic test results. Up until 2 or 3 years ago, we have regularly received user reports about stuttering audio, where it turned out that it was caused by transcodes with subtiltle burn-in. After half a year of research and testing, we have made the change in the stable release to use sw decoding instead of hw decoding in those specific cases.
After this change, we have rarely seen any such report. Many are running our server on NAS devices with non-recent and non-high-end CPUs, and this change has helped to lift the transcoding speed over the critical bar (1.0x, everything below cannot play fluidly) for many users.

Final Notes

Nr 1

If it was in the context of FFmpegInteropX playback where you came to that "10x faster" impression, then you might have missed to consider the following:

When comparing decoding speed while switching FFmpegInteropX between hw decoding and ffmpeg sw decoding, you are not comparing "sw decoding" to "hw decoding". Instead you are actually comparing "sw decoding + hw uploading" to "hw decoding without data transfer".

Nr 2

Some things you wrote just do not sound right.

Yes, that's why I wanted to tell you about it.
There's not much point in telling things you already know 😆

But it's not about "I know something that you don't know" - it's about knowledge transfer. Since FFmpegInteropX is driving our Xbox app now, we have a natural interest in getting it even better.

@brabebhin
Copy link
Collaborator

What does "speed" mean? Is it GHz? Time it takes to do something?

@softworkz
Copy link
Collaborator Author

softworkz commented Nov 21, 2024

The speed is something that ffmpeg is outputting. It indicates velocity of its progress through the file relative to realtime playback.
At 1.0x - it would take the same amount of time to process the file as the playback duration. So, everything <1.0x cannot be played smoothly without interruption.

@softworkz
Copy link
Collaborator Author

softworkz commented Nov 21, 2024

We actually have a plugin for the server which is able to generate (depending on available gpus) and run such tests automatically, here's another set of figures for subtitle burn-in:

image

...showing that sw decoding is faster than hwdecoding+download.

@softworkz
Copy link
Collaborator Author

The two videos here are showing these tests in action, including a visualization of the transcoding pipeline topology:
https://github.com/softworkz/SubtitleFilteringDemos/tree/master/TestRun1?rgh-link-date=2024-11-19T10%3A47%3A59Z

@softworkz
Copy link
Collaborator Author

For convenience, all-in-one (ff binary + test files): https://1drv.ms/u/c/8a9863d7afb15f9b/Ebkn1YXBEBlMutUN5NuzO1oB7NSZNnrKyNdpJtJqKkrxhw?e=jqUq9B

Just unzip and you can run the commands in the Excel file.

@lukasf
Copy link
Member

lukasf commented Nov 21, 2024

@softworkz These are interesting results. But you are totally missing the point. The discussion was about efficiency (power consumption) - not decoding speed. I never said that a HW unit can decode 10x as fast as a CPU. That would be ridiculous - why would they add such an overpowered HW unit, wasting silicon area? I said that it uses 10x less energy during playback, thus saving battery and keeping the device cool. And I gave you proof that the numbers are actually even higher for most modern codecs.

It is a rule of thumb that an ASIC can perform an algorithm about a magnitude more efficient than a general purpose CPU (of course, strongly depending on the actual algorithm and implementation details). That's why they are used so often. Modern CPUs also start integrating ASICs for crypto algorithms, since they become more and more common and start eating a considerable amount of CPU power without ASIC support. CPU manufacturers sure would not do that if it did not have a considerable effect. And crypto mining of the popular coins is mainly done on ASICs now, since they are so much more efficient, allowing much higher revenues than GPU mining.

@lukasf
Copy link
Member

lukasf commented Nov 21, 2024

I wonder if it would be possible to speed up hwdownload and hwupload by parallelizing it (doing multiple downloads and uploads at the same time). Theoretically, the speed of PCIe 3.0 x16 should be high enough for download+upload of 4K frames in real time. And when running on a iGPU, things do not even have to go through PCIe. Why is it so slow then?

FFmpeg 7 does run filters in a graph in parallel, but a single hwdownload or hwupload does only process one frame at a time.

@brabebhin
Copy link
Collaborator

The PCIe 3 port is not the bottleneck, a 4090 can barely saturate it.
Some of the overhead comes from DirectX11 itself.
There are some optimizations that can be done with iGPUs, but these are only available in DirectX12 IIRC. Even so, DRAM and memory buses to the CPU are significantly slower than what a dedicated GPU can achieve.

@softworkz
Copy link
Collaborator Author

softworkz commented Nov 21, 2024

I wonder if it would be possible to speed up hwdownload and hwupload by parallelizing it (doing multiple downloads and uploads at the same time). Theoretically, the speed of PCIe 3.0 x16 should be high enough for download+upload of 4K frames in real time.

There's locking for D3D11 frame access in ffmpeg. I have removed that in our ffmpeg, because ffmpeg filtering is (was) single threaded, but that brought jsut a small improvement in certain cases.
Anyway, D3D11 doesn't support multi-threading (access from multiple threads yes, but they must be serialized AFAIR).
D3D12 supports real multi-threading.

I wonder

What I've been often wondering is why they can't just remap the memory instead of copying in case of iGPUs - it's the same memory anyway.

FFmpeg 7 does run filters in a graph in parallel, but a single hwdownload or hwupload does only process one frame at a time.

I have not worked with the code from newer versions, but running filters in parallel can only mean that multiple filters can execute in parallel, From the architecture it's not possible to have a single filter executing in parallel.

@brabebhin
Copy link
Collaborator

Remapping for iGPU is available in directx12.

@softworkz
Copy link
Collaborator Author

Oh, and there's a doubling involved. When you upload or download a d3d texture, you get a pointer in CPU memory for accessing the data, bu tyou don't "own" the data, so you need to copy it to or from your own memory range.

It doesn't double PCIe bandwidth, but memory bandwidth. And CPU time for copying.

@softworkz
Copy link
Collaborator Author

There's also the requirement of using array textures with D3D11 that's why it's slower than DXVA2 - or wait - I think that was just requirement for QSV withh D2D11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants