FX 支持使用 FP16 #1049

Blinue · 2025-01-02T14:41:29Z

效果可以使用 //!USE FP16 声明对半精度浮点数的支持，条件满足时会有以下变化：

MP_FP16 被定义。
MF 系列宏被定义为 min16float 族，如 MF4 为 min16float4，MF3x3为 min16float3x3 等。不使用 FP16 时这些宏被定义为对应的 float 类型。
符合条件的纹理被声明为 min16float 类型，例如 R16G16B16A16_FLOAT 格式的输入定义变为 Texture2D<min16float4>，输出定义变为 RWTexture2D<min16float4>；R16G16_UNORM 格式的输入定义变为 Texture2D<min16float2>，输出定义变为 RWTexture2D<unorm min16float2>。包含 32 位浮点数的格式仍使用 float 类型。

即使效果声明支持 FP16，也不意味着一定使用，有两种例外情况：GPU 不支持 FP16 或通过开发者选项禁用了 FP16。

添加了新的内置函数 MulAdd，等效于矩阵乘然后加上向量，让我们可以在 dp4 或 mad 之间灵活切换。目前大部分基于机器学习的效果大量使用 dp4，根据我的测试，切换为 mad 后性能提升相当可观。如果使用 FP16，mad 的性能可以进一步提升，而 dp4 的性能不升反降。

所有合适的效果都会适配 FP16 和 MulAdd，性能对比如下：

效果	当前	此 PR	使用 FP16	性能提升
Jinc	0.205ms	0.201ms	否	+2%
Anime4K_3D_(AA_)Upscale_US	0.132ms	0.131ms	是	+0.1%
Anime4K_Restore_(Soft_)S	0.161ms	0.154ms	是	+4.3%
Anime4K_Restore_(Soft_)M	0.351ms	0.344ms	是	+2%
Anime4K_Restore_(Soft_)L	0.559ms	0.433ms	是	+22.5%
Anime4K_Restore_(Soft_)VL	1.22ms	0.862ms	是	+29.3%
Anime4K_Restore_(Soft_)UL	2.82ms	1.83ms	是	+35.1%
Anime4K_Upscale_(Denoise_)S	0.144ms	0.113ms	是	+21.5%
Anime4K_Upscale_(Denoise_)L	0.549ms	0.432ms	是	+21.3%
Anime4K_Upscale_(Denoise_)VL	1.38ms	1.08ms	是	+21.7%
Anime4K_Upscale_(Denoise_)UL	2.44ms	1.92ms	是	+21.3%
Anime4K_Upscale_GAN_x2_S	0.82ms	0.689ms	是	+16%
Anime4K_Upscale_GAN_x2_M	1.73ms	1.31ms	是	+24.3%
Anime4K_Upscale_GAN_x3_L	4.7ms	3.3ms	是	+29.8%
CAS	0.016ms	0.015ms	是	+6.3%
CuNNy-2x4C-NVL(-DN)	0.167ms	0.132ms	是	+21%
CuNNy-3x4C-NVL(-DN)	0.213ms	0.166ms	是	+22.1%
CuNNy-4x4C-NVL(-DN)	0.259ms	0.202ms	是	+22%
CuNNy-8x4C-NVL(-DN)	0.443ms	0.336ms	是	+24.2%
CuNNy-4x8C-NVL(-DN)	0.744ms	0.5ms	是	+32.8%
CuNNy-6x8C-NVL(-DN)	1.19ms	0.702ms	是	+41%
CuNNy-8x8C-NVL(-DN)	1.52ms	0.907ms	是	+40.3%
CuNNy-4x16C-NVL(-DN)	4.8ms	1.8ms	是	+62.5%
CuNNy-8x16C-NVL(-DN)	8.6ms	3.35ms	是	+61%
CuNNy-16x16C-NVL(-DN)	15.7ms	6.4ms	是	+59.2%
FSRCNNX(_LineArt)	0.403ms	0.363ms	是	+9.9%
ACNet	0.612ms	0.514ms	是	+16%

其他更改：

添加开发者选项性能测试模式，开启后将持续渲染不做等待，用于测试效果的性能。
不再使用 wil::CreateDirectoryDeepNoThrow，因为它不支持相对路径，应改为使用 Win32Helper::CreateDir。
内联常量改为使用全局只读变量实现以避免宏定义引起的名字冲突，如 Fix effect shader compile error #678。
引入 rapidhash，删除现有 wyhash 实现，这会使现有缓存失效，但也是清理技术债务的好机会。
优化效果缓存逻辑，避免出现哈希碰撞时读取错误的缓存，修改了缓存文件名。
效果可以在 //!MAGPIE EFFECT 块包含 StubDefs.hlsli 以减少 IDE 中的错误，不影响编译结果。

这可以避免宏定义引起的名字冲突，如 #678

这会使效果缓存失效

加载缓存将检查源码是否匹配，更改缓存文件名

Blinue · 2025-01-03T12:39:04Z

我分别使用 N 卡（RTX 4070 Laptop）和 I 卡（Intel UHD）在同样的条件下测试结果如下：

效果	FP32-N	FP16-N	性能提升	FP32-I	FP16-I	性能提升
ACNet	0.64ms	0.56ms	+12.5%	19.5ms	7.6ms	+61%
Anime4K_Upscale_L	0.55ms	0.61ms	-10.9%	53.9ms	63.3ms	-17.4%
CuNNy-6x8C-NVL	0.93ms	1.1ms	-18.3%	38.6ms	105ms	-172%
Anime4K_Upscale_Denoise_UL	2.67ms	2.85ms	-6.7%	490ms	421ms	+14.1%
Anime4K_Restore_UL	2.95ms	2.81ms	+4.7%	657ms	474ms	+27.9%
Anime4K_Restore_Soft_UL	2.95ms	2.81ms	+4.7%	657ms	470ms	+28.5%
FSRCNNX	0.486ms	0.506ms	-4.1%	14.6ms	6.2ms	+57.5%

N 卡只有 ACNet 有较大的性能提升，其他效果反而下降；I 卡 ACNet 和 FSRCNNX 提升，其他则下降，而且性能变化幅度非常大。看来不同显卡的 FP16 性能差别很大，正确配置时可以大幅提高性能，反之则会大幅降低。这与我预想的不同，看来不能简单的全局启用或禁用。

Blinue · 2025-01-04T05:41:10Z

我这里串联了三个上了点强度，发现开关fp16的区别只能算误差...都是22.5xx ms波动

因为这几个效果还没做适配，现在都适配了。

hooke007 · 2025-01-04T06:16:56Z

好像还是不算明显

Blinue · 2025-01-04T06:40:55Z

试试 CuNNy-16x16C-NVL，我这里差别比较明显

FP16

FP32

hooke007 · 2025-01-04T07:20:49Z

。。。更慢了
fp16 -- fp32

Blinue · 2025-01-06T02:18:41Z

鉴于不同显卡 fp16 能力不同，我们应该支持针对单个效果启用或禁用 fp16。我想到两个方案：

允许用户针对单个效果启用 fp16
和 TensorRT 类似自动进行性能测试决定是否使用 fp16

我更喜欢第二个方案，虽然它很复杂，但优势很大

可以将粒度减小到通道，同一个效果内分别测试每个通道决定是否使用 fp16
不需要用户自己测试，开箱即用
为 TensorRT 铺路，由于机制类似，代码路径可以共用

hooke007 · 2025-01-06T02:38:26Z

我略作搜索好像确实只能算一点误差和显卡工作时频率影响的区别

nvidia的家用卡fp16和fp32似乎就是一个级别，这两代加入的fp16 tensor core和fp16不是一个东西
https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#hardware-precision-matrix

Support Matrix :: NVIDIA Deep Learning TensorRT Documentation
These support matrices provide an overview of the supported platforms, features, and hardware capabilities of the TensorRT APIs, parsers, and layers.

Blinue · 2025-01-06T03:02:40Z

计算速度上 fp16 和 fp32 是一样的，fp16 的主要优势是驱动可以将 2 个 fp16 打包到一个 32 位 VGPR 寄存器。

如果驱动支持，一个指令可以同时计算两个 fp16 值，相当于时间减少了一半
使用的 VGPR 寄存器数量减少一半，VGPR 用的太多会影响并发性能

Blinue · 2025-01-08T13:49:32Z

@soi8391 这些是优化后的 CuNNy，我这里性能有大幅提升，你能否测试一下？测试时要在开发者选项里打开“性能测试模式”。

CuNNy.zip

soi8391 · 2025-01-08T14:48:40Z

@Blinue 謝謝，快速找了款視覺小說遊戲、使用Magpie-dev-13b0ecb-x64進行測試，有開啟性能测试模式。

粗略觀察結果為
上方頁面的新CuNNy-8x32 60ms 16FPS
上方頁面的新CuNNy-4x32 33ms 30FPS

作者頁面的舊CuNNy-8x32 57ms 18FPS
作者頁面的舊CuNNy-4x32 30ms 32FPS

Blinue · 2025-01-09T01:02:02Z

@soi8391 谢谢测试。在开发者选项里禁用 FP16 之后，优化后的 CuNNy 性能如何？测试的时候请使用同一个游戏和同样的窗口尺寸。

soi8391 · 2025-01-09T02:22:46Z

@Blinue 謝謝。粗略觀察渲染用時與FPS結果基本完全相同。

未勾選禁用 FP16
新CuNNy-8x32 60ms 16FPS
新CuNNy-4x32 33ms 30FPS

已勾選禁用 FP16
新CuNNy-8x32 61ms 16FPS
新CuNNy-4x32 33ms 30FPS

同款視覺小說遊戲與同尺寸、使用Magpie-dev-57d0962-x64進行測試，有開啟性能测试模式。

Blinue · 2025-01-09T02:44:13Z

@soi8391 麻烦提供日志

soi8391 · 2025-01-09T03:07:22Z

@Blinue 再麻煩您查看
magpie.log

Blinue · 2025-01-09T04:45:55Z

总结一下 Intel Arc A750 的测试结果

指令\精度	fp32	fp16
dp4		57ms
mad	61ms	60ms

mad 比 dp4 稍慢，fp16 没什么收益。Intel Arc A750 发布于 2022 年，性能和 2070 相当。

soi8391 · 2025-01-10T10:16:56Z

@Blinue 很抱歉、我重新安裝了舊款的Intel Arc驅動(32.0.101.5972)後再次嘗試。

同款視覺小說遊戲與同尺寸、使用Magpie-dev-57d0962-x64進行測試，有開啟性能测试模式。

以下為新的結果與截圖與log

未勾選禁用 FP16
新CuNNy-8x32 70FPS 13ms

舊CuNNy-8x32 31FPS 32ms

已勾選禁用 FP16
新CuNNy-8x32 16FPS 59ms

magpie.log

註:在此前的測試結果是使用Intel Arc驅動(32.0.101.6449)，2者安裝時都有使用DDU清除驅動並全新安裝，我不確定是否為安裝或驅動問題。

plainround · 2025-01-11T09:33:41Z

@soi8391 你好，请问可以提供57d0962版本编译文件吗，我a750可以测试

Blinue · 2025-01-11T10:10:59Z

@plainround 从这下载 https://github.com/Blinue/Magpie/actions/runs/12688326756/artifacts/2406480622

plainround · 2025-01-11T17:55:51Z

@soi8391 我的结果和你测试的一样，5972有很大的性能提升，6319版本和6449基本相同，请问是怎么知道这个版本的性能优势的？
@Blinue
禁用fp16

5972_checkBanFp16_nativeCunny

5972_checkBanFp16_newCunny
——————————
不禁用fp16

5972_uncheckBanFp16_nativeCunny

5972_uncheckBanFp16_newCunny

禁用fp16

6319_checkBanFp16_nativeCunny

6319_checkBanFp16_newCunny
——————————
不禁用fp16

6319_uncheckBanFp16_nativeCunny

6319_uncheckBanFp16_newCunny

magpie.log
新的cunny很厉害

soi8391 · 2025-01-12T04:24:30Z

@plainround 實際上我也沒有預料到會造成如此多性能差距。會使用5972測試的原因單純只是我在遊玩其他遊戲時發現6449的FPS似乎沒有上個使用的驅動5972穩定才決定退回、趁此機會就重測了一下。

plainround · 2025-01-12T07:19:39Z

@soi8391 我也觉得老版本游戏更稳定
我打算就这个magpie性能问题向intel提交issue了，如果他们修不好，我就把5972用到换显卡🤪

Blinue · 2025-01-12T08:55:34Z

总结一下 @plainround 的测试结果

驱动版本 32.0.101.5972：

指令\精度	fp32	fp16
dp4		30.3ms
mad	55.6ms	12.4ms

驱动版本 32.0.101.6319：

指令\精度	fp32	fp16
dp4		49.9ms
mad	52.6ms	52.7ms

看来 6319 版本的 fp16 性能大幅下降了。

Blinue and others added 10 commits January 1, 2025 17:00

feat: 自动使用半精度浮点数，除非在开发者选项中禁用

c3f5a5f

feat: 添加用于测试效果性能的模式，将持续渲染不做等待

0b54ca1

chore: 避免不同配置使用相同的着色器头文件

c6ef833

fix: 不再使用 wil::CreateDirectoryDeepNoThrow，因为它不支持相对路径

c102cd5

feat: 内联常量改为使用全局只读变量实现

ca68838

这可以避免宏定义引起的名字冲突，如 #678

feat: 引入 rapidhash，不再使用 wyhash

1424ddc

这会使效果缓存失效

feat: 优化缓存系统

2a45fda

加载缓存将检查源码是否匹配，更改缓存文件名

ui: 优化开发者选项 UI

fea02ca

perf: 避免复制

9848df4

feat: 使用 USE_FP16 指令声明效果支持 FP16

2de6df2

Blinue added enhancement New feature or request area: performance area: effect labels Jan 2, 2025

Blinue and others added 4 commits January 2, 2025 23:27

fix: 小修复

4df6dee

chore: 修改措辞

9aade85

Merge branch 'dev' into feat/fp16

8e49d36

feat: 使几个效果支持 FP16，但性能变化不如预期

e1ccbb5

This comment was marked as outdated.

Sign in to view

feat: 适配几个效果供测试

181e4f9

CuNNy-D16N16

5f65096

Merge branch 'dev' into feat/fp16

9fd81ca

ACNet 从 mad 改为使用 dp4

0f6489c

Blinue mentioned this pull request Jan 8, 2025

使用Magpie自帶的CuNNy時有明顯的延遲與低FPS #1053

Closed

Merge branch 'dev' into feat/fp16

57d0962

perf: 优化 CAS

87d5942

plainround mentioned this pull request Jan 12, 2025

The new driver performs poorly in Magpie IGCIT/Intel-GPU-Community-Issue-Tracker-IGCIT#942

Open

10 tasks

Blinue and others added 12 commits January 13, 2025 17:08

perf: 优化 FSRCNNX

7ca07db

Merge branch 'dev' into feat/fp16

5c767e3

perf: 优化更多效果

b16ffb1

perf: 优化更多效果

3fba18c

perf: 优化更多效果

5715d19

perf: 优化更多效果

09ddff9

perf: 优化更多效果

1d99f5f

Merge branch 'dev' into feat/fp16

9742f24

fix: 修正字符串资源

b5f3ae6

Merge branch 'dev' into feat/fp16

a1eb9a1

fix: 修正字符串资源

c271fb0

Merge branch 'dev' into feat/fp16

9d88503

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FX 支持使用 FP16 #1049

FX 支持使用 FP16 #1049

Blinue commented Jan 2, 2025 •

edited

Loading

Blinue commented Jan 3, 2025 •

edited

Loading

This comment was marked as outdated.

Blinue commented Jan 4, 2025

hooke007 commented Jan 4, 2025 •

edited

Loading

Blinue commented Jan 4, 2025

hooke007 commented Jan 4, 2025

Blinue commented Jan 6, 2025

hooke007 commented Jan 6, 2025 •

edited by unfurl-links bot

Loading

Blinue commented Jan 6, 2025

Blinue commented Jan 8, 2025

soi8391 commented Jan 8, 2025

Blinue commented Jan 9, 2025 •

edited

Loading

soi8391 commented Jan 9, 2025

Blinue commented Jan 9, 2025

soi8391 commented Jan 9, 2025

Blinue commented Jan 9, 2025

soi8391 commented Jan 10, 2025 •

edited

Loading

plainround commented Jan 11, 2025 •

edited

Loading

Blinue commented Jan 11, 2025

plainround commented Jan 11, 2025

soi8391 commented Jan 12, 2025

plainround commented Jan 12, 2025

Blinue commented Jan 12, 2025

FX 支持使用 FP16 #1049

Are you sure you want to change the base?

FX 支持使用 FP16 #1049

Conversation

Blinue commented Jan 2, 2025 • edited Loading

Blinue commented Jan 3, 2025 • edited Loading

This comment was marked as outdated.

Blinue commented Jan 4, 2025

hooke007 commented Jan 4, 2025 • edited Loading

Blinue commented Jan 4, 2025

hooke007 commented Jan 4, 2025

Blinue commented Jan 6, 2025

hooke007 commented Jan 6, 2025 • edited by unfurl-links bot Loading

Blinue commented Jan 6, 2025

Blinue commented Jan 8, 2025

soi8391 commented Jan 8, 2025

Blinue commented Jan 9, 2025 • edited Loading

soi8391 commented Jan 9, 2025

Blinue commented Jan 9, 2025

soi8391 commented Jan 9, 2025

Blinue commented Jan 9, 2025

soi8391 commented Jan 10, 2025 • edited Loading

plainround commented Jan 11, 2025 • edited Loading

Blinue commented Jan 11, 2025

plainround commented Jan 11, 2025

soi8391 commented Jan 12, 2025

plainround commented Jan 12, 2025

Blinue commented Jan 12, 2025

Blinue commented Jan 2, 2025 •

edited

Loading

Blinue commented Jan 3, 2025 •

edited

Loading

hooke007 commented Jan 4, 2025 •

edited

Loading

hooke007 commented Jan 6, 2025 •

edited by unfurl-links bot

Loading

Blinue commented Jan 9, 2025 •

edited

Loading

soi8391 commented Jan 10, 2025 •

edited

Loading

plainround commented Jan 11, 2025 •

edited

Loading