Why is GPU inference slow? #4

budgetdevv · 2024-09-15T04:56:46Z

Find out why. I'm thinking because I am not using IOBindings.

budgetdevv · 2024-09-15T06:56:11Z

Apparently it is just slow on DirectML - Full GPU utilization when inferencing on CUDA.

I guess it can still be faster, though, via using IOBindings.

budgetdevv · 2024-09-15T06:57:11Z

Inference speed for full, non-quantized model is around 6 seconds on my 8700K RTX 3070 setup ( Fits nicely into VRAM )

budgetdevv added the Performance label Sep 15, 2024

budgetdevv self-assigned this Sep 15, 2024

Provide feedback