Is there a reason why backend couldn't be selected at runtime? #891

Please-just-dont · 2024-07-13T22:24:48Z

We select the backend at build time by selecting CUDA, Vulkan, SYCL, etc. Wouldn't it be better if you build with the backends you want to support and then select the backend at runtime? It's literally just one runtime if statement and that would make it much easier to compare the performance of the different backends.

slaren · 2024-07-15T16:40:36Z

Backends often need to link to a shared library that may not be available on the systems without the supported hardware drivers installed. Eg. you can't run the CUDA backend on systems without the CUDA driver. In the future I would like to move the backends to dynamic libraries that can be loaded at runtime, but that's a more complex change than an if statement.

Please-just-dont · 2024-07-16T07:28:18Z

You can easily have the host-side cpu inference method behind an if statement, right? It would be really convenient to switch it out and see the performance difference. For example I found my Vulkan implementation performs about the same as my CPU with 4 threads.

ngxson · 2024-07-27T13:33:18Z

Switching backend at runtime requires building all backend in the first place, which is complicated to setup, takes a lot of time and produces big binary size. For the same reason, pytorch offers different packages for CUDA/CPU/ROCm.

Out of the box, ggml comes with CPU + a backend of your choice. ggml_backend_sched interface can be used to do hybrid CPU+backend at the same time. Furthermore, rpc backend allow you to build one ggml "client" for each backend, and use sched to mix and match them. IMO that's already a lot of flexibility.

slaren · 2024-07-27T13:39:34Z

There is nothing stopping you from building ggml with multiple backends and use all of them with ggml_backend_sched (other than maybe a broken build script). It's just not practical at the moment because for some backends, the resulting binary will fail to run on computers without the corresponding drivers installed.

WilliamTambellini · 2024-08-06T16:03:34Z

+1 for that feature.
At least an easy way to choose at runtime between cuda and cpu backends.
It still does nt seem to be doable as today with llama-cli.
Could you just ref to the API to call in order to use the cpu backend when building with the ggml-cuda lib ?
Best

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a reason why backend couldn't be selected at runtime? #891

Is there a reason why backend couldn't be selected at runtime? #891

Please-just-dont commented Jul 13, 2024

slaren commented Jul 15, 2024

Please-just-dont commented Jul 16, 2024

ngxson commented Jul 27, 2024 •

edited

Loading

slaren commented Jul 27, 2024

WilliamTambellini commented Aug 6, 2024

Is there a reason why backend couldn't be selected at runtime? #891

Is there a reason why backend couldn't be selected at runtime? #891

Comments

Please-just-dont commented Jul 13, 2024

slaren commented Jul 15, 2024

Please-just-dont commented Jul 16, 2024

ngxson commented Jul 27, 2024 • edited Loading

slaren commented Jul 27, 2024

WilliamTambellini commented Aug 6, 2024

ngxson commented Jul 27, 2024 •

edited

Loading