-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable use to the rebar feature to upload buffers to the device. #9251
Conversation
Benchmark with Mistral-Nemo-Instruct-2407.Q5_K.gguf
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtavenrath I'm sorry that it took me so long to get to the review.
Thank you for the contribution, this makes a significant difference in specific cases. Looks good to me.
@0cc4m No worries, this was just the prelude to my current work with win32 iorings to get amazingly fast load times. On Windows I've been able to read ~45gb/s from 4xNVMe with a Win32 LVM raid into CPU memory. The open question is if reading to GPU memory exposed through rebar will result in similar read performance. If so, the next question is how can we either expose CPU pointers of the Vulkan rebar tensors to llama.cpp? If this is not possible the question which arises next is ggml should be able to support file i/o to hit the fastest possible path. |
You could probably make the buffer type a host buffer (return |
Instead of copying host -> host staging -> device one can use the rebar feature to directly copy host -> device skipping the latency and 2nd memcpy which tripples memory bw consumption.