-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Lower performance in SYCL vs IPEX LLM. #9505
Comments
Command used to start the server in parallel mode
4 tabs were opened and a same question was feed to the server . The output contained no garbage values as that emitted by ipex-llm. But it definetly was slow. This is log of all the time simultaneous request were sent from four different tabs.
|
Just for the sake of curiosity, can you try building and testing with this patch please? diff --git a/ggml/src/ggml-sycl.cpp b/ggml/src/ggml-sycl.cpp
index acef7c6d..009911ff 100644
--- a/ggml/src/ggml-sycl.cpp
+++ b/ggml/src/ggml-sycl.cpp
@@ -3496,8 +3496,12 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor
bool use_mul_mat_vec_q = ggml_is_quantized(src0->type)
&& src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32
- && src1->ne[1] <= MMVQ_MAX_BATCH_SIZE
- && (ctx.stream()->get_backend() == sycl::backend::ext_oneapi_cuda || src1->ne[1] > MMVQ_MIN_BATCH_SIZE);
+ && src1->ne[1] <= MMVQ_MAX_BATCH_SIZE;
+
+
+ if (ctx.stream()->get_backend() == sycl::backend::ext_oneapi_cuda) {
+ use_mul_mat_vec_q = use_mul_mat_vec_q && (src1->ne[1] > MMVQ_MIN_BATCH_SIZE);
+ }
bool use_mul_mat_q = ggml_sycl_supports_mmq(src0->type)
&& src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32; |
@qnixsynapse what exactly do the above changes do ? |
Revert a change and move that specific change to Nvidia only. Please see #9088 |
@qnixsynapse ` |
@adi-lb-phoenix please check if the issue of lower performance still exists or not. |
There was no significant change |
Thank you for confirming. This isn't related to my issue. edit: I saw 3 token/sec on your server test so I thought this maybe related. |
That was when I ran a server a executed tasks through four different tabs. |
Yeah, I also get that when running a server. That revert fixes it in my testing. |
can you please share the logs, the test conditions, model used to test? I have used the model Meta-Llama-3-8B-Instruct.Q8_0.gguf. |
I am using quantized models such as iq4_xs to test on my server. The master branch has no problem with fp16 or fp32 models. The PR I linked seems to be cause a regression in my case. |
@adi-lb-phoenix We could have same performance of IPEX LLM. But we need time because all developers are working in spare time. As I know, some developers are working for it. In passed half year, we focused on function and bug. |
Anyone who has an Intel integrated VGA should try the koboldcpp_nocuda program and choose Vulkan. The integrated VGA will work without anything. |
@NeoZhangJianyu , Thank you for the info. This is such a great tool. can you tag the contributor's working on this and possibly see if we can work on it to improve performance? |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
What happened?
We expected to see similar performance in llama.cpp when compared to ipex-llm. But llama.cpp was almost two times slower than ipex-llm given all the parameters were the same.
Result from ipex-llm:
Below is the result from llama.cpp
Name and Version
the below output is from llama.cpp
./build/bin/llama-cli --version
version: 3769 (d54c21d)
built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
The below output is from ipexllm:
/llama-cli --version
version: 1 (ce3a83b)
built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnu
What operating system are you seeing the problem on?
No response
Relevant log output
The text was updated successfully, but these errors were encountered: