Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] fallback mmvq #9088

Merged
merged 3 commits into from
Aug 20, 2024
Merged

[SYCL] fallback mmvq #9088

merged 3 commits into from
Aug 20, 2024

Conversation

airMeng
Copy link
Collaborator

@airMeng airMeng commented Aug 19, 2024

There is a bug in SYCL MMVQ implementation, stuck when evaluate accuracy. fallback mmvq to avoid the stuck

./bin/llama-perplexity  --hellaswag -m ~/Meta-Llama-3-8B-Instruct-Q4_K_S.gguf -s 0 -ngl 99 --hellaswag-tasks 40 -f ../hellaswag_val_full.txt

@airMeng airMeng requested a review from joeatodd August 19, 2024 05:44
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Aug 19, 2024
ggml/src/ggml-sycl.cpp Outdated Show resolved Hide resolved
@airMeng airMeng requested a review from NeoZhangJianyu August 20, 2024 03:47
@airMeng airMeng force-pushed the sycl-fallback-mmvq branch from 56581db to 7c332dc Compare August 20, 2024 06:45
@NeoZhangJianyu NeoZhangJianyu merged commit 50addec into master Aug 20, 2024
55 checks passed
@qnixsynapse
Copy link
Contributor

I am getting performance regression with this PR ( from 35 tokens/sec to 4 tokens/sec on Arc A750 from the latest master branch) Reverting this fixes the issue. Subscribing myself just in case for checking out later.

@airMeng
Copy link
Collaborator Author

airMeng commented Aug 23, 2024

I am getting performance regression with this PR ( from 35 tokens/sec to 4 tokens/sec on Arc A750 from the latest master branch) Reverting this fixes the issue. Subscribing myself just in case for checking out later.

@qnixsynapse what the command you are using?

@qnixsynapse
Copy link
Contributor

This:

build/bin/llama-server -c 8192 -t 4 -m meta-llama-3.1-8b-instruct-iq4_xs-imat.gguf --no-mmap -ngl 99 -a "LLaMA" -b 128 --port 8080 --log-disable -s 0 -mg 0 -sm none --log-format text -dt 0.1 --metrics

I haven't fully tested it yet, just identified the offending patch and reverted.

@qnixsynapse
Copy link
Contributor

qnixsynapse commented Sep 16, 2024

This is still causing performance regression on quantized models (esp iq4_xs) here.

stuck when evaluate accuracy. fallback mmvq to avoid the stuck

Can you please elaborate on why this is necessary? If Nvidia is unaffected by it, we can make it Nvidia exclusive.

Here is what I thought of doing:

bool use_mul_mat_vec_q =  ggml_is_quantized(src0->type)
       && src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32
       && src1->ne[1] <= MMVQ_MAX_BATCH_SIZE;
         
   
 if (ctx.stream()->get_backend() == sycl::backend::ext_oneapi_cuda) {
            use_mul_mat_vec_q = use_mul_mat_vec_q && (src1->ne[1] > MMVQ_MIN_BATCH_SIZE);
  }

@qnixsynapse
Copy link
Contributor

Update: With this PR, the boolean variable, "use_mul_mat_vec_q" is always false:
Screenshot from 2024-09-17 20-23-08

But reverting this PR, the value changes to true sometimes:
Screenshot from 2024-09-17 20-30-21

cc : @NeoZhangJianyu Please look at this.

@airMeng airMeng deleted the sycl-fallback-mmvq branch September 21, 2024 14:06
@airMeng
Copy link
Collaborator Author

airMeng commented Sep 21, 2024

@qnixsynapse sorry for the carelessness, could you open a PR to revert it?

qnixsynapse pushed a commit to qnixsynapse/llama.cpp that referenced this pull request Sep 21, 2024
@qnixsynapse
Copy link
Contributor

@airMeng Sure, no problem.

airMeng pushed a commit that referenced this pull request Sep 23, 2024
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* fallback mmvq to mul_mat

* mmvq in cuda path

* Update ggml/src/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <[email protected]>

---------

Co-authored-by: Alberto Cabrera Pérez <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* fallback mmvq to mul_mat

* mmvq in cuda path

* Update ggml/src/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <[email protected]>

---------

Co-authored-by: Alberto Cabrera Pérez <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants