Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sycl-exp : dequant q4 k improvements #7972

Merged
merged 4 commits into from
Jun 18, 2024

Conversation

AidanBeltonS
Copy link
Contributor

This PR provides improvements to the dequantize_block_q4_K kernel. It focuses on improving the global memory accesses.

Three main changes are implemented:

  • Single 32 bit load for half2 rather than two 16 bit loads
  • Load all scales in to local memory then do random access on results
  • Vectorize the q load so we load 32bits each time rather than 8bits

All results below collected on A100 GPU

Without Changes With Changes % Change
LLama-bench 70 B PP Throughput (t/s) 503.36 564.04 -11.85 Negative change is better
NSYS Avg Kernel time (us) 587.54 409.52 30.30 Positive change is better

No meaningful change in Intel GPU results have been observed.

@AidanBeltonS AidanBeltonS requested a review from joeatodd June 17, 2024 09:58
@github-actions github-actions bot added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Jun 17, 2024
@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 18, 2024
Copy link
Contributor

@joeatodd joeatodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and I've tested it all locally 👍

@joeatodd joeatodd merged commit 0e4699e into codeplay/sycl-main Jun 18, 2024
67 checks passed
Alcpz pushed a commit to Alcpz/llama.cpp that referenced this pull request Jun 20, 2024
* Remove double lines

* Single load for half2

* Store scales in local mem

* Vectorize q load
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants