Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b2902 #122

Merged
merged 8 commits into from
May 16, 2024
Merged

b2902 #122

merged 8 commits into from
May 16, 2024

Conversation

Nexesenex
Copy link
Owner

No description provided.

danbev and others added 8 commits May 15, 2024 23:41
… MSVC (#7191)

* logging: add proper checks for clang to avoid errors and warnings with VA_ARGS

* build: add CMake Presets and toolchian files for Windows ARM64

* matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings

* ci: add support for optimized Windows ARM64 builds with MSVC and LLVM

* matmul-int8: fixed typos in q8_0_q8_0 matmuls

Co-authored-by: Georgi Gerganov <[email protected]>

* matmul-int8: remove unnecessary casts in q8_0_q8_0

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Switch to Ninja Multi-Config CMake generator to resurect bin/Release path
that broke artifact packaging in CI.
…l. (#7288)

* chore: add references to the quantisation space.

* fix grammer lol.

* Update README.md

Co-authored-by: Julien Chaumond <[email protected]>

* Update README.md

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Julien Chaumond <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
This can be overridden with the -m command line option

ref: #7293
@Nexesenex Nexesenex merged commit bff35ad into Nexesenex:downstream May 16, 2024
32 of 40 checks passed
Nexesenex pushed a commit that referenced this pull request Dec 22, 2024
* Adding q6_0_r4

We get PP-512(LLaMA-3.1-8B) = 257 t/s on a Ryzen-7950X.

* q6_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 95 t/s on M2-Max.
In terms of ops, q6_0_r4 is identical to q5_0_r4
except for loading the high bits being
vld1q_u8_x2 instead of vld1q_u8. It is strange that
this can make a 5% difference in performance, especially
considering that this is amortized (re-used) over 8 columns
in the right matrix. Or am I running out of vector registers?

* Fix AVX2

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants