Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementations for Q4_0_8_8 quantization based functions in AVX2 SIMD architecture #8713

Conversation

Srihari-mcw
Copy link
Contributor

  • The PR contains replication of Q4_0_8_8 quantization based functions for x86/x64 SIMD Architectures
  • The PR contains AVX2 implementations for quantize_q8_0_4x8, ggml_gemv_q4_0_8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0 functions
  • Good gains were observed especially with prompt processing with the above changes compared to the current default path for Q4_0 model. Currently the Q4_0 model goes through LLAMAFILE(sgemm.cpp) implementation for mul_mat operations by default
  • PR introduces integer variant for mul_sum_i8_pairs function for performing dot product operations and macros for conversion from half precision to full precision based on F16C intrinsics support
  • Performance Details

GCC Linux :

Q4_0 Model :

model size params backend threads test t/s speedup Commit id
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 pp 512 43.28 ± 0.08 de280085
llama 7B Q4_0_8_8 3.56 GiB 6.74 B CPU 6 pp 512 68.04 ± 0.08 57.2% 9737b2e
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 tg 128 14.69 ± 0.00 de280085
llama 7B Q4_0_8_8 3.56 GiB 6.74 B CPU 6 tg 128 14.90 ± 0.01 1.4% 9737b2e

The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b

The PR was tested in AMD Raphael 7600X which supports the following flags by default :

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1|

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 26, 2024
__m256i requiredOrder = _mm256_set_epi32(3 ,2 ,1 ,0, 7 ,6, 5, 4);

// Take group of four block_q8_0x4 structures at each pass of the loop and perform dot product operation
for (; y < nr / 4; y += 4) {
Copy link
Contributor Author

@Srihari-mcw Srihari-mcw Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ne11 is processed in batches of 16 in GEMM function. Leftover ne11 is processed in batches of four. Saw higher boostup in performance while processing ne11 in batches of 16 and leftover in batches of 4 versus just processing ne11 in batches of four

@bartowski1182
Copy link
Contributor

this isn't a new conversion type, right? it's just a new way of calculating Q4_0?

@Srihari-mcw
Copy link
Contributor Author

Hi @bartowski1182 , the Q4_0_8_8 is a format of quantization where the values are stored in the same 4 bit quantized fomat, along with the same delta values as Q4_0. The 4 bit quantized quants values across eight different blocks are interleaved with each other. This was introduced in PR #5780 . Models that needs to use this particular code path, needs to be quantized in this particular format of Q4_0_8_8. Thanks

@nisten
Copy link

nisten commented Aug 1, 2024

I just tested this. It works. Albeit nowhere as drastically as it helps on ARM cpus but it helps , tps on intel cpu inference of 4bit llama 405B went from 0.78 (meta-llama-405b-Q_4_NL) to 0.89 meta-llama-405b-Q_4_0_8_8 . the q4nl and q4088 shape filesize are identital.

prompt: ./llama-cli -m ~/meta-406b-q4_0_4_4.gguf -t 4 -co -p "You are a Nasa jpl engineer.Human: How to build a city on Mars via calculating Aldrin-Cycler orbits? Assistant:" -fa -e -c 512 -n 512 -t 64 -b 128

Again both x86 and ARM cpus max out batch-size-1 inference at 64 cores or threads, any more and it slows down a bit or stays stagnant.
Screenshot 2024-07-31 at 8 02 04 PM

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Aug 1, 2024
@ggerganov
Copy link
Owner

Albeit nowhere as drastically as it helps on ARM cpus but it helps , tps on intel cpu inference ..

The main benefit from these changes should be in the prompt processing speed, not the text generation. Better to use llama-bench to make a comparison

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Srihari-mcw Could you make a perplexity comparison before merging? For example against a Q4_0 model at PPL 32 chunks

@Srihari-mcw Srihari-mcw force-pushed the block_interleaving_q4_0_8_8_avx2_implementation branch from 81d9078 to c950fc3 Compare September 4, 2024 13:44
@Srihari-mcw
Copy link
Contributor Author

Srihari-mcw commented Sep 4, 2024

Hi @ggerganov,

The perplexity was measured for models quantized from meta llama2 7B model with the following command :
./llama-perplexity -m <model_name> -f wikitext-2-raw/wiki.test.raw --chunk-size 32

It calculated perplexity over 655 chunks :
perplexity: calculating perplexity over 655 chunks, n_ctx=512, batch_size=2048, n_seq=4

The perplexity results are tabulated as follows :

model perplexity (Final estimate PPL) Commit id
llama 7B Q4_0 5.9627 +/- 0.03348 c950fc306
llama 7B Q4_0_8_8 5.9625 +/- 0.03348 c950fc306

The perplexity readings were found to be almost the same post the tests

Further, post the latest changes in master branch and in the PR, the performance readings are as follows

GCC Linux :

Q4_0 Model :

model size params backend threads test t/s speedup Commit id
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 pp 512 58.20 ± 0.10 7605ae7da (Base Commit before changes)
llama 7B Q4_0_8_8 3.56 GiB 6.74 B CPU 6 pp 512 68.96 ± 0.08 18.48% c950fc306
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 tg 128 14.48 ± 0.01 7605ae7da (Base Commit before changes)
llama 7B Q4_0_8_8 3.56 GiB 6.74 B CPU 6 tg 128 14.87 ± 0.00 2.7% c950fc306

GCC Version = 12.3

The PR was tested in AMD Raphael 7600X which supports the following flags by default :

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1|

Thanks

@ggerganov ggerganov merged commit 581c305 into ggerganov:master Sep 4, 2024
52 checks passed
@slaren
Copy link
Collaborator

slaren commented Sep 5, 2024

It looks like the rocm compiler is crashing when compiling this code, which is breaking the generation of docker images.

2024-09-05T09:46:55.3876908Z #20 800.8 fatal error: error in backend: Instruction Combining seems stuck in an infinite loop after 1000 iterations.
2024-09-05T09:46:55.3878203Z #20 800.8 PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
2024-09-05T09:46:55.3878349Z #20 800.8 Stack dump:
2024-09-05T09:46:55.3883113Z #20 800.8 0.	Program arguments: /opt/rocm/llvm/bin/clang -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -std=c11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -c ggml/src/ggml-aarch64.c -o ggml/src/ggml-aarch64.o
2024-09-05T09:46:55.3883318Z #20 800.8 1.	<eof> parser at end of file
2024-09-05T09:46:55.3883453Z #20 800.8 2.	Optimizer
2024-09-05T09:46:55.3884169Z #20 800.8  #0 0x00005579e5d7d866 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/opt/rocm/llvm/bin/clang+0x27cf866)
2024-09-05T09:46:55.3884896Z #20 800.8  #1 0x00005579e5d7b6b4 llvm::sys::CleanupOnSignal(unsigned long) (/opt/rocm/llvm/bin/clang+0x27cd6b4)
2024-09-05T09:46:55.3885528Z #20 800.8  #2 0x00005579e5cd8877 llvm::CrashRecoveryContext::HandleExit(int) (/opt/rocm/llvm/bin/clang+0x272a877)
2024-09-05T09:46:55.3886095Z #20 800.8  #3 0x00005579e5d732c2 llvm::sys::Process::Exit(int, bool) (/opt/rocm/llvm/bin/clang+0x27c52c2)
2024-09-05T09:46:55.3886413Z #20 800.8  #4 0x00005579e44bf197 (/opt/rocm/llvm/bin/clang+0xf11197)
2024-09-05T09:46:55.3887083Z #20 800.8  #5 0x00005579e5ce1bc0 llvm::report_fatal_error(llvm::Twine const&, bool) (/opt/rocm/llvm/bin/clang+0x2733bc0)
2024-09-05T09:46:55.3889765Z #20 800.8  #6 0x00005579e58168ab combineInstructionsOverFunction(llvm::Function&, llvm::InstructionWorklist&, llvm::AAResults*, llvm::AssumptionCache&, llvm::TargetLibraryInfo&, llvm::TargetTransformInfo&, llvm::DominatorTree&, llvm::OptimizationRemarkEmitter&, llvm::BlockFrequencyInfo*, llvm::ProfileSummaryInfo*, unsigned int, llvm::LoopInfo*) InstructionCombining.cpp:0:0
2024-09-05T09:46:55.3890799Z #20 800.8  #7 0x00005579e5816c6e llvm::InstCombinePass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/opt/rocm/llvm/bin/clang+0x2268c6e)
2024-09-05T09:46:55.3892398Z #20 800.8  #8 0x00005579e61237c6 llvm::detail::PassModel<llvm::Function, llvm::InstCombinePass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/opt/rocm/llvm/bin/clang+0x2b757c6)
2024-09-05T09:46:55.3894412Z #20 800.8  #9 0x00005579e4501ce1 llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/opt/rocm/llvm/bin/clang+0xf53ce1)
2024-09-05T09:46:55.3896174Z #20 800.8 #10 0x00005579e4d6c037 llvm::CGSCCToFunctionPassAdaptor::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) (/opt/rocm/llvm/bin/clang+0x17be037)
2024-09-05T09:46:55.3899187Z #20 800.8 #11 0x00005579e44f5276 llvm::detail::PassModel<llvm::LazyCallGraph::SCC, llvm::CGSCCToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) (/opt/rocm/llvm/bin/clang+0xf47276)
2024-09-05T09:46:55.3901682Z #20 800.8 #12 0x00005579e4d654b9 llvm::PassManager<llvm::LazyCallGraph::SCC, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) (/opt/rocm/llvm/bin/clang+0x17b74b9)
2024-09-05T09:46:55.3905481Z #20 800.8 #13 0x00005579e5781256 llvm::detail::PassModel<llvm::LazyCallGraph::SCC, llvm::PassManager<llvm::LazyCallGraph::SCC, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) (/opt/rocm/llvm/bin/clang+0x21d3256)
2024-09-05T09:46:55.3907077Z #20 800.8 #14 0x00005579e4d689e1 llvm::DevirtSCCRepeatedPass::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) (/opt/rocm/llvm/bin/clang+0x17ba9e1)
2024-09-05T09:46:55.3910090Z #20 800.8 #15 0x00005579e5781206 llvm::detail::PassModel<llvm::LazyCallGraph::SCC, llvm::DevirtSCCRepeatedPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) (/opt/rocm/llvm/bin/clang+0x21d3206)
2024-09-05T09:46:55.3911158Z #20 800.8 #16 0x00005579e4d665f2 llvm::ModuleToPostOrderCGSCCPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/opt/rocm/llvm/bin/clang+0x17b85f2)
2024-09-05T09:46:55.3912136Z #20 800.8 #17 0x00005579e5789bd2 llvm::ModuleInlinerWrapperPass::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/opt/rocm/llvm/bin/clang+0x21dbbd2)
2024-09-05T09:46:55.3913769Z #20 800.8 #18 0x00005579e7061546 llvm::detail::PassModel<llvm::Module, llvm::ModuleInlinerWrapperPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/opt/rocm/llvm/bin/clang+0x3ab3546)
2024-09-05T09:46:55.3915054Z #20 800.8 #19 0x00005579e5658195 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/opt/rocm/llvm/bin/clang+0x20aa195)
2024-09-05T09:46:55.3917062Z #20 800.8 #20 0x00005579e6134c13 (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline(clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>&, std::unique_ptr<llvm::ToolOutputFile, std::default_delete<llvm::ToolOutputFile>>&) BackendUtil.cpp:0:0
2024-09-05T09:46:55.3919543Z #20 800.8 #21 0x00005579e6137e95 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>) (/opt/rocm/llvm/bin/clang+0x2b89e95)
2024-09-05T09:46:55.3920464Z #20 800.8 #22 0x00005579e700507d clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/opt/rocm/llvm/bin/clang+0x3a5707d)
2024-09-05T09:46:55.3921109Z #20 800.8 #23 0x00005579e7bd6191 clang::ParseAST(clang::Sema&, bool, bool) (/opt/rocm/llvm/bin/clang+0x4628191)
2024-09-05T09:46:55.3921772Z #20 800.8 #24 0x00005579e691ac99 clang::FrontendAction::Execute() (/opt/rocm/llvm/bin/clang+0x336cc99)
2024-09-05T09:46:55.3922579Z #20 800.8 #25 0x00005579e68a3301 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/opt/rocm/llvm/bin/clang+0x32f5301)
2024-09-05T09:46:55.3923346Z #20 800.8 #26 0x00005579e69dc160 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/opt/rocm/llvm/bin/clang+0x342e160)
2024-09-05T09:46:55.3924065Z #20 800.8 #27 0x00005579e44c0555 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/opt/rocm/llvm/bin/clang+0xf12555)
2024-09-05T09:46:55.3924658Z #20 800.8 #28 0x00005579e44bb84f ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) driver.cpp:0:0
2024-09-05T09:46:55.3926850Z #20 800.8 #29 0x00005579e66e7a89 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::'lambda'()>(long) Job.cpp:0:0
2024-09-05T09:46:55.3927685Z #20 800.8 #30 0x00005579e5cd8767 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/opt/rocm/llvm/bin/clang+0x272a767)
2024-09-05T09:46:55.3929271Z #20 800.8 #31 0x00005579e66e7e17 clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const (.part.0) Job.cpp:0:0
2024-09-05T09:46:55.3930455Z #20 800.8 #32 0x00005579e66a8bc1 clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/opt/rocm/llvm/bin/clang+0x30fabc1)
2024-09-05T09:46:55.3932662Z #20 800.8 #33 0x00005579e66a95d6 std::_Function_handler<void (), clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const::'lambda'()>::_M_invoke(std::_Any_data const&) Compilation.cpp:0:0
2024-09-05T09:46:55.3934080Z #20 800.8 #34 0x00005579e66aeaf8 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const (/opt/rocm/llvm/bin/clang+0x3100af8)
2024-09-05T09:46:55.3935419Z #20 800.8 #35 0x00005579e66beb7c clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) (/opt/rocm/llvm/bin/clang+0x3110b7c)
2024-09-05T09:46:55.3935924Z #20 800.8 #36 0x00005579e44be4d7 clang_main(int, char**) (/opt/rocm/llvm/bin/clang+0xf104d7)
2024-09-05T09:46:55.3936517Z #20 800.8 #37 0x00007fd4e1d9fd90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
2024-09-05T09:46:55.3937138Z #20 800.8 #38 0x00007fd4e1d9fe40 call_init ./csu/../csu/libc-start.c:128:20
2024-09-05T09:46:55.3937692Z #20 800.8 #39 0x00007fd4e1d9fe40 __libc_start_main ./csu/../csu/libc-start.c:379:5
2024-09-05T09:46:55.3938083Z #20 800.8 #40 0x00005579e44b73f5 _start (/opt/rocm/llvm/bin/clang+0xf093f5)
2024-09-05T09:46:55.3938829Z #20 800.8 clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
2024-09-05T09:46:55.3939964Z #20 800.8 AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.0 23243 be997b2f3651a41597d7a41441fff8ade4ac59ac)
2024-09-05T09:46:55.3940232Z #20 800.8 Target: x86_64-unknown-linux-gnu
2024-09-05T09:46:55.3940492Z #20 800.8 Thread model: posix
2024-09-05T09:46:55.3940686Z #20 800.8 InstalledDir: /opt/rocm/llvm/bin
2024-09-05T09:46:55.3940932Z #20 800.8 clang-16: note: diagnostic msg: 
2024-09-05T09:46:55.3941071Z #20 800.8 ********************
2024-09-05T09:46:55.3941213Z #20 800.8 
2024-09-05T09:46:55.3941540Z #20 800.8 PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
2024-09-05T09:46:55.3941978Z #20 800.8 Preprocessed source(s) and associated run script(s) are located at:
2024-09-05T09:46:55.3942447Z #20 800.8 clang-16: note: diagnostic msg: /tmp/ggml-aarch64-3228e7.c
2024-09-05T09:46:55.3943010Z #20 800.8 clang-16: note: diagnostic msg: /tmp/ggml-aarch64-3228e7.sh
2024-09-05T09:46:55.3943273Z #20 800.8 clang-16: note: diagnostic msg: 

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* Add AVX2 based implementations for quantize_q8_0_4x8, ggml_gemv_q4_0_8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0 functions

* Update code to fix issues occuring due to non alignment of elements to be processed as multiple of 16 in MSVC

* Update comments and indentation

* Make updates to reduce number of load instructions
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* Add AVX2 based implementations for quantize_q8_0_4x8, ggml_gemv_q4_0_8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0 functions

* Update code to fix issues occuring due to non alignment of elements to be processed as multiple of 16 in MSVC

* Update comments and indentation

* Make updates to reduce number of load instructions
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* Add AVX2 based implementations for quantize_q8_0_4x8, ggml_gemv_q4_0_8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0 functions

* Update code to fix issues occuring due to non alignment of elements to be processed as multiple of 16 in MSVC

* Update comments and indentation

* Make updates to reduce number of load instructions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants