Ag indirect copy dest #292

Nexesenex · 2024-08-13T10:26:17Z

No description provided.

…anov#8746) Signed-off-by: Xiaodong Ye <[email protected]>

…ggerganov#8748) In these codes, we want to retain the value that they previously held when mask[i] is false. So we should use undisturbed. With the default agnostic policy of rvv intrinsic, these values can be held or be written with 1s. Co-authored-by: carter.li <[email protected]>

Signed-off-by: zhentaoyu <[email protected]>

…ov#8751) * added android implementation of ggml_print_backtrace_symbols * Update ggml/src/ggml.c Co-authored-by: slaren <[email protected]> * Update ggml/src/ggml.c Co-authored-by: slaren <[email protected]> * Update ggml/src/ggml.c Co-authored-by: slaren <[email protected]> * Update ggml/src/ggml.c Co-authored-by: slaren <[email protected]> * Update ggml/src/ggml.c Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

…gerganov#8774) * gguf_writer.py: add_array() should not add to kv store if empty * Apply suggestions from code review I was wondering if there was a specific reason for `if val` but good to hear we can safely use `len(val == 0` Co-authored-by: compilade <[email protected]> --------- Co-authored-by: compilade <[email protected]>

Listing individual outputs no longer necessary to reduce the runtime closure size after NixOS/nixpkgs#323056.

* Adding Gemma 2 2B configs Updates to Q scaling and Gemma 2 model sizes to match v2 2B model. * Update src/llama.cpp Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

@fairydreaming

* Fix potential race condition as pointed out by @fairydreaming in ggerganov#8776 * Reference the .o rather than rebuilding every time. * Adding in CXXFLAGS and LDFLAGS * Removing unnecessary linker flags.

…8779) Fixes ggerganov#8763

* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X * update asserts * only use dmmv for supported types * add test

…ganov#8783) * Only enable backtrace on GLIBC linux systems * fix missing file from copy * use glibc macro instead of defining a custom one

* Adding support for unified memory * adding again the documentation about unified memory * refactoring: Moved the unified memory code in the correct location. * Fixed compilation error when using hipblas * cleaning up the documentation * Updating the documentation Co-authored-by: Johannes Gäßler <[email protected]> * adding one more case where the PR should not be enabled --------- Co-authored-by: matteo serva <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>

* fix ggml_cann_im2col for 1D im2col * fix build warning

* add truncate_bf16 * truncate intermediate fp32 if converting bf16 to bf16 * fix masking in __compute_fp32_to_bf16 * np.int16 no longer used * missing cast and additional numpy 2.x fix * ggml-impl : do not flush bf16 subnormals to zero * ggml : add reference fp32 to bf16 conversion The fast version is no longer equivalent for all platforms because of the handling of subnormal values. * gguf-py : remove flush to zero for bf16 subnormals * gguf-py : remove float32 truncation to bf16 Rounding achieves the same thing in the cases where this was used. * missed prototype update in merge * merge cleanup --------- Co-authored-by: Francis Couture-Harpin <[email protected]>

* ggml : reading the runtime sve config of the cpu * change to one time init to prevent performance drop * prefix variable to avoid possible conflicts * revert xxhash fix and add brackets --------- Co-authored-by: domke <[email protected]>

* [example] batched-bench "segmentation fault" When `llama-batched-bench` is invoked _without_ setting `-npl`, "number of parallel prompts", it segfaults. The segfault is caused by invoking `max_element()` on a zero-length vector, `n_pl` This commit addresses that by first checking to see if the number of parallel prompts is zero, and if so sets the maximum sequence size to 1; otherwise, sets it to the original, the result of `max_element()`. Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf` ``` * thread ggerganov#1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28 69 llama_context_params ctx_params = llama_context_params_from_gpt_params(params); 70 71 // ensure enough sequences are available -> 72 ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end()); ``` * Update examples/batched-bench/batched-bench.cpp Co-authored-by: compilade <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: compilade <[email protected]>

* Don't ignore llama.cpp params * Add fallback for max_tokens

This commit moves the comment for the c parameter from ggml_rope to ggml_rope_ext. The comment is currently incorrect as ggml_rope does not have a c parameter (freq_factors tensor). Signed-off-by: Daniel Bevenius <[email protected]>

* Fix Vulkan repeat op * Implement Vulkan concat op * Delete old Vulkan shader generator * Implement Vulkan im2col op * Implement Vulkan unary gelu_quick op * Implement Vulkan group_norm op * Implement Vulkan timestep_embedding op * Implement Vulkan upscale op * Fix Vulkan vk_context tensor extra index issue * Fix Vulkan matmul shader parameter bug * Properly fix Vulkan matmul shader parameter bug * Add Vulkan ADD f16 + f32 -> f16 operator support * Implement Vulkan tanh op * Fix Vulkan group count too large Validation error on non-Nvidia GPUs * Throw error when too much memory is requested * Fix another Vulkan group count too large Validation error on non-Nvidia GPUs * Fix matmul MMQ condition * Implement Vulkan pad op * Fix Vulkan crash when tensor is used multiple times in a compute graph * Add Vulkan CONCAT f16 + f16 -> f16 op * Add Vulkan LEAKY_RELU op

* default n_swa for phi-3 * fix * double check swa

…ronization overhead. (ggerganov#8943) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <[email protected]>

…gerganov#8956) Co-authored-by: Stanisław Szymczyk <[email protected]>

Co-authored-by: Neo Zhang <>

* gguf-py : Numpy dequantization for most types * gguf-py : Numpy dequantization for grid-based i-quants

ggml-ci

* py : fix requirements check '==' -> '~=' * cont : fix the fix * ci : run on all requirements.txt

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724 In order to access the above bug you need to login using one of the emails in https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5 Signed-off-by: David Korczynski <[email protected]>

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680 Signed-off-by: David Korczynski <[email protected]>

* readme: introduce gpustack GPUStack is an open-source GPU cluster manager for running large language models, which uses llama.cpp as the backend. Signed-off-by: thxCode <[email protected]> * readme: introduce gguf-parser GGUF Parser is a tool to review/check the GGUF file and estimate the memory usage without downloading the whole model. Signed-off-by: thxCode <[email protected]> --------- Signed-off-by: thxCode <[email protected]>

…8970) * llama : model-based max number of graph nodes calculation * Update src/llama.cpp --------- Co-authored-by: slaren <[email protected]>

Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates.

JohannesGaessler and others added 30 commits July 28, 2024 22:32

cmake: use 1 more thread for non-ggml in CI (ggerganov#8740)

6eeaeba

[SYCL] add conv support (ggerganov#8688)

0832de7

cuda : organize vendor-specific headers into vendors directory (ggerg…

439b3fc

…anov#8746) Signed-off-by: Xiaodong Ye <[email protected]>

[SYCL] Add TIMESTEP_EMBEDDING OP (ggerganov#8707)

c887d8b

Signed-off-by: zhentaoyu <[email protected]>

cann: update cmake (ggerganov#8765)

6e2b600

flake.lock: Update (ggerganov#8729)

140074b

nix: cuda: rely on propagatedBuildInputs (ggerganov#8772)

268c566

Listing individual outputs no longer necessary to reduce the runtime closure size after NixOS/nixpkgs#323056.

cmake : fix use of external ggml (ggerganov#8787)

44d28dd

Adding Gemma 2 2B configs (ggerganov#8784)

398ede5

* Adding Gemma 2 2B configs Updates to Q scaling and Gemma 2 model sizes to match v2 2B model. * Update src/llama.cpp Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

Build: Fix potential race condition (ggerganov#8781)

ed9d285

* Fix potential race condition as pointed out by @fairydreaming in ggerganov#8776 * Reference the .o rather than rebuilding every time. * Adding in CXXFLAGS and LDFLAGS * Removing unnecessary linker flags.

server : update llama-server embedding flag documentation (ggerganov#…

afbbcf3

…8779) Fixes ggerganov#8763

cann: support q8_0 for Ascend� backend (ggerganov#8805)

c8a0090

cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X (ggerganov#8800)

7a11eb3

* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X * update asserts * only use dmmv for supported types * add test

Build: Only include execinfo.h on linux systems that support it (gger…

b7a08fd

…ganov#8783) * Only enable backtrace on GLIBC linux systems * fix missing file from copy * use glibc macro instead of defining a custom one

[SYCL] Fixing wrong VDR iq4nl value (ggerganov#8812)

0fbbd88

cann: Fix ggml_cann_im2col for 1D im2col (ggerganov#8819)

e09a800

* fix ggml_cann_im2col for 1D im2col * fix build warning

flake.lock: Update (ggerganov#8847)

4b77ea9

baby-llama : remove duplicate vector include

01aae2b

Server: Don't ignore llama.cpp params (ggerganov#8754)

978ba3d

* Don't ignore llama.cpp params * Add fallback for max_tokens

Install curl in runtime layer (ggerganov#8693)

0d6fb52

cann: support q4_0 model (ggerganov#8822)

c02b0a8

ngxson and others added 15 commits August 10, 2024 13:04

llama : default n_swa for phi-3 (ggerganov#8931)

7eb2384

* default n_swa for phi-3 * fix * double check swa

metal : fix uninitialized abort_callback (ggerganov#8968)

6e02327

llama : check all graph nodes when searching for result_embd_pooled (g…

33309f6

…gerganov#8956) Co-authored-by: Stanisław Szymczyk <[email protected]>

update guide (ggerganov#8909)

a21c6fd

Co-authored-by: Neo Zhang <>

flake.lock: Update (ggerganov#8979)

8cd1bcf

gguf-py : Numpy dequantization for most types (ggerganov#8939)

4134999

* gguf-py : Numpy dequantization for most types * gguf-py : Numpy dequantization for grid-based i-quants

server : handle models with missing EOS token (ggerganov#8997)

5ef07e2

ggml-ci

py : fix requirements check '==' -> '~=' (ggerganov#8982)

d3ae0ee

* py : fix requirements check '==' -> '~=' * cont : fix the fix * ci : run on all requirements.txt

Fix a spelling mistake (ggerganov#9001)

2589292

grammar-parser : fix possible null-deref (ggerganov#9004)

1262e7e

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680 Signed-off-by: David Korczynski <[email protected]>

llama : model-based max number of graph nodes calculation (ggerganov#…

0fd93cd

…8970) * llama : model-based max number of graph nodes calculation * Update src/llama.cpp --------- Co-authored-by: slaren <[email protected]>

Nexesenex merged commit 0a21af5 into Nexesenex:lcpp_pr_cuda_graphs_improve Aug 13, 2024
8 of 11 checks passed

github-actions bot added documentation Improvements or additions to documentation Nvidia GPU testing examples python server ggml devops SYCL Vulkan build script Apple Metal nix labels Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ag indirect copy dest #292

Ag indirect copy dest #292

Nexesenex commented Aug 13, 2024

Ag indirect copy dest #292

Ag indirect copy dest #292

Conversation

Nexesenex commented Aug 13, 2024