Merge b3565 #286

Nexesenex · 2024-08-10T18:44:37Z

No description provided.

* add truncate_bf16 * truncate intermediate fp32 if converting bf16 to bf16 * fix masking in __compute_fp32_to_bf16 * np.int16 no longer used * missing cast and additional numpy 2.x fix * ggml-impl : do not flush bf16 subnormals to zero * ggml : add reference fp32 to bf16 conversion The fast version is no longer equivalent for all platforms because of the handling of subnormal values. * gguf-py : remove flush to zero for bf16 subnormals * gguf-py : remove float32 truncation to bf16 Rounding achieves the same thing in the cases where this was used. * missed prototype update in merge * merge cleanup --------- Co-authored-by: Francis Couture-Harpin <[email protected]>

* ggml : reading the runtime sve config of the cpu * change to one time init to prevent performance drop * prefix variable to avoid possible conflicts * revert xxhash fix and add brackets --------- Co-authored-by: domke <[email protected]>

* [example] batched-bench "segmentation fault" When `llama-batched-bench` is invoked _without_ setting `-npl`, "number of parallel prompts", it segfaults. The segfault is caused by invoking `max_element()` on a zero-length vector, `n_pl` This commit addresses that by first checking to see if the number of parallel prompts is zero, and if so sets the maximum sequence size to 1; otherwise, sets it to the original, the result of `max_element()`. Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf` ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28 69 llama_context_params ctx_params = llama_context_params_from_gpt_params(params); 70 71 // ensure enough sequences are available -> 72 ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end()); ``` * Update examples/batched-bench/batched-bench.cpp Co-authored-by: compilade <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: compilade <[email protected]>

* Don't ignore llama.cpp params * Add fallback for max_tokens

This commit moves the comment for the c parameter from ggml_rope to ggml_rope_ext. The comment is currently incorrect as ggml_rope does not have a c parameter (freq_factors tensor). Signed-off-by: Daniel Bevenius <[email protected]>

* Fix Vulkan repeat op * Implement Vulkan concat op * Delete old Vulkan shader generator * Implement Vulkan im2col op * Implement Vulkan unary gelu_quick op * Implement Vulkan group_norm op * Implement Vulkan timestep_embedding op * Implement Vulkan upscale op * Fix Vulkan vk_context tensor extra index issue * Fix Vulkan matmul shader parameter bug * Properly fix Vulkan matmul shader parameter bug * Add Vulkan ADD f16 + f32 -> f16 operator support * Implement Vulkan tanh op * Fix Vulkan group count too large Validation error on non-Nvidia GPUs * Throw error when too much memory is requested * Fix another Vulkan group count too large Validation error on non-Nvidia GPUs * Fix matmul MMQ condition * Implement Vulkan pad op * Fix Vulkan crash when tensor is used multiple times in a compute graph * Add Vulkan CONCAT f16 + f16 -> f16 op * Add Vulkan LEAKY_RELU op

ggml-ci

* Fix Vulkan mul mat vec invalid results when ncols < warp size * Only run backend ops mul mat vec block size test if block size not already covered

* Vulkan-shaders: attempt fix compilation on windows * fix miss-matched parenthesis

… Llama 3.1 tool call support (#8858) * gguf-py, llama : add constants and methods related to Llama-3.1 <|eom_id|> token * llama : find Llama-3.1 <|eom_id|> token id during vocab loading * llama-vocab : add Llama-3.1 <|eom_id|> token to the set of tokens stopping the generation --------- Co-authored-by: Stanisław Szymczyk <[email protected]>

* py: add more authorship metadata from model card * fixup! py: add more authorship metadata from model card

It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.

ramalama is a repo agnostic boring CLI tool that supports pulling from ollama, huggingface and oci registries. Signed-off-by: Eric Curtin <[email protected]>

* common : Changed tuple to struct (TODO fix) Use struct `llama_init_result` to replace the previous std::tuple<struct llama_model *, struct llama_context *> * delete llama_init_default_params() * delete the extra whitespace

* cann: fix ggml_backend_cann_buffer_get_tensor 1. fix data ptr offset 2. enable the acquisition of incomplete tensors * fix backend cann set_tensor

* add conversion for bge-m3; small fix in unigram tokenizer * clean up and simplify XLMRoberta conversion

Signed-off-by: Molly Sophia <[email protected]>

…e31a4f6` (#8880) * Fix compilation issue in `vulkan-shaders-gen` e31a4f6 broke compilation on w64devkit. Including `algorithm` seems to fix that. * Guard it under `#ifdef _WIN32`

When using CMake to build with Vulkan support, compiling vulkan-shaders-gen fails due to missing a CMakeLists.txt specification to link vulkan-shaders-gen with the threading library, resulting in the following error. [5/172] Linking CXX executable bin/vulkan-shaders-gen FAILED: bin/vulkan-shaders-gen : && /usr/bin/c++ ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o -o bin/vulkan-shaders-gen && : ld: error: undefined symbol: pthread_create >>> referenced by vulkan-shaders-gen.cpp >>> ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o:(std::__1::__libcpp_thread_create[abi:se180100](pthread**, >>> void* (*)(void*), void*)) c++: error: linker command failed with exit code 1 (use -v to see invocation) [6/172] Generating build details from Git -- Found Git: /usr/local/bin/git (found version "2.45.2") ninja: build stopped: subcommand failed. Add the CMakeLists.txt specification to link vulkan-shaders-gen with the threading library and fix the above error. Fixes #8834

This commit updates the name of the executable in README.md from `simple` to `llama-simple`.

* gguf-py : use classes for quants * convert_hf : simplify internal quantization type selection * gguf-py : fix flake8 lint * gguf-py : fix BF16 numpy view type * gguf-py : remove LlamaFileTypeMap Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.

* llama : avoid useless copies in dummy session writer * llama : avoid double tensor copy when saving session to buffer

This commit adds the `--pooling` option to the README.md file in the `examples/embedding` directory. The motivation for adding this options is that currently if the model used does not specify a pooling type the embedding example will fail with the following error message: ```console main: error: pooling type NONE not supported ``` This commit also updates the name of the executable in the examples section.

* ggml: use vulkan as gpu backend when available Signed-off-by: Matt Stephenson <[email protected]> * whisper: enable using vk as default buffer type Signed-off-by: Matt Stephenson <[email protected]> --------- Signed-off-by: Matt Stephenson <[email protected]>

* init * rename * add run android for termux in readme * add android readme * add instructions in readme * change name in readme * Update README.md * fixed line * add result in readme * random pos_embed * add positions index * change for ollama * change for ollama * better pos_embed in clip * support ollama * updata cmakelist * updata cmakelist * rename wrapper * clear code * replace and organize code * add link * sync master * fix warnings * fix warnings * fix bug in bicubic resize when need resize iamge smaller * receive review comments and modify * receive review comments and modify * put all code into llava dir * fix quality problem in pr code * change n_layer * add space in "-1" * imitate reshape bug of python code * fix bug in clip * fix issues for merging * fix llama-minicpmv-cli in cmake file * change pr readme * fix code review * remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir * fix cmakefile * add warn * fix KEY_HAS_MINICPMV_PROJ * remove load_image_size into clip_ctx * remove the extern "C", MINICPMV_API * fix uhd code for review comment * delete minicpmv-wrapper in pr * remove uhd_image_embed * Modify 2 notes * clip : style changes * del common.h in clip * fix Type-Check error * fix Type-Check error * fix Type-Check error * fix Type-Check error * fix makefile error * fix ubuntu-make error * try fix clip * try fix 1 --------- Co-authored-by: Hongji Zhu <[email protected]> Co-authored-by: harvestingmoon <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* llama : better replace_all (cont) ggml-ci * code : deduplicate replace_all ggml-ci

ggml-ci

Co-authored-by: Stanisław Szymczyk <[email protected]>

Signed-off-by: tarilabs <[email protected]>

* gguf-py : add T5ENCODER model architecture * common : call llama_decode() during warmup only if the model has decoder * convert-hf : add T5EncoderModel * llama : add llama_model_has_decoder() API function * llama : split build_t5() into build_t5_encoder() and build_t5_decoder() * llama : add support for LLM_ARCH_T5ENCODER * llama-embedding : add support for LLAMA_POOLING_TYPE_NONE * llama-embedding : add support for encoder-only models --------- Co-authored-by: Stanisław Szymczyk <[email protected]>

* default n_swa for phi-3 * fix * double check swa

CISC and others added 30 commits August 2, 2024 15:11

flake.lock: Update (#8847)

4b77ea9

baby-llama : remove duplicate vector include

01aae2b

Server: Don't ignore llama.cpp params (#8754)

978ba3d

* Don't ignore llama.cpp params * Add fallback for max_tokens

Install curl in runtime layer (#8693)

0d6fb52

cann: support q4_0 model (#8822)

c02b0a8

sync : ggml

5587e57

ggml-ci

vulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (#8855)

064cdc2

* Fix Vulkan mul mat vec invalid results when ncols < warp size * Only run backend ops mul mat vec block size test if block size not already covered

llama : better replace_all (#8852)

f1ea514

readme : update model list (#8851)

400ae6f

cmake: fix paths for vulkan shaders compilation on Windows (#8573)

e31a4f6

* Vulkan-shaders: attempt fix compilation on windows * fix miss-matched parenthesis

py: Add more authorship metadata from model card (#8810)

1ef14b3

* py: add more authorship metadata from model card * fixup! py: add more authorship metadata from model card

ggml : fix overflows in elu function (#8866)

b9dfc25

It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.

readme : add ramalama to the availables UI (#8811)

b42978e

ramalama is a repo agnostic boring CLI tool that supports pulling from ollama, huggingface and oci registries. Signed-off-by: Eric Curtin <[email protected]>

cann: fix buffer_num and runtime speed slowly error (#8865)

bc0f887

common : Changed tuple to struct (TODO fix) (#8823)

0a4ce78

* common : Changed tuple to struct (TODO fix) Use struct `llama_init_result` to replace the previous std::tuple<struct llama_model *, struct llama_context *> * delete llama_init_default_params() * delete the extra whitespace

[SYCL] correct cmd name (#8877)

d4ff847

[CANN]: Fix ggml_backend_cann_buffer_get_tensor (#8871)

c21a896

* cann: fix ggml_backend_cann_buffer_get_tensor 1. fix data ptr offset 2. enable the acquisition of incomplete tensors * fix backend cann set_tensor

convert : add support for XLMRoberta embedding models (#8658)

cdd1889

* add conversion for bge-m3; small fix in unigram tokenizer * clean up and simplify XLMRoberta conversion

ggml : add epsilon as a parameter for group_norm (#8818)

2d5dd7b

Signed-off-by: Molly Sophia <[email protected]>

contributing : add note about write access

0bf16de

[Vulkan] Fix compilation of vulkan-shaders-gen on w64devkit after `…

efda90c

…e31a4f6` (#8880) * Fix compilation issue in `vulkan-shaders-gen` e31a4f6 broke compilation on w64devkit. Including `algorithm` seems to fix that. * Guard it under `#ifdef _WIN32`

simple : update name of executable to llama-simple (#8885)

5f4dcb1

This commit updates the name of the executable in README.md from `simple` to `llama-simple`.

CUDA: fix padding logic for FP16/FP32 (#8884)

641f5dd

ggerganov and others added 18 commits August 8, 2024 14:40

scripts : fix sync filenames (#0)

366d486

scripts : sync cann files (#0)

afd27f0

llama : reduce useless copies when saving session (#8916)

345a686

* llama : avoid useless copies in dummy session writer * llama : avoid double tensor copy when saving session to buffer

server : add one level list nesting for embeddings (#8936)

daef3ab

llama : fix typo in llama_tensor_get_type comment [no ci] (#8937)

6f6496b

sync : ggml

4305b57

llama : better replace_all (cont) (#8926)

45a55b9

* llama : better replace_all (cont) ggml-ci * code : deduplicate replace_all ggml-ci

make : fix llava obj file race (#8946)

272e3bd

ggml-ci

llama : add support for lora adapters in T5 model (#8938)

6afd1a9

Co-authored-by: Stanisław Szymczyk <[email protected]>

Merge commit from fork

b72942f

gguf-py : fix double call to add_architecture() (#8952)

911b437

Signed-off-by: tarilabs <[email protected]>

llama : default n_swa for phi-3 (#8931)

7eb2384

* default n_swa for phi-3 * fix * double check swa

metal : fix uninitialized abort_callback (#8968)

6e02327

github-actions bot added Nvidia GPU testing examples python server ggml devops SYCL Vulkan script Apple Metal labels Aug 10, 2024

Nexesenex merged commit 14f4f40 into Nexesenex:patch-1 Aug 10, 2024
48 of 57 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge b3565 #286

Merge b3565 #286

Nexesenex commented Aug 10, 2024

Merge b3565 #286

Merge b3565 #286

Conversation

Nexesenex commented Aug 10, 2024