b3565 #287

Nexesenex · 2024-08-10T20:47:01Z

No description provided.

* re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections

* Add llama_detokenize(): - Update header files location - UNKNOWN and CONTROL are 'special pieces' - Remove space after UNKNOWN and CONTROL - Refactor llama_token_to_piece() - Add flag: clean_up_tokenization_spaces - Symmetric params for llama_tokenize() and llama_detokenize() * Update and fix tokenizer tests: - Using llama_detokenize() - Unexpected vocab type as test fail instead of error - Useful when automating tests: - If you don't know in advance the vocab type - Differenciate other loading errors - Skip unicode surrogaes and undefined - Gracefully exit threads - Using exit() is throwing random exceptions - Clean old known problematic codepoints - Minor: confusing hexadecimal codepoint * Update bruteforce random tests - Add detokenizer checks - New generator: ascii_lr_strip - New generator: apostrophe - Add more vocabs files - Detokenize special tokens. - Replace errors with '\uFFFD' when detokenizing to 'utf-8' - More edge cases - Better detokenization results check * Fix add_space_prefix, set false by default * Better leading space removal * Do not remove space when decoding special tokens * Bugfix: custom regexs splits undefined unicode codepoints * 'viking' detokenizer clean spaces

* llama : add early return for empty range This commit adds an early return to the llama_kv_cache_seq_add and llama_kv_cache_seq_div functions. The motivation for adding this is to avoid looping over the cache when the range is empty. I ran into this when using the self-extend feature in main.cpp. Signed-off-by: Daniel Bevenius <[email protected]> * llama : add static_cast to fix CI warning/error This commit attempts to fix the following warning/error: ```console src/llama.cpp:7271:31: error: comparison of integer expressions of different signedness: ‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare] 7271 | if (i < hparams.n_layer_dense_lead) { | ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` This can be reproduced locally by setting -Wsign-compare in the Makefile. Signed-off-by: Daniel Bevenius <[email protected]> * squash! llama : add early return for empty range Remove the setting of cache.head to 0 when the range is empty. Signed-off-by: Daniel Bevenius <[email protected]> * Update src/llama.cpp --------- Signed-off-by: Daniel Bevenius <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…8307) * added support for Authorization Bearer tokens * removed auth_token, removed set_ function, other small fixes * Update common/common.cpp --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

* server: Retrieve prompt template in /props This PR adds the following: - Expose the model's Jinja2 prompt template from the model in the /props endpoint. - Change log-level from Error to Warning for warning about template mismatch. The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it. Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function. * Make string buffer dynamic * Add doc and better string handling * Using chat_template naming convention * Use intermediate vector for string assignment

This patch replaces an old commad "main" with "llama-cli" in finetune.sh. The part that I fixed is comment, so it doesn't change the script. Signed-off-by: Masanari Iida <[email protected]>

Rename an old command name "finetune" to "llama-finetune" in README.md Signed-off-by: Masanari Iida <[email protected]>

ggml-ci

* add chatglm3-6b model support huggingface model: https://hf-mirror.com/THUDM/chatglm3-6b Signed-off-by: XingXing Qiao <[email protected]> * remove .rotary_pos_emb.inv_freq and unuse code for chatglm3 model Signed-off-by: XingXing Qiao <[email protected]> * fix lint error Signed-off-by: XingXing Qiao <[email protected]> * optimize convert-hf-to-gguf.py for chatglm model Signed-off-by: XingXing Qiao <[email protected]> * support glm-4-9b-chat Signed-off-by: XingXing Qiao <[email protected]> * fix eos tokens to glm4 * remove unused log * add preprocess to chatglm3 and chatglm4 * add eos_id_list to llama.cpp * fix code style * fix code style * fix conflicts * fix conflicts * Revert "add eos_id_list to llama.cpp" This reverts commit 3a4d579. * set <|endoftext|> as eos and <|user|> as eot * fix chat template bug * add comment to glm prefix and suffix * fix conflicts and add rope_ratio & ChatGLMForConditionalGeneration * fix chat template bug * fix codestyle * fix conflicts * modified the general name of glm model * fix conflicts * remove prefix and suffix * use normal glm4 chattempalte & use LLM_FFN_SWIGLU in phi3 * fix: resolve Flake8 errors in `convert-hf-to-gguf.py` - Fix E302 by adding two blank lines before top-level function definitions - Replace print statements to fix NP100 - Fix E303 by ensuring only one blank line between lines of code * fix rope ratio to solve incorrect answers * fix by comments --------- Signed-off-by: XingXing Qiao <[email protected]> Co-authored-by: XingXing Qiao <[email protected]> Co-authored-by: Umpire2018 <[email protected]>

…8048) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <[email protected]>

* adding guile_llama_cpp to binding list * fix formatting * fix formatting

* Added checks for cmake,make and ctest * Removed erroneous whitespace

* Update README.md * Update README.md * Update README.md fixed llama-cli/main, templates on some cmds added chat template sections and fixed typos in some areas * Update README.md * Update README.md * Update README.md

* py : type-check all Python scripts with Pyright * server-tests : use trailing slash in openai base_url * server-tests : add more type annotations * server-tests : strip "chat" from base_url in oai_chat_completions * server-tests : model metadata is a dict * ci : disable pip cache in type-check workflow The cache is not shared between branches, and it's 250MB in size, so it would become quite a big part of the 10GB cache limit of the repo. * py : fix new type errors from master branch * tests : fix test-tokenizer-random.py Apparently, gcc applies optimisations even when pre-processing, which confuses pycparser. * ci : only show warnings and errors in python type-check The "information" level otherwise has entries from 'examples/pydantic_models_to_grammar.py', which could be confusing for someone trying to figure out what failed, considering that these messages can safely be ignored even though they look like errors.

`emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op. Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.

* conv transpose 1d passing test for 1d input and kernel * working for different input and output channel counts, added test for variable stride * initial draft appears to work with stride other than 1 * working with all old and new conv1d tests * added a test for large tensors * removed use cuda hardcoding * restored test-conv-transpose.c * removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail * fixed accumulator bug * added test to test-backend-ops * fixed mistake * addressed review * fixed includes * removed blank lines * style and warning fixes * return failure when test fails * fix supports_op --------- Co-authored-by: slaren <[email protected]>

ggml-ci

* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same

* make : use C compiler to build metal embed object * use rm + rmdir to avoid -r flag in rm

`ggml/src/llamafile/sgemm.o` was not deleted on `make clean`

ggml-ci

* gguf-py : use classes for quants * convert_hf : simplify internal quantization type selection * gguf-py : fix flake8 lint * gguf-py : fix BF16 numpy view type * gguf-py : remove LlamaFileTypeMap Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.

* llama : avoid useless copies in dummy session writer * llama : avoid double tensor copy when saving session to buffer

This commit adds the `--pooling` option to the README.md file in the `examples/embedding` directory. The motivation for adding this options is that currently if the model used does not specify a pooling type the embedding example will fail with the following error message: ```console main: error: pooling type NONE not supported ``` This commit also updates the name of the executable in the examples section.

* ggml: use vulkan as gpu backend when available Signed-off-by: Matt Stephenson <[email protected]> * whisper: enable using vk as default buffer type Signed-off-by: Matt Stephenson <[email protected]> --------- Signed-off-by: Matt Stephenson <[email protected]>

* init * rename * add run android for termux in readme * add android readme * add instructions in readme * change name in readme * Update README.md * fixed line * add result in readme * random pos_embed * add positions index * change for ollama * change for ollama * better pos_embed in clip * support ollama * updata cmakelist * updata cmakelist * rename wrapper * clear code * replace and organize code * add link * sync master * fix warnings * fix warnings * fix bug in bicubic resize when need resize iamge smaller * receive review comments and modify * receive review comments and modify * put all code into llava dir * fix quality problem in pr code * change n_layer * add space in "-1" * imitate reshape bug of python code * fix bug in clip * fix issues for merging * fix llama-minicpmv-cli in cmake file * change pr readme * fix code review * remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir * fix cmakefile * add warn * fix KEY_HAS_MINICPMV_PROJ * remove load_image_size into clip_ctx * remove the extern "C", MINICPMV_API * fix uhd code for review comment * delete minicpmv-wrapper in pr * remove uhd_image_embed * Modify 2 notes * clip : style changes * del common.h in clip * fix Type-Check error * fix Type-Check error * fix Type-Check error * fix Type-Check error * fix makefile error * fix ubuntu-make error * try fix clip * try fix 1 --------- Co-authored-by: Hongji Zhu <[email protected]> Co-authored-by: harvestingmoon <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* llama : better replace_all (cont) ggml-ci * code : deduplicate replace_all ggml-ci

ggml-ci

Co-authored-by: Stanisław Szymczyk <[email protected]>

Signed-off-by: tarilabs <[email protected]>

* gguf-py : add T5ENCODER model architecture * common : call llama_decode() during warmup only if the model has decoder * convert-hf : add T5EncoderModel * llama : add llama_model_has_decoder() API function * llama : split build_t5() into build_t5_encoder() and build_t5_decoder() * llama : add support for LLM_ARCH_T5ENCODER * llama-embedding : add support for LLAMA_POOLING_TYPE_NONE * llama-embedding : add support for encoder-only models --------- Co-authored-by: Stanisław Szymczyk <[email protected]>

* default n_swa for phi-3 * fix * double check swa

…ronization overhead. (#8943) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <[email protected]>

…8956) Co-authored-by: Stanisław Szymczyk <[email protected]>

Co-authored-by: Neo Zhang <>

OuadiElfarouki and others added 30 commits July 5, 2024 13:23

Enabled more data types for oneMKL gemm_batch (#8236)

1f3e1b6

cmake : add GGML_BUILD and GGML_SHARED macro definitions (#8281)

1d894a7

llama : fix compile warning (#8304)

7ed03b8

Reorganize documentation pages (#8325)

be20e7f

* re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections

update main readme (#8333)

60d83a0

added support for Authorization Bearer tokens when downloading model (#…

86e7299

…8307) * added support for Authorization Bearer tokens * removed auth_token, removed set_ function, other small fixes * Update common/common.cpp --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

finetune: Rename an old command name in finetune.sh (#8344)

210eb9e

This patch replaces an old commad "main" with "llama-cli" in finetune.sh. The part that I fixed is comment, so it doesn't change the script. Signed-off-by: Masanari Iida <[email protected]>

finetune: Rename command name in README.md (#8343)

b81ba1f

Rename an old command name "finetune" to "llama-finetune" in README.md Signed-off-by: Masanari Iida <[email protected]>

py : use cpu-only torch in requirements.txt (#8335)

d39130a

llama : fix n_rot default (#8348)

b504008

ggml-ci

readme : update bindings list (#8222)

f1948f1

* adding guile_llama_cpp to binding list * fix formatting * fix formatting

ci : add checks for cmake,make and ctest in ci/run.sh (#8200)

4090ea5

* Added checks for cmake,make and ctest * Removed erroneous whitespace

Update llama-cli documentation (#8315)

a8db2a9

* Update README.md * Update README.md * Update README.md fixed llama-cli/main, templates on some cmds added chat template sections and fixed typos in some areas * Update README.md * Update README.md * Update README.md

readme : add supported glm models (#8360)

04ce3a8

common : avoid unnecessary logits fetch (#8358)

ffd0079

infill : assert prefix/suffix tokens + remove old space logic (#8351)

6f0dbf6

tests : fix whitespace (#0)

6847d54

sync : ggml

2ee44c9

ggml-ci

scripts : fix sync for sycl

3f2d538

sycl : fix powf call in device code (#8368)

2ec846d

readme : fix web link error [no ci] (#8347)

c4dd11d

labeler : updated sycl to match docs and code refactor (#8373)

a130ecc

slaren and others added 29 commits August 7, 2024 13:29

ggml-backend : fix async copy from CPU (#8897)

be55695

* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same

make : use C compiler to build metal embed object (#8899)

15fa07a

* make : use C compiler to build metal embed object * use rm + rmdir to avoid -r flag in rm

make : clean llamafile objects (#8923)

ebd541a

`ggml/src/llamafile/sgemm.o` was not deleted on `make clean`

metal : add abort callback (ggml/905)

85fca8d

metal : fix struct name (ggml/912)

5b33ea1

ggml-ci

ggml : ignore more msvc warnings (ggml/906)

f93d49a

sync : ggml

e44a561

scripts : fix sync filenames (#0)

366d486

scripts : sync cann files (#0)

afd27f0

llama : reduce useless copies when saving session (#8916)

345a686

* llama : avoid useless copies in dummy session writer * llama : avoid double tensor copy when saving session to buffer

server : add one level list nesting for embeddings (#8936)

daef3ab

llama : fix typo in llama_tensor_get_type comment [no ci] (#8937)

6f6496b

sync : ggml

4305b57

llama : better replace_all (cont) (#8926)

45a55b9

* llama : better replace_all (cont) ggml-ci * code : deduplicate replace_all ggml-ci

make : fix llava obj file race (#8946)

272e3bd

ggml-ci

llama : add support for lora adapters in T5 model (#8938)

6afd1a9

Co-authored-by: Stanisław Szymczyk <[email protected]>

Merge commit from fork

b72942f

gguf-py : fix double call to add_architecture() (#8952)

911b437

Signed-off-by: tarilabs <[email protected]>

llama : default n_swa for phi-3 (#8931)

7eb2384

* default n_swa for phi-3 * fix * double check swa

metal : fix uninitialized abort_callback (#8968)

6e02327

llama : check all graph nodes when searching for result_embd_pooled (#…

33309f6

…8956) Co-authored-by: Stanisław Szymczyk <[email protected]>

update guide (#8909)

a21c6fd

Co-authored-by: Neo Zhang <>

flake.lock: Update (#8979)

8cd1bcf

Nexesenex closed this Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b3565 #287

b3565 #287

Nexesenex commented Aug 10, 2024

b3565 #287

b3565 #287

Conversation

Nexesenex commented Aug 10, 2024