b1860 #72

Nexesenex · 2024-01-13T18:31:09Z

No description provided.

* imatrix: 1st version * imatrix: WIP * Cleanup * Update examples/imatrix/imatrix.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Iwan Kawrakow <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* llama : fix llm_build_k_shift to use correct n_rot ggml-ci * llama : always use hparams.n_rot for ggml_rope_custom ggml-ci * convert : fix persimmon conversion to write correct n_rot

* common : streamline the formatting of help - Separate alternative parameters by a comma - Do not indent `--version` differently * Update common/common.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

* Updated Models Layout - Added a models drawer - Added downloading directly from Hugging Face - Load custom models from local folder - Delete models by swiping left * trimmed trailing white space * Updated Models Layout

This commit replaces the magic number used in export-lora.cpp with the one defined in llama.h, which is indirectly included via common.h. Signed-off-by: Daniel Bevenius <[email protected]>

* llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <[email protected]> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>

* Create pydantic-models-to-grammar.py * Added some comments for usage * Refactored Grammar Generator Added example and usage instruction. * Update pydantic_models_to_grammar.py * Update pydantic-models-to-grammar-examples.py * Renamed module and imported it. * Update pydantic-models-to-grammar.py * Renamed file and fixed grammar generator issue.

ggml-ci

* ggml : fix 32-bit ARM compat * ggml : fix fix * ggml : fix fix fix

* convert : update phi-2 to latest HF repo ggml-ci * py : try to fix flake stuff

* * fix deadlock * * dont ruint all whitespace

* metal : detect more GPU families * metal : refactor kernel loading * metal : set kernel family requirements * metal : fix kernel init + fix compile options * metal : take into account simdgroup reduction support * metal : print only skipped kernels * metal : fix check for simdgroup reduction support * metal : check for Metal 3 * metal : free allocations * metal : normalize encoder:setComputePipelineStatus calls ggml-ci * metal : fix Metal3 family check ggml-ci * metal : check for simdgroup matrix mul. feature ggml-ci

Co-authored-by: Bernhard Gstrein <[email protected]>

* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens * remove empty line --------- Co-authored-by: Georgi Gerganov <[email protected]>

The fix should be just the `sudo apt-get update`

* examples : save-load-state: save only required state * llama : only reserve n_vocab * n_batch at most for logits llama_decode asserts that only n_batch tokens are passed each call, and n_ctx is expected to be bigger than n_batch. * llama : always reserve n_vocab * n_batch for logits llama_context de-serialization breaks if the contexts have differing capacity for logits and llama_decode will at maximum resize to n_vocab * n_batch. * llama : only save and restore used logits for batch sizes of 512 this reduces save state in the best case by around 62 MB, which can be a lot if planning to save on each message to allow regenerating messages. * llama : use ostringstream and istringstream for save and load * llama : serialize rng into minimum amount of space required * llama : break session version due to serialization changes

Co-authored-by: goerch <[email protected]>

* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up from 133.2 t/s. * Fix AVX2 In addition to fixing iq4_nl, it seems I never adhusted the AVX2 implementation for iq2_tn to the block scale removal? This commit also fixes that. --------- Co-authored-by: Iwan Kawrakow <[email protected]>

ggerganov and others added 30 commits January 11, 2024 21:58

swift : track ggml release branch (#4867)

b037787

main : disable token count by default (#4874)

3ca63b4

main : better name for variable n_print (#4874)

7edefbd

server : fix infill when prompt is empty (#4833)

1d11838

llama : fix llm_build_k_shift to use correct n_rot (#4889)

f445c0e

* llama : fix llm_build_k_shift to use correct n_rot ggml-ci * llama : always use hparams.n_rot for ggml_rope_custom ggml-ci * convert : fix persimmon conversion to write correct n_rot

py : fix lint (#4889)

2d00741

common : streamline the formatting of help (#4890)

4315a94

* common : streamline the formatting of help - Separate alternative parameters by a comma - Do not indent `--version` differently * Update common/common.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

llama : fix typo "imp_embd" -> "inp_embd"

3cabe80

CUDA: fix softmax compile for old CUDA versions (#4862)

1b280c9

gitignore : imatrix

5537d9d

llama.swiftui : update models layout (#4826)

e790eef

* Updated Models Layout - Added a models drawer - Added downloading directly from Hugging Face - Load custom models from local folder - Delete models by swiping left * trimmed trailing white space * Updated Models Layout

export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)

930f907

This commit replaces the magic number used in export-lora.cpp with the one defined in llama.h, which is indirectly included via common.h. Signed-off-by: Daniel Bevenius <[email protected]>

llama : remove redundant assert for StableLM (#4901)

584d674

CUDA: faster q8_0 -> f16 dequantization (#4895)

3fe8178

backend_sched : fix assignments

fa5c1fb

ggml-ci

ggml : fix 32-bit ARM compat for IQ2_XS (whisper/1758)

f238461

* ggml : fix 32-bit ARM compat * ggml : fix fix * ggml : fix fix fix

sync : ggml

de473f5

convert : update phi-2 to latest HF repo (#4903)

15ebe59

* convert : update phi-2 to latest HF repo ggml-ci * py : try to fix flake stuff

server : fix crash with multimodal models without BOS token (#4904)

ee8243a

server : fix deadlock that occurs in multi-prompt scenarios (#4905)

356327f

* * fix deadlock * * dont ruint all whitespace

compare-llama-bench: tweak output format (#4910)

7dc7876

gguf : fix potential infinite for-loop (#4600)

c30b1ef

Co-authored-by: Bernhard Gstrein <[email protected]>

main : add parameter --no-display-prompt (#4541)

722d33f

* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens * remove empty line --------- Co-authored-by: Georgi Gerganov <[email protected]>

workflows: unbreak nix-build-aarch64, and split it out (#4915)

6b48ed0

The fix should be just the `sudo apt-get update`

metal : disable log for loaded kernels (#4794)

2d57de5

ggerganov and others added 2 commits January 13, 2024 18:47

llama : fix detokenization of non-special added-tokens (#4916)

f172de0

Co-authored-by: goerch <[email protected]>

server : fix prompt caching with system prompt (#4914)

0ea069b

Nexesenex merged commit c7f60af into Nexesenex:_master_up Jan 13, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b1860 #72

b1860 #72

Nexesenex commented Jan 13, 2024

b1860 #72

b1860 #72

Conversation

Nexesenex commented Jan 13, 2024