b1827 #71

Nexesenex · 2024-01-11T19:54:21Z

No description provided.

Uses ggml functions instead of hardcoded names and adds support to quantize into the modern Q-K variants. This is just the bare minimum to get k-types working - a more refined choice of types would be needed to get best quality on low quantizations. I ran a few tests, it doesn't break anything I could notice and a Q6_K ViT works almost as well as Q8_0 but 3 times the inference speed.

This update categorizes models with 24 layers as MODEL_1B, ensuring compatibility with different Phi model variants without impacting existing Phi-2 model functionality.

* llm_load_print_meta: Add additional suffixs for model params * Update llama.cpp model param log remove unneeded comments and convert from > to >=

* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line

* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too

* fix : cuda order of synchronization when setting a buffer * also sync before memcpy --------- Co-authored-by: slaren <[email protected]>

NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.

* change GGML_MAX_NAME to 128 * allow controlling the value of GGML_MAX_NAME through external macro definitions

* Token count changes * Add show token count * Updating before PR * Two requested changes * Move param def posn

* ci: nix-flake-update: new token with pr permissions --------- Co-authored-by: Georgi Gerganov <[email protected]>

* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too * used LOG_INFO after successful model loading

* server: added support for multiple api keys, added loading api keys from file * minor: fix whitespace * added file error handling to --api-key-file, changed code to better reflect current style * server: update README.md for --api-key-file --------- Co-authored-by: Michael Coppola <[email protected]>

* Implement credentialed CORS according to MDN * Fix syntax error * Move validate_api_key up so it is defined before its first usage

ggml-ci

* iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <[email protected]>

* Restore intended k-quants quantization mixes for MoE models * Update Q2_K_S values in the quantize tool Still using LLaMA-v1 PPL values in the quant description today does not make much sense. But let's leave this update for another PR. --------- Co-authored-by: Iwan Kawrakow <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

Co-authored-by: Iwan Kawrakow <[email protected]>

cmp-nct and others added 23 commits January 10, 2024 15:37

llama : recognize 1B phi models (#4847)

329ff61

This update categorizes models with 24 layers as MODEL_1B, ensuring compatibility with different Phi model variants without impacting existing Phi-2 model functionality.

llama : add additional suffixes for model params (#4834)

57d016b

* llm_load_print_meta: Add additional suffixs for model params * Update llama.cpp model param log remove unneeded comments and convert from > to >=

server : fix build + rename enums (#4870)

5c1980d

fix : cuda order of synchronization when setting a buffer (ggml/679)

f34432c

* fix : cuda order of synchronization when setting a buffer * also sync before memcpy --------- Co-authored-by: slaren <[email protected]>

Fix execlp call (ggml/689)

c910e3c

NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.

ggml : change GGML_MAX_NAME at compile time (ggml/682)

e739de7

* change GGML_MAX_NAME to 128 * allow controlling the value of GGML_MAX_NAME through external macro definitions

metal : wrap each operation in debug group (ggml/690)

5362e43

ggml : remove ggml_cpy_inplace and ggml_cont_inplace (ggml/693)

f85a973

metal : fix deprecation warning (ggml/690)

3267c2a

sync : ggml

64802ec

metal : put encoder debug group behind a define (#4873)

2a7c94d

server : fix typo in model name (#4876)

2f04332

main : print total token count and tokens consumed so far (#4874)

43f76bf

* Token count changes * Add show token count * Updating before PR * Two requested changes * Move param def posn

ci: nix-flake-update: new token with pr permissions (#4879)

d8d90aa

* ci: nix-flake-update: new token with pr permissions --------- Co-authored-by: Georgi Gerganov <[email protected]>

server : implement credentialed CORS (#4514)

4330bd8

* Implement credentialed CORS according to MDN * Fix syntax error * Move validate_api_key up so it is defined before its first usage

swift : pin ggml commit + remove ggml.h from spm-headers (#4878)

3ba5b8c

ggml-ci

Nexesenex merged commit 23056db into Nexesenex:_master_up Jan 11, 2024
30 of 41 checks passed

Nexesenex pushed a commit that referenced this pull request Dec 22, 2024

iqk_mul_mat: better srategy when nrc_y not divisible by ny (#71)

8cba478

Co-authored-by: Iwan Kawrakow <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b1827 #71

b1827 #71

Nexesenex commented Jan 11, 2024

b1827 #71

b1827 #71

Conversation

Nexesenex commented Jan 11, 2024