b2532 #102

Nexesenex · 2024-03-25T18:13:26Z

No description provided.

* Add MobileVLM_V2 backup * Update MobileVLM-README.md * Update examples/llava/MobileVLM-README.md Co-authored-by: Georgi Gerganov <[email protected]> * Update examples/llava/convert-image-encoder-to-gguf.py Co-authored-by: Georgi Gerganov <[email protected]> * clip : fix whitespace --------- Co-authored-by: Georgi Gerganov <[email protected]>

This reverts commit f8c4e74.

* server: version bump for httplib and json * fix build * bring back content_length

* cuda : refactor to remove global resources

* Add MobileVLM_V2 backup * Update MobileVLM-README.md * Update examples/llava/MobileVLM-README.md Co-authored-by: Georgi Gerganov <[email protected]> * Update examples/llava/convert-image-encoder-to-gguf.py Co-authored-by: Georgi Gerganov <[email protected]> * clip : fix whitespace * fix deifinition mistake in clip.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

* k_cache: be able to use Q5_0 * k_cache: be able to use Q5_1 on CODA * k_cache: be able to use Q5_0 on Metal * k_cache: be able to use Q5_1 on Metal * k_cache: be able to use IQ4_NL - just CUDA for now * k_cache: be able to use IQ4_NL on Metal * k_cache: add newly added supported types to llama-bench and CUDA supports_op --------- Co-authored-by: Iwan Kawrakow <[email protected]>

* Initial commit - add mac prebuilds. * forward contribution credits for building the workflow. * minor : remove trailing whitespaces --------- Co-authored-by: Nicolas Patry <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* json: fix arrays (disallow `[,1]`) * json: support tuple types (`[number, string]`) * json: support additionalProperties (`{[k: string]: [string,number][]}`) * json: support required / optional properties * json: add support for pattern * json: resolve $ref (and support https schema urls) * json: fix $ref resolution * join: support union types (mostly for nullable types I think) * json: support allOf + nested anyOf * json: support any (`{}` or `{type: object}`) * json: fix merge * json: temp fix for escapes * json: spaces in output and unrestricted output spaces * json: add typings * json:fix typo * Create ts-type-to-grammar.sh * json: fix _format_literal (json.dumps already escapes quotes) * json: merge lit sequences and handle negatives {"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"} * json: handle pattern repetitions * Update json-schema-to-grammar.mjs * Create regex-to-grammar.py * json: extract repeated regexp patterns to subrule * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * json: handle schema from pydantic Optional fields * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * Update ts-type-to-grammar.sh * Update ts-type-to-grammar.sh * json: simplify nullable fields handling * json: accept duplicate identical rules * json: revert space to 1 at most * json: reuse regexp pattern subrules * json: handle uuid string format * json: fix literal escapes * json: add --allow-fetch * json: simplify range escapes * json: support negative ranges in patterns * Delete commit.txt * json: custom regex parser, adds dot support & JS-portable * json: rm trailing spaces * Update json-schema-to-grammar.mjs * json: updated server & chat `( cd examples/server && ./deps.sh )` * json: port fixes from mjs to python * Update ts-type-to-grammar.sh * json: support prefixItems alongside array items * json: add date format + fix uuid * json: add date, time, date-time formats * json: preserve order of props from TS defs * json: port schema converter to C++, wire in ./server * json: nits * Update json-schema-to-grammar.cpp * Update json-schema-to-grammar.cpp * Update json-schema-to-grammar.cpp * json: fix mjs implementation + align outputs * Update json-schema-to-grammar.mjs.hpp * json: test C++, JS & Python versions * json: nits + regen deps * json: cleanup test * json: revert from c++17 to 11 * json: nit fixes * json: dirty include for test * json: fix zig build * json: pass static command to std::system in tests (fixed temp files) * json: fix top-level $refs * json: don't use c++20 designated initializers * nit * json: basic support for reserved names `{number:{number:{root:number}}}` * Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test) * json: re-ran server deps.sh * json: simplify test * json: support mix of additional props & required/optional * json: add tests for some expected failures * json: fix type=const in c++, add failure expectations for non-str const&enum * json: test (& simplify output of) empty schema * json: check parsing in test + fix value & string refs * json: add server tests for OAI JSON response_format * json: test/fix top-level anyOf * json: improve grammar parsing failures * json: test/fix additional props corner cases * json: fix string patterns (was missing quotes) * json: ws nit * json: fix json handling in server when there's no response_format * json: catch schema conversion errors in server * json: don't complain about unknown format type in server if unset * json: cleaner build of test * json: create examples/json-schema-pydantic-example.py * json: fix date pattern * json: move json.hpp & json-schema-to-grammar.{cpp,h} to common * json: indent 4 spaces * json: fix naming of top-level c++ function (+ drop unused one) * json: avoid using namespace std * json: fix zig build * Update server.feature * json: iostream -> fprintf * json: space before & refs for consistency * json: nits

* Make quantize_row_iq4_nl do the same thing is quantization on CUDA * Make quantize_row_iq4_nl do the same thing is quantization on CUDA This time for real. backend-ops tests pass. * Now fix test-quantize-fns --------- Co-authored-by: Iwan Kawrakow <[email protected]>

ggml-ci

The stated file `./devops/main-server.Dockerfile` does not exist. I figure that `.devops/server-intel.Dockerfile` was meant.

* Fix params underscore convert to dash. * Update common/common.cpp --------- Co-authored-by: slaren <[email protected]>

* metal : require ne00 >= 128 for mat-mat kernels ggml-ci * llama : pad n_ctx by 32 ggml-ci

* metal : proper assert for mat-mat memory alignment ggml-ci * readme : add notice about the bug fix * metal : fix the fix ggml-ci

* split: support in llama_model_loader * avoid copying the entire vector Co-authored-by: slaren <[email protected]> * split: move llama_tensor_offset to llama_model_loader * llama_model_loader: PR feedbacks: - use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts * avoid copying the entire vector * Simplify this by making these optional, switch some layer creation tensor optional Co-authored-by: Georgi Gerganov <[email protected]> * Handle optional tensors Co-authored-by: Georgi Gerganov <[email protected]> * llama_model_loader: fail if backend cannot allocate buffer * fix mmap buffer management * llama_model_loader: map file to backend buffer if the allocation succeeds only * llama_model_loader: only map tensors included in the context * llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast * llama_model_loader: fail if any of backend buffer cannot be allocated * spacing Co-authored-by: slaren <[email protected]> * fix loop over pointer Co-authored-by: slaren <[email protected]> * llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting * llama_model_loader: ensure mappings vector has the expected size * llama_model_loader: use at instead of operator[] if this should never add to the map. * llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size. * llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer * llama_model_loader: fix map -> unordered map * llama_split_prefix: use a clearer version, not pass split path len but dest max len. Co-authored-by: Xuan Son Nguyen <[email protected]> * llama : minor ggml-ci * llama : introduce some typedef helpers * docs: add model shard in hot topic * llama_model_loader: put mapping in a unique_ptr from the moment it is allocated Co-authored-by: slaren <[email protected]> * fix llama_split_prefix --------- Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>

* quantize: be able to specify the output tensor type * quantize: be able to specify the token embedding tensor type --------- Co-authored-by: Iwan Kawrakow <[email protected]>

* convert-llama2c-to-ggml: enable conversion of multiqueries, #5608 * add test in build action * Update build.yml * Update build.yml * Update build.yml * gg patch

) * lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens

* Add support for Grok model architecture * Revert convert-hf-to-gguf to default options * Fixed f_norm_rms_eps bug * Fix whitespaces * llama : fix grok rope type * llama : minor --------- Co-authored-by: Georgi Gerganov <[email protected]>

…sable` (#6254)

* llama: llama_split_prefix fix strncpy does not include string termination common: llama_load_model_from_url: - fix header name case sensitive - support downloading additional split in parallel - hide password in url * common: EOL EOF * common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition * common: change max url max length * common: minor comment * server: support HF URL options * llama: llama_model_loader fix log * common: use a constant for max url length * common: clean up curl if file cannot be loaded in gguf * server: tests: add split tests, and HF options params * common: move llama_download_hide_password_in_url inside llama_download_file as a lambda * server: tests: enable back Release test on PR * spacing Co-authored-by: Georgi Gerganov <[email protected]> * spacing Co-authored-by: Georgi Gerganov <[email protected]> * spacing Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

also fix missing #defines before windows.h, and BPE LF token on MSVC

* support release win * fix value * fix value * fix value * fix error * fix error * fix format

* remove no USM methods * leave the schedule to ggml_backend_sched entirely

@cebtenzzre

* sampling: remove duplicated code for probability distribution access * free original_logits * fix original_logits allocation * fixes based on review @cebtenzzre * change function name to `llama_sampling_prepare`

* imatrix : fix wname for mul_mat_id ops * also filter tensor names in mul_mat_id ops --------- Co-authored-by: slaren <[email protected]>

* would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation * Fixes error possibly introduced by #6248

This change causes some quants (e.g. Q4_0, Q8_0) to go faster on some architectures (e.g. AMD Zen 4).

* add `retrieval` example * add README * minor fixes * cast filepos on print * remove use of variable sized array * store similarities in separate vector * print error on insufficient batch size * fix error message printing * assign n_batch value to n_ubatch * fix param definitions * define retrieval-only parameters in retrieval.cpp * fix `--context-file` option to be provided multiple times for multiple files * use vector for `query_emb` * add usage description in README * fix merge conflict * fix usage printing * remove seed setting * fix lint * increase file read buffer size * retrieval : minor --------- Co-authored-by: Georgi Gerganov <[email protected]>

* fix LOG() error for SYCL, enhance erro check by CI * rollback to bash * add newline at end of file

* server: clean up oai parsing function * fix response_format * fix empty response_format * minor fixes * add TODO for logprobs * update docs

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14) → 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Co-authored-by: Iwan Kawrakow <[email protected]>

Since no blas was provided to buildInputs, the executable is built without blas support. This is a backport of NixOS/nixpkgs#298567

* Add Granite and GranoteMoE models * Granite: avoid NaNs on CUDA by scaling Q before K*Q multiplication --------- Co-authored-by: Iwan Kawrakow <[email protected]>

jkarthic and others added 30 commits March 20, 2024 12:02

Server: Handle n_keep parameter in the request (#6174)

47cc7a7

Revert "llava : add a MobileVLM_V2-1.7B backup (#6152)"

d795988

This reverts commit f8c4e74.

server : allow to override -ngl in tests (#6170)

bc0baab

gitignore : ignore curl-related files

6b7e76d

Server: version bump for httplib and json (#6169)

91f8ad1

* server: version bump for httplib and json * fix build * bring back content_length

cuda : refactor to remove global resources (#6170)

ccf58aa

* cuda : refactor to remove global resources

llava : update MobileVLM-README.md (#6180)

f9c7ba3

cuda : print the returned error when CUDA initialization fails (#6185)

1c51f98

cuda : fix conflict with std::swap (#6186)

42e21c6

Add nvidia and amd backends (#6157)

c5b8595

ci : fix indentation error (#6195)

1943c01

cuda : fix LLAMA_CUDA_F16 build (#6197)

03a8f8f

tests : disable system() calls (#6198)

924ce1d

ggml-ci

Corrected typo to wrong file (#6199)

f372c49

The stated file `./devops/main-server.Dockerfile` does not exist. I figure that `.devops/server-intel.Dockerfile` was meant.

cuda : disable host register by default (#6206)

d0a7123

server : update readme doc from slot_id to id_slot (#6213)

be07a03

Fix params underscore convert to dash. (#6203)

fa046ea

* Fix params underscore convert to dash. * Update common/common.cpp --------- Co-authored-by: slaren <[email protected]>

add blog link (#6222)

59c17f0

metal : pad n_ctx by 32 (#6177)

95d576b

* metal : require ne00 >= 128 for mat-mat kernels ggml-ci * llama : pad n_ctx by 32 ggml-ci

ci : add CURL flag for the mac builds (#6214)

b2075fd

metal : proper assert for mat-mat memory alignment (#6225)

b3e94f2

* metal : proper assert for mat-mat memory alignment ggml-ci * readme : add notice about the bug fix * metal : fix the fix ggml-ci

server : enable continuous batching by default (#6231)

68e210b

server : fix n_keep always showing as 0 in response (#6211)

6b8bb3a

readme : add RecurseChat to the list of UIs (#6219)

29ab270

phymbert and others added 28 commits March 22, 2024 19:00

quantize: options for output and token embedding tensors qtype (#6239)

1d0331c

* quantize: be able to specify the output tensor type * quantize: be able to specify the token embedding tensor type --------- Co-authored-by: Iwan Kawrakow <[email protected]>

convert-llama2c-to-ggml : enable conversion of GQA models (#6237)

92397d8

* convert-llama2c-to-ggml: enable conversion of multiqueries, #5608 * add test in build action * Update build.yml * Update build.yml * Update build.yml * gg patch

common : default --hf-file to --model (#6234)

56a00f0

server: flush stdout after logging in both text and json layout (#6253)

1b26aeb

split: add gguf-split in the make build target (#6262)

21cad01

llama : add grok-1 support (#6204)

476b025

* Add support for Grok model architecture * Revert convert-hf-to-gguf to default options * Fixed f_norm_rms_eps bug * Fix whitespaces * llama : fix grok rope type * llama : minor --------- Co-authored-by: Georgi Gerganov <[email protected]>

server: docs: --threads and --threads, --ubatch-size, `--log-di…

1997577

…sable` (#6254)

gitignore : gguf-split

9556217

use _wfopen instead of fopen on Windows (#6248)

94d1b3b

also fix missing #defines before windows.h, and BPE LF token on MSVC

Support build win release for SYCL (#6241)

d03224a

* support release win * fix value * fix value * fix value * fix error * fix error * fix format

[SYCL] offload op (#6217)

ddf6568

* remove no USM methods * leave the schedule to ggml_backend_sched entirely

sampling : deduplicated code for probability distribution access (#6240)

586e7bc

* sampling: remove duplicated code for probability distribution access * free original_logits * fix original_logits allocation * fixes based on review @cebtenzzre * change function name to `llama_sampling_prepare`

ci : close inactive issue, increase operations per run (#6270)

ea279d5

Fixed lookup compilation issues on Windows (#6273)

7aed0ff

imatrix : fix wname for mul_mat_id ops (#6271)

a0e584d

* imatrix : fix wname for mul_mat_id ops * also filter tensor names in mul_mat_id ops --------- Co-authored-by: slaren <[email protected]>

ggml : support AVX512VNNI (#6280)

7733f0c

This change causes some quants (e.g. Q4_0, Q8_0) to go faster on some architectures (e.g. AMD Zen 4).

[SYCL] fix SYCL backend build on windows is break by LOG() error (#6290)

95ad616

* fix LOG() error for SYCL, enhance erro check by CI * rollback to bash * add newline at end of file

Server: clean up OAI params parsing function (#6284)

ad3a050

* server: clean up oai parsing function * fix response_format * fix empty response_format * minor fixes * add TODO for logprobs * update docs

cuda : refactor into multiple files (#6269)

ae1f211

cuda : fix LLAMA_CUDA_F16 build (#6298)

2f34b86

tests : include IQ2_XXS and IQ2_XS in test-quantize-fns (#6303)

1f2fd4e

Co-authored-by: Iwan Kawrakow <[email protected]>

nix: fix blas support (#6281)

b06c16e

Since no blas was provided to buildInputs, the executable is built without blas support. This is a backport of NixOS/nixpkgs#298567

Nexesenex merged commit f4949bc into Nexesenex:Nexesenex-IQ1_XS-IQ1_S-quant-strategies Mar 25, 2024
112 of 163 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b2532 #102

b2532 #102

Nexesenex commented Mar 25, 2024

b2532 #102

b2532 #102

Conversation

Nexesenex commented Mar 25, 2024