forked from LostRuins/koboldcpp
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
b2532 #102
Merged
Nexesenex
merged 64 commits into
Nexesenex:Nexesenex-IQ1_XS-IQ1_S-quant-strategies
from
ggerganov:master
Mar 25, 2024
Merged
b2532 #102
Nexesenex
merged 64 commits into
Nexesenex:Nexesenex-IQ1_XS-IQ1_S-quant-strategies
from
ggerganov:master
Mar 25, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Add MobileVLM_V2 backup * Update MobileVLM-README.md * Update examples/llava/MobileVLM-README.md Co-authored-by: Georgi Gerganov <[email protected]> * Update examples/llava/convert-image-encoder-to-gguf.py Co-authored-by: Georgi Gerganov <[email protected]> * clip : fix whitespace --------- Co-authored-by: Georgi Gerganov <[email protected]>
This reverts commit f8c4e74.
* server: version bump for httplib and json * fix build * bring back content_length
* cuda : refactor to remove global resources
* Add MobileVLM_V2 backup * Update MobileVLM-README.md * Update examples/llava/MobileVLM-README.md Co-authored-by: Georgi Gerganov <[email protected]> * Update examples/llava/convert-image-encoder-to-gguf.py Co-authored-by: Georgi Gerganov <[email protected]> * clip : fix whitespace * fix deifinition mistake in clip.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>
* k_cache: be able to use Q5_0 * k_cache: be able to use Q5_1 on CODA * k_cache: be able to use Q5_0 on Metal * k_cache: be able to use Q5_1 on Metal * k_cache: be able to use IQ4_NL - just CUDA for now * k_cache: be able to use IQ4_NL on Metal * k_cache: add newly added supported types to llama-bench and CUDA supports_op --------- Co-authored-by: Iwan Kawrakow <[email protected]>
* Initial commit - add mac prebuilds. * forward contribution credits for building the workflow. * minor : remove trailing whitespaces --------- Co-authored-by: Nicolas Patry <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
* json: fix arrays (disallow `[,1]`) * json: support tuple types (`[number, string]`) * json: support additionalProperties (`{[k: string]: [string,number][]}`) * json: support required / optional properties * json: add support for pattern * json: resolve $ref (and support https schema urls) * json: fix $ref resolution * join: support union types (mostly for nullable types I think) * json: support allOf + nested anyOf * json: support any (`{}` or `{type: object}`) * json: fix merge * json: temp fix for escapes * json: spaces in output and unrestricted output spaces * json: add typings * json:fix typo * Create ts-type-to-grammar.sh * json: fix _format_literal (json.dumps already escapes quotes) * json: merge lit sequences and handle negatives {"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"} * json: handle pattern repetitions * Update json-schema-to-grammar.mjs * Create regex-to-grammar.py * json: extract repeated regexp patterns to subrule * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * json: handle schema from pydantic Optional fields * Update json-schema-to-grammar.py * Update json-schema-to-grammar.py * Update ts-type-to-grammar.sh * Update ts-type-to-grammar.sh * json: simplify nullable fields handling * json: accept duplicate identical rules * json: revert space to 1 at most * json: reuse regexp pattern subrules * json: handle uuid string format * json: fix literal escapes * json: add --allow-fetch * json: simplify range escapes * json: support negative ranges in patterns * Delete commit.txt * json: custom regex parser, adds dot support & JS-portable * json: rm trailing spaces * Update json-schema-to-grammar.mjs * json: updated server & chat `( cd examples/server && ./deps.sh )` * json: port fixes from mjs to python * Update ts-type-to-grammar.sh * json: support prefixItems alongside array items * json: add date format + fix uuid * json: add date, time, date-time formats * json: preserve order of props from TS defs * json: port schema converter to C++, wire in ./server * json: nits * Update json-schema-to-grammar.cpp * Update json-schema-to-grammar.cpp * Update json-schema-to-grammar.cpp * json: fix mjs implementation + align outputs * Update json-schema-to-grammar.mjs.hpp * json: test C++, JS & Python versions * json: nits + regen deps * json: cleanup test * json: revert from c++17 to 11 * json: nit fixes * json: dirty include for test * json: fix zig build * json: pass static command to std::system in tests (fixed temp files) * json: fix top-level $refs * json: don't use c++20 designated initializers * nit * json: basic support for reserved names `{number:{number:{root:number}}}` * Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test) * json: re-ran server deps.sh * json: simplify test * json: support mix of additional props & required/optional * json: add tests for some expected failures * json: fix type=const in c++, add failure expectations for non-str const&enum * json: test (& simplify output of) empty schema * json: check parsing in test + fix value & string refs * json: add server tests for OAI JSON response_format * json: test/fix top-level anyOf * json: improve grammar parsing failures * json: test/fix additional props corner cases * json: fix string patterns (was missing quotes) * json: ws nit * json: fix json handling in server when there's no response_format * json: catch schema conversion errors in server * json: don't complain about unknown format type in server if unset * json: cleaner build of test * json: create examples/json-schema-pydantic-example.py * json: fix date pattern * json: move json.hpp & json-schema-to-grammar.{cpp,h} to common * json: indent 4 spaces * json: fix naming of top-level c++ function (+ drop unused one) * json: avoid using namespace std * json: fix zig build * Update server.feature * json: iostream -> fprintf * json: space before & refs for consistency * json: nits
* Make quantize_row_iq4_nl do the same thing is quantization on CUDA * Make quantize_row_iq4_nl do the same thing is quantization on CUDA This time for real. backend-ops tests pass. * Now fix test-quantize-fns --------- Co-authored-by: Iwan Kawrakow <[email protected]>
The stated file `./devops/main-server.Dockerfile` does not exist. I figure that `.devops/server-intel.Dockerfile` was meant.
* Fix params underscore convert to dash. * Update common/common.cpp --------- Co-authored-by: slaren <[email protected]>
* metal : require ne00 >= 128 for mat-mat kernels ggml-ci * llama : pad n_ctx by 32 ggml-ci
* metal : proper assert for mat-mat memory alignment ggml-ci * readme : add notice about the bug fix * metal : fix the fix ggml-ci
* split: support in llama_model_loader * avoid copying the entire vector Co-authored-by: slaren <[email protected]> * split: move llama_tensor_offset to llama_model_loader * llama_model_loader: PR feedbacks: - use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts * avoid copying the entire vector * Simplify this by making these optional, switch some layer creation tensor optional Co-authored-by: Georgi Gerganov <[email protected]> * Handle optional tensors Co-authored-by: Georgi Gerganov <[email protected]> * llama_model_loader: fail if backend cannot allocate buffer * fix mmap buffer management * llama_model_loader: map file to backend buffer if the allocation succeeds only * llama_model_loader: only map tensors included in the context * llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast * llama_model_loader: fail if any of backend buffer cannot be allocated * spacing Co-authored-by: slaren <[email protected]> * fix loop over pointer Co-authored-by: slaren <[email protected]> * llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting * llama_model_loader: ensure mappings vector has the expected size * llama_model_loader: use at instead of operator[] if this should never add to the map. * llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size. * llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer * llama_model_loader: fix map -> unordered map * llama_split_prefix: use a clearer version, not pass split path len but dest max len. Co-authored-by: Xuan Son Nguyen <[email protected]> * llama : minor ggml-ci * llama : introduce some typedef helpers * docs: add model shard in hot topic * llama_model_loader: put mapping in a unique_ptr from the moment it is allocated Co-authored-by: slaren <[email protected]> * fix llama_split_prefix --------- Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>
* quantize: be able to specify the output tensor type * quantize: be able to specify the token embedding tensor type --------- Co-authored-by: Iwan Kawrakow <[email protected]>
* convert-llama2c-to-ggml: enable conversion of multiqueries, #5608 * add test in build action * Update build.yml * Update build.yml * Update build.yml * gg patch
* Add support for Grok model architecture * Revert convert-hf-to-gguf to default options * Fixed f_norm_rms_eps bug * Fix whitespaces * llama : fix grok rope type * llama : minor --------- Co-authored-by: Georgi Gerganov <[email protected]>
* llama: llama_split_prefix fix strncpy does not include string termination common: llama_load_model_from_url: - fix header name case sensitive - support downloading additional split in parallel - hide password in url * common: EOL EOF * common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition * common: change max url max length * common: minor comment * server: support HF URL options * llama: llama_model_loader fix log * common: use a constant for max url length * common: clean up curl if file cannot be loaded in gguf * server: tests: add split tests, and HF options params * common: move llama_download_hide_password_in_url inside llama_download_file as a lambda * server: tests: enable back Release test on PR * spacing Co-authored-by: Georgi Gerganov <[email protected]> * spacing Co-authored-by: Georgi Gerganov <[email protected]> * spacing Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
also fix missing #defines before windows.h, and BPE LF token on MSVC
* support release win * fix value * fix value * fix value * fix error * fix error * fix format
* remove no USM methods * leave the schedule to ggml_backend_sched entirely
* sampling: remove duplicated code for probability distribution access * free original_logits * fix original_logits allocation * fixes based on review @cebtenzzre * change function name to `llama_sampling_prepare`
* imatrix : fix wname for mul_mat_id ops * also filter tensor names in mul_mat_id ops --------- Co-authored-by: slaren <[email protected]>
* would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation * Fixes error possibly introduced by #6248
This change causes some quants (e.g. Q4_0, Q8_0) to go faster on some architectures (e.g. AMD Zen 4).
* add `retrieval` example * add README * minor fixes * cast filepos on print * remove use of variable sized array * store similarities in separate vector * print error on insufficient batch size * fix error message printing * assign n_batch value to n_ubatch * fix param definitions * define retrieval-only parameters in retrieval.cpp * fix `--context-file` option to be provided multiple times for multiple files * use vector for `query_emb` * add usage description in README * fix merge conflict * fix usage printing * remove seed setting * fix lint * increase file read buffer size * retrieval : minor --------- Co-authored-by: Georgi Gerganov <[email protected]>
* fix LOG() error for SYCL, enhance erro check by CI * rollback to bash * add newline at end of file
* server: clean up oai parsing function * fix response_format * fix empty response_format * minor fixes * add TODO for logprobs * update docs
Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14) → 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Iwan Kawrakow <[email protected]>
Since no blas was provided to buildInputs, the executable is built without blas support. This is a backport of NixOS/nixpkgs#298567
Nexesenex
merged commit Mar 25, 2024
f4949bc
into
Nexesenex:Nexesenex-IQ1_XS-IQ1_S-quant-strategies
112 of 163 checks passed
Nexesenex
pushed a commit
that referenced
this pull request
Dec 22, 2024
* Add Granite and GranoteMoE models * Granite: avoid NaNs on CUDA by scaling Q before K*Q multiplication --------- Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.