b2069 #81

Nexesenex · 2024-02-05T10:00:33Z

No description provided.

* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver * Fix another Vulkan CPY buffer size bug

* add --no-mmap, show sycl backend * fix conflict * fix code format, change print for --no-mmap * ren no_mmap to mmap, show mmap when not default value in printer * update guide for mmap * mv position to reduce model reload

The llama_batch_init allocates memory for a fixed number of tokens. However, the llama_batch_free only frees memory for the number of tokens that were added to the batch. This change-set uses a null terminated array for the batch seq_id, and frees all the elements until the nullptr is reached. This change-set also changes the name of the first parameter from `n_tokens` to `n_tokens_alloc` to more clearly indicate that this value is the number of tokens allocated to the batch, not the number of tokens in the batch.

* update guide for make installation, memory, gguf model link, rm todo for windows build * add vs install requirement * update for gpu device check * update help of llama-bench * fix grammer issues

* get max alloc size from device prop * fix macro typo

* add vulkan dockerfile * intel dockerfile: compile sycl by default * fix vulkan dockerfile * add docs for vulkan * docs: sycl build in docker * docs: remove trailing spaces * docs: sycl: add docker section * docs: clarify install vulkan SDK outside docker * sycl: use intel/oneapi-basekit docker image * docs: correct TOC * docs: correct docker image for Intel oneMKL

* Tidy some code in ggml-sycl * Remove blank space * Remove std::printf comments --------- Co-authored-by: Abhilash Majumder <[email protected]>

* scripts : parse wtype in server-llm.sh * scripts : fix check for wfile

* YaRN : store rope scaling type as int32_t in memory * llama : store mapped names as const char *

* Fix Vulkan on Intel ARC Optimize matmul for Intel ARC Add Vulkan dequant test * Add Vulkan debug and validate flags to Make and CMakeLists.txt * Enable asynchronous transfers in Vulkan backend * Fix flake8 * Disable Vulkan async backend functions for now * Also add Vulkan run tests command to Makefile and CMakeLists.txt

option() is specifically for booleans. Fixes #5158

* imatrix: adding --combine and --continue-from * imatrix: be able to start from a specific chunk --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11) → 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30) → 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25) → 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)

* Fix cpy with dims of 3 * rm asserts --------- Co-authored-by: Abhilash Majumder <[email protected]>

* Update server-llm.sh Add flag --non-interactive that allows run script without asking a permission * Update scripts/server-llm.sh --------- Co-authored-by: Georgi Gerganov <[email protected]>

* added dynamic temp params in main * added help text

We get slightly better PPL, and we cut quantization time in nearly half. The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way. Co-authored-by: Iwan Kawrakow <[email protected]>

* py : fix internlm2-hf convert to gguf * ggml-ci

Co-authored-by: Iwan Kawrakow <[email protected]>

slaren and others added 30 commits February 1, 2024 18:30

cuda : fix LLAMA_CUDA_F16 (#5262)

8ca511c

Vulkan Phi Fix for AMD Proprietary Drivers (#5260)

4d0924a

* Replace tanh to avoid NaN in gelu shader on AMD proprietary driver * Fix another Vulkan CPY buffer size bug

add --no-mmap in llama-bench (#5257)

128dcbd

* add --no-mmap, show sycl backend * fix conflict * fix code format, change print for --no-mmap * ren no_mmap to mmap, show mmap when not default value in printer * update guide for mmap * mv position to reduce model reload

[SYCL] update guide of SYCL backend (#5254)

af3ba5d

* update guide for make installation, memory, gguf model link, rm todo for windows build * add vs install requirement * update for gpu device check * update help of llama-bench * fix grammer issues

[SYCL] get MAX_MEM_ALLOC from device property (#5270)

e805f0f

* get max alloc size from device prop * fix macro typo

Tidy ggml-sycl (#5261)

b05102f

* Tidy some code in ggml-sycl * Remove blank space * Remove std::printf comments --------- Co-authored-by: Abhilash Majumder <[email protected]>

py : add check for '.attn.masked_bias' layers to GPT2model (#5281)

2d40085

scripts : parse wtype in server-llm.sh (#5167)

e437b37

* scripts : parse wtype in server-llm.sh * scripts : fix check for wfile

perplexity : fix KL divergence calculations on Windows (#5273)

1912211

Fix im2col with 32fp (#5286)

a305dba

readme : add tenere in the ui tools list (#5284)

6a66c50

YaRN : store rope scaling type as int32_t in memory (#5285)

1ec3332

* YaRN : store rope scaling type as int32_t in memory * llama : store mapped names as const char *

refactor : switch to emplace_back to avoid extra object (#5291)

52bb63c

add Vulkan support to Nix flake

60ecf09

make: fix nvcc optimization flags for host code (#5309)

3cc5ed3

make: add nvcc info print (#5310)

3c0d25c

cmake : use set() for LLAMA_WIN_VER (#5298)

277fad3

option() is specifically for booleans. Fixes #5158

Adding some imatrix tools (#5302)

5ed26e1

* imatrix: adding --combine and --continue-from * imatrix: be able to start from a specific chunk --------- Co-authored-by: Iwan Kawrakow <[email protected]>

[SYCL] Fix cpy with dims of 3 (#5289)

4833ac2

* Fix cpy with dims of 3 * rm asserts --------- Co-authored-by: Abhilash Majumder <[email protected]>

readme : add CodeShell models to the supported models list (#5330)

5d55b0c

scripts : add non-interactive server-llm.sh (#5303)

4be04c8

* Update server-llm.sh Add flag --non-interactive that allows run script without asking a permission * Update scripts/server-llm.sh --------- Co-authored-by: Georgi Gerganov <[email protected]>

scripts : fix typos, cleanup (#5303)

30679d4

common : add dynamic temperature parameters to main example cli (#5295)

e6f8177

* added dynamic temp params in main * added help text

server : allow to get default generation settings for completion (#5307)

a2d60c9

py : fix internlm2-hf convert to gguf (#5305)

7e1ae37

* py : fix internlm2-hf convert to gguf * ggml-ci

Nexesenex merged commit 94a5b2a into Nexesenex:_master_up Feb 5, 2024
24 of 39 checks passed

Nexesenex pushed a commit that referenced this pull request Dec 22, 2024

Move scale fudge factors to quantization (#81)

fe36930

Co-authored-by: Iwan Kawrakow <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b2069 #81

b2069 #81

Nexesenex commented Feb 5, 2024

b2069 #81

b2069 #81

Conversation

Nexesenex commented Feb 5, 2024