Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master b1794 #70

Merged
merged 64 commits into from
Jan 8, 2024
Merged

master b1794 #70

merged 64 commits into from
Jan 8, 2024

Conversation

Nexesenex
Copy link
Owner

No description provided.

ggerganov and others added 30 commits December 30, 2023 23:24
* clip : refactor + bug fixes

ggml-ci

* server : add log message
...and add a job for flakehub.com
to a commit recently cached by nixpkgs-cuda-ci
* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (#4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

ggml-ci
* update: awq support llama-7b model

* update: change order

* update: benchmark results for llama2-7b

* update: mistral 7b v1 benchmark

* update: support 4 models

* fix: Readme

* update: ready for PR

* update: readme

* fix: readme

* update: change order import

* black

* format code

* update: work for bot mpt and awqmpt

* update: readme

* Rename to llm_build_ffn_mpt_awq

* Formatted other files

* Fixed params count

* fix: remove code

* update: more detail for mpt

* fix: readme

* fix: readme

* update: change folder architecture

* fix: common.cpp

* fix: readme

* fix: remove ggml_repeat

* update: cicd

* update: cicd

* uppdate: remove use_awq arg

* update: readme

* llama : adapt plamo to new ffn

ggml-ci

* fix: update torch version

---------

Co-authored-by: Trần Đức Nam <[email protected]>
Co-authored-by: Le Hoang Anh <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
* Changes to server to allow metadata override

* documentation

* flake.nix: expose full scope in legacyPackages

* flake.nix: rocm not yet supported on aarch64, so hide the output

* flake.nix: expose checks

* workflows: nix-ci: init; build flake outputs

* workflows: nix-ci: add a job for eval

* workflows: weekly `nix flake update`

* workflows: nix-flakestry: drop tag filters

...and add a job for flakehub.com

* workflows: nix-ci: add a qemu job for jetsons

* flake.nix: suggest the binary caches

* flake.lock: update

to a commit recently cached by nixpkgs-cuda-ci

---------

Co-authored-by: John <[email protected]>
Co-authored-by: Someone Serge <[email protected]>
* Add n_key_dim and n_value_dim

Some models use values that are not derived from `n_embd`.
Also remove `n_embd_head` and `n_embd_gqa` because it is not clear
which "head" is referred to (key or value).

Fix issue #4648.

* Fix `llm_build_kqv` to use `n_value_gqa`

* Rebase

* Rename variables

* Fix llm_build_kqv to be more generic wrt n_embd_head_k

* Update default values for n_embd_head_k and n_embd_head_v

Co-authored-by: Georgi Gerganov <[email protected]>

* Fix llm_load_tensors: the asserts were not backcompat

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* replaced all API facing `int`'s with `int32_t`

* formatting and missed `int` in `llama_token_to_piece`
* server: add token counts to stats

* server: generate hpp

---------

Co-authored-by: phiharri <[email protected]>
* ggml : disable fast-math for Metal (cmake build only)

ggml-ci

* metal : fix Metal API debug warnings

* cmake : add -fno-inline for Metal build (#4545)

* metal : fix API debug warnings

* metal : fix compile warnings

* metal : use uint64_t for strides

* cmake : rename option to LLAMA_METAL_SHADER_DEBUG

* metal : fix mat-vec Q8_0 kernel for BS > 1

* metal : normalize mat-vec kernel signatures

* cmake : respect LLAMA_QKK_64 option

* metal : fix mat-vec Q4_K kernel for QK_K == 64

* metal : optimizing ggml_mul_mat_id (wip)

* metal : minor fix

* metal : opt mul_mm_id
* add more int ops

* ggml_compute_forward_dup_bytes

* add tests

* PR comments

* tests : minor indentations

---------

Co-authored-by: Georgi Gerganov <[email protected]>
ggml-ci
* updates the package.swift to use ggml as dependency

* changes the ggml package url src to ggerganov
azarovalex and others added 13 commits January 7, 2024 10:20
* examples : add passkey test

* passkey : better prints

* passkey : select pass key pos from CLI

* passkey : simplify n_past logic

* make : add passkey target

* passkey : add "self-extend"-like context extension (#4810)

* llama : "self-extend"-like context extension

* passkey : add comment

* passkey : add readme
* examples : add passkey test

* passkey : better prints

* passkey : select pass key pos from CLI

* passkey : simplify n_past logic

* llama : "self-extend"-like context extension

* passkey : add comment

* main : add Self-Extend support

* llama : add comment about llama_kv_cache_seq_div
* iq2_xxs: basics

* iq2_xxs: scalar and AVX2 dot products

Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.

* iq2_xxs: ARM_NEON dot product

Somehow strangely slow (112 ms/token).

* iq2_xxs: WIP Metal

Dequantize works, something is still wrong with the
dot product.

* iq2_xxs: Metal dot product now works

We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s

Not the greatest performance, but not complete garbage either.

* iq2_xxs: slighty faster dot product

TG-128 is now 48.4 t/s

* iq2_xxs: slighty faster dot product

TG-128 is now 50.9 t/s

* iq2_xxs: even faster Metal dot product

TG-128 is now 54.1 t/s.

Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.

* iq2_xxs: dequantize CUDA kernel - fix conflict with master

* iq2_xxs: quantized CUDA dot product (MMVQ)

We get TG-128 = 153.1 t/s

* iq2_xxs: slightly faster CUDA dot product

TG-128 is now at 155.1 t/s.

* iq2_xxs: add to llama ftype enum

* iq2_xxs: fix MoE on Metal

* Fix missing MMQ ops when on hipBLAS

I had put the ggml_supports_mmq call at the wrong place.

* Fix bug in qequantize_row_iq2_xxs

The 0.25f factor was missing.
Great detective work by @ggerganov!

* Fixing tests

* PR suggestion

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
@Nexesenex Nexesenex merged commit e4705be into Nexesenex:master Jan 8, 2024
4 checks passed
Nexesenex added a commit that referenced this pull request Dec 15, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex added a commit that referenced this pull request Dec 15, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex added a commit that referenced this pull request Dec 15, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex added a commit that referenced this pull request Dec 15, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex added a commit that referenced this pull request Dec 15, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex added a commit that referenced this pull request Dec 15, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex added a commit that referenced this pull request Dec 15, 2024
This reverts commit 2637d2deebed514b45f39df95c88cd9b8f783324.
Nexesenex added a commit that referenced this pull request Dec 19, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex added a commit that referenced this pull request Dec 20, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex added a commit that referenced this pull request Dec 21, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex added a commit that referenced this pull request Dec 21, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex pushed a commit that referenced this pull request Dec 22, 2024
* Adding fused y*unary(x) op

* Fused y*unary(x) op: CUDA

* Fused y*unary(x) op: dedicated CPU implementation for silu and gelu

* Fused y*unary(x) op: Metal

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Nexesenex added a commit that referenced this pull request Dec 23, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex added a commit that referenced this pull request Dec 23, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex added a commit that referenced this pull request Dec 24, 2024
Credit : Iwan Kawrakow @ikawrakow
Nexesenex added a commit that referenced this pull request Dec 24, 2024
Credit : Iwan Kawrakow @ikawrakow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.