Feat: Support for `falcon-mamba` architecture #9074

younesbelkada · 2024-08-18T11:36:30Z

What does this PR do?

Fixes: #9009
Fixes: #9048

This PR adds the support for FalconMamba architecture in llama.cpp. I followed the suggestion from @compilade here: #9009 (comment) by simply extending the current Mamba architecture to be able to perform RMS norm operations for B / dt & C projections, in order to make things simple.

Output from the model converted locally:

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

cc @compilade @ggerganov

compilade

Thanks!

convert_hf_to_gguf.py

compilade · 2024-08-18T13:22:40Z

src/llama.cpp

+                // Some Mamba variants (e.g. FalconMamba) apply RMS norm in B, C & Dt layers
+                if (ssm_b_dt_rms) {
+                    dt = ggml_rms_norm(ctx0, dt, norm_rms_eps);
+                    B = ggml_rms_norm(ctx0, B, norm_rms_eps);
+                    C = ggml_rms_norm(ctx0, C, norm_rms_eps);
+                }


This will eventually be rewritten to use llm_build_norm, because some Mamba-based architectures like Jamba use RMS norms with learnable parameters here.

But for now I think this is fine.

src/llama.cpp

convert_hf_to_gguf.py

Co-authored-by: compilade <[email protected]>

younesbelkada · 2024-08-18T14:19:30Z

Thanks for the detailed review @compilade ! Should be all addressed now

gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

src/llama.cpp

gguf-py/gguf/constants.py

src/llama.cpp

Co-authored-by: compilade <[email protected]>

compilade

My only remaining nitpick is related vertical alignment of a printed string. This will be good to merge after fixing that.

I can confirm old Mamba models still work correctly with this change. I'll try Falcon-Mamba next to see if perplexity looks reasonable.

src/llama.cpp

Co-authored-by: compilade <[email protected]>

compilade · 2024-08-18T16:17:25Z

Getting an error during conversion of https://huggingface.co/tiiuae/falcon-mamba-7b-instruct, will investigate

Traceback (most recent call last):
  File "/.../convert_hf_to_gguf.py", line 4078, in <module>
    main()
  File "/.../convert_hf_to_gguf.py", line 4072, in main
    model_instance.write()
  File "/.../convert_hf_to_gguf.py", line 390, in write
    self.gguf_writer.write_kv_data_to_file()
  File "/.../gguf-py/gguf/gguf_writer.py", line 240, in write_kv_data_to_file
    kv_bytes += self._pack_val(val.value, val.type, add_vtype=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../gguf-py/gguf/gguf_writer.py", line 855, in _pack_val
    kv_data += encoded_val
TypeError: can't concat list to bytearray

This is likely related with metadata extraction from the model card (when the license field is a list). @mofosyne is this already handled in one of your recent PRs?

younesbelkada · 2024-08-18T16:21:52Z

@compilade this commit: https://huggingface.co/tiiuae/falcon-mamba-7b-instruct/commit/5e1687c297b82872dc38b33878d4601810e2ed67 should fix it, somehow gguf is not happy when license is parsed as a list

compilade · 2024-08-19T04:28:03Z

I've been experimenting with quantization, and from what I've seen with Mamba-2, I think it could be safe to quantize Mamba-1's ssm_x and ssm_dt tensors.

Here's the patch I used if you're interested (click to expand)

diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 4b843991..108c822c 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -295,6 +295,7 @@ class Model:
                             gguf.MODEL_TENSOR.FFN_GATE_INP,
                             gguf.MODEL_TENSOR.POS_EMBD,
                             gguf.MODEL_TENSOR.TOKEN_TYPES,
+                            gguf.MODEL_TENSOR.SSM_CONV1D,
                         )
                     )
                     or not name.endswith(".weight")
@@ -2786,23 +2787,6 @@ class MambaModel(Model):
 
         return [(new_name, data_torch)]
 
-    def tensor_force_quant(self, name: str, new_name: str, bid: int | None, n_dims: int) -> gguf.GGMLQuantizationType | bool:
-        if bid is not None and new_name in (
-            self.format_tensor_name(
-                n, bid, ".weight" if name.endswith(".weight") else ""
-            )
-            for n in [
-                gguf.MODEL_TENSOR.SSM_CONV1D,
-                gguf.MODEL_TENSOR.SSM_X,
-                gguf.MODEL_TENSOR.SSM_DT,
-                gguf.MODEL_TENSOR.SSM_A,
-                gguf.MODEL_TENSOR.SSM_D,
-            ]
-        ):
-            return gguf.GGMLQuantizationType.F32
-
-        return super().tensor_force_quant(name, new_name, bid, n_dims)
-
 
 @Model.register("CohereForCausalLM")
 class CommandR2Model(Model):
diff --git a/src/llama.cpp b/src/llama.cpp
index 84fe4967..b8fa7684 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -16450,8 +16450,6 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
         // do not quantize Mamba's small yet 2D weights
         // NOTE: can't use LLM_TN here because the layer number is not known
         quantize &= name.find("ssm_conv1d.weight") == std::string::npos;
-        quantize &= name.find("ssm_x.weight")      == std::string::npos;
-        quantize &= name.find("ssm_dt.weight")     == std::string::npos;
 
         // do not quantize relative position bias (T5)
         quantize &= name.find("attn_rel_b.weight") == std::string::npos;

This reduces the Q4_K_S quantization of Falcon-Mamba-7B from 4942.25 MiB (5.70 bpw) to 4007.25 MB (4.62 bpw).

At 4.62 bpw, the story with the prompt Once upon at temp 0 seemed to start quite well (something related to a village in the Amazon rainforest).

I did not measure perplexity yet, because my hardware is a low-power laptop with 8GB of RAM and I only get 2 tokens per second with that model (with 12% of the time in ggml_ssm_scan and at least 75% in matrix multiplications, measured with perf). If you could compare the first few chunks of wikitext-2-raw perplexity between the two quantization "schemes", it would be nice.

$ ./bin/llama-perplexity -m /path/to/falcon-mamba-7B-chat-Q4_K_S.gguf -f /path/to/wiki.test.txt -b 512 -c 512

For me it would take 4 minutes per chunk, and the 5GB model barely fits in my free RAM.

younesbelkada · 2024-08-19T06:54:16Z

Hi @compilade
Thank you very much !
I just tried what you have suggested, and evaluated 20 chunks of the 2 quantization schemes while evaluating them qualitatively as well (by trying few prompts), both seems to generate very coherent results and find the perplexities below:

scheme 1 (current status of the PR): 6.17825
scheme 2 (with the git patch applied): 6.2279

I will let you decide here, happy to push the patch in this PR and I'll upload the converted quants on the TII HF org

mofosyne · 2024-08-19T14:29:29Z

Getting an error during conversion of https://huggingface.co/tiiuae/falcon-mamba-7b-instruct, will investigate

Traceback (most recent call last):
  File "/.../convert_hf_to_gguf.py", line 4078, in <module>
    main()
  File "/.../convert_hf_to_gguf.py", line 4072, in main
    model_instance.write()
  File "/.../convert_hf_to_gguf.py", line 390, in write
    self.gguf_writer.write_kv_data_to_file()
  File "/.../gguf-py/gguf/gguf_writer.py", line 240, in write_kv_data_to_file
    kv_bytes += self._pack_val(val.value, val.type, add_vtype=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../gguf-py/gguf/gguf_writer.py", line 855, in _pack_val
    kv_data += encoded_val
TypeError: can't concat list to bytearray

This is likely related with metadata extraction from the model card (when the license field is a list). @mofosyne is this already handled in one of your recent PRs?

Yeah I thought I already tackled this issue in #8774 . Double checked by copying your falcon mamba model card metadata to my test repo and running against it. No issues detected.

Basically added a 'zero length array' check to add_array() to not add the kv in if that's the case.

In that case then... Should we throw an error if add_key_value() doesn't match the GGUFValueType defined in vtype?

e.g.

    def add_key_value(self, key: str, val: Any, vtype: GGUFValueType) -> None:
        assert(GGUFValueType.get_type(val) == vtype)
        ...

edit: Ah, my sketch above won't deal with different sized integers... but the point still stands

compilade · 2024-08-19T15:03:01Z

@mofosyne

Double checked by copying your falcon mamba model card metadata to my test repo and running against it. No issues detected.

The problematic model card is no longer the latest version (which works fine), it was https://huggingface.co/tiiuae/falcon-mamba-7b/blob/503c3d4eaf202d970aabd81376c9f0d0e3defe2c/README.md.

Basically added a 'zero length array' check to add_array() to not add the kv in if that's the case.

Note that the problem here was that a list was passed to add_string. (through add_license)

The failure was during serialization of the metadata, so a way to reproduce the error would be to make a vocab-only conversion.

Should we throw an error if add_key_value() doesn't match the GGUFValueType defined in vtype?

Yes, failing early would make it easier to debug. This is a good idea, because the traceback I got doesn't explicitly mention the source of the problem, only that the types are wrong somewhere.

But ideally incorrect metadata in the model card should not prevent conversion, which means the types should be checked (and/or coerced) in gguf-py/gguf/metadata.py to avoid passing values with the wrong type to GGUFWriter. At least the types expected by huggingface_hub should not cause failures (arguably, different types than these would cause problems on HF's side anyway).

Actually, in this case huggingface_hub seems to expect license to be Optional[str] for model cards (while for dataset cards it accepts Optional[str | list[str]]).

Anyway, I think that more type checking in GGUFWriter.add_key_value should be useful to write better error messages in case of wrong types.

compilade · 2024-08-19T16:36:16Z

I will let you decide here, happy to push the patch in this PR

I've ran some further tests on a small Mamba model, and I realized that my initial decision to avoid quantizing these tensors was because mamba-130m uses 48 for the row size of ssm_dt.weight, and llama_tensor_get_type (used by llama_model_quantize_internal) does not properly fallback when the fallback type still has an incompatible block size (e.g. 48 % 32 != 0), so the perplexity went through the roof (with 100501489979291.2500 for the first chunk), likely because of something weird happening when the block size doesn't match.

I've added a fallback to the fallback quantization types:

diff --git a/src/llama.cpp b/src/llama.cpp
index b8fa7684..fe3c0db6 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -16122,6 +16122,9 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
             case GGML_TYPE_Q6_K:   new_type = GGML_TYPE_Q8_0;   break;
             default: throw std::runtime_error("\nUnsupported tensor size encountered\n");
         }
+        if (tensor->ne[0] % ggml_blck_size(new_type) != 0) {
+            new_type = GGML_TYPE_F16;
+        }
         LLAMA_LOG_WARN(" - using fallback quantization %s\n", ggml_type_name(new_type));
         ++qs.n_fallback;
     }

Full patch so far for the changes to Mamba's quantization (click to expand)

diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 4b843991..108c822c 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -295,6 +295,7 @@ class Model:
                             gguf.MODEL_TENSOR.FFN_GATE_INP,
                             gguf.MODEL_TENSOR.POS_EMBD,
                             gguf.MODEL_TENSOR.TOKEN_TYPES,
+                            gguf.MODEL_TENSOR.SSM_CONV1D,
                         )
                     )
                     or not name.endswith(".weight")
@@ -2786,23 +2787,6 @@ class MambaModel(Model):
 
         return [(new_name, data_torch)]
 
-    def tensor_force_quant(self, name: str, new_name: str, bid: int | None, n_dims: int) -> gguf.GGMLQuantizationType | bool:
-        if bid is not None and new_name in (
-            self.format_tensor_name(
-                n, bid, ".weight" if name.endswith(".weight") else ""
-            )
-            for n in [
-                gguf.MODEL_TENSOR.SSM_CONV1D,
-                gguf.MODEL_TENSOR.SSM_X,
-                gguf.MODEL_TENSOR.SSM_DT,
-                gguf.MODEL_TENSOR.SSM_A,
-                gguf.MODEL_TENSOR.SSM_D,
-            ]
-        ):
-            return gguf.GGMLQuantizationType.F32
-
-        return super().tensor_force_quant(name, new_name, bid, n_dims)
-
 
 @Model.register("CohereForCausalLM")
 class CommandR2Model(Model):
diff --git a/src/llama.cpp b/src/llama.cpp
index 84fe4967..fe3c0db6 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -16122,6 +16122,9 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
             case GGML_TYPE_Q6_K:   new_type = GGML_TYPE_Q8_0;   break;
             default: throw std::runtime_error("\nUnsupported tensor size encountered\n");
         }
+        if (tensor->ne[0] % ggml_blck_size(new_type) != 0) {
+            new_type = GGML_TYPE_F16;
+        }
         LLAMA_LOG_WARN(" - using fallback quantization %s\n", ggml_type_name(new_type));
         ++qs.n_fallback;
     }
@@ -16450,8 +16453,6 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
         // do not quantize Mamba's small yet 2D weights
         // NOTE: can't use LLM_TN here because the layer number is not known
         quantize &= name.find("ssm_conv1d.weight") == std::string::npos;
-        quantize &= name.find("ssm_x.weight")      == std::string::npos;
-        quantize &= name.find("ssm_dt.weight")     == std::string::npos;
 
         // do not quantize relative position bias (T5)
         quantize &= name.find("attn_rel_b.weight") == std::string::npos;

When doing this, a Q5_K_S quantization of mamba-130m (94 MiB) is still smaller than the older Q4_K_S (97 MiB) which kept ssm_x.weight and ssm_dt.weight as F32, yet it gets a much lower perplexity (Q5_K_S with the new settings gets 30.6587 after 20 chunks of wikitext-2-raw, while Q4_K_S with f32 dt and x gets 40.1860 after 20 chunks. For comparison, the "new" Q4_K_S of mamba-130m (84 MiB) gets a perplexity of 40.5705 after 20 chunks).

In summary, for Mamba-130M:

quant scheme	`ssm_x.weight` type	`ssm_dt.weight` type	size	perplexity (20 chunks)
Q4_K_S (old)	F32	F32	97 MiB	40.1860
Q4_K_S (new)	Q4_K	F16 (fallback)	84 MiB	40.5705
Q5_K_S (new)	Q5_K	F16 (fallback)	94 MiB	30.6587

EDIT: for Mamba-370M:

quant scheme	`ssm_x.weight` type	`ssm_dt.weight` type	size	perplexity (20 chunks)
Q4_K_S (old)	F32	F32	271 MiB	21.1734
Q4_K_S (new)	Q4_K	Q5_0 (fallback)	221 MiB	21.1835
Q5_K_S (new)	Q5_K	Q5_1 (fallback)	258 MiB	19.1001

scheme 1 (current status of the PR): 6.17825

scheme 2 (with the git patch applied): 6.2279

This seems reasonable considering the file sizes differ by 1 GB (which in this case corresponds to 5.70 bpw vs 4.62 bpw while using Q4_K for most tensors in both schemes). I expect Falcon-Mamba-7B at Q5_K_S with the patch applied would have a similar size as the old Q4_K_S, yet will likely have a better perplexity.

So I think this patch is worth it, at least when comparing the perplexity for given file sizes with Mamba-130M. I think this should also apply to Falcon-Mamba-7B.

@younesbelkada Do you want me to push this directly here or do you want to commit the patch by yourself? Either way is fine with me.

and I'll upload the converted quants on the TII HF org

To save you some bandwidth, be aware that currently, for Mamba models, there is no difference within variants like Q2_K(|_S), Q3_K_(S|M|L), Q4_K_(S|M) or Q5_K_(S|M) (For a comprehensive list, see the default types in llama_model_quantize_internal, since this is what the Mamba-specific tensors use.)

younesbelkada · 2024-08-19T16:43:00Z

Thank you very much for the detailed answer, with respect to the experiments this are clear on my side!
Feel free to push directly on this branch, I'll upload the quantized weights right after

* llama : use f16 as the fallback of fallback quant types

LiuChaoXD · 2024-08-20T16:10:29Z

Hi, I try this pr.
The hugging face repo I used is: https://huggingface.co/tiiuae/falcon-mamba-7b-instruct

convert hf to gguf by: python ./convert_hf_to_gguf.py --outtype f32 --outfile ../falcon-mamba_f32.gguf path/to/hf/falcon-mamba-7b
quantize: ./llama-quantize ../falcon-mamba_f32.gguf ../falcon-maba_Q8_0.gguf Q8_0
try the cli command: ./llama-cli -m ../falcon-maba_Q8_0.gguf -p "You are a helpful assistant" -cnv

and model response is weird:

...............................................................................................
llama_new_context_with_model: n_ctx      = 1048576
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 103079.22 MB
llama_kv_cache_init:      Metal KV buffer size =    38.00 MiB
llama_new_context_with_model: KV self size  =   38.00 MiB, K (f32):    6.00 MiB, V (f32):   32.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.25 MiB
llama_new_context_with_model:      Metal compute buffer size =   151.13 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.51 MiB
llama_new_context_with_model: graph nodes  = 2568
llama_new_context_with_model: graph splits = 384
main: chat template example: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 1048576, n_batch = 2048, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system
You are a helpful

> how to build a website by python
TalesFromYourServer

“I don’t think you deserve a tip for this.” I have a similar story to this. My husband and I went to the casino and I won $1200 in one hour. I was so excited and went to cash in my tickets at the counter, it was $1200 in quarters and nickels. I asked the cashier if I could have them counted so I could take the cash out of the machine and she said she was busy and that it would be about 10 minutes. I said okay and went to sit down with my husband to wait. When I got up 10 minutes later, she had counted them and I had $1100 in $1 bills. I was so upset, I went to the manager and he gave me $200 in cash for the $100 I lost. I was still upset and went to another cashier and told her the story and asked if she could count it for me. She said yes and she counted it in 2 minutes and had $1100 in $1 bills for me. I thanked her and she said that it was no problem and to have a nice night. I was very happy with that and I went to tell my husband that the second cashier counted them for me and she said it was no problem. I told her the whole story and she was shocked. She said that she would never have done that. She said that it was a lot of work and she could have gotten into trouble. I told her that it was the right thing to do and that I was grateful for her help. I don't get it, why wouldn't you just have the $1000 in $1 coins, then you'd have $1000 in $1 coins and you wouldn't need to count them. I didn’t even think about that.

>

younesbelkada · 2024-08-21T06:00:31Z

@LiuChaoXD can you make sure you have compiled llama.cpp using this branch with the command make ?

younesbelkada · 2024-08-21T06:01:06Z

@compilade @ggerganov thanks for all the reviews, is there anything to do before merging this PR that I can help ?

LiuChaoXD · 2024-08-21T06:07:20Z

@LiuChaoXD can you make sure you have compiled llama.cpp using this branch with the command make ?

yes.
Use this branch, and make clean, make again.
I will try again later..

younesbelkada · 2024-08-21T06:21:06Z

@LiuChaoXD using the 4bit quantized model I get coherent results with the same system prompt you shared, see below:

And using the prompt you shared:

LiuChaoXD · 2024-08-21T06:32:46Z

@LiuChaoXD using the 4bit quantized model I get coherent results with the same system prompt you shared, see below:

And using the prompt you shared:

Thanks, appreciate.
maybe i loss some key information. Sorry about that.
I will try.

* feat: initial support for llama.cpp * fix: lint * refactor: better refactor * Update src/llama.cpp Co-authored-by: compilade <[email protected]> * Update src/llama.cpp Co-authored-by: compilade <[email protected]> * fix: address comments * Update convert_hf_to_gguf.py Co-authored-by: compilade <[email protected]> * fix: add more cleanup and harmonization * fix: lint * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <[email protected]> * fix: change name * Apply suggestions from code review Co-authored-by: compilade <[email protected]> * add in operator * fix: add `dt_b_c_rms` in `llm_load_print_meta` * fix: correct printf format for bool * fix: correct print format * Update src/llama.cpp Co-authored-by: compilade <[email protected]> * llama : quantize more Mamba tensors * llama : use f16 as the fallback of fallback quant types --------- Co-authored-by: compilade <[email protected]>

feat: initial support for llama.cpp

59a08be

github-actions bot added the python python script changes label Aug 18, 2024

younesbelkada added 2 commits August 18, 2024 11:40

fix: lint

bfa0286

refactor: better refactor

b97704c

compilade reviewed Aug 18, 2024

View reviewed changes

younesbelkada and others added 5 commits August 18, 2024 18:11

Update src/llama.cpp

343b583

Co-authored-by: compilade <[email protected]>

Update src/llama.cpp

a8109e3

Co-authored-by: compilade <[email protected]>

fix: address comments

184a4c6

Update convert_hf_to_gguf.py

3494265

Co-authored-by: compilade <[email protected]>

fix: add more cleanup and harmonization

f7d2e91

younesbelkada requested a review from compilade August 18, 2024 14:19

fix: lint

9e22bb7

compilade reviewed Aug 18, 2024

View reviewed changes

gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved

younesbelkada and others added 2 commits August 18, 2024 18:41

Update gguf-py/gguf/gguf_writer.py

4553502

Co-authored-by: compilade <[email protected]>

fix: change name

d637bb9

compilade reviewed Aug 18, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

gguf-py/gguf/constants.py Outdated Show resolved Hide resolved

gguf-py/gguf/constants.py Outdated Show resolved Hide resolved

src/llama.cpp Show resolved Hide resolved

younesbelkada and others added 2 commits August 18, 2024 19:35

Apply suggestions from code review

57c3eb4

Co-authored-by: compilade <[email protected]>

add in operator

bf5e344

younesbelkada requested a review from compilade August 18, 2024 15:41

younesbelkada force-pushed the add-fm-support branch from 5aeaca0 to bf5e344 Compare August 18, 2024 15:44

fix: add dt_b_c_rms in llm_load_print_meta

ca4db9e

younesbelkada force-pushed the add-fm-support branch from 2f1391d to ca4db9e Compare August 18, 2024 15:47

younesbelkada added 2 commits August 18, 2024 15:48

fix: correct printf format for bool

78ad84f

fix: correct print format

7aeccbb

compilade approved these changes Aug 18, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

Update src/llama.cpp

5c0f108

Co-authored-by: compilade <[email protected]>

ggerganov approved these changes Aug 19, 2024

View reviewed changes

younesbelkada mentioned this pull request Aug 19, 2024

Feature Request: Adding FalconMamba 7B Instruct in ollama ollama/ollama#6415

Open

llama : quantize more Mamba tensors

3491291

* llama : use f16 as the fallback of fallback quant types

compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Aug 19, 2024

mofosyne mentioned this pull request Aug 19, 2024

Refactor: Add more typechecking to GGUFWriter.add_key_value #9095

Open

ggerganov merged commit b40eb84 into ggerganov:master Aug 21, 2024
54 checks passed

younesbelkada deleted the add-fm-support branch August 21, 2024 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Support for `falcon-mamba` architecture #9074

Feat: Support for `falcon-mamba` architecture #9074

younesbelkada commented Aug 18, 2024

compilade left a comment

compilade Aug 18, 2024

younesbelkada commented Aug 18, 2024

compilade left a comment

compilade commented Aug 18, 2024 •

edited

Loading

younesbelkada commented Aug 18, 2024 •

edited

Loading

compilade commented Aug 19, 2024 •

edited

Loading

younesbelkada commented Aug 19, 2024 •

edited

Loading

mofosyne commented Aug 19, 2024 •

edited

Loading

compilade commented Aug 19, 2024

compilade commented Aug 19, 2024 •

edited

Loading

younesbelkada commented Aug 19, 2024

LiuChaoXD commented Aug 20, 2024 •

edited

Loading

younesbelkada commented Aug 21, 2024

younesbelkada commented Aug 21, 2024

LiuChaoXD commented Aug 21, 2024

younesbelkada commented Aug 21, 2024 •

edited

Loading

LiuChaoXD commented Aug 21, 2024

Feat: Support for falcon-mamba architecture #9074

Feat: Support for falcon-mamba architecture #9074

Conversation

younesbelkada commented Aug 18, 2024

What does this PR do?

compilade left a comment

Choose a reason for hiding this comment

compilade Aug 18, 2024

Choose a reason for hiding this comment

younesbelkada commented Aug 18, 2024

compilade left a comment

Choose a reason for hiding this comment

compilade commented Aug 18, 2024 • edited Loading

younesbelkada commented Aug 18, 2024 • edited Loading

compilade commented Aug 19, 2024 • edited Loading

younesbelkada commented Aug 19, 2024 • edited Loading

mofosyne commented Aug 19, 2024 • edited Loading

compilade commented Aug 19, 2024

compilade commented Aug 19, 2024 • edited Loading

younesbelkada commented Aug 19, 2024

LiuChaoXD commented Aug 20, 2024 • edited Loading

younesbelkada commented Aug 21, 2024

younesbelkada commented Aug 21, 2024

LiuChaoXD commented Aug 21, 2024

younesbelkada commented Aug 21, 2024 • edited Loading

LiuChaoXD commented Aug 21, 2024

Feat: Support for `falcon-mamba` architecture #9074

Feat: Support for `falcon-mamba` architecture #9074

compilade commented Aug 18, 2024 •

edited

Loading

younesbelkada commented Aug 18, 2024 •

edited

Loading

compilade commented Aug 19, 2024 •

edited

Loading

younesbelkada commented Aug 19, 2024 •

edited

Loading

mofosyne commented Aug 19, 2024 •

edited

Loading

compilade commented Aug 19, 2024 •

edited

Loading

LiuChaoXD commented Aug 20, 2024 •

edited

Loading

younesbelkada commented Aug 21, 2024 •

edited

Loading