llama : support for `falcon-mamba` architecture (#9074)
* feat: initial support for llama.cpp
* fix: lint
* refactor: better refactor
* Update src/llama.cpp
Co-authored-by: compilade <[email protected]>
* Update src/llama.cpp
Co-authored-by: compilade <[email protected]>
* fix: address comments
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <[email protected]>
* fix: add more cleanup and harmonization
* fix: lint
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: compilade <[email protected]>
* fix: change name
* Apply suggestions from code review
Co-authored-by: compilade <[email protected]>
* add in operator
* fix: add `dt_b_c_rms` in `llm_load_print_meta`
* fix: correct printf format for bool
* fix: correct print format
* Update src/llama.cpp
Co-authored-by: compilade <[email protected]>
* llama : quantize more Mamba tensors
* llama : use f16 as the fallback of fallback quant types
---------
Co-authored-by: compilade <[email protected]>