-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenELM support #7359
OpenELM support #7359
Conversation
Fix formatting
It looks like context shift currently causes crashes, because A few other functions seem like they will be broken as well. |
We already have this logic for the Refact models: llama.cpp/convert-hf-to-gguf.py Lines 1141 to 1143 in 8513724
You can try to reuse it in a similar way for OpenELM
Have you ran
We'll probably need to generalize the head number to be determined per layer. Do you need to some assistance with that? |
@icecream95 might jump back on this cause im curious to where i got stuck |
I'll be working on porting the variable GQA stuff written for Jamba (in #7531) to OpenELM in the next days, which will both help with making OpenELM work and with reducing the size of the Jamba PR (ref: #7531 (comment)). I think the @icecream95 do you mind if I push to this branch, or would you prefer that I use another one? In the meantime I'll work on a local branch. |
* llama : add variable GQA and variable FFN sizes Some metadata keys can now also be arrays to support setting their value per-layer for models like OpenELM.
I've got this working for the 270M model, the 1.1B model, and the 450M model. Getting reasonable perplexity. I did not yet test the 3B model, but will try once it finishes downloading. EDIT: the 3B model works too! According to https://github.com/apple/corenet/blob/2261885b6696950aaf481a862e8926921ef1a067/projects/openelm/instruction_tuning/openelm-instruct.yaml#L7, the prompt template is zephyr, but it seems to perform extremely poorly in practice, while an alpaca-like template (with This might be because the tokenizer doesn't have tokens for Session file saving support isn't yet done, but everything else should pretty much work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be because the tokenizer doesn't have tokens for <|user|> and <|assistant|>, since it's using Llama-2's tokenizer.
Do we know what is the reason for the tokenizer confusion? It seems the model is unusable without the correct tokenizer
src/llama.cpp
Outdated
// TODO: find a more compact way to add more per-layer hyper-parameters | ||
std::vector<int32_t> n_head_vec; | ||
std::vector<int32_t> n_head_kv_vec; | ||
std::vector<int32_t> n_ff_vec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect negative values here? If not, we should change to uint32_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used int32_t
here due to a limitation of GGUFWriter
, which doesn't allow choosing the specific type of integer arrays and defaults to INT32
. And also because get_arr
in src/llama.cpp
can only load arrays into vectors which have exactly the same type as the loaded array.
I'll see if this can be fixed, but it will likely require using Numpy types (e.g. np.uint32
) in GGUFValueType.get_type()
in gguf-py/gguf/constants.py
.
src/llama.cpp
Outdated
@@ -2173,18 +2202,53 @@ struct llama_hparams { | |||
return false; | |||
} | |||
|
|||
uint32_t n_gqa() const { | |||
// TODO: deduplicate per-layer getters | |||
uint32_t n_head_l(uint32_t layer) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove hparams.n_head
and rename this to:
uint32_t n_head(uint32_t layer) const {
This way everywhere we reference hparams.n_head
, we will use hparams.n_head()
.
We can do this for the rest of the parameters that make sense to be per-layer, the main goal being to avoid having duplicated information in hparams.n_head
and hparams.n_head_vec
.
We can do this change in a follow-up refactoring PR, together with finding a more compact way to have per-layer hparams (maybe via new struct llama_hparams_layer
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main goal being to avoid having duplicated information in
hparams.n_head
andhparams.n_head_vec
.
The advantage of duplicating the information is to have a fallback value, and it also avoids having to allocate the vectors when there's only one value for all the layers. The disadvantage is the possible confusion of which value to use.
We can do this change in a follow-up refactoring PR, together with finding a more compact way to have per-layer hparams (maybe via new
struct llama_hparams_layer
)
Yes a follow-up refactor seems appropriate. The goal with this initial implementation was to minimize having to change the accesses to hparams.n_head
, hparams.n_head_kv
, and hparams.n_ff
all over the place, and also to re-use the nice abstraction which llama_model_loader.get_arr
seemed to be (it was introduced in #7225, but isn't used by anything on master
).
Although a possible huge problem with using std::vector
in llama_hparams
is that it's no longer trivially copyable as assumed by llama_state_save_file_internal
and llama_state_load_file_internal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although a possible huge problem with using std::vector in llama_hparams is that it's no longer trivially copyable as assumed by llama_state_save_file_internal and llama_state_load_file_internal.
Hm, correct. We could switch to uint32 n_head_vec[LLAMA_MAX_LAYERS];
instead and use n_layer
as the container size?
I'm not sure. They clearly say they used the Llama 2 tokenizer, while they also suggest to use it verbatim. This doesn't feel right, but this is apparently what they have done. From playing around with the OpenELM Instruct models with only the BOS token as a prompt and different seeds at When asking questions or instructions, at least they seem to follow them, but the answers are never short (which can be useful in some cases). It's a bit unfortunate they did not add special tokens (or even re-use the BOS and EOS token in a chatml-style way, like the bagel models do), which makes their instruct models worse than they could have been. Although this could likely be fixed by better fine-tuning. |
Co-authored-by: Georgi Gerganov <[email protected]>
Question: How did you get the .gguf format OpenELM models? Could you please share the experience? Thanks! |
@compilade Shall we merge after green CI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we merge after green CI?
Yes, this feels pretty much ready.
src/llama.cpp
Outdated
#define LLAMA_MAX_LAYERS 256 | ||
#define LLAMA_MAX_EXPERTS 160 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to note from which models these values come (or maybe not; it's also fine as-is). The 160 experts come from DeepSeekV2, but I don't know about models with 256 layers.
There are some models with more than 128 layers, like Goliath-120B (137 layers), and TheProfessor-155B (180 layers), so I assume 256 is there for some future-proofing, being the next power of 2.
(PaLM 540B (even though not open-weights) has 118 layers, so Llama-3-405B might still be fine with this limit. The models with more layers are usually merges.)
I've been trying some chat templates, but there is no sign that it is trained for chat. Apparently, apple did not make it clear if "Instruction" means single turn Q&A or multi-turn conversation. Hopefully someone will release a fine tune with proper chat template support. |
* Initial OpenELM support (270M only so far) * Fill out missing entries in llama_model_type_name * fixup! Initial OpenELM support (270M only so far) Fix formatting * llama : support all OpenELM models * llama : add variable GQA and variable FFN sizes Some metadata keys can now also be arrays to support setting their value per-layer for models like OpenELM. * llama : minor spacing changes Co-authored-by: Georgi Gerganov <[email protected]> * llama : use std::array for per-layer hparams * llama : fix save/load state * llama : do not print hparams for vocab-only models * llama : handle n_head == 0 * llama : use const ref for print_f and fix division by zero * llama : fix t5 uses of n_head and n_ff * llama : minor comment --------- Co-authored-by: Francis Couture-Harpin <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
* Initial OpenELM support (270M only so far) * Fill out missing entries in llama_model_type_name * fixup! Initial OpenELM support (270M only so far) Fix formatting * llama : support all OpenELM models * llama : add variable GQA and variable FFN sizes Some metadata keys can now also be arrays to support setting their value per-layer for models like OpenELM. * llama : minor spacing changes Co-authored-by: Georgi Gerganov <[email protected]> * llama : use std::array for per-layer hparams * llama : fix save/load state * llama : do not print hparams for vocab-only models * llama : handle n_head == 0 * llama : use const ref for print_f and fix division by zero * llama : fix t5 uses of n_head and n_ff * llama : minor comment --------- Co-authored-by: Francis Couture-Harpin <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
* Initial OpenELM support (270M only so far) * Fill out missing entries in llama_model_type_name * fixup! Initial OpenELM support (270M only so far) Fix formatting * llama : support all OpenELM models * llama : add variable GQA and variable FFN sizes Some metadata keys can now also be arrays to support setting their value per-layer for models like OpenELM. * llama : minor spacing changes Co-authored-by: Georgi Gerganov <[email protected]> * llama : use std::array for per-layer hparams * llama : fix save/load state * llama : do not print hparams for vocab-only models * llama : handle n_head == 0 * llama : use const ref for print_f and fix division by zero * llama : fix t5 uses of n_head and n_ff * llama : minor comment --------- Co-authored-by: Francis Couture-Harpin <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
Hi there! I tried to deploy the openELM series models like openELM-270M-IT. After transforming to .gguf format , I met a bug which showed that unknown architecture openelm Platform: Android Pixel 8 Pro(ARM) Q: How to tackle load model issue? |
@sqzhang-jeremy Thanks for trying this.
Try to rebuild your binaries.
This build is from a commit from before OpenELM support was merged. |
After rebuilding, It worked! Thank you! |
* Initial OpenELM support (270M only so far) * Fill out missing entries in llama_model_type_name * fixup! Initial OpenELM support (270M only so far) Fix formatting * llama : support all OpenELM models * llama : add variable GQA and variable FFN sizes Some metadata keys can now also be arrays to support setting their value per-layer for models like OpenELM. * llama : minor spacing changes Co-authored-by: Georgi Gerganov <[email protected]> * llama : use std::array for per-layer hparams * llama : fix save/load state * llama : do not print hparams for vocab-only models * llama : handle n_head == 0 * llama : use const ref for print_f and fix division by zero * llama : fix t5 uses of n_head and n_ff * llama : minor comment --------- Co-authored-by: Francis Couture-Harpin <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
* Initial OpenELM support (270M only so far) * Fill out missing entries in llama_model_type_name * fixup! Initial OpenELM support (270M only so far) Fix formatting * llama : support all OpenELM models * llama : add variable GQA and variable FFN sizes Some metadata keys can now also be arrays to support setting their value per-layer for models like OpenELM. * llama : minor spacing changes Co-authored-by: Georgi Gerganov <[email protected]> * llama : use std::array for per-layer hparams * llama : fix save/load state * llama : do not print hparams for vocab-only models * llama : handle n_head == 0 * llama : use const ref for print_f and fix division by zero * llama : fix t5 uses of n_head and n_ff * llama : minor comment --------- Co-authored-by: Francis Couture-Harpin <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
Fixes: #6868.
Thanks to @joshcarp for an initial try at doing this (#6986), it was very helpful as a source to copy-paste from and check against.
Currently a bunch of the configuration is hardcoded into
llama.cpp
, so only the 270M model works at this point.The
ffn_up
tensors in the converted model are actually concatenations offfn_gate
andffn_up
, perhaps the conversion script should separate them out?The 270M model is impressively fast, and works fine for generation, but "Chat" mode in
./server
doesn't really work well. Perhaps that's just because it hasn't been finetuned for that? I'm not really sure.