You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a technique called LayerSkip, (https://arxiv.org/abs/2404.16710) that allows for a type of acceleration for decoding without a draft model (self-speculative decoding). As I understand it, the models are trained with layer dropout + a loss function that allows it to be possible for earlier layers to predict tokens directly (and thus can get a speedup by letting subset of layers draft tokens). There is a collection of llama models trained with this property, so implementing this would be useful.
A basic implementation can be done with making a full split of the model using the first E layers as a separate draft model, but this is not optimal (because weights and KV cache would not be reused). This was the initial strategy used in the transformers library before a more optimized implementation was merged.
Challenges
To get "draft" tokens, we'd need to determine an exit layer, get outputs from that layer, and pass them through the LM head of the model. I could see this as being possible with llama.cpp if at startup time we make a split in the graph / make a separate graph with shared weights for early exit and the layer for early exit is fixed.
(Verification is done by predicting the next token with the remaining layers for the next token after the drafted tokens and looking for a contradiction, as with "normal" speculative decoding with a separate draft model).
This also has some complications for how KV caching is done, since the KV cache for the draft model (first E layers) can be used as part of the KV cache for the full model, and the exit layer's query vector can be saved as well. I have to read this section of the paper + look at implementations more thoroughly to understand the details here.
I've looked at #4224 and #2783, but I think this may be easier to implement since the layer at which to do early exiting can be fixed before beginning inference (so doesn't require the full output from each hidden layer and could split the graph ahead-of-time).
Requests
Any advice on making a GGUF model from a subset of layers of llama architecture models? I think I can just limit n_layers in the model config.json (through looking at convert_hf_to_gguf.py, but I'm not sure if would have to override anything elsewhere / if would require a little graph surgery to take first E layers + final LM head layer.
The immediate thing I want to try myself is converting + quantizing their checkpoints, making a subset into a draft model, and testing the token generation rate through treating as fully separate models.
Do you think it's plausible to implement this (with weight sharing + cache re-use) in llama.cpp? Or would this run into some complications because of the way the graph is constructed / inference is done?
The text was updated successfully, but these errors were encountered:
Discussed in #10787
Originally posted by tc-wolf December 11, 2024
Summary
There is a technique called LayerSkip, (https://arxiv.org/abs/2404.16710) that allows for a type of acceleration for decoding without a draft model (self-speculative decoding). As I understand it, the models are trained with layer dropout + a loss function that allows it to be possible for earlier layers to predict tokens directly (and thus can get a speedup by letting subset of layers draft tokens). There is a collection of llama models trained with this property, so implementing this would be useful.
A basic implementation can be done with making a full split of the model using the first E layers as a separate draft model, but this is not optimal (because weights and KV cache would not be reused). This was the initial strategy used in the transformers library before a more optimized implementation was merged.
Challenges
To get "draft" tokens, we'd need to determine an exit layer, get outputs from that layer, and pass them through the LM head of the model. I could see this as being possible with llama.cpp if at startup time we make a split in the graph / make a separate graph with shared weights for early exit and the layer for early exit is fixed.
(Verification is done by predicting the next token with the remaining layers for the next token after the drafted tokens and looking for a contradiction, as with "normal" speculative decoding with a separate draft model).
This also has some complications for how KV caching is done, since the KV cache for the draft model (first E layers) can be used as part of the KV cache for the full model, and the exit layer's query vector can be saved as well. I have to read this section of the paper + look at implementations more thoroughly to understand the details here.
I've looked at #4224 and #2783, but I think this may be easier to implement since the layer at which to do early exiting can be fixed before beginning inference (so doesn't require the full output from each hidden layer and could split the graph ahead-of-time).
Requests
n_layers
in the model config.json (through looking atconvert_hf_to_gguf.py
, but I'm not sure if would have to override anything elsewhere / if would require a little graph surgery to take first E layers + final LM head layer.The text was updated successfully, but these errors were encountered: