Replies: 1 comment
-
Looks like there are layer-number-specific checks that set the model type (couldn't just limit Lines 5549 to 5576 in 235f6e1 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Summary
There is a technique called LayerSkip, (https://arxiv.org/abs/2404.16710) that allows for a type of acceleration for decoding without a draft model (self-speculative decoding). As I understand it, the models are trained with layer dropout + a loss function that allows it to be possible for earlier layers to predict tokens directly (and thus can get a speedup by letting subset of layers draft tokens). There is a collection of llama models trained with this property, so implementing this would be useful.
A basic implementation can be done with making a full split of the model using the first E layers as a separate draft model, but this is not optimal (because weights and KV cache would not be reused). This was the initial strategy used in the transformers library before a more optimized implementation was merged.
Challenges
To get "draft" tokens, we'd need to determine an exit layer, get outputs from that layer, and pass them through the LM head of the model. I could see this as being possible with llama.cpp if at startup time we make a split in the graph / make a separate graph with shared weights for early exit and the layer for early exit is fixed.
(Verification is done by predicting the next token with the remaining layers for the next token after the drafted tokens and looking for a contradiction, as with "normal" speculative decoding with a separate draft model).
This also has some complications for how KV caching is done, since the KV cache for the draft model (first E layers) can be used as part of the KV cache for the full model, and the exit layer's query vector can be saved as well. I have to read this section of the paper + look at implementations more thoroughly to understand the details here.
I've looked at #4224 and #2783, but I think this may be easier to implement since the layer at which to do early exiting can be fixed before beginning inference (so doesn't require the full output from each hidden layer and could split the graph ahead-of-time).
Requests
n_layers
in the model config.json (through looking atconvert_hf_to_gguf.py
, but I'm not sure if would have to override anything elsewhere / if would require a little graph surgery to take first E layers + final LM head layer.Beta Was this translation helpful? Give feedback.
All reactions