Is LayerSkip / self-speculative decoding possible (requires getting one intermediate layer's output + some KV cache changes)? #10787

tc-wolf · 2024-12-11T20:22:46Z

tc-wolf
Dec 11, 2024

Summary

There is a technique called LayerSkip, (https://arxiv.org/abs/2404.16710) that allows for a type of acceleration for decoding without a draft model (self-speculative decoding). As I understand it, the models are trained with layer dropout + a loss function that allows it to be possible for earlier layers to predict tokens directly (and thus can get a speedup by letting subset of layers draft tokens). There is a collection of llama models trained with this property, so implementing this would be useful.

A basic implementation can be done with making a full split of the model using the first E layers as a separate draft model, but this is not optimal (because weights and KV cache would not be reused). This was the initial strategy used in the transformers library before a more optimized implementation was merged.

Challenges

To get "draft" tokens, we'd need to determine an exit layer, get outputs from that layer, and pass them through the LM head of the model. I could see this as being possible with llama.cpp if at startup time we make a split in the graph / make a separate graph with shared weights for early exit and the layer for early exit is fixed.

(Verification is done by predicting the next token with the remaining layers for the next token after the drafted tokens and looking for a contradiction, as with "normal" speculative decoding with a separate draft model).

This also has some complications for how KV caching is done, since the KV cache for the draft model (first E layers) can be used as part of the KV cache for the full model, and the exit layer's query vector can be saved as well. I have to read this section of the paper + look at implementations more thoroughly to understand the details here.

I've looked at #4224 and #2783, but I think this may be easier to implement since the layer at which to do early exiting can be fixed before beginning inference (so doesn't require the full output from each hidden layer and could split the graph ahead-of-time).

Requests

Any advice on making a GGUF model from a subset of layers of llama architecture models? I think I can just limit n_layers in the model config.json (through looking at convert_hf_to_gguf.py, but I'm not sure if would have to override anything elsewhere / if would require a little graph surgery to take first E layers + final LM head layer.
- The immediate thing I want to try myself is converting + quantizing their checkpoints, making a subset into a draft model, and testing the token generation rate through treating as fully separate models.
Do you think it's plausible to implement this (with weight sharing + cache re-use) in llama.cpp? Or would this run into some complications because of the way the graph is constructed / inference is done?

tc-wolf · 2024-12-11T20:25:53Z

tc-wolf
Dec 11, 2024
Author

Looks like there are layer-number-specific checks that set the model type (couldn't just limit hparams.n_layer layers)

llama.cpp/src/llama.cpp

Lines 5549 to 5576 in 235f6e1

    
           switch (model.arch) { 
        
               case LLM_ARCH_LLAMA: 
        
                   { 
        
                       ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps); 
        
                       if (hparams.n_expert == 8) { 
        
                           switch (hparams.n_layer) { 
        
                               case 32: model.type = e_model::MODEL_8x7B; break; 
        
                               case 56: model.type = e_model::MODEL_8x22B; break; 
        
                               default: model.type = e_model::MODEL_UNKNOWN; 
        
                           } 
        
                       } else { 
        
                           switch (hparams.n_layer) { 
        
                               case 16: model.type = e_model::MODEL_1B; break; // Llama 3.2 1B 
        
                               case 22: model.type = e_model::MODEL_1B; break; 
        
                               case 26: model.type = e_model::MODEL_3B; break; 
        
                               case 28: model.type = e_model::MODEL_3B; break; // Llama 3.2 3B 
        
                               // granite uses a vocab with len 49152 
        
                               case 32: model.type = hparams.n_vocab == 49152 ? e_model::MODEL_3B : (hparams.n_vocab < 40000 ? e_model::MODEL_7B : e_model::MODEL_8B); break; 
        
                               case 36: model.type = e_model::MODEL_8B; break; // granite 
        
                               case 40: model.type = e_model::MODEL_13B; break; 
        
                               case 48: model.type = e_model::MODEL_34B; break; 
        
                               case 60: model.type = e_model::MODEL_30B; break; 
        
                               case 80: model.type = hparams.n_head() == hparams.n_head_kv() ? e_model::MODEL_65B : e_model::MODEL_70B; break; 
        
                               default: model.type = e_model::MODEL_UNKNOWN; 
        
                           } 
        
                       } 
        
                   } break;

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is LayerSkip / self-speculative decoding possible (requires getting one intermediate layer's output + some KV cache changes)? #10787

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Is LayerSkip / self-speculative decoding possible (requires getting one intermediate layer's output + some KV cache changes)? #10787

tc-wolf Dec 11, 2024

Summary

Challenges

Requests

Replies: 1 comment

tc-wolf Dec 11, 2024 Author

tc-wolf
Dec 11, 2024

tc-wolf
Dec 11, 2024
Author