Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

About

Speculative Prefill is a technique for accelerating LLM inference via token importance transferability. Essentially, Speculative Prefill adopts a smaller, usually cheaper, LLM as a "draft" model that speculates what tokens are contextually important. Only these tokens, along with their original position information are then sent to the main model for inference.

Speculative Prefill achieves impressive TTFT reduction on many downstream tasks, including LongBench and RULER. The implementation is based on vLLM.

Performance

Speculative Prefill greatly improves maximum QPS that a system can support (benchmarked on 8 x NVIDIA H200s):

In terms of downstream quality, Speculative Prefill can reserve quality with keeping only 10% of the tokens for many compressible tasks:

Getting Started

Create a conda environment:

conda create -n sp python=3.10.15 -y
conda activate sp

Install via pip:

pip3 install git+git://github.com/Jingyu6/speculative_prefill.git#egg=speculative_prefill

To reproduce all experiments, clone the repo and install required dependencies:

git clone https://github.com/Jingyu6/speculative_prefill.git
cd speculative_prefill
pip3 install -r requirements.txt

Example Usage

We just need to apply the monkey patch before native vLLM code.

from speculative_prefill import enable_prefill_spec

# monkey patch must be placed before everything
enable_prefill_spec(
    spec_model='meta-llama/Llama-3.2-1B-Instruct', 
    spec_config_path='./configs/config_p1_full_lah8.yaml'
)

from vllm import LLM, SamplingParams

llm = LLM(
    'meta-llama/Meta-Llama-3.1-70B-Instruct', 
    gpu_memory_utilization=0.8, 
    enforce_eager=True, 
    enable_chunked_prefill=False, 
    tensor_parallel_size=8
)

Evaluation

To reproduce the results from the paper, we include scripts in experiments. Please clone the repository that contains experiment scripts.

Before running these scripts, be sure to setup some configurations:

Move all lm_eval configs in eval/lm_eval_patch to the corresponding place in your lm_eval library. These files contain special templates for Llama 3.1 and 3.2.
For RULER experiments, we recommend following run_ruler.md and launch a client using RULER's official script.
All other experiments can be launched by running

bash experiments/run_{task_of_interest}.sh

All results will be saved in a local folder called local.

(For another baseline RAG experiments, please checkout branch rag_baseline)

WIP and Contributing

We welcome everyone to try and contribute to the code! Here're some planned TODOs

Make sure all experiments are reproducible in the paper.
Package the repo.
Update to the latest vLLM version.

Since vLLM is updating very fast, we choose to keep this project as a monkey patch. Integrating into the main vLLM is extremely appreciated!!!

Citation

If you found our work to be useful, please cite our paper:

@misc{liu2025speculativeprefillturbochargingttft,
      title={Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation}, 
      author={Jingyu Liu and Beidi Chen and Ce Zhang},
      year={2025},
      eprint={2502.02789},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.02789}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
configs		configs
eval		eval
examples		examples
experiments		experiments
figs		figs
rag_baseline		rag_baseline
speculative_prefill		speculative_prefill
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

About

Performance

Getting Started

Example Usage

Evaluation

WIP and Contributing

Citation

About

Releases

Packages

Languages

License

Jingyu6/speculative_prefill

Folders and files

Latest commit

History

Repository files navigation

Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

About

Performance

Getting Started

Example Usage

Evaluation

WIP and Contributing

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages