Skip to content

Latest commit

 

History

History
114 lines (113 loc) · 44.2 KB

inference_acceleration.md

File metadata and controls

114 lines (113 loc) · 44.2 KB

Inference Acceleration

Title & Authors Introduction Links
StarPublish
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Zichang Liu, Jue WANG, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen
image Github
Paper
Publish
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann
image Paper
StarPublish
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen
image Github
Paper
StarPublish
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Sangmin Bae, Jongwoo Ko, Hwanjun Song, Se-Young Yun
image Github
Paper
StarPublish
Compressing Context to Enhance Inference Efficiency of Large Language Models
Yucheng Li, Bo Dong, Chenghua Lin, Frank Guerin
image Github
Paper
Publish
ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference
Ziqian Zeng, Yihuai Hong, Hongliang Dai, Huiping Zhuang, Cen Chen
image Paper
Publish
Accelerating LLM Inference with Staged Speculative Decoding
Benjamin Spector, Chris Re
image Paper
Publish
TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction
Junyi Liu, Liangzhi Li, Tong Xiang, Bowen Wang, Yiming Qian
image Paper
Inference with Reference: Lossless Acceleration of Large Language Models
Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei
image Github
paper
Star
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia
image Github
paper
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference
Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee
image Paper
Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
Xuefei Ning, Zinan Lin, Zixuan Zhou, Huazhong Yang, Yu Wang
image Paper
Star
Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Sharad Mehrotra
image Github
Paper
Star
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
image Github
Paper
(Dynamic) Prompting might be all you need to repair Compressed LLMs
Duc N.M Hoang, Minsik Cho, Thomas Merth, Mohammad Rastegari, Zhangyang Wang
image Paper
Star
Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning
Murong Yue, Jie Zhao, Min Zhang, Liang Du, Ziyu Yao
image Github
Paper
CacheGen: Fast Context Loading for Language Model Applications
Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang, Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, Ganesh Ananthanarayanan, Junchen Jiang
image Paper
StarPublish
Context Compression for Auto-regressive Transformers with Sentinel Tokens
Siyu Ren, Qi Jia, Kenny Q. Zhu
image Github
Paper
Star
A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models
Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, Guido Zuccon
image Github
Paper
SPEED: Speculative Pipelined Execution for Efficient Decoding
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao
image Paper
Accelerating LLM Inference by Enabling Intermediate Layer Decoding
Neeraj Varshney, Agneet Chatterjee, Mihir Parmar, Chitta Baral
image Paper
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
Hongxuan Zhang, Zhining Liu, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen
image Paper
Star
Compressed Context Memory For Online Language Model Interaction
Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song
image Github
Paper
SparQ Attention: Bandwidth-Efficient LLM Inference
Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr
image Paper
Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy
Yao Zhao, Zhitian Xie, Chenyi Zhuang, Jinjie Gu
image Paper
Cascade Speculative Drafting for Even Faster LLM Inference
Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Jie Huang, Kevin Chen-Chuan Chang
image Paper
Star
EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation
Yuhui Li, Chao Zhang, and Hongyang Zhang
image Github
Blog
LoMA: Lossless Compressed Memory Attention
Yumeng Wang, Zhenyang Xiao
image Paper
Star
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao
image Github
Paper
APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding
Mingdao Liu, Aohan Zeng, Bowen Wang, Peng Zhang, Jie Tang, Yuxiao Dong
image Paper
Star
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao
image Github
Paper
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi
image Paper
RelayAttention for Efficient Large Language Model Serving with Long System Prompts
Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W.H. Lau
image Paper
Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement
Wonseok Jeon, Mukul Gagrani, Raghavv Goel, Junyoung Park, Mingu Lee, Christopher Lott
image Paper
Star
Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens
Ziqian Zeng, Jiahong Yu, Qianshi Pang, Zihao Wang, Huiping Zhuang, Cen Chen
image Github
Paper
CHAI: Clustered Head Attention for Efficient LLM Inference
Saurabh Agarwal, Bilge Acun, Basil Homer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu
image Paper
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti
image Paper
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng
image Paper
Optimal Block-Level Draft Verification for Accelerating Speculative Decoding
Ziteng Sun, Jae Hun Ro, Ahmad Beirami, Ananda Theertha Suresh
image Paper
Hierarchical Skip Decoding for Efficient Autoregressive Text Generation
Yunqi Zhu, Xuebing Yang, Yuanyuan Wu, Wensheng Zhang
image Paper
Star
SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens
Chengbo Liu, Yong Zhu
image Github
Paper
Star
Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models
Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover
image Github
Paper
Exploring and Improving Drafts in Blockwise Parallel Decoding
Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton
image Paper
Star
Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding
Jie Ou, Yueming Chen, Wenhong Tian
image Github
Paper
Self-Selected Attention Span for Accelerating Large Language Model Inference
Tian Jin, Wanzin Yazar, Zifei Xu, Sayeh Sharify, Xin Wang
image Paper
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration
Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li, Jingang Wang, Xunliang Cai, Dongyan Zhao
image Paper
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian
image Paper
StarPublish
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah
image Github
Paper
Efficient LLM Inference with Kcache
Qiaozhi He, Zhihua Wu
image Paper
Better & Faster Large Language Models via Multi-token Prediction
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve
image Paper
Publish
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
Minsik Cho, Mohammad Rastegari, Devang Naik
image Paper
You Only Cache Once: Decoder-Decoder Architectures for Language Models
Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei
image Github
Paper
Star
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang
image Github
Paper
Accelerating Speculative Decoding using Dynamic Speculation Length
Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy Schwartz
image Paper
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, Bin Cui
image Paper
Star
EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models
Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang
image Github
Paper
Distributed Speculative Inference of Large Language Models
Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel
image Paper
Star
SirLLM: Streaming Infinite Retentive LLM
Yao Yao, Zuchao Li, Hai Zhao
image Github
Paper
Star
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
Hao (Mark)Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan
image Github
Paper
Faster Cascades via Speculative Decoding
Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar
image Paper
Publish
Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism
Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai
image Paper
Star
QuickLLaMA: Query-aware Inference Acceleration for Large Language Models
Jingyao Li, Han Shi, Xin Jiang, Zhenguo Li, Hong Xu, Jiaya Jia
image Github
Paper
StarPublish
When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
Haoran You, Yichao Fu, Zheng Wang, Amir Yazdanbakhsh, Yingyan (Celine)Lin
image Github
Paper
HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning
Heejun Lee, Geon Park, Youngwan Lee, Jina Kim, Wonyoung Jeong, Myeongjae Jeon, Sung Ju Hwang
image Paper
Optimized Speculative Sampling for GPU Hardware Accelerators
Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet
image Paper
Star
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang
image Github
Paper
Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Huanqi Cao, Xiao Chuanfu, Xingcheng Zhang, Dahua Lin, Chao Yang
image Paper
Interpreting Attention Layer Outputs with Sparse Autoencoders
Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda
image Paper
Star
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
image Github
Paper
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers
Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu
image Paper
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models
Parsa Kavehzadeh, Mohammadreza Pourreza, Mojtaba Valipour, Tinashu Zhu, Haoli Bai, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh
image Paper
Star
LiveMind: Low-latency Large Language Models with Simultaneous Inference
Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li
image Github
Paper
Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference
Zongyue Qin, Ziniu Hu, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun
image Paper
Adaptive Draft-Verification for Efficient Large Language Model Decoding
Xukun Liu, Bowen Lei, Ruqi Zhang, Dongkuan Xu
image Paper
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi
image Paper
Publish
Inference acceleration for large language models using "stairs" assisted greedy generation
Domas Grigaliūnas, Mantas Lukoševičius
image Paper
An Efficient Inference Framework for Early-exit Large Language Models
Ruijie Miao, Yihan Yan, Xinshuo Yao, Tong Yang
Paper
Accelerating Large Language Model Inference with Self-Supervised Early Exits
Florian Valade
Paper
Star
Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding
Bin Xiao, Lujun Gui, Lei Su, Weipeng Chen
image Github
Paper
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion
Jacob K Christopher, Brian R Bartoldson, Bhavya Kailkhura, Ferdinando Fioretto
image Paper
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che
image Paper
Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation
Lujun Gui, Bin Xiao, Lei Su, Weipeng Chen
image Paper
Path-Consistency: Prefix Enhancement for Efficient Inference in LLM
Jiace Zhu, Yingtao Shen, Jie Zhao, An Zou
image Paper
Star
OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs
Jintian Zhang, Cheng Peng, Mengshu Sun, Xiang Chen, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen, Ningyu Zhang
image Github
Paper
Star
Sirius: Contextual Sparsity with Correction for Efficient LLMs
Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, Beidi Chen
image Github
Paper
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu
image Paper
Star
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie
image Github
Paper
Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference
Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun
image Paper
Star
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty
image Github
Paper
Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse
image Paper
A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts
Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng
image Paper
Star
TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text
Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang
image Github
Paper
Star
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li
image Github
Paper
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong Yu
image Paper
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia
image Paper
DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure
Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, Tianhao Wu, Lei Zou
image Paper
QSpec: Speculative Decoding with Complementary Quantization Schemes
Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu
image Paper
Star
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han
image Github
Paper
StarPublish
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang
image Github
Paper
Star
CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation
Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen
image Github
Paper
Star
Dynamic Vocabulary Pruning in Early-Exit LLMs
Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec
image Github
Paper
Efficient Inference for Augmented Large Language Models
Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher
image Paper
Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition
Artem Basharin, Andrei Chertkov, Ivan Oseledets
image Paper
Star
MagicPIG: LSH Sampling for Efficient LLM Generation
Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen
image Github
Paper
Dynamic Strategy Planning for Efficient Question Answering with Large Language Models
Tanmay Parekh, Pradyot Prakash, Alexander Radovic, Akshay Shekher, Denis Savenkov
image Paper
SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
Gabriele Oliaro, Zhihao Jia, Daniel Campos, Aurick Qiao
image Paper
The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation
Lawrence Stewart, Matthew Trager, Sujan Kumar Gonugondla, Stefano Soatto
Paper
Accelerated AI Inference via Dynamic Execution Methods
Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu
Paper
Star
SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents
Dawei Li, Zhen Tan, Peijia Qian, Yifan Li, Kumar Satvik Chaudhary, Lijie Hu, Jiayi Shen
image Github
Paper
Publish
FastDraft: How to Train Your Draft
Ofir Zafrir, Igor Margulis, Dorin Shteyman, Guy Boudoukh
Paper