Improvements: Papers #146

Blaizzy · 2024-12-07T00:25:14Z

VisionZip
A simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance.

https://arxiv.org/pdf/2412.04467

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

In this work, we propose a training-free adaptive inference method for multimodal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop.
Our method consists of a) iterative token merging based on
embedding similarity before LLMs, and b) progressive token pruning within LLM layers based on multi-modal importance. With a minimalist design, our method can be applied to both video and image LLMs. Extensive experiments on diverse video and image benchmarks demonstrate that, our method substantially reduces computation load (e.g., a 7-fold reduction in FLOPs) while preserving the performance of video and image LLMs.

https://arxiv.org/pdf/2412.03248

Blaizzy changed the title ~~Improvements: VisionZip~~ Improvements: Papers Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements: Papers #146

Improvements: Papers #146

Blaizzy commented Dec 7, 2024 •

edited

Loading

Improvements: Papers #146

Improvements: Papers #146

Comments

Blaizzy commented Dec 7, 2024 • edited Loading

Blaizzy commented Dec 7, 2024 •

edited

Loading