The official implementation of "Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training".
Mingjia Shi*, Yuhao Zhou*, Ruiji Yu, Zekai Li, Zhiyuan Liang, Xuanlei Zhao, Xiaojiang Peng, Tanmay Rajpurohit, Ramakrishna Vedantam, Wangbo Zhao†, Kai Wang†, Yang You
(*: equal contribution, †: corresponding authors)
🌟🌟 Mingjia, Ruiji, Zekai, and Zhiyuan are looking for Ph.D. positions, many thanks for considering their applications.
- Why is Mamba sensitive to token reduction?
- Why does R-MeeTo (i.e., Merging + Re-training) work?
The anwser to all is the key knowledge loss.
video_pre_v5.mp4
The key knowledge loss mainly causes the heavier performance drop after applying token reduction. R-MeeTo is thus proposed, fast fixing key knowledge and therefore recovering performance.
R-MeeTo is simple and effective, with only two main modules: merging and re-training. Merging lowers the knowledge loss while re-training fast recovers the knowledge structure of Mamba.
video_pre_method.mp4
Figure: Analysis’ sketch: Mamba is sensitive to token reduction. Experiments about i. token reduction are conducted with DeiT-S (Transformer) and Vim-S (Mamba) on ImageNet-1K. The reduction ratios in the experiment about ii. shuffled tokens are 0.14 for Vim-Ti and 0.31 for Vim-S/Vim-B. Shuffle strategy is odd-even shuffle: [0,1,2,3]→[0,2], [1,3]→[0,2,1,3]. The empirical results of I(X;Y), the mutual information between inputs X and outputs Y of the Attention Block and SSM, are measured by MINE on the middle layers of DeiT-S and Vim-S (7-th/12 layers and the 14-th/24 layers respectively.) See this implementation repo of MINE.
Abstract: Vision Mamba (e.g., Vim) has successfully been integrated into computer vision, and token reduction has yielded promising outcomes in Vision Transformers (ViTs). However, token reduction performs less effectively on Vision Mamba compared to ViTs. Pruning informative tokens in Mamba leads to a high loss of key knowledge tokens and a drop in performance, making it not a good solution for enhancing efficiency. Token merging, which preserves more token information than pruning, has demonstrated commendable performance in ViTs, but vanilla merging performance decreases as the reduction ratio increases either, failing to maintain the key knowledge and performance in Mamba. Re-training the model with token merging, which effectively rebuilds the key knowledge, enhances the performance of Mamba. Empirically, pruned Vims, recovered on ImageNet-1K, only drop up to 0.9% accuracy, by our proposed framework R-MeeTo in our main evaluation. We show how simple and effective the fast recovery can be achieved at minute-level, in particular, a 35.9% accuracy spike over 3 epochs of training on Vim-Ti. Moreover, Vim-Ti/S/B are re-trained within 5/7/17 minutes, and Vim-S only drop 1.3% with 1.2
$\times$ (up to 1.5$\times$ ) speed up in inference.
2024.12.12
: The code is released.
Hardware | Vim-Ti | Vim-S | Vim-B |
---|---|---|---|
1 x 8 x H100 (single machine) | 16.2 mins | 25.2 mins | 57.6 mins |
2 x 8 x H100 (Infiniband) | 8.1 mins | 12.9 mins | 30.6 mins |
4 x 8 x H100 (Infiniband) | 4.2 mins | 6.8 mins | 16.9 mins |
Wall time in minutes of re-training Vim-Ti, Vim-S and Vim-B for 3 epochs on 3 hardwares by R-MeeTo. Give us minutes, we give back a faster Mamba.
- For the image dataset, we use ImageNet-1K.
- For the video dataset, we use K400. You can download it from OpenDataLab or its official website. We follow the data list from here to split the dataset.
git clone https://github.com/NUS-HPC-AI-Lab/R-MeeTo
conda env create -f environment.yml
or install the necessary packages by requirement.txt
conda create -n R_MeeTo python=3.10.12
pip install -r requirements.txt
- For Vim baseline: pip install the mamba package and casual-conv1d (version:1.1.1) in the Vim repo.
git clone https://github.com/hustvl/Vim
cd Vim
pip install -e causal_conv1d==1.1.0
pip install -e mamba-1p1p1
- For VideoMamba baseline: pip install the mamba package and casual-conv1d (version:1.1.0) in the VideoMamba repo.
git clone https://github.com/OpenGVLab/VideoMamba
cd VideoMamba
pip install -e causal_conv1d
pip install -e mamba
See PRETRAINED for downloading the pretrained model of our baseline.
bash ./image_task/exp_sh/tab2/vim_tiny.sh
bash ./video_task/exp_sh/tab13/videomamba_tiny.sh
See CKPT to find our reproduced checkpoints and logs of the main results.
R-MeeTo effectively optimizes inference speed and is adaptable for both consumer-level, enterprise-level and other high-performance devices. See this example for testing FLOPS (G) and throughput (im/s).
See this example of visualization of merged token on ImageNet-1k val using a re-trained Vim-S.
If you found our work useful, please consider citing us.
@misc{shi2024faster,
title={Faster Vision Mamba is Rebuilt in Minutes Via Merged Token Re-training},
author={Shi, Mingjia and Zhou, Yuhao and Yu, Ruiji and Li, Zekai and Liang, Zhiyuan and Zhao, Xuanlei and
Peng, Xiaojiang and Rajpurohit, Tanmay and Vedantam, Ramakrishna and
Zhao, Wangbo and Wang, Kai and You, Yang},
year={2024},
eprint={2412.12496},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2412.12496},
}
The repo is partly built based on ToMe, Vision Mamba, and VideoMamba. We are grateful for their generous contributions to open source.