R-MeeTo: Rebuild Your Faster Vision Mamba in Minutes

The official implementation of "Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training".

Mingjia Shi^*, Yuhao Zhou^*, Ruiji Yu, Zekai Li, Zhiyuan Liang, Xuanlei Zhao, Xiaojiang Peng, Tanmay Rajpurohit, Ramakrishna Vedantam, Wangbo Zhao^†, Kai Wang^†, Yang You

(*: equal contribution, †: corresponding authors)

🌟🌟 Mingjia, Ruiji, Zekai, and Zhiyuan are looking for Ph.D. positions, many thanks for considering their applications.

Paper Project Page

TL;DR

Why is Mamba sensitive to token reduction?
Why does R-MeeTo (i.e., Merging + Re-training) work?

The anwser to all is the key knowledge loss.

video_pre_v5.mp4

The key knowledge loss mainly causes the heavier performance drop after applying token reduction. R-MeeTo is thus proposed, fast fixing key knowledge and therefore recovering performance.

R-MeeTo is simple and effective, with only two main modules: merging and re-training. Merging lowers the knowledge loss while re-training fast recovers the knowledge structure of Mamba.

video_pre_method.mp4

Overview

Figure: Analysis’ sketch: Mamba is sensitive to token reduction. Experiments about i. token reduction are conducted with DeiT-S (Transformer) and Vim-S (Mamba) on ImageNet-1K. The reduction ratios in the experiment about ii. shuffled tokens are 0.14 for Vim-Ti and 0.31 for Vim-S/Vim-B. Shuffle strategy is odd-even shuffle: [0,1,2,3]→[0,2], [1,3]→[0,2,1,3]. The empirical results of I(X;Y), the mutual information between inputs X and outputs Y of the Attention Block and SSM, are measured by MINE on the middle layers of DeiT-S and Vim-S (7-th/12 layers and the 14-th/24 layers respectively.) See this implementation repo of MINE.

Abstract: Vision Mamba (e.g., Vim) has successfully been integrated into computer vision, and token reduction has yielded promising outcomes in Vision Transformers (ViTs). However, token reduction performs less effectively on Vision Mamba compared to ViTs. Pruning informative tokens in Mamba leads to a high loss of key knowledge tokens and a drop in performance, making it not a good solution for enhancing efficiency. Token merging, which preserves more token information than pruning, has demonstrated commendable performance in ViTs, but vanilla merging performance decreases as the reduction ratio increases either, failing to maintain the key knowledge and performance in Mamba. Re-training the model with token merging, which effectively rebuilds the key knowledge, enhances the performance of Mamba. Empirically, pruned Vims, recovered on ImageNet-1K, only drop up to 0.9% accuracy, by our proposed framework R-MeeTo in our main evaluation. We show how simple and effective the fast recovery can be achieved at minute-level, in particular, a 35.9% accuracy spike over 3 epochs of training on Vim-Ti. Moreover, Vim-Ti/S/B are re-trained within 5/7/17 minutes, and Vim-S only drop 1.3% with 1.2 $\times$ (up to 1.5 $\times$) speed up in inference.

🚀 News

2024.12.12: The code is released.

⚡️ Faster Vision Mamba is Rebuilt in Minutes

Hardware	Vim-Ti	Vim-S	Vim-B
1 x 8 x H100 (single machine)	16.2 mins	25.2 mins	57.6 mins
2 x 8 x H100 (Infiniband)	8.1 mins	12.9 mins	30.6 mins
4 x 8 x H100 (Infiniband)	4.2 mins	6.8 mins	16.9 mins

Wall time in minutes of re-training Vim-Ti, Vim-S and Vim-B for 3 epochs on 3 hardwares by R-MeeTo. Give us minutes, we give back a faster Mamba.

🛠 Dataset Prepare

For the image dataset, we use ImageNet-1K.
For the video dataset, we use K400. You can download it from OpenDataLab or its official website. We follow the data list from here to split the dataset.

🛠 Installation

1. Clone the repository

git clone https://github.com/NUS-HPC-AI-Lab/R-MeeTo

2. Create a new Conda environment

conda env create -f environment.yml

or install the necessary packages by requirement.txt

conda create -n R_MeeTo python=3.10.12
pip install -r requirements.txt

3. Install Mamba package manually

For Vim baseline: pip install the mamba package and casual-conv1d (version:1.1.1) in the Vim repo.

git clone https://github.com/hustvl/Vim
cd Vim 
pip install -e causal_conv1d==1.1.0
pip install -e mamba-1p1p1

For VideoMamba baseline: pip install the mamba package and casual-conv1d (version:1.1.0) in the VideoMamba repo.

git clone https://github.com/OpenGVLab/VideoMamba
cd VideoMamba
pip install -e causal_conv1d
pip install -e mamba

4. Download the baseline pretrained models from our baseline official source

See PRETRAINED for downloading the pretrained model of our baseline.

⚙️ Usage

🛠️ Reproduce our results

For image task:

bash ./image_task/exp_sh/tab2/vim_tiny.sh

For video task:

bash ./video_task/exp_sh/tab13/videomamba_tiny.sh

Checkpoints:

See CKPT to find our reproduced checkpoints and logs of the main results.

⏱️ Measure inference speed

R-MeeTo effectively optimizes inference speed and is adaptable for both consumer-level, enterprise-level and other high-performance devices. See this example for testing FLOPS (G) and throughput (im/s).

🖼️ Visualization

See this example of visualization of merged token on ImageNet-1k val using a re-trained Vim-S.

Citation

If you found our work useful, please consider citing us.

@misc{shi2024faster,
      title={Faster Vision Mamba is Rebuilt in Minutes Via Merged Token Re-training},
      author={Shi, Mingjia and Zhou, Yuhao and Yu, Ruiji and Li, Zekai and Liang, Zhiyuan and Zhao, Xuanlei and
       Peng, Xiaojiang and Rajpurohit, Tanmay and Vedantam, Ramakrishna and
       Zhao, Wangbo and Wang, Kai and You, Yang},
      year={2024},
      eprint={2412.12496},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2412.12496},
}

Acknowledge

The repo is partly built based on ToMe, Vision Mamba, and VideoMamba. We are grateful for their generous contributions to open source.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
fig		fig
image_task		image_task
video_task		video_task
.gitignore		.gitignore
CKPT.md		CKPT.md
PRETRAINED.md		PRETRAINED.md
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

R-MeeTo: Rebuild Your Faster Vision Mamba in Minutes

TL;DR

Overview

🚀 News

⚡️ Faster Vision Mamba is Rebuilt in Minutes

🛠 Dataset Prepare

🛠 Installation

1. Clone the repository

2. Create a new Conda environment

3. Install Mamba package manually

4. Download the baseline pretrained models from our baseline official source

⚙️ Usage

🛠️ Reproduce our results

For image task:

For video task:

Checkpoints:

⏱️ Measure inference speed

🖼️ Visualization

Citation

Acknowledge

About

Releases

Packages

Contributors 3

Languages

NUS-HPC-AI-Lab/R-MeeTo

Folders and files

Latest commit

History

Repository files navigation

R-MeeTo: Rebuild Your Faster Vision Mamba in Minutes

TL;DR

Overview

🚀 News

⚡️ Faster Vision Mamba is Rebuilt in Minutes

🛠 Dataset Prepare

🛠 Installation

1. Clone the repository

2. Create a new Conda environment

3. Install Mamba package manually

4. Download the baseline pretrained models from our baseline official source

⚙️ Usage

🛠️ Reproduce our results

For image task:

For video task:

Checkpoints:

⏱️ Measure inference speed

🖼️ Visualization

Citation

Acknowledge

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages