LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
This repository hosts the code, data, and model weight of LLaVA-UHD v2, an advanced MLLM centered around a Hierarchical window transformer that enables capturing diverse visual granularity by constructing and integrating a high resolution feature pyramid. Notably, our model built on LLaVA-UHD, brings an average boost of 3.7% across 14 benchmarks compared with the baseline method, 9.3% on DocVQA for instance. Visit our 📃 paper here!
LLaVA-UHD v2 includes two key components:
(i) an inverse feature pyramid, constructed by a ViT-derived feature up-sampling process utilizing high-frequency details from an image pyramid
(ii) hierarchical window attention, focusing on a set of key sampling features within cross-scale windows to condense multi-level feature maps.
-[2024/12/19] 🔥LLaVA-UHD v2 achieves achieves superior performance over existing MLLMs on 13 popular benchmarks. Notably, our design brings an average boost of 3.7% across 14 benchmarks compared with the baseline method(LLaVA-UHD), 9.3% on DocVQA for instance. Model checkpoints and LLaVA-UHD-v2-SFT-Data are available in huggingface.
-[2024/07/29] LLaVA-UHD achieves performance improvement on 8 common benchmarks beyong LLaVA-1.5.
Our novel projector, spatially constrained resampler, realizes high feature compression and convergence efficiency.
Model checkpoints are available in hugging-face.
You can find the original project instruction and code of LLaVA-UHD in branch LLaVA-UHD-v1
-[2024/07/01] 📢LLaVA-UHD is accepted by ECCV2024.
- To reproduce the results of the paper, please set up the Python environment using the following code:
conda create -n llava-uhd python=3.10
conda activate llava-uhd
sh install.sh
- Download the checkpoints of CLIP-ViT-L/14-336 and Vicuna-7B-v1.5. And put them into
./pretrained_models
. In the checkpoint path of vicuna-7b-v1.5, set 'do_sample' in 'generation_config.json' as 'True', otherwise there is an error when saving training checkpoint.
If something wrong happens, please kindly refer to the issues of LLaVA or submit issues in our repository.
- JBU module pre-training Data: Download MS-COCO stuff 2017.
- Pretraining Data: Download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.
And put the data into
./playground/data
. - Fine-tuning Data: Please download all images and the instruction-tuning annotations
llava-uhd-v2-sft-data.json
in LLaVA-UHD-v2-SFT-Data. And place them in the./playground/data
.
We organize the data like the official code of LLaVA. If necessary, you can refer to it.
- JBU module pre-training:
Please use jbu-pretrain.sh, and all hyper parameters are in
./featup/configs/jbu_upsampler.yaml
. You can directly use our pretrained JBU module of CLIP-ViT-L/14-336.
sh jbu-pretrain.sh
- model training: Please refer to train.sh for pretraining script and fine-tuning script (we comment in the file). If you want to do end-to-end pretraining, fine-tuning and evalutation, please run the following command. You can directly use our pretrained multimodal_projector.
sh model-train.sh
1. Evaluation script: We use evaluation scripts to evaluate MME, AI2D, DocVQA, ChartVQA, TextVQA, GQA, SciQA-IMG. You can run evaluation scripts in eval.sh:
sh eval.sh dir_name_in_checkpoints_new
# e.g. sh eval.sh llava-uhd-144-7b
# llava-uhd-144-7b is the dir_name stored in the path of ./checkpoints_new
Details of data organization:
- please refer to here for help. We provide the same script to complete the testing.
- For DocVQA, ChartVQA, please download images from ureader-instruction-1.0, and download the annotations from LLaVA-UHD-v2-Evaluation, which are also constructed from ureader-instruction-1.0.
2. VLMEvalKit: We use VLMEvalKit to evaluete OCR-Bench, MMMU-val, SEED-Image, MMB, RealWorldQA, HR-Bench. We integrate VLMEvalKit into this repository for better reproducibility. You can follow the setup instruction of VLMEvalKit, and evaluate our model with this scripts:
sh VLMEvalKit/eval.sh
For using LLaVA-UHD v1, You can follow the original project instruction and code of LLaVA-UHD v1 in branch LLaVA-UHD-v1
, or just set the following hyper paramerters in training script to change training mode to LLaVA-UHD v1.
--mm_projector_type adapt_spatial_resampler_v1
--feature_mode uhd_v1
If you find LLaVA-UHD v2 useful for your research and applications, please cite using this BibTeX:
@inproceedings{guo2024llava-uhd,
title={{LLaVA-UHD}: an LMM Perceiving Any Aspect Ratio and High-Resolution Images},
author={Guo, Zonghao and Xu, Ruyi and Yao, Yuan and Cui, Junbo and Ni, Zanlin and Ge, Chunjiang and Chua, Tat-Seng and Liu, Zhiyuan and Huang, Gao},
booktitle={ECCV},
year={2024}
}
@article{zhang2024llavauhdv2,
title={LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer},
author={Yipeng Zhang and Yifan Liu and Zonghao Guo and Yidan Zhang and Xuesong Yang and Chi Chen and Jun Song and Bo Zheng and Yuan Yao and Zhiyuan Liu and Tat-Seng Chua and Maosong Sun},
journal={arXiv preprint arXiv:2412.13871},
year={2024}
}