forked from thunlp/LLaVA-UHD
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
xuruyi
committed
Mar 18, 2024
0 parents
commit 8eb1c97
Showing
147 changed files
with
34,993 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
|
||
# LLaVA-UHD | ||
|
||
We present LLaVA-UHD, a | ||
large multimodal model that can efficiently perceive images in any aspect ratio and | ||
high resolution. | ||
|
||
|
||
|
||
Notably, our model built on LLaVA-1.5 336×336 supports 6 times | ||
larger (i.e., 672×1088) resolution images using only 94% inference computation, | ||
and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be | ||
efficiently trained within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). | ||
|
||
|
||
## Overview | ||
|
||
LLaVA-UHD includes three key components: | ||
|
||
- An image modularization strategy that divides native-resolution images into smaller variable-sized | ||
slices for efficient and extensible encoding. | ||
|
||
- A compression module that further | ||
condenses image tokens from visual encoders. | ||
|
||
- A spatial schema to organize | ||
slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD out- | ||
performs established LMMs trained with 2-3 orders of magnitude more data on | ||
9 benchmarks. | ||
|
||
## Preparing | ||
To reproduce the results of the paper, please set up the Python environment using the following code: | ||
```bash | ||
conda create -n llava-uhd python=3.10 | ||
conda activate llava-uhd | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Pretraining Code | ||
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain). | ||
|
||
You should refer to the documentation of llava1.5, set up the environment according to llava1.5's way, and organize the training data properly, placing it in the path ./playground. Then run the following code for inference: | ||
|
||
```bash | ||
bash scripts/pretrain.sh | ||
``` | ||
|
||
## Fine-tuning Code | ||
|
||
Please download the annotation of the final mixture our instruction tuning data [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json), and download the images from constituting datasets: | ||
- COCO: train2017 | ||
- GQA: images | ||
- OCR-VQA: download script, we save all files as .jpg | ||
- TextVQA: train_val_images | ||
- VisualGenome: part1, part2 | ||
|
||
Download dataset images as in the finetuning process of llava1.5, place them in the playground, and then run the following code: | ||
```bash | ||
bash scripts/finetune.sh | ||
``` | ||
|
||
## Evaluation Code | ||
|
||
When evaluating the model, we almost synchronously use the testing code of llava1.5, and the basic usage method is consistent. Please refer to [here](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#evaluation) for help. We provide the same script to complete the testing. | ||
|
||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from .model import LlavaLlamaForCausalLM |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
CONTROLLER_HEART_BEAT_EXPIRATION = 30 | ||
WORKER_HEART_BEAT_INTERVAL = 15 | ||
|
||
LOGDIR = "." | ||
|
||
# Model Constants | ||
IGNORE_INDEX = -100 | ||
IMAGE_TOKEN_INDEX = -200 | ||
DEFAULT_IMAGE_TOKEN = "<image>" | ||
DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>" | ||
DEFAULT_IM_START_TOKEN = "<im_start>" | ||
DEFAULT_IM_END_TOKEN = "<im_end>" | ||
IMAGE_PLACEHOLDER = "<image-placeholder>" |
Oops, something went wrong.