A curated list of awesome Document Understanding resources, including papers, codes, and datasets.
continue update ๐ค
MiniCPM
- MiniCPM-V: A GPT-4V Level MLLM on Your Phone (OpenBMB) | 24.8.3
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (THU,ModelBest) | 24.4.9
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images (THU,NUS,UCAS) | 24.3.18
- Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (THU,ShanghaiAILab,Zhuhu,ModelBest) | 23.8.23
VILA
- VILA: On Pre-training for Visual Language Models (NVIDIA,MIT) | 23.12.12
- X-VILA: Cross-Modality Alignment for Large Language Model (NVIDIA,MIT,KAUST) | 24.5.29
- VILA2: VILA Augmented VILA (NVIDIA,MIT,UT-Austin) | 24.7.24
- LongVILA: Scaling Long-Context Visual Language Models for Long Videos (NVIDIA,MIT,KAUST) | 24.8.19
- VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation (THU,MIT,NVIDIA,UCB,UCSD) | 24.9.6
- NVILA: Efficient Frontier Visual Language Models (NVIDIA,MIT,UCB,TW,THU) | 24.12.5
LLaMA
- The Llama 3 Herd of Models (Meta) | 24.7.13
- Llama 2: Open Foundation and Fine-Tuned Chat Models (Meta) | 23.7.18
- LLaMA: Open and Efficient Foundation Language Models (Meta) | 23.2.27
Qwen
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (Alibaba) | 24.9.18
- Qwen2 Technical Report (Alibaba) | 24.7.15
- Qwen Technical Report (Alibaba) | 23.9.28
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (Alibaba) | 23.8.24
Intern
- InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output (Shanghai AI Lab) | 24.7.3
- InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD (Shanghai AI Lab) | 24.4.9
- InternLM2 Technical Report (Shanghai AI Lab) | 24.3.26
- InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (Shanghai AI Lab) | 24.1.29
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition (Shanghai AI Lab) | 23.9.26
- InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities (Shanghai AI Lab) | 23.6.3
2024
- TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens (Huawei) | 24.10.7 | arXiv | Code
- DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models (KAIST,AWS) | 24.10.4 | arXiv | Code
- mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding (Alibaba,RUC) | 24.9.5 | arXiv | Code
- General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model (StepFun,Megvii,UCAS,THU) | 24.9.3 | arXiv | Code
- LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models (Adobe,Buffalo) | 24.7.27 | arXiv | Code
- Harmonizing Visual Text Comprehension and Generation (ECNU,ByteDance) | 24.7.23 | 24NIPS | Code
- A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding (FDU,ByteDance) | 24.7.2 | arXiv | Code
- Multimodal Table Understanding (UCAS,Baidu) | 24.06.12 | ACL24 | Code
- TRINS: Towards Multimodal Language Models that Can Read (Adobe,GIT) | 24.06.10 | CVPR24 | Code
- TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy (USTC,ByteDance) | 24.6.3 | arXiv | Code
- StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond (Baidu) | 24.5.31 | arXiv | Code
- Focus Anywhere for Fine-grained Multi-page Document Understanding (UCAS,MEGVII) | 24.5.23 | arXiv | Code
- MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering (ByteDance,HUST) | 24.5.20 | arXiv | Code
- Exploring the Capabilities of Large Multimodal Models on Dense Text (HUST) | 24.5.9 | arXiv | Code
- How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (Shanghai AI Lab,CUHK,THU,NJU,FDU,SenseTime) | 24.4.25 | arXiv | Code
- TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding (USTC) | 24.4.15 | arXiv | Code
- InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD (Shanghai AI Lab,CUHK,THU,SenseTime) | 24.4.9 | arXiv | Code
- LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding (Alibaba,ZJU) | 24.4.8 | arXiv | Code
- Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models (CUHK,Shanghai AI Lab,SenseTime) | 24.3.25 | arXiv | Code
- mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding (Alibaba,RUC) | 24.3.19 | arXiv | Code
- TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document (HUST) | 24.3.7 | arXiv | Code
- HRVDA: High-Resolution Visual Document Assistant (Tencent YouTu Lab,USTC) | 24.2.29 | CVPR24
- Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models (Tencent YouTu Lab) | 24.2.29 | CVPR24
- InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (Shanghai AI Lab,CUHK,SenseTime) | 24.1.29 | arXiv | Code
- Small Language Model Meets with Reinforced Vision Vocabulary (MEGVII,UCAS,HUST) | 24.1.23 | arXiv | Code
2023
- DocLLM: A layout-aware generative language model for multimodal document understanding (JPMorgan AI Research) | 23.12.31 | arXiv
- Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models (MEGVII,UCAS,HUST) | 23.12.11 | ECCV24 | Code
- mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model (Alibaba) | 23.11.30 | arXiv | Code
- Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs (USTC) | 23.11.22 | arXiv | Code
- DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding (USTC,ByteDance) | 23.11.20 | arXiv
- Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models (HUST) | 23.11.11 | CVPR24 | Code
- mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration (Alibaba) | 23.11.07 | CVPR24 | Code
- Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation (SCUT) | 23.10.25 | arXiv | Code
- UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model (DAMO,RUC,ECNU) | 23.10.08 | arXiv | Code
- Kosmos-2.5: A Multimodal Literate Model (MSRA) | 23.9.20 | arXiv | Code
- BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions (UC San Diego) | 23.8.19 | AAAI24 | Code
- UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding (USTC,ByteDance) | 23.8.19 | arXiv
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding (DAMO) | 23.7.4 | arXiv | Code
- On the Hidden Mystery of OCR in Large Multimodal Models (HUST,SCUT,Microsoft) | 23.5.13 | arXiv | Code
- Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution (HUST) | 23.5.12 | arXiv | Code
- Document Understanding Dataset and Evaluation (DUDE) | 23.5.15 | arXiv | Website
- StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training (Baidu) | 23.03.01 | ICLR23 | Code
2022
- Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding (Huawei) | 22.12.19 | ACL23
- Unifying Vision, Text, and Layout for Universal Document Processing (Microsoft) | 22.12.05 | CVPR23 | Code
- ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding (Baidu) | 22.10.12 | arXiv | Code
- Unified Pretraining Framework for Document Understanding (Adobe) | 22.04.22 | NIPS21
- LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking (Microsoft) | 22.04.18 | ACM MM22 | Code
- XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding (Alibaba) | 22.3.14 | Code Unofficial
- DiT: Self-supervised Pre-training for Document Image Transformer (Microsoft) | 22.03.04 | ACM MM22 | Code
- Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark (Huawei) | 22.2.14 | NIPS22 | Code
2021
- LayoutReader: Pre-training of Text and Layout for Reading Order Detection (Microsoft) | 21.08.26 | EMNLP21 | Code
- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding (Microsoft) | 21.04.18 | arXiv | Code
- Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer (Applica) | 21.02.18 | ICDAR21 | Code
2020
- LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding (Microsoft) | 20.12.29 | arXiv | Code
2020
- LayoutLM: Pre-training of Text and Layout for Document Image Understanding (Microsoft) | 19.12.31 | KDD20 | Code
2024
- MiniCPM-V: A GPT-4V Level MLLM on Your Phone (OpenBMB) | 24.8.3
- X-VILA: Cross-Modality Alignment for Large Language Model (NVIDIA,HKUST,MIT) | 24.5.29 | arXiv | Code
- How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (Shanghai AI Lab,SenseTime,THU,NJU,FU,CUHK) | 24.04.25 | arXiv | Code
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (THU,ModelBest) | 24.4.9
- Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (CUHK,SmartMore) | 24.3.27 | Code
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images (THU,NUS,UCAS) | 24.03.18 | arXiv | Code
- Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models (XMU) | 24.03.05 | arXiv | Code
- DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models (CUHK,Shanghai AI Lab) | 24.2.22 | arXiv | Code
- InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (Shanghai AI Lab) | 24.01.29 | arXiv | Code
2023
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (OpenGVLab,NJU,HKU,CUHK,THU,USTC,SenseTime) | 23.12.21 | CVPR24 | Code
- ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts (UWM,Cruise LLC) | 23.12.01 | CVPR24 | Code
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions (USTC,Shanghai AI Lab) | 23.11.28 | arXiv | Code
- MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning (KAUST,Meta) | 23.10.14 | arXiv | Code
- Improved Baselines with Visual Instruction Tuning (UWM,Microsoft) | 23.10.05 | arXiv | Code
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition (Shanghai AI Lab) | 23.02.26 | arXiv | Code
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (Alibaba) | 23.08.24 | arXiv | Code
- MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action (Azure) | 23.05.20 | arXiv | Code
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning (Salesforce) | 23.05.11 | arXiv | Code
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality (DAMO) | 23.04.27 | arXiv | Code
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models (KAUST) | 23.04.20 | arXiv | Code
- Visual Instruction Tuning (UWM,Microsoft) | 23.04.17 | NeurIPS | Code
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (Salesforce) | 23.01.30 | arXiv | Code
2022
- Flamingo: a Visual Language Model for Few-Shot Learning (Deepmind) | 22.11.15 | Nips22 | Code
2024
- SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model (UCSD,HKU,NVIDIA) | 24.6.3 | arXiv | Code
- Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models (HKU,ByteDance) | 24.4.19 | ECCV24 | Code
- Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (CU,UCSB,Apple) | 24.04.11 | arXiv | Code
- Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Apple) | 24.04.08 | arXiv | Code
- GroundingGPT: Language Enhanced Multi-modal Grounding Model (ByteDance,FDU) | 24.03.05 | arXiv | Code
2023
- LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models (HKUST,SCUT,IDEA,CUHK) | 23.12.05 | arXiv | Code
- Ferret: Refer and Ground Anything Anywhere at Any Granularity (CU,Apple) | 23.10.11 | arXiv | Code
- BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs (ByteDance) | 23.07.17 | arXiv | Code
- Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic (SenseTime,BUAA,SJTU) | 23.06.27 | arXiv | Code
- Kosmos-2: Grounding Multimodal Large Language Models to the World (Microsoft) | 23.06.26 | arXiv | Code
2024
- Artemis: Towards Referential Understanding in Complex Videos (UCAS,UB) | 24.6.1 | arXiv | Code
2023
- TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding (PKU,Noah) | 23.12.04 | CVPR24 | Code
- Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models (PKU,PengCheng,Microsoft,FarReel) | 23.11.27 | arXiv | Code
- Video-LLaVA: Learning United Visual Representation by Alignment Before Projection (PKU,PengCheng) | 23.11.16 | arXiv | code
- Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding (PKU,PengCheng) | 23.11.14 | arXiv | Code
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding (DAMO) | 23.06.05 | arXiv | code