A curated list of papers about key information extraction.
Paperswithcode links will be preferred.
Welcome contributions!
Name | Title | Links |
---|---|---|
DUE | DUE: End-to-End Document Understanding Benchmark | [link] |
RVL-CDIP | Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval | [link][download] |
SROIE | ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction | [link][download] |
FUNSD | FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents | [link][download] |
XFUND | XFUND: A Multilingual Form Understanding Benchmark | [link] |
CORD | CORD: A Consolidated Receipt Dataset for Post-OCR Parsing | [link] |
EPHOIE | Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution | [link] |
EATEN | EATEN: Entity-aware Attention for Single Shot Visual Text Extraction | [link] |
Train Ticket | PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks | [link][download] |
POIE | Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution | [link][download] |
Year | Title | Links |
---|---|---|
2023 | On the Hidden Mystery of OCR in Large Multimodal Models | [link] |
2021 | Document AI: Benchmarks, Models and Applications | [link] |
Year | Title | Links |
---|---|---|
2022 | DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding | [paper][code] |
2021 | MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding | [paper][code] |
2020 | PP-OCR: A Practical Ultra Lightweight OCR System | [paper][code] |
2024 | ANLS* -- A Universal Document Processing Metric for Generative Large Language Models | [paper][code] |
Pub. | Year | Title | Links |
---|---|---|---|
Arxiv | 2024 | mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding | [link] |
Arxiv | 2024 | mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | [link] |
Arxiv | 2024 | A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding | [link] |
ICML | 2023 | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | [link] |
Arxiv | 2023 | InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | [link] |
Arxiv | 2023 | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | [link] |
Arxiv | 2023 | Visual Instruction Tuning | [link] |
Arxiv | 2023 | Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | [link] |
Arxiv | 2023 | mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | [link] |
Arxiv | 2023 | mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | [link] |
Arxiv | 2023 | mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | [link] |
Arxiv | 2023 | Otter: A Multi-Modal Model with In-Context Instruction Tuning | [link] |
Arxiv | 2023 | UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model | [link] |
Blog | 2023 | Fuyu-8B: A Multimodal Architecture for AI Agents | [blog][model] |
Pub. | Year | Title | Links |
---|---|---|---|
ICDAR | 2023 | LayoutGCN: A Lightweight Architecture for Visually Rich Document Understanding | [paper] |
ACL-Findings | 2021 | Spatial Dependency Parsing for Semi-Structured Document Information Extraction | [link] |
Arxiv | 2021 | Spatial Dual-Modality Graph Reasoning for Key Information Extraction | [link] |
ICPR | 2020 | PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks | [link] |
Pub. | Year | Title | Links |
---|---|---|---|
ACL | 2022 | LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding | [link] |
ACL | 2022 | FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction | [link] |
CVPR | 2022 | XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding | [link] |
Arxiv | 2022 | LoPE: Learnable Sinusoidal Positional Encoding for Improving Document Transformer Model | [link] |
Arxiv | 2022 | LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking | [link] |
Arxiv | 2022 | ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding | [link] |
AAAI | 2022 | BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents | [link] |
ICDAR | 2021 | ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents | [link][code] |
Arxiv | 2021 | TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models | [link] |
ACM-MM | 2021 | StrucTexT: Structured Text Understanding with Multi-Modal Transformers | [link] |
ACL | 2021 | LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding | [link] |
KDD | 2020 | LayoutLM: Pre-training of Text and Layout for Document Image Understanding | [link] |
Pub. | Year | Title | Links |
---|---|---|---|
ICDAR | 2021 | ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents | [link] |
ICDAR | 2021 | VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach | [link] |
NIPS | 2019 | BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding | [link] |
EMNLP | 2018 | Chargrid: Towards Understanding 2D Documents | [link] |
Pub. | Year | Title | Links |
---|---|---|---|
ICDAR | 2023 | Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution | [link] |
ICML | 2023 | Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding | [link] |
ECCV | 2022 | OCR-free Document Understanding Transformer | [link] |
Arxiv | 2022 | TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents | [link] |
ICCV | 2021 | DocFormer: End-to-End Transformer for Document Understanding | [link] |
ACM-MM | 2020 | TRIE: End-to-End Text Reading and Information Extraction for Document Understanding | [link] |
ICDAR | 2019 | EATEN: Entity-aware Attention for Single Shot Visual Text Extraction | [link] |
Pub. | Year | Title | Links |
---|---|---|---|
ICDAR | 2023 | Information Extraction from Documents: Question Answering vs Token Classification in real-world setups | [link] |