Skip to content

Latest commit

 

History

History
146 lines (112 loc) · 18.8 KB

File metadata and controls

146 lines (112 loc) · 18.8 KB

Awesome Key Infomation Extraction

Awesome

A curated list of papers about key information extraction.

Paperswithcode links will be preferred.

Welcome contributions!

Tabel of Contents

Datasets

Name Title Links
DUE DUE: End-to-End Document Understanding Benchmark [link]
RVL-CDIP Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval [link][download]
SROIE ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction [link][download]
FUNSD FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents [link][download]
XFUND XFUND: A Multilingual Form Understanding Benchmark [link]
CORD CORD: A Consolidated Receipt Dataset for Post-OCR Parsing [link]
EPHOIE Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution [link]
EATEN EATEN: Entity-aware Attention for Single Shot Visual Text Extraction [link]
Train Ticket PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks [link][download]
POIE Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution [link][download]

Survey

Year Title Links
2023 On the Hidden Mystery of OCR in Large Multimodal Models [link]
2021 Document AI: Benchmarks, Models and Applications [link]

Toolkits

Year Title Links
2022 DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding [paper][code]
2021 MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding [paper][code]
2020 PP-OCR: A Practical Ultra Lightweight OCR System [paper][code]
2024 ANLS* -- A Universal Document Processing Metric for Generative Large Language Models [paper][code]

Models

⭐LLM-Based

Pub. Year Title Links
Arxiv 2024 mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [link]
Arxiv 2024 mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models [link]
Arxiv 2024 A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding [link]
ICML 2023 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [link]
Arxiv 2023 InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [link]
Arxiv 2023 MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models [link]
Arxiv 2023 Visual Instruction Tuning [link]
Arxiv 2023 Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond [link]
Arxiv 2023 mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [link]
Arxiv 2023 mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding [link]
Arxiv 2023 mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration [link]
Arxiv 2023 Otter: A Multi-Modal Model with In-Context Instruction Tuning [link]
Arxiv 2023 UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model [link]
Blog 2023 Fuyu-8B: A Multimodal Architecture for AI Agents [blog][model]

Graph-Based

Pub. Year Title Links
ICDAR 2023 LayoutGCN: A Lightweight Architecture for Visually Rich Document Understanding [paper]
ACL-Findings 2021 Spatial Dependency Parsing for Semi-Structured Document Information Extraction [link]
Arxiv 2021 Spatial Dual-Modality Graph Reasoning for Key Information Extraction [link]
ICPR 2020 PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks [link]

Transformer-Based

Pub. Year Title Links
ACL 2022 LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding [link]
ACL 2022 FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction [link]
CVPR 2022 XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding [link]
Arxiv 2022 LoPE: Learnable Sinusoidal Positional Encoding for Improving Document Transformer Model [link]
Arxiv 2022 LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking [link]
Arxiv 2022 ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding [link]
AAAI 2022 BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents [link]
ICDAR 2021 ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents [link][code]
Arxiv 2021 TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models [link]
ACM-MM 2021 StrucTexT: Structured Text Understanding with Multi-Modal Transformers [link]
ACL 2021 LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding [link]
KDD 2020 LayoutLM: Pre-training of Text and Layout for Document Image Understanding [link]

Grid-Based

Pub. Year Title Links
ICDAR 2021 ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents [link]
ICDAR 2021 VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach [link]
NIPS 2019 BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding [link]
EMNLP 2018 Chargrid: Towards Understanding 2D Documents [link]

End-to-end

Pub. Year Title Links
ICDAR 2023 Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution [link]
ICML 2023 Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding [link]
ECCV 2022 OCR-free Document Understanding Transformer [link]
Arxiv 2022 TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents [link]
ICCV 2021 DocFormer: End-to-End Transformer for Document Understanding [link]
ACM-MM 2020 TRIE: End-to-End Text Reading and Information Extraction for Document Understanding [link]
ICDAR 2019 EATEN: Entity-aware Attention for Single Shot Visual Text Extraction [link]

Others

Pub. Year Title Links
ICDAR 2023 Information Extraction from Documents: Question Answering vs Token Classification in real-world setups [link]

Related Repositories

Star History

Star History Chart