Awesome Key Infomation Extraction
A curated list of papers about key information extraction.
Paperswithcode links will be preferred.
Welcome contributions!
Name |
Title |
Links |
DUE |
DUE: End-to-End Document Understanding Benchmark |
[link] |
RVL-CDIP |
Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval |
[link][download] |
SROIE |
ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction |
[link][download] |
FUNSD |
FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents |
[link][download] |
XFUND |
XFUND: A Multilingual Form Understanding Benchmark |
[link] |
CORD |
CORD: A Consolidated Receipt Dataset for Post-OCR Parsing |
[link] |
EPHOIE |
Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution |
[link] |
EATEN |
EATEN: Entity-aware Attention for Single Shot Visual Text Extraction |
[link] |
Train Ticket |
PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks |
[link][download] |
POIE |
Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution |
[link][download] |
Year |
Title |
Links |
2023 |
On the Hidden Mystery of OCR in Large Multimodal Models |
[link] |
2021 |
Document AI: Benchmarks, Models and Applications |
[link] |
Year |
Title |
Links |
2022 |
DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding |
[paper][code] |
2021 |
MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding |
[paper][code] |
2020 |
PP-OCR: A Practical Ultra Lightweight OCR System |
[paper][code] |
2024 |
ANLS* -- A Universal Document Processing Metric for Generative Large Language Models |
[paper][code] |
Pub. |
Year |
Title |
Links |
Arxiv |
2024 |
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding |
[link] |
Arxiv |
2024 |
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models |
[link] |
Arxiv |
2024 |
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding |
[link] |
ICML |
2023 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
[link] |
Arxiv |
2023 |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning |
[link] |
Arxiv |
2023 |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models |
[link] |
Arxiv |
2023 |
Visual Instruction Tuning |
[link] |
Arxiv |
2023 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond |
[link] |
Arxiv |
2023 |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality |
[link] |
Arxiv |
2023 |
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding |
[link] |
Arxiv |
2023 |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration |
[link] |
Arxiv |
2023 |
Otter: A Multi-Modal Model with In-Context Instruction Tuning |
[link] |
Arxiv |
2023 |
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model |
[link] |
Blog |
2023 |
Fuyu-8B: A Multimodal Architecture for AI Agents |
[blog][model] |
Pub. |
Year |
Title |
Links |
ICDAR |
2023 |
LayoutGCN: A Lightweight Architecture for Visually Rich Document Understanding |
[paper] |
ACL-Findings |
2021 |
Spatial Dependency Parsing for Semi-Structured Document Information Extraction |
[link] |
Arxiv |
2021 |
Spatial Dual-Modality Graph Reasoning for Key Information Extraction |
[link] |
ICPR |
2020 |
PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks |
[link] |
Pub. |
Year |
Title |
Links |
ACL |
2022 |
LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding |
[link] |
ACL |
2022 |
FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction |
[link] |
CVPR |
2022 |
XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding |
[link] |
Arxiv |
2022 |
LoPE: Learnable Sinusoidal Positional Encoding for Improving Document Transformer Model |
[link] |
Arxiv |
2022 |
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking |
[link] |
Arxiv |
2022 |
ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding |
[link] |
AAAI |
2022 |
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents |
[link] |
ICDAR |
2021 |
ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents |
[link][code] |
Arxiv |
2021 |
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models |
[link] |
ACM-MM |
2021 |
StrucTexT: Structured Text Understanding with Multi-Modal Transformers |
[link] |
ACL |
2021 |
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding |
[link] |
KDD |
2020 |
LayoutLM: Pre-training of Text and Layout for Document Image Understanding |
[link] |
Pub. |
Year |
Title |
Links |
ICDAR |
2021 |
ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents |
[link] |
ICDAR |
2021 |
VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach |
[link] |
NIPS |
2019 |
BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding |
[link] |
EMNLP |
2018 |
Chargrid: Towards Understanding 2D Documents |
[link] |
Pub. |
Year |
Title |
Links |
ICDAR |
2023 |
Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution |
[link] |
ICML |
2023 |
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding |
[link] |
ECCV |
2022 |
OCR-free Document Understanding Transformer |
[link] |
Arxiv |
2022 |
TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents |
[link] |
ICCV |
2021 |
DocFormer: End-to-End Transformer for Document Understanding |
[link] |
ACM-MM |
2020 |
TRIE: End-to-End Text Reading and Information Extraction for Document Understanding |
[link] |
ICDAR |
2019 |
EATEN: Entity-aware Attention for Single Shot Visual Text Extraction |
[link] |
Pub. |
Year |
Title |
Links |
ICDAR |
2023 |
Information Extraction from Documents: Question Answering vs Token Classification in real-world setups |
[link] |