Extrix is a tool that simplify data extraction from documents (PDF, Word, etc.) and outputs structured data in .json
or .csv
format. It uses a language model (LLM) to get best understanding of the context and extract data.
See test results here
overview.mp4
- Data extraction from documents (Text based or scanned image)
- Structured output
.json
or.csv
- Output validation with a model schema (Pyndatic Model or JSON Schema)
- Multi-language support
- CLI Tool
- LLM that support tooling integrations for data extraction
- LLM that not support tooling integrations for data extraction
- Web API
- Docker container
- Multiple document processing options (Unstructured, EasyOCR, TesserOCR)
- LLM monitoring (tokens consumption, cost estimation, etc.)
- Vision Language Model (VLM) support
- Python library
- ...
- Python 3.10 or higher
- Pip, Virtualenv
- Libmagic, Poppler, Tesseract
Example of installation on macOS
brew install libmagic poppler tesseract
Clone the repository using git
or download the zip file
git clone https://github.com/mendrika261/data-extraction.git
Get inside the project directory
cd data-extraction
Create a virtual environment
python -m venv venv
Activate the virtual environment
source venv/bin/activate
Install dependencies
pip install -r requirements.txt
# Pull the image
docker pull ghcr.io/mendrika261/extrix:latest
# Run with example file
docker run -v $(pwd)/data:/app/data ghcr.io/mendrika261/extrix:latest "data/example.pdf"
# Clone the repository
git clone https://github.com/mendrika261/extrix.git
cd extrix
# Build the image
docker build -t extrix .
# Run with your documents
docker run -v $(pwd)/data:/app/data extrix "data/*.pdf"
Create a .env
file with your configuration:
# LLM Provider (Required)
GOOGLE_API_KEY=your_api_key
# Optional configurations
LANGUAGES=fr,en
PDF_STRATEGY=auto
Run with your config:
docker run --env-file .env -v $(pwd)/data:/app/data extrix "data/*.pdf"
Model is a schema that defines the structure of the output data. It is used to validate the output data. It can be a Pydantic model or a JSON schema.
Description of each field in the model is important to help the LLM to understand the context and extract data.
{
"name": "Leasing",
"description": "Informations extraites d'un contrat de location",
"fields": {
"bailleur": {
"field_type": "str",
"title": "Bailleur",
"description": "prénom nom des personnes SI SEULEMENT des particuliers, sinon nom de la société UNIQUEMENT",
"required": true
},
"preneur": {
"field_type": "str",
"title": "Preneur",
"description": "prénom nom des personnes SI SEULEMENT des particuliers, sinon nom de la société UNIQUEMENT",
"required": true
},
"adresse": {
"field_type": "str",
"title": "Adresse du bien loué",
"description": "Numéro, rue, code postal, ville",
"required": true
},
"description": {
"field_type": "str",
"title": "Description du bien et type d'usage",
"description": "Description sous forme de liste, et types d'usage à la fin uniquement, pas de phrases",
"required": true
},
"surface": {
"field_type": "float",
"title": "Surface",
"description": "Surface du bien loué en m²",
"required": true
},
"date_prise_effet": {
"field_type": "date",
"title": "Date de prise d'effet",
"description": "Date de prise d'effet du bail",
"required": true
},
"date_fin": {
"field_type": "date",
"title": "Date de fin",
"description": "Date de fin du bail",
"required": true,
"validators": [
{
"type": "registered",
"name": "date_after",
"params": {
"field": "date_prise_effet"
}
}
]
},
"duree_bail": {
"field_type": "delay",
"title": "Durée du bail",
"description": "Durée du bail en années et mois",
"required": true,
"validators": [
{
"type": "registered",
"name": "delay_matches_dates",
"params": {
"start_date": "date_prise_effet",
"end_date": "date_fin"
}
}
]
}
}
}
class Leasing(BaseModel):
bailleur: str = Field(
...,
title="Bailleur",
description="Nom de la société UNIQUEMENT ou prénom nom de la personne UNIQUEMENT"
)
preneur: str = Field(
...,
title="Preneur",
description="Nom de la société UNIQUEMENT ou prénom nom de la personne UNIQUEMENT"
)
adresse: str = Field(
...,
title="Adresse du bien loué",
description="Numéro, rue, code postal, ville"
)
description: str = Field(
...,
title="Description du bien",
description="Description sous forme de liste, avec usages et équipements inclus, pas de phrases"
)
surface: float = Field(
...,
title="Surface",
description="Surface du bien loué en m²"
)
date_prise_effet: Date = Field(
...,
title="Date de prise d'effet",
description="Date de prise d'effet du bail"
)
date_fin: Date = Field(
...,
title="Date de fin",
description="Date de fin du bail"
)
duree_bail: Delay = Field(
...,
title="Durée du bail",
description="Durée du bail en années et mois"
)
@field_validator('date_fin')
def validate_date_fin(cls, v: Date, info: ValidationInfo) -> datetime:
# implementation
pass
@field_validator('duree_bail')
def validate_duree_bail(cls, v: Delay, info: ValidationInfo) -> Delay:
# implementation
pass
model_config = {
"json_schema_extra": {
"description": "Informations extraites d'un contrat de location"
}
}
Examples is another way to help the LLM to understand the context and extract data. It is json file that contains examples of the output data.
Role
assistant
is used to indicate the output of the LLM. You can also use roleuser
to indicate the input data. See Message concept for more details.
Examples for a leasing contract according to the model above (data/examples.json
)
{
"examples": [
{
"role": "assistant",
"content": {
"bailleur": "Nvidia",
"preneur": "Jean Dupont",
"adresse": "12 rue de la paix, 75000 Paris",
"description": "Appartement de 3 pièces, 2 chambres, 1 salle de bain, 1 cuisine, 1 salon | usage: habitation uniquement",
"surface": 50.0,
"date_prise_effet": {
"year": 2022,
"month": 1,
"day": 31
},
"date_fin": {
"year": 2023,
"month": 1,
"day": 31
},
"duree_bail": {
"years": 1
}
}
}
]
}
Create a .env
file in the project directory and set the configuration. You can use the .env.example
file as a template.
For now, all providers from here and LLM with tooling integrations are supported.
# File processing configuration
LANGUAGES = "fr"
CACHE_DIR = "./tmp/cache"
# Set the API key for unstructured API if you want to use it
# UNSTRUCTURED_API_KEY =
PDF_STRATEGY = "auto"
# LLM configuration
# Api keys for any supported LLM provider (e.g., google-genai, ollama SEE langchain providers documentation for the name of the variables)
GOOGLE_API_KEY =
# Monitoring
MONITORING_FILE_PATH = "monitoring.json"
COST_MAPPING_PATH = "config/cost_mapping.json"
TOKENS_INPUT_PRICING_UNIT = 1000000
TOKENS_OUTPUT_PRICING_UNIT = 1000000
Basic usage of the CLI tool, see Configuration for more details.
The path can be a file or a glob pattern (e.g.,
data/*.pdf
)
python cli.py "data/BAIL 3.pdf"
Example of output from data/Bail 3.PDF
[
{
"bailleur": "François Girard",
"preneur": "Maison du Livre",
"adresse": "4 Rue de l'Académie, 75009 Paris",
"description": "une grande salle de vente, un bureau et des sanitaires | usage: librairie",
"surface": 60.0,
"date_prise_effet": {
"day": 1,
"month": 6,
"year": 2025
},
"date_fin": {
"day": 31,
"month": 5,
"year": 2034
},
"duree_bail": {
"years": 9,
"months": 0
}
}
]
The monitoring file is a JSON file that contains the usage of the LLM. It can be set using the MONITORING_FILE_PATH
in the .env
file.
The cost is based on the actual pricing of the LLM provider. The cost mapping can be updated in config/cost_mapping.json
.
Example of monitoring file
[
{
"timestamp": "2025-02-10T23:23:22.340297",
"model": "gemini-1.5-flash",
"provider": "google-genai",
"duration_seconds": 393.38787699999995,
"input_tokens": 175982,
"output_tokens": 25027,
"total_tokens": 200009,
"estimated_cost_usd": "xxx"
},
{
"timestamp": "2025-02-11T00:40:50.692436",
"model": "gemini-1.5-pro",
"provider": "google-genai",
"duration_seconds": 24.356106,
"input_tokens": 6984,
"output_tokens": 429,
"total_tokens": 7413,
"estimated_cost_usd": "xxx"
}
]
usage: cli.py [-h] [--languages LANGUAGES [LANGUAGES ...]] [--strategy {auto,hi_res,fast}]
[--no-cache] [--processor {unstructured,easyocr,tesserocr}] [--model MODEL]
[--provider PROVIDER] [--temperature TEMPERATURE] [--examples EXAMPLES]
[--model-schema MODEL_SCHEMA] [--output OUTPUT]
input_pattern
Extract structured data from documents
options:
-h, --help show this help message and exit
Input Configuration:
input_pattern Path or glob pattern for document files
--languages LANGUAGES [LANGUAGES ...]
List of languages (default: ['fr'])
--strategy {auto,hi_res,fast}
--no-cache
--processor {unstructured,easyocr,tesserocr}
LLM Configuration:
--model MODEL LLM model name to use for extraction (default:
gemini-1.5-pro)
--provider PROVIDER LLM provider (e.g., google-genai, ollama) (default: google-
genai)
--temperature TEMPERATURE LLM temperature setting - lower values are more focused
(default: 0.1)
Model Configuration:
--examples EXAMPLES Path to examples JSON file for few-shot learning (default:
data/examples.json)
--model-schema MODEL_SCHEMA Path to model schema JSON file defining the extraction
structure (default: config/model.json)
Output Configuration:
--output OUTPUT Output file path (supported formats: CSV, JSON)
Full documentation is available in the docs directory:
Feel free to contribute or to request features. You can open an issue or submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.