Document Extraction Libraries 3.9.0

Document Extraction Libraries are a suite of python/java libraries that provide APIs to extract information from documents (e.g. scanned/native PDFs, images etc.). For semi-structured documents (e.g. form-like documents), this can be done in a simple and predictable manner. For unstructured documents, it can extract the raw content, retrieve relevant text from it using semantic search and use a Large Language Model (LLM) to extract information.

The suite consists of libraries that can be used to generate OCR files using free/commercial tools; parse OCR files to extract regions of interest; and extract texts and selection field values from the regions of interest. It can extract segments from pages, generate chunks, generate embeddings, and save them to a vector DB which can then be retrieved as context and given to LLM to extract information from documents.

Additionally, it includes a data pipeline framework - Document Processor Platform (DPP) - for creating reusable components and configurable pipelines. The pipeline brings all the libraries together to create a logical workflow.

These libraries can be used as SDKs to solve document digitization problems and help with semantic search and information extraction requirements.

Prerequisites

Python >=3.10 or <=3.11
Java >=8
OCR Tool (Tesseract / Azure Read OCR V 3.2) (for non-digital documents)

Libraries

The details of each library and its core functionality is given below. For more details, please read the docs.

S#	Library	Description
1	infy_ocr_generator	Provides APIs to generate OCR files by specifying an OCR provider.
2	infy_ocr_parser	Provides APIs to parse OCR files and detect regions of interest (bounding boxes) when given a search criteria.
3	infy_field_extractor	Provides APIs for extracting free text and selection fields (checkboxes and radio buttons) from image files using regions of interest (bounding boxes) as input.
4	infy_table_extractor	Provides APIs to extract rows and columns from an image of a table.
5	infy_common_utils	Provides APIs to invoke external tools like JAR files.
6	infy_fs_utils	Provides APIs to abstract underlying file system and object stores.
7	infy_gen_ai_sdk	Provides APIs for using embeddings, Large Language Models (LLM), vector DB etc.
8	InfyFormatConverterJAR	Provides APIs to convert documents from one format to another. E.g., PDF to image, JSON etc.
9	InfyOcrEngineJAR	Provides APIs to invoke OCR engines. Currently, it has tesseract.
10	infy_dpp_sdk	The sdk for document processor platform (DPP) containing the interfaces for processors, schema definition for document data and in-built orchestrators to execute a data pipeline made from processors.
11	infy_dpp_core	A collection of processors for core tasks like request creation, meta-data extraction etc.
12	infy_dpp_segmentation	A collection of processors for tasks like document segmentation, chunk creation etc.
13	infy_dpp_ai	A collection of processors for tasks like generating embeddings, calling LLMs with prompt templates etc.
14	infy_dpp_storage	A collection of processors to help store data to graph DB etc.
15	infy_dpp_content_extractor	A collection of processors for extracting raw contents from documents.

Apps

The details of each app and its core functionality is given below.

S#	App	Description
1	infy_dpp_processor	It is implementation of indexing pipeline. This apps can be deployed as docker image on Kubernetes cluster and managed via an orchestrator like Airflow or Kubeflow to run the indexing pipeline.
2	infy_db_service	There are two ways to store the created indexes, one is locally in the environment where indexing pipeline is running, and the other one is using infy_db_service which provides way to store indexes in a central environment where this service is hosted.
3	infy_search_service	This is implementation of inferencing pipeline.If infy_db_service is used to store created indexes then use infy_search_service to query on those documents.

How does it Work

Semi-structured documents

The libraries use computer vision in the form of an OCR engine (e.g. Tesseract, Azure OCR Read etc.) for positional text detection. It then takes a "region definition" as input and applies techniques to detect regions of interest within the document.

This makes it possible to extract attributes -free text, selection fields (checkboxes, radio buttons) and bordered tables - specifically from the regions of interest and eliminates the risk of potential errors in future should the document layout change but not the regions of interest.

Unstructured documents

The libraries along with the data pipeline framework help create a workflow where raw content is extracted from documents and stored in a vector DB as embeddings. From this, useful information is extracted using the Retrieval augmented generation (RAG) approach.

The API logical input/output is given below.

Step	Library	Input	Output
1	infy_ocr_generator	`image file`	`OCR file`
2	infy_ocr_parser	`OCR file`, `region definition`	`region of interest [x,y,w,h]`
3	infy_field_extractor	`OCR file`, `region of interest [x,y,w,h]`	`text`, `checkbox state(T/F)`, `radio button state(T/F)`
4	infy_table_extractor	`image file`	`table data with rows and cols`
5	infy_dpp_core infy_dpp_segmentation infy_dpp_ai infy_dpp_storage infy_dpp_content_extractor	`config_data,document_data, context_data`	`document_data, context_data`

Examples

For code examples, please read docs/notebook.
For infy_dpp_processor, please read apps/infy_dpp_processor/README.md
For infy_db_service, please read apps/infy_db_service/README.md
For infy_search_service,please read apps/infy_search_service/README.md

Reference

For API specifications, please read docs/reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Document Extraction Libraries 3.9.0

Prerequisites

Libraries

Apps

How does it Work

Semi-structured documents

Unstructured documents

Examples

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

Document Extraction Libraries 3.9.0

Prerequisites

Libraries

Apps

How does it Work

Semi-structured documents

Unstructured documents

Examples

Reference