1. Introduction

infy_ocr_parser(v0.0.10) is a python library for parsing OCR xml files. It provides APIs to detect regions (bounding boxes) given a search criteria.

The regions (bounding boxes) are then given as the input to other data extraction libraries.

Currently, it works with the following OCR tools. Support for other OCR tools may be added in future.

Tesseract
Azure Read v3
Azure OCR v3

The library requires python 3.6

2. Build and Install

For build:

python setup.py bdist_wheel

For install:

pipenv install <whl file>

3. API

Initialization

Creates an instance of OCR Parser.

class OcrParser(ocr_file_list:list,
		data_service_provider:infy_ocr_parser.interface.data_service_provider_interface.DataServiceProviderInterface,
	config_params_dict={'match_method': 'normal',
		'similarity_score': 1,
		'max_word_space': '1.5t'},
	logger:logging.Logger=None,
	log_level:int=None):

Input:

Argument	Description
ocr_file_list (list)	List of OCR documents with full path. E.g. `[C:\1.hocr]`.
data_service_provider (DataServiceProviderInterface)	Provider to parse OCR file of a specified OCR tool.
config_params_dict (CONFIG_PARAMS_DICT, optional)	Configuration dictionary for achor text match method. Defaults to CONFIG_PARAMS_DICT.
logger (logging.Logger, optional)	Logger object. Defaults to None.
log_level (int, optional)	log level. Defaults to None.

Output:

None

API - calculate_scaling_factor

Calculate and Return the Scaling Factor of given image

def calculate_scaling_factor(image_width=0,
	image_height=0) -> dict:

Input:

Argument	Description
image_width (int, optional)	Value of image width. Default is 0.
image_height (int, optional)	Value of image height. Default is 0.

Output:

Param	Description
dict	Dict of calculated scaling factor and warnings(if any).

API - get_bbox_for

Get the relative region of given region definition by parsing the OCR file(s). Returns [X1,Y1,W,H] bbox of 'anchor-text', If 'anchorPoint1' and 'anchorPoint2' provided then 'anchorText' bbox calculated based on that.

def get_bbox_for(region_definition=[{'anchorText': [''],
		'pageNum': [],
		'anchorTextMatch': {'method': '',
		'similarityScore': 1,
		'maxWordSpace': '1.5t'},
		'anchorPoint1': {'left': None,
		'top': None,
		'right': None,
		'bottom': None},
		'anchorPoint2': {'left': None,
		'top': None,
		'right': None,
		'bottom': None},
		'pageDimensions': {'width': 1000,
		'height': 1000}}],
	subtract_region_definition=[[{'anchorText': [''],
		'pageNum': [],
		'anchorTextMatch': {'method': '',
		'similarityScore': 1,
		'maxWordSpace': '1.5t'},
		'anchorPoint1': {'left': None,
		'top': None,
		'right': None,
		'bottom': None},
		'anchorPoint2': {'left': None,
		'top': None,
		'right': None,
		'bottom': None},
		'pageDimensions': {'width': 1000,
		'height': 1000}}]],
	scaling_factor={'hor': 1,
		'ver': 1}) -> list:

Input:

Argument	Description
region_definition ([REG_DEF_DICT], optional)	List of `REG_DEF_DICT` to find relative region. `REG_DEF_DICT`: - `anchorText` - Text to search for in ocr. - `pageNums` - page numbers to look for the `anchorText` in the doc. - `anchorPoint1` & `anchorPoint2` - A point placed inside/on/outside the bounding box of the anchor-text using location properties. It accepts unit as pixels (Eg. 30, '30', '30px') or percent (Eg. '30%a'- 30% of page dimension or '30%r' - 30% from `anchorText` to the end of page) or text width/height (Eg. '30t' based on direction). - `anchorTextMatch` - Accepts pattern `matchMethod` and its `similarityScore`. Also accepts `maxWordSpace` with unit as pixels (Eg. 30, '30', '30px') or percent (Eg. '30%') or text height (Eg. '30t'). - `pageDimensions` - `width` and `height` of the page. Defaults to [REG_DEF_DICT].
subtract_region_definition ([[REG_DEF_DICT]], optional)	2d List of `REG_DEF_DICT` to subtract region from interested regions. E.g Subtract Header/ Footer region. Defaults to [[REG_DEF_DICT]].
scaling_factor (SCALING_FACTOR, optional)	Token data saved to json file after Scaling to given number. Defaults to SCALING_FACTOR.

Output:

Param	Description
list	List of regions dict that contains anchorText bbox and interested area bbox.

API - get_nearby_tokens

It returns a ordered list of tokens in all four directions (top, left, bottom, right) from a given anchor text. The numbeer of tokens returned in each direction can be set using token_count

def get_nearby_tokens(anchor_txt_dict:{'anchorText': [''],
		'anchorTextMatch': {'method': '',
		'similarityScore': 1,
		'maxWordSpace': '1.5t'},
		'pageNum': [],
		'distance': {'left': None,
		'top': None,
		'right': None,
		'bottom': None},
		'pageDimensions': {'width': 0,
		'height': 0}},
	token_type_value:int=3,
	token_count:int=1,
	token_min_alignment_threshold:float=0.5,
	scaling_factor={'hor': 1,
		'ver': 1}):

Input:

Argument	Description
anchor_txt_dict (ANCHOR_TXT_DICT)	Anchor text info around which nearby tokens are to be searched.
token_type_value (int, optional)	1(WORD), 2(LINE), 3(PHRASE). Defaults to 3.
token_count (int, optional)	Count of nearby tokens to be returned. Defaults to 1.
token_min_alignment_threshold (float, optional)	Percent of anchor text aligned with nearby tokens. The value ranges from 0 to 1. If the threshold is set to 0, it means the nearby token should completely be within the width/height of the anchor text (Eg. anchor_text_bbox =[10,10,100,20], then nearby top token should be above the anchor text and between the lines, x=10 and x=110). For threshold between 0 to 1, nearby token should align atleast the given threshold of anchor text (Eg. anchor_text_bbox=[10,10,100,20] and threshold=0.5, then nearby top should atleast align with 50px of anchor text(either side), which is 0.5(threshold) of anchor text width). For threshold 1, the token should start from anchortext's start position or before and end at anchortext's end poition or after. Defaults to 0.5.
scaling_factor (SCALING_FACTOR, optional)	Token data saved to json file after Scaling to given number. Defaults to SCALING_FACTOR.

Output:

Param	Description
list	info of nearby tokens

API - get_tokens_from_ocr

Get Json data of token_type_value by parsing the OCR file(s).

def get_tokens_from_ocr(token_type_value:int,
	within_bbox=[],
	ocr_word_list=[],
	pages=[],
	scaling_factor={'hor': 1,
		'ver': 1},
	max_word_space='1.5t') -> list:

Input:

Argument	Description
token_type_value (int)	1(WORD), 2(LINE), 3(PHRASE)
within_bbox (list, optional)	Return Json data within this region. Default is empty list.
ocr_word_list (list, optional)	When token_type_value is 3(PHRASE) then ocr_word_list formed as pharse and returns Json data of it. Default is empty list.
pages (list, optional)	To get token_type_value Json data for specific page(s) from list of pages. Default is empty list.
scaling_factor (SCALING_FACTOR, optional)	Scale to given number and then returns token. Defaults to SCALING_FACTOR.
max_word_space (str, optional)	max space between words to consider them as one phrase. Defaults to '1.5t'.

Output:

Param	Description
list	List of token dict

API - save_tokens_as_json

Save token_type_value Json data to out_file location by parsing the OCR file(s).

def save_tokens_as_json(out_file,
		token_type_value:int,
	pages=[],
	scaling_factor={'hor': 1,
		'ver': 1}) -> dict:

Input:

Argument	Description
out_file (str)	Json file full path. E.g. 'C:/word_token.json'.
token_type_value (int)	1(WORD), 2(LINE), 3(PHRASE).
pages (list, optional)	To get token_type_value Json data for specific page(s) from list of pages. Default is empty list.
scaling_factor (SCALING_FACTOR, optional)	Token data saved to json file after Scaling to given number. Defaults to SCALING_FACTOR.

Output:

Param	Description
dict	Dict of saved info.

Initialization

Creates an instance of TesseractOcrDataServiceProvider class

class TesseractOcrDataServiceProvider(logger:logging.Logger=None,
	log_level:int=None):

Input:

Argument	Description
logger (logging.Logger, optional)	logger object. Defaults to None.
log_level (int, optional)	log level. Defaults to None.

Output:

None

API - get_line_dict_from

Returns list of line dictionary containing text and bbox values. Returns pages wise bbox list

def get_line_dict_from(pages:list=None,
	line_dict_list:list=None,
	scaling_factors:list=None) -> list:

Input:

Argument	Description
pages (list, optional)	Page to filter from given `doc_list`. Defaults to None.
line_dict_list (list, optional)	Existing line dictonary to filter certain page(s). - Defaults to None.
scaling_factors (list, optional)	value to scale up/down the bbox. First element is for vertical scaling factor and second is for horizontal scaling factor. - Defaults to [1.0, 1.0]

Output:

Param	Description
None	list: List of line dictionary containing the text, words and respective bbox values.
get_page_bbox_dict(self) -> lis	get_page_bbox_dict(self) -> list list: List of dictionary containing page num and its bbox values.

API - get_word_dict_from

Returns list of word dictionary containing text and bbox values.

def get_word_dict_from(line_obj=None,
	pages:list=None,
	word_dict_list:list=None,
	scaling_factors:list=None) -> list:

Input:

Argument	Description
line_obj ([any], optional)	Existing line object to get words of it. - Defaults to None.
pages (list, optional)	Page to filter from given `doc_list`. Defaults to None.
word_dict_list (list, optional)	Existing word dictonary to filter certain page(s). - Defaults to None.
scaling_factors (list, optional)	value to scale up/down the bbox. First element is for vertical scaling factor and second is for horizontal scaling factor. - Defaults to [1.0, 1.0]

Output:

Param	Description
list	List of word dictionary containing the text, bbox and conf values.

API - init_provider_inputs

Method used to load the list input ocr files to given provider.

def init_provider_inputs(doc_list:list):

Input:

Argument	Description
doc_list (list)	OCR file list

Output:

None

Initialization

Creates an instance of AzureReadOcrDataServiceProvider class

class AzureReadOcrDataServiceProvider(logger:logging.Logger=None,
	log_level:int=None):

Input:

Argument	Description
logger (logging.Logger, optional)	logger object. Defaults to None.
log_level (int, optional)	log level. Defaults to None.

Output:

None

API - get_line_dict_from

Returns list of line dictionary containing text and bbox values. Returns pages wise bbox list

def get_line_dict_from(pages:list=None,
	line_dict_list:list=None,
	scaling_factors=None) -> list:

Input:

Argument	Description
pages (list, optional)	Page to filter from given `doc_list`. Defaults to None.
line_dict_list (list, optional)	Existing line dictonary to filter certain page(s). - Defaults to None.
scaling_factors (list, optional)	value to scale up/down the bbox. First element is for vertical scaling factor and second is for horizontal scaling factor. - Defaults to [1.0, 1.0]

Output:

Param	Description
None	list: List of line dictionary containing the text, words and respective bbox values.
get_page_bbox_dict(self) -> lis	get_page_bbox_dict(self) -> list list: List of dictionary containing page num and its bbox values.

API - get_word_dict_from

Returns list of word dictionary containing text and bbox values.

def get_word_dict_from(line_obj=None,
	pages:list=None,
	word_dict_list:list=None,
	scaling_factors=None,
	bbox_unit='pixel') -> list:

Input:

Argument	Description
line_obj ([any], optional)	Existing line object to get words of it. - Defaults to None.
pages (list, optional)	Page to filter from given `doc_list`. Defaults to None.
word_dict_list (list, optional)	Existing word dictonary to filter certain page(s). - Defaults to None.
scaling_factors (list, optional)	value to scale up/down the bbox. First element is for vertical scaling factor and second is for horizontal scaling factor. - Defaults to [1.0, 1.0]
bbox_unit (str, optional)	Unit of bbox value. Defaults to 'pixel'.

Output:

Param	Description
list	List of word dictionary containing the text, bbox and conf values.

API - init_provider_inputs

Method used to load the list input ocr files to given provider.

def init_provider_inputs(doc_list:list):

Input:

Argument	Description
doc_list (list)	OCR file list

Output:

None

Initialization

Creates an instance of AzureOcrDataServiceProvider class

class AzureOcrDataServiceProvider(logger:logging.Logger=None,
	log_level:int=None):

Input:

Argument	Description
logger (logging.Logger, optional)	logger object. Defaults to None.
log_level (int, optional)	log level. Defaults to None.

Output:

None

API - get_line_dict_from

Returns list of line dictionary containing text and bbox values. Returns pages wise bbox list

def get_line_dict_from(pages:list=None,
	line_dict_list:list=None,
	scaling_factors=None) -> list:

Input:

Argument	Description
pages (list, optional)	Page to filter from given `doc_list`. Defaults to None.
line_dict_list (list, optional)	Existing line dictonary to filter certain page(s). - Defaults to None.
scaling_factors (list, optional)	value to scale up/down the bbox. First element is for vertical scaling factor and second is for horizontal scaling factor. - Defaults to [1.0, 1.0]

Output:

Param	Description
None	list: List of line dictionary containing the text, words and respective bbox values.
get_page_bbox_dict(self) -> lis	get_page_bbox_dict(self) -> list list: List of dictionary containing page num and its bbox values.

API - get_word_dict_from

"Returns list of word dictionary containing text and bbox values.

def get_word_dict_from(line_obj=None,
	pages:list=None,
	word_dict_list:list=None,
	scaling_factors=None) -> list:

Input:

Argument	Description
line_obj ([any], optional)	Existing line object to get words of it. - Defaults to None.
pages (list, optional)	Page to filter from given `doc_list`. Defaults to None.
word_dict_list (list, optional)	Existing word dictonary to filter certain page(s). - Defaults to None.
scaling_factors (list, optional)	value to scale up/down the bbox. First element is for vertical scaling factor and second is for horizontal scaling factor. - Defaults to [1.0, 1.0]

Output:

Param	Description
list	List of word dictionary containing the text, bbox and conf values.

API - init_provider_inputs

Method used to load the list input ocr files to given provider.

def init_provider_inputs(doc_list:list):

Input:

Argument	Description
doc_list (list)	OCR file list

Output:

None

4. Explanation - Creating Region Definition

Combinations of how different properties of `region_definition` can be used to define a region.

Type	Synonym	match_method	similarity_score	max_word_space	Format
Singleline	No	normal	0.8	1.5t	[ { "anchorText": [ [ "full nome" ] ], "anchorTextMatch": { "method": "normal", "similarityScore": 0.8 } } ]
Singleline	Yes	normal	1	1.5t	[ { "anchorText": [ [ "full name:", "grade:" ] ], "anchorTextMatch": { "method": "normal", "similarityScore": 1 } } ]
Singleline	Yes	regex	NA	1.5t	[ { "anchorText": [ [ "^full name:", "^contact" ] ], "anchorTextMatch": { "method": "regex" } } ]
Singleline	No	regex	NA	1.0t	[ { "anchorText": [ [ "^full name:$" ] ], "anchorTextMatch": { "method": "regex", "maxWordSpace": "1.0t" } } ]
Multiline	No	normal	0.5	1.5t	[ { "anchorText": [ "full n0me", "GRADE" ], "anchorTextMatch": { "method": "normal", "similarityScore": 0.5 } } ]
Multiline	Yes	regex	NA	1.0t	[ { "anchorText": [ [ "^full name:$", "john" ], [ "^grade:$", "VIII" ], [ "^contact", "no.:" ] ], "anchorTextMatch": { "method": "regex", "maxWordSpace": "1.0t" } } ]
Singleline	No	normal	1	1.5t	[ { "anchorText": [ [ "full name:" ] ], "anchorPoint1": { "left": null, "top": "-0.5t", "right": "0", "bottom": null }, "anchorPoint2": { "left": null, "top": null, "right": "2t", "bottom": "0.5t" } } ]
Singleline	No	normal, regex	NA, 0.8	1.0t, 1.5t	[ { "anchorText": [ [ "^full name:$" ] ], "anchorTextMatch": { "method": "regex", "maxWordSpace": "1.0t" }, "anchorPoint1": { "left": "-0.1t", "top": null, "right": null, "bottom": "0.2t" } }, { "anchorText": [ [ "contact no" ] ], "anchorTextMatch": { "method": "normal", "similarityScore": 0.8 }, "anchorPoint1": { "left": null, "top": "-0.2t", "right": 0, "bottom": null } } ]
NA	NA	NA	NA	NA	[ { "anchorText": [ "{{EOD}}" ] }, { "anchorText": [ "{{BOD}}" ] } ]

AnchorPoint

To create a bbox we need two diagonal points which can be obtained from the anchor text bbox. To do that we need to select any one of the four points from anchortext bbox which could be used as a reference to navigate through the image. Below are the diff combinations by which we can select a particular point and navigate to another point.

Point	Image	Format	New Point
Left, top		“left”: -10 “right”: None “top”: -15 “bottom”: None
Right, top		“left”: None “right”: 10 “top”: -15 “bottom”: None
Bottom, left		“left”: -10 “right”: None “top”: None “bottom”: 15
Bottom, right		“left”: None “right”: 10 “top”: None “bottom”: 15

Tips and Suggestions:

For better and flexible region definition creation, use relative value instead of hard numbers. Eg, use 't' or '%' instead of pixel values for defining anchor points

A region definition is a list where maximum of 2 elements can be passed. If a list of 1 element is passed then use 2 anchor points(anchorPoint1 and anchorPoint2) to define a region and if 2 element is passed use 1 anchor point(anchorPoint1) for each element in the list. Ex.

    # With single element and two anchor points from the same anchor text
    reg_def_1 = [{
        'anchorText':'^name',
            'anchorPoint1': ['left':0,'top':0,'right':None,'bottom':None]
            'anchorPoint2': ['left':0,'top':0,'right':None,'bottom':None]
            'anchorTextMatch': {'method':'regex'} }]
    
    # With two elements and two anchor points from two different anchor texts and different match methods
    reg_def_2 = [
        {'anchorText':'^name',
            'anchorPoint1': ['left':0,'top':0,'right':None,'bottom':None]
            'anchorTextMatch': {'method':'regex'}},
        {'anchorText':'Name',
            'anchorPoint1': ['left':0,'top':0,'right':None,'bottom':None]
            'anchorTextMatch': {'method':'normal', 'similarityScore': 0.8} } ]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infy-ocr-parser.md

infy-ocr-parser.md

1. Introduction

2. Build and Install

3. API

Initialization

API - calculate_scaling_factor

API - get_bbox_for

API - get_nearby_tokens

API - get_tokens_from_ocr

API - save_tokens_as_json

Initialization

API - get_line_dict_from

API - get_word_dict_from

API - init_provider_inputs

Initialization

API - get_line_dict_from

API - get_word_dict_from

API - init_provider_inputs

Initialization

API - get_line_dict_from

API - get_word_dict_from

API - init_provider_inputs

4. Explanation - Creating Region Definition

Combinations of how different properties of `region_definition` can be used to define a region.

AnchorPoint

Files

infy-ocr-parser.md

Latest commit

History

infy-ocr-parser.md

File metadata and controls

1. Introduction

2. Build and Install

3. API

Initialization

API - calculate_scaling_factor

API - get_bbox_for

API - get_nearby_tokens

API - get_tokens_from_ocr

API - save_tokens_as_json

Initialization

API - get_line_dict_from

API - get_word_dict_from

API - init_provider_inputs

Initialization

API - get_line_dict_from

API - get_word_dict_from

API - init_provider_inputs

Initialization

API - get_line_dict_from

API - get_word_dict_from

API - init_provider_inputs

4. Explanation - Creating Region Definition

Combinations of how different properties of region_definition can be used to define a region.

AnchorPoint

Combinations of how different properties of `region_definition` can be used to define a region.