infy_ocr_parser
(v0.0.10) is a python library for parsing OCR xml files. It provides APIs to detect regions (bounding boxes) given a search criteria.
The regions (bounding boxes) are then given as the input to other data extraction libraries.
Currently, it works with the following OCR tools. Support for other OCR tools may be added in future.
- Tesseract
- Azure Read v3
- Azure OCR v3
The library requires python 3.6
For build:
python setup.py bdist_wheel
For install:
pipenv install <whl file>
Creates an instance of OCR Parser.
class OcrParser(ocr_file_list:list,
data_service_provider:infy_ocr_parser.interface.data_service_provider_interface.DataServiceProviderInterface,
config_params_dict={'match_method': 'normal',
'similarity_score': 1,
'max_word_space': '1.5t'},
logger:logging.Logger=None,
log_level:int=None):
Input:
Argument | Description |
---|---|
ocr_file_list (list) | List of OCR documents with full path. E.g. [C:\1.hocr] . |
data_service_provider (DataServiceProviderInterface) | Provider to parse OCR file of a specified OCR tool. |
config_params_dict (CONFIG_PARAMS_DICT, optional) | Configuration dictionary for achor text match method. Defaults to CONFIG_PARAMS_DICT. |
logger (logging.Logger, optional) | Logger object. Defaults to None. |
log_level (int, optional) | log level. Defaults to None. |
Output:
None
Calculate and Return the Scaling Factor of given image
def calculate_scaling_factor(image_width=0,
image_height=0) -> dict:
Input:
Argument | Description |
---|---|
image_width (int, optional) | Value of image width. Default is 0. |
image_height (int, optional) | Value of image height. Default is 0. |
Output:
Param | Description |
---|---|
dict | Dict of calculated scaling factor and warnings(if any). |
Get the relative region of given region definition by parsing the OCR file(s). Returns [X1,Y1,W,H] bbox of 'anchor-text', If 'anchorPoint1' and 'anchorPoint2' provided then 'anchorText' bbox calculated based on that.
def get_bbox_for(region_definition=[{'anchorText': [''],
'pageNum': [],
'anchorTextMatch': {'method': '',
'similarityScore': 1,
'maxWordSpace': '1.5t'},
'anchorPoint1': {'left': None,
'top': None,
'right': None,
'bottom': None},
'anchorPoint2': {'left': None,
'top': None,
'right': None,
'bottom': None},
'pageDimensions': {'width': 1000,
'height': 1000}}],
subtract_region_definition=[[{'anchorText': [''],
'pageNum': [],
'anchorTextMatch': {'method': '',
'similarityScore': 1,
'maxWordSpace': '1.5t'},
'anchorPoint1': {'left': None,
'top': None,
'right': None,
'bottom': None},
'anchorPoint2': {'left': None,
'top': None,
'right': None,
'bottom': None},
'pageDimensions': {'width': 1000,
'height': 1000}}]],
scaling_factor={'hor': 1,
'ver': 1}) -> list:
Input:
Argument | Description |
---|---|
region_definition ([REG_DEF_DICT], optional) | List of REG_DEF_DICT to find relative region.REG_DEF_DICT :- anchorText - Text to search for in ocr.- pageNums - page numbers to look for the anchorText in the doc.- anchorPoint1 & anchorPoint2 - A point placed inside/on/outside thebounding box of the anchor-text using location properties. It accepts unit as pixels (Eg. 30, '30', '30px') or percent (Eg. '30%a'- 30% of page dimension or '30%r' - 30% from anchorText to the end of page) or text width/height (Eg. '30t' based on direction). - anchorTextMatch - Accepts pattern matchMethod and its similarityScore .Also accepts maxWordSpace with unit as pixels (Eg. 30, '30', '30px')or percent (Eg. '30%') or text height (Eg. '30t'). - pageDimensions - width and height of the page.Defaults to [REG_DEF_DICT]. |
subtract_region_definition ([[REG_DEF_DICT]], optional) | 2d List of REG_DEF_DICT to subtractregion from interested regions. E.g Subtract Header/ Footer region. Defaults to [[REG_DEF_DICT]]. |
scaling_factor (SCALING_FACTOR, optional) | Token data saved to json file after Scaling to given number. Defaults to SCALING_FACTOR. |
Output:
Param | Description |
---|---|
list | List of regions dict that contains anchorText bbox and interested area bbox. |
It returns a ordered list of tokens in all four directions (top, left, bottom, right) from a given
anchor text. The numbeer of tokens returned in each direction can be set using token_count
def get_nearby_tokens(anchor_txt_dict:{'anchorText': [''],
'anchorTextMatch': {'method': '',
'similarityScore': 1,
'maxWordSpace': '1.5t'},
'pageNum': [],
'distance': {'left': None,
'top': None,
'right': None,
'bottom': None},
'pageDimensions': {'width': 0,
'height': 0}},
token_type_value:int=3,
token_count:int=1,
token_min_alignment_threshold:float=0.5,
scaling_factor={'hor': 1,
'ver': 1}):
Input:
Argument | Description |
---|---|
anchor_txt_dict (ANCHOR_TXT_DICT) | Anchor text info around which nearby tokens are to be searched. |
token_type_value (int, optional) | 1(WORD), 2(LINE), 3(PHRASE). Defaults to 3. |
token_count (int, optional) | Count of nearby tokens to be returned. Defaults to 1. |
token_min_alignment_threshold (float, optional) | Percent of anchor text aligned with nearby tokens. The value ranges from 0 to 1. If the threshold is set to 0, it means the nearby token should completely be within the width/height of the anchor text (Eg. anchor_text_bbox =[10,10,100,20], then nearby top token should be above the anchor text and between the lines, x=10 and x=110). For threshold between 0 to 1, nearby token should align atleast the given threshold of anchor text (Eg. anchor_text_bbox=[10,10,100,20] and threshold=0.5, then nearby top should atleast align with 50px of anchor text(either side), which is 0.5(threshold) of anchor text width). For threshold 1, the token should start from anchortext's start position or before and end at anchortext's end poition or after. Defaults to 0.5. |
scaling_factor (SCALING_FACTOR, optional) | Token data saved to json file after Scaling to given number. Defaults to SCALING_FACTOR. |
Output:
Param | Description |
---|---|
list | info of nearby tokens |
Get Json data of token_type_value by parsing the OCR file(s).
def get_tokens_from_ocr(token_type_value:int,
within_bbox=[],
ocr_word_list=[],
pages=[],
scaling_factor={'hor': 1,
'ver': 1},
max_word_space='1.5t') -> list:
Input:
Argument | Description |
---|---|
token_type_value (int) | 1(WORD), 2(LINE), 3(PHRASE) |
within_bbox (list, optional) | Return Json data within this region. Default is empty list. |
ocr_word_list (list, optional) | When token_type_value is 3(PHRASE) then ocr_word_list formed as pharse and returns Json data of it. Default is empty list. |
pages (list, optional) | To get token_type_value Json data for specific page(s) from list of pages. Default is empty list. |
scaling_factor (SCALING_FACTOR, optional) | Scale to given number and then returns token. Defaults to SCALING_FACTOR. |
max_word_space (str, optional) | max space between words to consider them as one phrase. Defaults to '1.5t'. |
Output:
Param | Description |
---|---|
list | List of token dict |
Save token_type_value Json data to out_file location by parsing the OCR file(s).
def save_tokens_as_json(out_file,
token_type_value:int,
pages=[],
scaling_factor={'hor': 1,
'ver': 1}) -> dict:
Input:
Argument | Description |
---|---|
out_file (str) | Json file full path. E.g. 'C:/word_token.json'. |
token_type_value (int) | 1(WORD), 2(LINE), 3(PHRASE). |
pages (list, optional) | To get token_type_value Json data for specific page(s) from list of pages. Default is empty list. |
scaling_factor (SCALING_FACTOR, optional) | Token data saved to json file after Scaling to given number. Defaults to SCALING_FACTOR. |
Output:
Param | Description |
---|---|
dict | Dict of saved info. |
Creates an instance of TesseractOcrDataServiceProvider class
class TesseractOcrDataServiceProvider(logger:logging.Logger=None,
log_level:int=None):
Input:
Argument | Description |
---|---|
logger (logging.Logger, optional) | logger object. Defaults to None. |
log_level (int, optional) | log level. Defaults to None. |
Output:
None
Returns list of line dictionary containing text and bbox values. Returns pages wise bbox list
def get_line_dict_from(pages:list=None,
line_dict_list:list=None,
scaling_factors:list=None) -> list:
Input:
Argument | Description |
---|---|
pages (list, optional) | Page to filter from given doc_list . Defaults to None. |
line_dict_list (list, optional) | Existing line dictonary to filter certain page(s). - Defaults to None. |
scaling_factors (list, optional) | value to scale up/down the bbox. First element is for vertical scaling factor and second is for horizontal scaling factor. - Defaults to [1.0, 1.0] |
Output:
Param | Description |
---|---|
None | list: List of line dictionary containing the text, words and respective bbox values. |
get_page_bbox_dict(self) -> lis | get_page_bbox_dict(self) -> list list: List of dictionary containing page num and its bbox values. |
Returns list of word dictionary containing text and bbox values.
def get_word_dict_from(line_obj=None,
pages:list=None,
word_dict_list:list=None,
scaling_factors:list=None) -> list:
Input:
Argument | Description |
---|---|
line_obj ([any], optional) | Existing line object to get words of it. - Defaults to None. |
pages (list, optional) | Page to filter from given doc_list . Defaults to None. |
word_dict_list (list, optional) | Existing word dictonary to filter certain page(s). - Defaults to None. |
scaling_factors (list, optional) | value to scale up/down the bbox. First element is for vertical scaling factor and second is for horizontal scaling factor. - Defaults to [1.0, 1.0] |
Output:
Param | Description |
---|---|
list | List of word dictionary containing the text, bbox and conf values. |
Method used to load the list input ocr files to given provider.
def init_provider_inputs(doc_list:list):
Input:
Argument | Description |
---|---|
doc_list (list) | OCR file list |
Output:
None
Creates an instance of AzureReadOcrDataServiceProvider class
class AzureReadOcrDataServiceProvider(logger:logging.Logger=None,
log_level:int=None):
Input:
Argument | Description |
---|---|
logger (logging.Logger, optional) | logger object. Defaults to None. |
log_level (int, optional) | log level. Defaults to None. |
Output:
None
Returns list of line dictionary containing text and bbox values. Returns pages wise bbox list
def get_line_dict_from(pages:list=None,
line_dict_list:list=None,
scaling_factors=None) -> list:
Input:
Argument | Description |
---|---|
pages (list, optional) | Page to filter from given doc_list . Defaults to None. |
line_dict_list (list, optional) | Existing line dictonary to filter certain page(s). - Defaults to None. |
scaling_factors (list, optional) | value to scale up/down the bbox. First element is for vertical scaling factor and second is for horizontal scaling factor. - Defaults to [1.0, 1.0] |
Output:
Param | Description |
---|---|
None | list: List of line dictionary containing the text, words and respective bbox values. |
get_page_bbox_dict(self) -> lis | get_page_bbox_dict(self) -> list list: List of dictionary containing page num and its bbox values. |
Returns list of word dictionary containing text and bbox values.
def get_word_dict_from(line_obj=None,
pages:list=None,
word_dict_list:list=None,
scaling_factors=None,
bbox_unit='pixel') -> list:
Input:
Argument | Description |
---|---|
line_obj ([any], optional) | Existing line object to get words of it. - Defaults to None. |
pages (list, optional) | Page to filter from given doc_list . Defaults to None. |
word_dict_list (list, optional) | Existing word dictonary to filter certain page(s). - Defaults to None. |
scaling_factors (list, optional) | value to scale up/down the bbox. First element is for vertical scaling factor and second is for horizontal scaling factor. - Defaults to [1.0, 1.0] |
bbox_unit (str, optional) | Unit of bbox value. Defaults to 'pixel'. |
Output:
Param | Description |
---|---|
list | List of word dictionary containing the text, bbox and conf values. |
Method used to load the list input ocr files to given provider.
def init_provider_inputs(doc_list:list):
Input:
Argument | Description |
---|---|
doc_list (list) | OCR file list |
Output:
None
Creates an instance of AzureOcrDataServiceProvider class
class AzureOcrDataServiceProvider(logger:logging.Logger=None,
log_level:int=None):
Input:
Argument | Description |
---|---|
logger (logging.Logger, optional) | logger object. Defaults to None. |
log_level (int, optional) | log level. Defaults to None. |
Output:
None
Returns list of line dictionary containing text and bbox values. Returns pages wise bbox list
def get_line_dict_from(pages:list=None,
line_dict_list:list=None,
scaling_factors=None) -> list:
Input:
Argument | Description |
---|---|
pages (list, optional) | Page to filter from given doc_list . Defaults to None. |
line_dict_list (list, optional) | Existing line dictonary to filter certain page(s). - Defaults to None. |
scaling_factors (list, optional) | value to scale up/down the bbox. First element is for vertical scaling factor and second is for horizontal scaling factor. - Defaults to [1.0, 1.0] |
Output:
Param | Description |
---|---|
None | list: List of line dictionary containing the text, words and respective bbox values. |
get_page_bbox_dict(self) -> lis | get_page_bbox_dict(self) -> list list: List of dictionary containing page num and its bbox values. |
"Returns list of word dictionary containing text and bbox values.
def get_word_dict_from(line_obj=None,
pages:list=None,
word_dict_list:list=None,
scaling_factors=None) -> list:
Input:
Argument | Description |
---|---|
line_obj ([any], optional) | Existing line object to get words of it. - Defaults to None. |
pages (list, optional) | Page to filter from given doc_list . Defaults to None. |
word_dict_list (list, optional) | Existing word dictonary to filter certain page(s). - Defaults to None. |
scaling_factors (list, optional) | value to scale up/down the bbox. First element is for vertical scaling factor and second is for horizontal scaling factor. - Defaults to [1.0, 1.0] |
Output:
Param | Description |
---|---|
list | List of word dictionary containing the text, bbox and conf values. |
Method used to load the list input ocr files to given provider.
def init_provider_inputs(doc_list:list):
Input:
Argument | Description |
---|---|
doc_list (list) | OCR file list |
Output:
None
To create a bbox we need two diagonal points which can be obtained from the anchor text bbox. To do that we need to select any one of the four points from anchortext bbox which could be used as a reference to navigate through the image. Below are the diff combinations by which we can select a particular point and navigate to another point.
Tips and Suggestions:
- For better and flexible region definition creation, use relative value instead of hard numbers. Eg, use 't' or '%' instead of pixel values for defining anchor points
- A region definition is a list where maximum of 2 elements can be passed. If a list of 1 element is passed then use 2 anchor points(
anchorPoint1
andanchorPoint2
) to define a region and if 2 element is passed use 1 anchor point(anchorPoint1
) for each element in the list. Ex.# With single element and two anchor points from the same anchor text reg_def_1 = [{ 'anchorText':'^name', 'anchorPoint1': ['left':0,'top':0,'right':None,'bottom':None] 'anchorPoint2': ['left':0,'top':0,'right':None,'bottom':None] 'anchorTextMatch': {'method':'regex'} }] # With two elements and two anchor points from two different anchor texts and different match methods reg_def_2 = [ {'anchorText':'^name', 'anchorPoint1': ['left':0,'top':0,'right':None,'bottom':None] 'anchorTextMatch': {'method':'regex'}}, {'anchorText':'Name', 'anchorPoint1': ['left':0,'top':0,'right':None,'bottom':None] 'anchorTextMatch': {'method':'normal', 'similarityScore': 0.8} } ]