OCR in Python

Optical Character Recognition using Pytesseract or EasyOCR, along with preprocessing and postprocessing techniques. Following and adapting these scripts, an OCR workflow can be built for your projects..

Most tips and techniques are taken from Constellate notebooks by Ithaka. For further informations see them!

The example I used is a rare Italian book from the 1950s which I have OCR'ed for a friend.Thus, spellcheker for postprocessing is based on Italian language. The book has a missing couple of pages. This is why the page count in the OCR’ed file texts does not totally correspond to the page numbering of the book.

For the last step, text analysis, see my Python script for text analysis

Requirements

The project is created using Python 3.7.

The software used are
Tesseract (see Tesseract-OCR)
Poppler (for pdf2image library).

I recommend to add them to the environment variables of your OS; alternativale you have to specify their path in the code.

All used libraries:

for preprocessing

pdf2image (1.16.0)
numpy (1.21.6)
opencv-python (4.6.0.66)
pillow (9.1.0)
scipy (1.7.3)

for processing

pytesseract (0.3.9)

for postprocessing

pandas (1.3.5)
regex (2022.4.24)

Preprocessing

The first five scripts are preprocessing steps that can be individually executed:

1. Converting a multipage pdf in multiple images

2. Deleting a color from images

This step could be useful to delete signs made with highlighters, colored pens or pencils (such as marginalia) from a scanned book. It could be repeated for different colors.

3. Binarizing

Converting images to greyscale or black&white could improve the efficiency of the OCR tools.

4. Cropping

This techniques must be personalized, identifying the correct pixel measurements to crop. The goal is to split double scanned images or delete number of pages and heading titles in the margins of the book.

5. Rotation

Automatic rotation of pages based on an automatic measurements. If an image is not askew, it is not rotated.

6. Noise canceling

Not only denoising, but contrast and brightness adjustments too. This step could be useful, but It has not been used in my project.

OCR processing

Here are different ways to OCR the preprocessed images. In 7A, 7B & 7C code I use PyTesseract, the Python library to manage Tesseract. These scripts differ only in the output format:

7A is the most correct way: from every preprocessed image, we obtain as output a different text file, that it means that we obtain a different text file for each page of the scanned book. See the "texts" folder in the "output" folder. These multiple outputs will be very helpful for postprocessing tasks;
7B is the most simple way to obtain a single output: a single text file from the whole scanned book withouth subdivions. See file.txt in the output folder;
7C is similar to 7B but with an advantage: the single text file with page subdivions marked in the text, to help human reviewers make corrections. See file_paginated.txt in the output folder

In 7D and 7E I have tried EasyOCR 1.7.1 (Attention: the requisites are higher, and you need a more updated Python version. There could be some issues beacuse of PyTorch) EasyOCR is a newer library that offers great development potential and great features, such as multilingual recognition and a much detailed output: a nested list with 3 main items (bounding box, text detected and confident level), but I don’t know how to obtain a simple text that maintains the original line division of the document and how to avoid some mistakes in word order. For more details see this Medium article

7C easyocr_images: the output is an image with OCRed text annotated for each preprocessed page. Based on different confidence levels, It is possible to diffently color the annotated text (for example: green if codidence > 0.8, yellow if <0.8 and > 0.5, red if < 0.5)
7E easyocr_texts: is the EasyOCR version of Tesseract's 7A

In this way, a human reviewers can easily and quicly detect and correct the errors:

Postprocessing

The main postprocessing technique is spellcheking, to detect transciption errors in OCRed text. Since there are not many good libraries for Italian spellchecking, here this task is performed using glossaries: OCRed words not included in our Italian glossaries are reported as errors. The Italian glossaries I use are build by me grouping various sources.

The output is spellcheker_data.csv in the ouput folder.

Further improvements

converting all the scripts into object oriented program

Acknowledgments

Thank you to Nathan Kelber from Ithaka for teaching me so many things. Many of these scripts are inspired by his courses.
Thank you to my colleague Deborah Grbac for the projects followed together.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
output		output
sample_data		sample_data
tools		tools
1. convert_from_pdf_to_multiple_images.py		1. convert_from_pdf_to_multiple_images.py
2. delete_a_color.py		2. delete_a_color.py
3. binarization.py		3. binarization.py
4. cropping.py		4. cropping.py
5. rotation.py		5. rotation.py
6. noise_canceling.py		6. noise_canceling.py
7A. Tesseract_OCR_multiple_text_files.py		7A. Tesseract_OCR_multiple_text_files.py
7B. Tesseract_OCR_text_file.py		7B. Tesseract_OCR_text_file.py
7C. Tesseract_OCR_paginated_text_file.py		7C. Tesseract_OCR_paginated_text_file.py
7D. EasyOCR_multiple_images.py		7D. EasyOCR_multiple_images.py
7E. EasyOCR_multiple_text_files.py		7E. EasyOCR_multiple_text_files.py
8. spellcheker_ matching_text_with_glossaries.py		8. spellcheker_ matching_text_with_glossaries.py
9. concatenate_all_txt_files_into_one.py		9. concatenate_all_txt_files_into_one.py
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR in Python

Table of Contents

Workflow