An innovative library designed to segment column-based layouts in PDFs into manageable snippets, enhancing OCR (Optical Character Recognition) processes.
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Table of Contents
OCR Reading PDF is designed to enhance the preprocessing steps for OCR applications, particularly focusing on PDFs with newspaper-style column-based layouts. It employs advanced image processing to segment pages into columns or snippets, significantly improving subsequent OCR accuracy.
To get a local copy up and running follow these simple steps.
Ensure you have Python 3.6 or later installed on your system.
- Clone the repo
git clone https://github.com/ashbate/OCR_Reading_PDF.git
- Navigate to the cloned directory and install required packages:
pip install -r requirements.txt
For detailed usage instructions, please refer to the Documentation.
The library processes scanned PDF files of newspaper-style columnar documents, breaking them down into individual column images for enhanced OCR accuracy. Here's a step-by-step guide using the provided methods:
-
Convert the PDF to Image: Begin by converting your PDF pages into images. For the sake of this example, we will assume that this step has been completed and we now have images of each page.
-
Adjust the Image: Crop the image to eliminate extraneous borders or edges, streamlining the OCR process. Here's the result after using the
adjust
method: -
Get Cut Points: Calculate the precise locations at which the image should be cut into columns or snippets with the
get_cutpoints
method. -
Cut the Image into Snippets: Utilizing the cut points, slice the images into individual columns. Here are the snippets after being processed:
-
Prepare for OCR: The individual column images are now prepared for OCR processing to extract the textual content.
-
Final Look:
A A thru Z Consulting and Distributing, Inc. 7512 Clybourn Avenue Sun Valley, CA 91352-5146 P.O. Box 15969 North Hollywood, CA 91615-5969 (818) 504-0420 Fax: 818-504-0868 Sean A. Stoddard, Owner Employees: 5 Established: 1990 Privately owned Products: Light steel manufacturing ABB HAFO Inc. (formerly Asea Hafo, Inc.) 11501 Rancho Bernardo Road Suite 200 San Diego, CA 92127 (619) 675-3400 Fax: 619-675-3450 Parent Company: ABB HAFO AB Jarfalla, S-175 26 SWEDEN Bruttovagen 1, Box 520 Ralph Waggitt, President Joseph Courtois, Vice President Finance Ron Carino, Vice President Sales and Yukio Nishikawa, Vice President Engineering Company acronym: HAFO Employees: 30 Established: 1983 Privately owned Products: Custom CMOS integrated circuits Sites: Stockholm (Sweden), Paris (France)
For a full demonstration, see the examples below created with images from a sample document.
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project.
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
). - Commit your Changes (
git commit -m 'Add some AmazingFeature'
). - Push to the Branch (
git push origin feature/AmazingFeature
). - Open a Pull Request.
Distributed under the MIT License. See LICENSE
for more information.
Project Link: https://github.com/ashbate/OCR_Reading_PDF
- Image processing techniques and tools.
- Optical Character Recognition (OCR) technology.
- All contributors and open-source communities.