Skip to content

This is a library to read newspaper-column wise structured data into text

License

Notifications You must be signed in to change notification settings

ashbate/OCR_Reading_PDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Logo

OCR Reading PDF

An innovative library designed to segment column-based layouts in PDFs into manageable snippets, enhancing OCR (Optical Character Recognition) processes.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. License
  6. Contact
  7. Acknowledgments

About The Project

OCR Reading PDF is designed to enhance the preprocessing steps for OCR applications, particularly focusing on PDFs with newspaper-style column-based layouts. It employs advanced image processing to segment pages into columns or snippets, significantly improving subsequent OCR accuracy.

Built With

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

Ensure you have Python 3.6 or later installed on your system.

Installation

  1. Clone the repo
    git clone https://github.com/ashbate/OCR_Reading_PDF.git
    
  2. Navigate to the cloned directory and install required packages:
    pip install -r requirements.txt
    

Usage

For detailed usage instructions, please refer to the Documentation.

Usage

The library processes scanned PDF files of newspaper-style columnar documents, breaking them down into individual column images for enhanced OCR accuracy. Here's a step-by-step guide using the provided methods:

  1. Convert the PDF to Image: Begin by converting your PDF pages into images. For the sake of this example, we will assume that this step has been completed and we now have images of each page.

  2. Adjust the Image: Crop the image to eliminate extraneous borders or edges, streamlining the OCR process. Here's the result after using the adjust method:

    Adjusted Image

  3. Get Cut Points: Calculate the precise locations at which the image should be cut into columns or snippets with the get_cutpoints method.

  4. Cut the Image into Snippets: Utilizing the cut points, slice the images into individual columns. Here are the snippets after being processed:

Snippet 1 Snippet 2 Snippet 3
  1. Prepare for OCR: The individual column images are now prepared for OCR processing to extract the textual content.

  2. Final Look:

      A
     A thru Z Consulting and Distributing, Inc.
     7512 Clybourn Avenue
     Sun Valley, CA 91352-5146
     P.O. Box 15969
     North Hollywood, CA 91615-5969
     (818) 504-0420
     Fax: 818-504-0868
     Sean A. Stoddard, Owner
     Employees: 5
     Established: 1990
     Privately owned
     Products: Light steel manufacturing
     ABB HAFO Inc.
     (formerly Asea Hafo, Inc.)
     11501 Rancho Bernardo Road
     Suite 200
     San Diego, CA 92127
     (619) 675-3400
     Fax: 619-675-3450
     Parent Company: ABB HAFO AB
     Jarfalla, S-175 26 SWEDEN
     Bruttovagen 1, Box 520
     Ralph Waggitt, President
     Joseph Courtois, Vice President Finance
     Ron Carino, Vice President Sales and
     Yukio Nishikawa, Vice President Engineering
     Company acronym: HAFO
     Employees: 30
     Established: 1983
     Privately owned
     Products: Custom CMOS integrated circuits
     Sites: Stockholm (Sweden), Paris (France)
    

For a full demonstration, see the examples below created with images from a sample document.

Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project.
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature).
  3. Commit your Changes (git commit -m 'Add some AmazingFeature').
  4. Push to the Branch (git push origin feature/AmazingFeature).
  5. Open a Pull Request.

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Project Link: https://github.com/ashbate/OCR_Reading_PDF

Acknowledgments

  • Image processing techniques and tools.
  • Optical Character Recognition (OCR) technology.
  • All contributors and open-source communities.

About

This is a library to read newspaper-column wise structured data into text

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages