GitHub - ashbate/OCR_Reading_PDF: This is a library to read newspaper-column wise structured data into text

OCR Reading PDF

An innovative library designed to segment column-based layouts in PDFs into manageable snippets, enhancing OCR (Optical Character Recognition) processes.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Contributing
License
Contact
Acknowledgments

About The Project

OCR Reading PDF is designed to enhance the preprocessing steps for OCR applications, particularly focusing on PDFs with newspaper-style column-based layouts. It employs advanced image processing to segment pages into columns or snippets, significantly improving subsequent OCR accuracy.

Built With

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

Ensure you have Python 3.6 or later installed on your system.

Installation

Clone the repo

git clone https://github.com/ashbate/OCR_Reading_PDF.git

Navigate to the cloned directory and install required packages:
```
pip install -r requirements.txt
```

Usage

For detailed usage instructions, please refer to the Documentation.

Usage

The library processes scanned PDF files of newspaper-style columnar documents, breaking them down into individual column images for enhanced OCR accuracy. Here's a step-by-step guide using the provided methods:

Convert the PDF to Image: Begin by converting your PDF pages into images. For the sake of this example, we will assume that this step has been completed and we now have images of each page.
Adjust the Image: Crop the image to eliminate extraneous borders or edges, streamlining the OCR process. Here's the result after using the adjust method:
Get Cut Points: Calculate the precise locations at which the image should be cut into columns or snippets with the get_cutpoints method.
Cut the Image into Snippets: Utilizing the cut points, slice the images into individual columns. Here are the snippets after being processed:

Prepare for OCR: The individual column images are now prepared for OCR processing to extract the textual content.

Final Look:

  A
 A thru Z Consulting and Distributing, Inc.
 7512 Clybourn Avenue
 Sun Valley, CA 91352-5146
 P.O. Box 15969
 North Hollywood, CA 91615-5969
 (818) 504-0420
 Fax: 818-504-0868
 Sean A. Stoddard, Owner
 Employees: 5
 Established: 1990
 Privately owned
 Products: Light steel manufacturing
 ABB HAFO Inc.
 (formerly Asea Hafo, Inc.)
 11501 Rancho Bernardo Road
 Suite 200
 San Diego, CA 92127
 (619) 675-3400
 Fax: 619-675-3450
 Parent Company: ABB HAFO AB
 Jarfalla, S-175 26 SWEDEN
 Bruttovagen 1, Box 520
 Ralph Waggitt, President
 Joseph Courtois, Vice President Finance
 Ron Carino, Vice President Sales and
 Yukio Nishikawa, Vice President Engineering
 Company acronym: HAFO
 Employees: 30
 Established: 1983
 Privately owned
 Products: Custom CMOS integrated circuits
 Sites: Stockholm (Sweden), Paris (France)

For a full demonstration, see the examples below created with images from a sample document.

Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project.
Create your Feature Branch (git checkout -b feature/AmazingFeature).
Commit your Changes (git commit -m 'Add some AmazingFeature').
Push to the Branch (git push origin feature/AmazingFeature).
Open a Pull Request.

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Project Link: https://github.com/ashbate/OCR_Reading_PDF

Acknowledgments

Image processing techniques and tools.
Optical Character Recognition (OCR) technology.
All contributors and open-source communities.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
visuals		visuals
.DS_Store		.DS_Store
.gitattributes		.gitattributes
LICENSE		LICENSE
OCR_Library.py		OCR_Library.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Reading PDF

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Usage

Contributing

License

Contact

Acknowledgments

About

Releases

Packages

Languages

License

ashbate/OCR_Reading_PDF

Folders and files

Latest commit

History

Repository files navigation

OCR Reading PDF

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Usage

Contributing

License

Contact

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages