Skip to content

This code extracts text from a PDF file using OCR, cleans it, and writes it to an Excel spreadsheet. It uses fitz, io, ocrmypdf, and pandas libraries to achieve this task.

Notifications You must be signed in to change notification settings

felixlu07/ocronpdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

ocronpdf

This code is designed to extract text from a PDF file, clean the text, and then write the text to an Excel spreadsheet.

The code uses the fitz library to open the PDF file, then loops through each page using page.get_pixmap() to convert each page to an image. Then, ocrmypdf is used to perform optical character recognition (OCR) on each image to extract the text. The extracted text is stored in a variable named text.

Next, the code cleans the text by removing all newline characters (\n) and extra whitespaces, and then store the resulting clean text in the text variable.

Finally, the pandas library is used to create a dataframe named df containing the clean text, and this dataframe is then written to an Excel spreadsheet named training_data.xlsx.

Overall, this code provides a simple and effective way to extract and clean text from PDFs and store it in a structured format for further analysis or use.

About

This code extracts text from a PDF file using OCR, cleans it, and writes it to an Excel spreadsheet. It uses fitz, io, ocrmypdf, and pandas libraries to achieve this task.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages