-
-
Notifications
You must be signed in to change notification settings - Fork 1k
OCRmyPDF FAQ and Usage
OCRmyPDF is excellent at converting PDFs to searchable OCR PDFs in an unsupervised fashion. It uses all available CPU cores and has a pipelined architecture that helps to schedule CPUs efficiently.
It works well as a batch job, whether a for-loop or managed by a directory watching program. It has been used for large scale batch jobs.
"First do no harm" is a principle it tries to follow. Generally, it tries to change only things in a PDF that must change to complete OCR. If you accidentally run it on a regular "born digital" PDF (for example, a Microsoft Word document converted to PDF), or a PDF that contains a mix of "born digital" and scanned content, it can add OCR without destroying the scanned content (if the --skip-text
option is provided). It can also force rasterizing of all this content with --force-ocr
.
Abbyy FineReader 12 is the author's recommendation for anyone who needs to create a PDF whose annotated text needs to be perfect. This program automatically recognizes text and then gives the user the opportunity to correct it. However, this is boring and some might even say, soul destroying. It is very sophisticated at detecting document elements such as tables and converting them to spreadsheets. For example, if one needs to extract scientific data from a scanned image, turn to FineReader. Its OCR is much slower than OCRmyPDF, but higher quality even when unsupervised.
Adobe Acrobat XI can perform OCR, but it is slower than OCRmyPDF and similar in accuracy. It does a better job of preserving the original page's content, however. If you work with PDFs extensively, you will find many uses for Acrobat.
OCRmyPDF uses Tesseract-OCR as its OCR engine, so it depends entirely on Tesseract for OCR quality. Tesseract gives good results for clear black and white scans with common fonts, normal font sizes, and when the correct language is specified, the dictionary contains all the document's words, the document is oriented in one direction, deskewed and contains no multi-orientation elements, and basically the stars are aligned. If you can easily read a document yourself with no squinting or special effort, that is a good sign. OCRmyPDF and tesseract do an good job on files like tests/resources/LinnSequencer.jpg.
OCRmyPDF is good at making documents searchable by identifying keywords within it. In a huge collection, even its ability to only occasionally find useful keywords can be helpful for search.
If you want all of the text extracted perfectly, consider using one of the commercial programs.
Unfortunately, in many files certain patterns will confuse Tesseract, and it will find gibberish and not filter this out. Maps are one example – legend markers and geographic features will be reinterpreted as letters. The --debug-output
option reveals its findings for the curious.
If possible, OCRmyPDF will insert a text layer into your PDF and convert the result to PDF/A. PDF/A conversion may (probably does) transcode images to a standardized colorspace, which is what you want for long term archiving. Auto-rotation correction can be done without changing the quality of the image layer.
For some options and some PDF files, it will instead rasterize your PDF at the resolution of the highest quality image on a given page, perform OCR on then image, and then construct a new PDF based on the image and text. In this case, the resulting PDF could be larger than the input if multiple images are present, and vector content will be lost.
OCRmyPDF excels at taking an existing PDF and adding OCR information to it, something that is cumbersome to do with many other tools, because modifying a PDF in a way that preserves as much of its content as possible is not easy.
You can use Tesseract itself to convert single images or multi-page TIFFs to to OCR PDFs (without PDF/A). For some people this is enough.
If you want to use OCRmyPDF to run Tesseract for its speed or preprocessing capabilities, or you require PDF/A, then you need to use another program to convert your images.
For common images (JPEG, PNG), use img2pdf. It works without transcoding images, and has options for normalizing page sizes.
Unfortunately, you might have a TIFF to convert to PDF. Worse still, it might be an exotic multipage TIFF. Here are your options:
- tiff2pdf in libtiff is a good tool for converting some multipage TIFFs to PDF, but it will fail on exotic types of TIFF despite being a reference implementation. For example, YCbCr colorspace TIFFs will be converted to valid PDFs will ruin color information
- ImageMagick converts images to PDF by way of Ghostscript, and avoids some pitfalls of tiff2pdf, but it is slow and always transcodes images.
- GDAL is a set of tools for geographic information systems. It includes some good tools for working scientific TIFFs. It may be the best option.
A PDF/A is a subset of the PDF specification designed for archiving. All features in PDF not suitable for archiving, such as external fonts or color profiles, are eliminated. If a future civilization found a PDF/A 100,000 years from now, they would have everything they need to reproduce it. A PDF could refer to resources that might not be present on all computers.
Adobe Acrobat and Reader display a message when a PDF/A and disable editing until the user clicks "Enable Editing". Otherwise, there is no impact on most users.