A command line PDF cropping tool targeted specifically at adapting documents formatted for print to the different requirements of e-readers.
Submitted as a Mendicant University session 9 personal project
Books which are formatted for print often contain large margins which waste a lot of the limited screen real estate on non-print devices, especially e-readers. Because PDF is a final output format, designed to look nearly identical on any display device, it does not contain the semantic data necessary to tell a reader where those margins are. We have to cheat to find them.
The goal of this project is to identify a bounding box for a given PDF which contains most, but not all, of the "ink" on a range of pages within that PDF, and to create a new PDF with the CropBox adjusted to contain only those interesting bits. Page numbers, headers, and rare footnotes or marginal notes should be trimmed in order to maximize the size of the body text.
Barber renders a range of pages as very low resolution raster images and then composes them into a single image, somewhat like running the same piece of paper through a printer many times. For documents with obvious margins, this should produce a large black rectangle in the center of the page.
The composed image is then floodfilled from the center, and the non-floodfilled pixels are removed; this is a crude form of blob detection. The size of the remaining image is then compared to the original. The size adjustment and offset are scaled to match that of the original document. Finally, a new PDF is written with the CropBox set to the new values.
It's up to you to visually scan the document beforehand to find a good range of pages to use as a basis for the required --range parameter. It's best to skip titles, tables of contents, and pages which contain content which runs into the margins, such as large images or horizontal rules. A range of about ten pages will usually provide good results.
pdf-barber$ ruby bin/barber.rb --range 1-8 pdfs/bookie-basic-feature.pdf
MediaBox: [0, 0, 504, 661] CropBox: []
Rendering pages 1 to 8...
New CropBox for all pages: [66, 74, 440, 615] Page Size: [374, 541]
Writing PDF with new CropBox to cropped_bookie-basic-feature.pdf...
Options:
--separate
: Process odd- and even-numbered pages separately. This is useful for books in which the binding edge and outside edge of each page have different margins.
--dryrun
: Display the calculated CropBox without writing a new file.
--tmpdir DIR
: Render the working files to the specified directory and retain them, so you can see what the renderer is doing. WARNING: Using the same tmpdir for multiple runs will cause odd behavior.
--verbose
: Echoes all of the system commands to stdout.
- A *nix-like system. This has been tested on Ubuntu. Believe it or not, I don't currently own a Mac.
- GhostScript, to read and write PDF files.
- ImageMagick, to process the raster files generated by GhostScript.
Barber is intended to be used by a person, from a command line. You must eyeball each document in order to find a good page range. If you really want to, you can tell the Barber to give himself a shave:
require_relative 'lib/barber'
Barber::Shaver.shave(filename: 'pdfs/bookie-basic-feature.pdf', range: [1,8])