Welcome! This repo is a step-by-step guide that provides an in-depth introduction to programatically filling and rendering PDF forms using modern web standards.
This tutorial uses a real government form as an example, the SF 2809 - Health Benefits Election Form.
If you just want to get up and running, see the quickstart instructions. However, the How To section contains in-depth instructions along with runnable code that will teach you how to "digitize" the SF 2809.
Clone this repo, cd
into it and run bundle
.
In one terminal tab, start the API server:
ruby server.rb
In another terminal tab, serve index.html
:
- Ruby:
ruby -rwebrick -e'WEBrick::HTTPServer.new(:Port => 8000, :DocumentRoot => Dir.pwd).start'
- Python 3:
python3 -m http.server
Visit localhost:8000
To modify the form data, alter formData
in sf2809.js
.
# in a new tab, while server.rb is running
curl \
-X POST \
-H "Content-Type: application/json" \
-d '{"part_i_area_code": "20006"}' \
http://localhost:4567 > sf2809.pdf
# ...
open sf2809.pdf
Below are the steps to teach you the process of "digitizing" a PDF form. A benefit of digitizing a PDF form is that you can store information as data and not as difficult-to-search files. Think of the filled PDF as one of many possible rendering implementations. The underlying data, once de-coupled from the PDF can be repurposed in useful ways, such as easy reporting. For example, suppose the underlying data is stored in a SQL database:
SELECT * FROM forms WHERE form.end_date < '2015-09-13';
Try doing that same operation against a folder of PDF files.
Moreovoer, by gaining the ability to represent the information contained in a PDF with semantic HTML, the information and user interface can become much more accessible.
At the end of this tutorial, you will have the following:
- A REST API that takes JSON and returns a filled PDF file
- A Javascript SDK that makes an AJAX request to the REST API and renders the result natively using PDF.js
- A simple Bootstrap HTML form. You fill in the form, submit it, and you get a filled PDF rendered in the same page.
- And most importantly: a particular set of skills that can be applied to any PDF form that accepts FDF data.
To complete this tutorial, you'll need some beginner/intermediate Ruby, Javascript, and HTML skills. Nothing too fancy.
Technically-speaking, your computer will need:
- Ruby (2.x should be fine)
- pdftk
To follow along, clone this repo (and cd
into it and run bundle
) and run the included code samples as prompted in the tutorial.
Included in this repo:
sf2809.pdf
- the form we will be fillinggenerate_json_mappings.rb
- generates a JSON mapping fileserver.rb
- a simple REST APIindex.html
andsf2809.js
- a client-side PDF renderer
Fillable PDF forms use the "Acrobot Forms Data Format" (FDF) to serialize form data. While FDF is less pretty than other data exchance formats such as JSON, parsing it is already a solved problem. There are libraries for Ruby and Node, for example, that make it easy to transform FDF to native data structures (e.g. Ruby's Hash or Javascript's Objects) to FDF.
At its core, a software-based PDF filler simply takes data from user input, serializes it to FDF, and applies that FDF to the PDF.
Fortunately, applying FDF to a PDF is also a solved problem thanks to a library called pdftk
. pdftk
is a command-line utility for editing and manipulating PDFs. The two pdftk
commands we care about for PDF filling are:
dump_data_fields
fill_form
To make matters even easier, there is a Ruby gem called pdf-forms
that provides an interface to PDFs with clean, idiomatic Ruby. The equivalent pdf-forms
functions that we care about are:
PdfForms#get_field_names
PdfForms#fill_form
The goal of this step is to become familiar with pdftk
and pdf-forms
and their output formats.
Run pdftk sf2809.pdf dump_data_fields
. You should see a lot of entries that look like:
FieldType: Button
FieldName: 43. Med B
FieldNameAlt: Part, A,. Enrollee and Family Member Information. Number 43. This is check box two of three. Press the space bar to select this box if the third family member is covered by Medicare Part B.
FieldFlags: 0
FieldJustification: Left
FieldStateOption: 1
FieldStateOption: Off
Notice the information we get about each field in the form. Important to us are its type, name, alt text, and state options.
Writing a parser for this text would be a bit of a pain and thankfully pdf-forms
does this for you. To get the field names in pdf-forms
:
require 'pdf-forms'
require 'cliver'
# Cliver makes it easy to find the path to command line utilities
pdftk = PdfForms.new(Cliver.detect('pdftk'))
pdftk.get_fields('sf2809.pdf')
#=>
# ...
# @flags="8388608",
# @justification="Center",
# @max_length="10",
# @name="H. event date 1",
# @name_alt=
# "Part H. Signature. Number 2. Enter the date you signed the form. Enter a two digit month
# and day and a four digit year.",
# @type="Text">
# ...
Now that we can access the form fields as Ruby object, it's time to serialize the fields as a JSON file.
The goal of this step is to create a data dictionary that maps human-friendly field names with machine-friendly field names.
FDF fields are typically written by humans when using something like Adobe Acrobat. While these names may or may not be user-friendly, they are almost certainly not machine-friendly. For example, the SF 2809 has an FDF field called 47. email address
. Imagine that as a JSON key and it becomes clear that it will be useful to convert these field names to be more machine-friendly. 47. email address
could be converted to "47_email_address"
, "email"
, or "email_47"
, for example.
One could write a single function that takes human-friendly field names and transforms them. A benefit is that the process is quick and automated. However, because the original field names were likely written by humans, they also likely lack the consistency to be addressable by humans. For this reason, it might be worth spending some time manually creating the machine-friendly names.
For this tutorial, to save time, we're going to use the following function to transform the field names:
# downcase it
# remove spaces
# remove periods
field.name.downcase.gsub(' ', '').gsub('.', '')
# ...
The file generate_mappings_json.rb
creates a pretty-formatted JSON file with all the right field names on which you can add machine-friendly field names by hand-editing the file.
require 'bundler/setup'
require 'pdf_forms'
require 'cliver'
require 'json'
pdftk = PdfForms.new(Cliver.detect('pdftk'))
fields = pdftk.get_fields('sf2809.pdf').map do |field|
result = {}
result[:pdf_name] = field.name
result[:api_name] = field.name.downcase.gsub(' ', '').gsub('.', '')
result[:type] = field.type.downcase
result[:alt_text] = field.name_alt
if field.respond_to?(:options) && !field.options.nil?
result[:options] = field.options
end
result
end
File.write('sf2809_mappings.json', JSON.pretty_generate(fields))
All this code does is map an array of Ruby objects as hashes, converts it all to JSON, and writes the JSON to a file.
To generate the JSON file, run ruby generate_mappings_json.rb
.
In a text editor, open sf2809_mappings.json
. It should contain a lot of these:
// ...
{
"pdf_name": "1. Name",
"api_name": "1name",
"type": "text",
"alt_text": "Part, A,. Enrollee and Family Member Information. Number 1. Enter enrollee's last name, first name, and middle initial."
},
{
"pdf_name": "2. SS",
"api_name": "2ss",
"type": "text",
"alt_text": "Part, A,. Enrollee and Family Member Information. Number 2. Enter enrollee's social security number."
},
// ...
The goal of this step is to write a function that fills the PDF programatically and to then wrap that function inside of a Ruby class.
With the JSON mappings file, we can fill the PDF fairly easily using PdfForms#fill_form
. The usage for that function is:
pdftk = PdfForms.new(Cliver.detect('pdftk'))
pdftk.fill_form '/path/to/form.pdf', 'myform.pdf', foo: 'bar'
Adapted for the SF 2809, the usage would be:
fill_values = {}
fill_values['6address1'] = '1800 F Street NW'
pdftk.fill_form 'sf2809.pdf', 'sf2809-filled.pdf', fill_values
Wrapping this into a simple Ruby class (in form_filler.rb
):
require 'bundler/setup'
require 'pdf-forms'
require 'cliver'
require 'json'
class SF2809
attr_accessor :fill_values
def initialize(fill_values: {})
@mappings = JSON.parse(File.read('sf2809_mappings.json'))
@pdftk = PdfForms.new(Cliver.detect('pdftk'))
@fill_values = fill_values
end
def save(path: nil)
full_path = ''
if path.nil?
full_path = 'sf2809-filled.pdf'
else
full_path = path
end
@pdftk.fill_form(
'sf2809.pdf',
full_path,
convert_fill_values(@fill_values)
)
end
private
def convert_fill_values(fill_values)
converted_field_names = {}
fill_values.each_pair do |api_name, value|
converted_field_names[convert_field_name(api_name)] = value
end
converted_field_names
end
def convert_field_name(api_name)
@mappings.select do |field|
field['api_name'] == api_name
end.first['pdf_name']
end
end
The usage of this class is quite simple:
fill_values = {
'6address1' => '1800 F. Street NW'
}
form = SF2809.new(fill_values: fill_values)
path = form.save('tmp')
path #=> tmp/sf2809_1442198514.pdf
The goal of this step is to expose the SF2809
Ruby class to a RESTful API.
At the end of this step, the following cURL command will return a filled PDF:
curl \
-X POST \
-H "Content-Type: application/json" \
-d '{"6address1": "1800 F. Street NW"}' \
http://localhost:4567/sf2809 > sf2809-filled-from-curl.pdf
For this step, we'll use Ruby's Sinatra web framework. This API will be stateless, meaning it won't store any information. It will accept JSON as part of a POST request, and send back the filled PDF file.
POST requests work well because they will help ensure anonymity. If you use SSL in production (which this tutorial assumes), POST data will be encrypted and safe from prying eyes. POST data will also not "leak" the way GET data will.
This is the API from server.rb
require 'bundler/setup'
require 'sinatra'
require 'json'
require 'tempfile'
require_relative 'form_filler.rb'
post '/sf2809' do
begin
json_params = JSON.parse(request.body.read)
form = SF2809.new(fill_values: json_params)
file = form.save('tmp')
bytes = File.read(file)
File.delete(file)
tmpfile = Tempfile.new('response.pdf')
tmpfile.write(bytes)
send_file(tmpfile)
rescue => e
content_type :json
return {
error: e.to_s
}
end
end
The request is wrapped in a begin rescue end
block so that errors can be caught and sent back to the client as JSON.
To start the server, run ruby server.rb
. In another terminal tab, run:
curl \
-X POST \
-H "Content-Type: application/json" \
-d '{"6address1": "1800 F. Street NW"}' \
http://localhost:4567/sf2809 > sf2809-filled-from-curl.pdf
Open sf2809-filled-from-curl.pdf
(and scroll down to the actual form) and notice the address field was filled! Exciting stuff!
The goal of this step is to provide a client-side Javascript interface to the API from the previous step which can render the PDF natively using PDF.js.
The first step is to create an HTML scaffold:
<!DOCTYPE html>
<html>
<head>
<title>SF 2809</title>
<script type="text/javascript" src="pdf.js"></script>
<script type="text/javascript" src="sf2809.js"></script>
</head>
<body>
<div id="main">
<canvas id="1"></canvas>
<canvas id="2"></canvas>
<canvas id="3"></canvas>
<canvas id="4"></canvas>
<canvas id="5"></canvas>
<canvas id="6"></canvas>
<canvas id="7"></canvas>
<canvas id="8"></canvas>
<canvas id="9"></canvas>
<canvas id="10"></canvas>
<canvas id="11"></canvas>
<canvas id="12"></canvas>
<canvas id="13"></canvas>
<canvas id="14"></canvas>
<canvas id="15"></canvas>
</div>
</body>
</html>
PDF.js
will expect a canvas
tag for each page in the PDF. Because SF 2809 always has 15 pages, we can hard-code 15 canvases. We're also loading two scripts, pdf.js
and sf2809.js
. pdf.js
is just copied from https://github.com/mozilla/pdfjs-dist/blob/master/build/pdf.combined.js.
In sf2809.js
, we need to do two things:
- send a POST to the server
- take the POST response and feed it into PDF.js to render it onto the
canvas
tags
We'll be eschewing jQuery in favor of vanilla.js
.
In order to make a POST request, we need to instantiate an XMLHttpRequest
:
var formData = {
"6address1": "1800 F. Street NW"
};
// ...
var xhr = new XMLHttpRequest();
xhr.open('POST', 'http://localhost:4567/sf2809', true);
xhr.setRequestHeader('Content-Type', 'application/json');
xhr.responseType = 'arraybuffer';
xhr.onload = function(e) {
// response is unsigned 8 bit integer
var responseArray = new Uint8Array(this.response);
// rendering goes here
};
xhr.send(JSON.stringify(formData));
With the response, responseArray
, we need to pass it to PDF.js:
var responseArray = new Uint8Array(this.response);
PDFJS.getDocument(responseArray).then(function wub(pdf) {
canvases.forEach(function(myCanvas) {
var pageNumber = parseInt(myCanvas);
pdf.getPage(pageNumber).then(function yoyo(page) {
var scale = 1.5;
var viewport = page.getViewport(scale);
var canvas = document.getElementById(myCanvas);
var context = canvas.getContext('2d');
canvas.height = viewport.height;
canvas.width = viewport.width;
page.render({canvasContext: context, viewport: viewport}).promise.then(function() {
// ...
});
});
});
});
The finished file is in sf2809.js
in this repo.
To run this locally, serve the files using a static file server. Here are two ways to do this (there are many more):
- Ruby:
ruby -rwebrick -e'WEBrick::HTTPServer.new(:Port => 8000, :DocumentRoot => Dir.pwd).start'
- Python 3:
python3 -m http.server
Visit localhost:8000
(scroll down to the form) and notice that the data defined in formData
from sf2809.js
is now in the rendered PDF.
The goal of this step is to create an HTML form which sends an AJAX request to the API using the Javascript from the previous step.
[TODO]
This project is in the worldwide public domain. As stated in CONTRIBUTING:
This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.
All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.