Skip to content

Commit

Permalink
README
Browse files Browse the repository at this point in the history
  • Loading branch information
ashariyar committed Sep 30, 2022
1 parent a8eb54f commit 30d9f0c
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 31 deletions.
4 changes: 2 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# NEXT RELEASE

* Make `README` more PyPi friendly

# 1.5.0
Bunch of small changes to support releasing on pypi:
Bunch of small changes to support releasing on [pypi](https://pypi.org/project/pdfalyzer/)
* Invoke with shell command `pdfalyze` instead of local python file `./pdfalyzer.py` (options are the same)
* Core class renames: `PdfWalker` -> `Pdfalyzer`, `DataStreamHandler` -> `BinaryScanner`
* Permanent env var configuration moved from a file called `.env` to a file called `.pdfalyzer`
Expand Down
58 changes: 29 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,12 @@ A PDF analysis tool geared towards visualizing the inner tree-like data structur

**PyPi Users:** If you are reading this document [on PyPi](https://pypi.org/project/pdfalyzer/) be aware that it renders a lot better [over on GitHub](https://github.com/michelcrypt4d4mus/pdfalyzer). Lots of pretty pictures, footnotes that work, etc.

[^1]: The official Adobe PDF specification calls this tree the PDF's "logical structure", which is a good example of nomenclature that does not help those who see it understand anything about what is being described. I can forgive them given that they named this thing back in the 80s, though it's a good example of why picking good names for things at the beginning is so important.
#### Quick Start
```sh
pip install pdfalyzer
pdfalyze the_heidiggerian_themes_expressed_in_illmatic.pdf
```
See [Installation](#installation) and [Usage](#usage) for more details.

### What It Do
1. **Generate summary format as well as in depth visualizations of a PDF's tree structure**[^1] with helpful color themes that conceptually link objects of similar type. See [the examples below](#example-output) to get an idea.
Expand All @@ -16,27 +21,18 @@ If you're looking for one of these things this may be the tool for you.

An exception will be raised if there's any issue placing a node while parsing or if there are any nodes not reachable from the root of the tree at the end of parsing.

[^2]: All internal PDF objects are guaranteed to exist in the tree except in these situations when warnings will be printed:
`/ObjStm` (object stream) is a collection of objects in a single stream that will be unrolled into its component objects.
`/XRef` Cross-reference stream objects which hold the same references as the `/Trailer` are hacked in as symlinks of the `/Trailer`

### What It Don't Do
This tool is mostly about examining a PDF's logical structure and assisting with the discovery of malicious content. As such it doesn't have much to offer as far as extracting text from PDFs, rendering PDFs[^3], writing new PDFs, or many of the more conventional things one might do with a portable document.

[^3]: Given the nature of the PDFs this tool is meant to be scan anything resembling "rendering" the document is pointedly NOT offered.

### Did The World Really Need Another PDF Tool?
This tool was built to fill a gap in the PDF assessment landscape following my . Didier Stevens's [pdfid.py](https://github.com/DidierStevens/DidierStevensSuite/blob/master/pdfid.py) and [pdf-parser.py](https://github.com/DidierStevens/DidierStevensSuite/blob/master/pdf-parser.py) are still the best game in town when it comes to PDF analysis tools but they lack in the visualization department and also don't give you much to work with as far as giving you a data model you can write your own code around. [Peepdf](https://github.com/jesparza/peepdf) seemed promising but turned out to be in a buggy, out of date, and more or less unfixable state. And neither of them offered much in the way of tooling for embedded font analysis.

All those things being the case lead to a situation where I felt the world might be slightly improved if I strung together a couple of more stable/well known/actively maintained open source projects ([AnyTree](https://github.com/c0fec0de/anytree), [PyPDF2](https://github.com/py-pdf/PyPDF2), and [Rich](https://github.com/Textualize/rich)) into this tool.

### OK Let's Do This
See [Installation](#installation) and [Usage](#usage) below (past the enormous images showing example output).



# Example Output
`pdfalyzer` can export visualizations to HTML, ANSI colored text, and SVG images using the file export functionality that comes with [Rich](https://github.com/Textualize/rich). SVGs can be turned into `png` format images with a tool like `inkscape` or `cairosvg` (Inkscape works a lot better in our experience).
The Pdfalyzer can export visualizations to HTML, ANSI colored text, and SVG images using the file export functionality that comes with [Rich](https://github.com/Textualize/rich). SVGs can be turned into `png` format images with a tool like `inkscape` or `cairosvg` (Inkscape works a lot better in our experience).


### Basic Tree View
Expand All @@ -48,8 +44,6 @@ The dimmer (as in "harder to see") nodes[^4] marked with `Non Child Reference` g

That's a pretty basic document. If you'd like to see the tree for a more complicated/longer PDF, [here's an example showing the `nmap` cheat sheet](doc/svgs/rendered_images/NMAP_Commands_Cheat_Sheet_and_Tutorial.pdf.tree.svg.png).

[^4]: Technically they are `SymlinkNodes`, a really nice feature of [AnyTree](https://github.com/c0fec0de/anytree).

### Rich Tree View
This image shows a more in-depth view of of the PDF tree for the same document shown above. This tree (AKA the "rich" tree) has almost everything - shows all PDF object properties, all relationships between objects. Even includes sizable previews of any binary data streams embedded or encrypted in the document. Note that the `/OpenAction` is highlighted in bright red, as is the Adobe Type1 font binary (Google's project zero regards any Adobe Type1 font as "mad sus").

Expand All @@ -70,21 +64,19 @@ It's actually `PyPDF2` doing the lifting here but we're happy to take the credit

#### Search Encrypted Binary Font Data for #MadSus Content No Malware Scanner Will Catch[^5]

doc/svgs/rendered_images/font_24_binary_scan.png

Things like, say, a hidden binary `/F` (PDF instruction meaning "URL") followed by a `JS` (I'll let you guess what "JS" stands for) and then a binary `»` character (AKA "the character the PDF specification uses to close a section of the PDF's logical structure"). Put all that together and it says that you're looking at a secret JavaScript instruction embedded in the encrypted part of a font binary. A secret instruction that causes the PDF renderer to pop out of its frame prematurely as it renders the font.

![Font with JS](doc/svgs/rendered_images/font29.js.1.png)

#### Extract And Decode Binary Patterns
Like, say, bytes between common regular expression markers that you might want to force a decode of in a lot of different encodings.

![Font Scan Regex](doc/svgs/rendered_images/font_34_frontslash_scan.png)

When all is said and done you can see some stats that may help you figure out what the character encoding may or may not be for the bytes matched by those patterns:
![Font Decode Summary](doc/svgs/rendered_images/font29_summary_stats.png)

![Font Decode Summary](doc/svgs/rendered_images/font29_summary_stats.png)

[^5]: At least they weren't catching it as of September 2022.

#### Now There's Even A Fancy Table To Tell You What The `chardet` Library Would Rank As The Most Likely Encoding For A Chunk Of Binary Data
Behold the beauty:
Expand Down Expand Up @@ -115,7 +107,7 @@ If you're lazy and don't want to retrieve his tools yourself there's [a simple b
scripts/install_didier_stevens_pdf_tools.sh
```

If there is a discrepancy between the output of betweeen his tools and this one you should assume his tool is correct and `pdfalyzer` is wrong until you conclusively prove otherwise.
If there is a discrepancy between the output of betweeen his tools and this one you should assume his tool is correct and The Pdfalyzer is wrong until you conclusively prove otherwise.

#### Installing The `t1utils` Font Suite
`t1utils` is a suite of old but battle tested apps for manipulating old Adobe font formats. You don't need it unless you're dealing with an older Type 1 or Type 2 font binary but given that those have been very popular exploit vectors in the past few years it can be extremely helpful. One of the tools in the suite, [`t1disasm`](https://www.lcdf.org/type/t1disasm.1.html), is particularly useful because it decrypts and decompiles Adobe Type 1 font binaries into a more human readable string representation.
Expand All @@ -135,23 +127,17 @@ As of right now these are the options:

![argparse_help](doc/screenshots/rich_help/full_text_of_help_orange_group.png)

**There's some further exposition on the particulars of what these options mean in [the sample `.pdfalyzer` file](.pdfalyzer.example).** Even if don't configure your own `env` you may still glean some insight from reading the descriptions of the various environment variables.

Beyond that there's [a few scripts](scripts/) in the repo that may be of interest.

### Setting Command Line Options Permanently With A `.pdfalyzer` File
If you find yourself specificying the same options over and over you may be able to automate that with a [dotenv](https://pypi.org/project/python-dotenv/) setup. Documentation on the available configuration options lives in [`.pdfalyzer.example`](.pdfalyzer.example) which doubles as a file you can copy into place and edit to your heart's content.

```sh
cp .pdfalyzer.example .pdfalyzer
```

If you find yourself specificying the same options over and over you may be able to automate that with a [dotenv](https://pypi.org/project/python-dotenv/) setup. When you run `pdfalyze` on some PDF the tool will check for a file called `.pdfalyzer` first in the current directory and then in the home directory. If it finds a file in either such place it will load options from it. Documentation on the options that can be configured with these files lives in [`.pdfalyzer.example`](.pdfalyzer.example) which doubles as an example file you can copy into place and edit to your needs. Even if don't configure your own `.pdfalyzer` file you may still glean some insight from reading the descriptions of the various variables in [.pdfalyzer.example](.pdfalyzer.example); there's a little more exposition there than in the output of `pdfalyze -h`.

### As A Code Library
The `Pdfalyzer` class is the core of the operation as it holds both the PDF's logical tree as well as a couple of other data structures that have been pre-processed to make them easier to work with. Chief among these is the `FontInfo` class which pulls together various properties of a font strewn across 3 or 4 different PDF objects.
At its core The Pdfalyzer is taking PDF internal objects gathered by [PyPDF2](https://github.com/py-pdf/PyPDF2) and wrapping them in [AnyTree](https://github.com/c0fec0de/anytree)'s `NodeMixin` class. Given that things like searching the tree or accessing internal properties will be done through those packages' code it may be quite helpful to review their documentation.

As far as The Pdfalyzer's unique functionality goes, `Pdfalyzer` is the class at the heart of the operation. It holds both the PDF's logical tree as well as a couple of other data structures that have been pre-processed to make them easier to work with. Chief among these is the `FontInfo` class which pulls together various properties of a font strewn across 3 or 4 different PDF objects and the `BinaryScanner1` class which lets you dig through the raw bytes looking for suspicious patterns.

Here's how to get at these objects:
Here's a short intro to how to access these objects:

```python
from pdfalyzer.pdfalyzer import Pdfalyzer
Expand Down Expand Up @@ -196,7 +182,6 @@ for regex_match in binary_scanner.extract_regex_capture_bytes(re.compile(b'\xcc(
process_(regex_match)
```

The representation of the PDF objects (e.g. `pdf_object` in the example above) is handled by [PyPDF2](https://github.com/py-pdf/PyPDF2) so for more details on what's going on there check out its documentation.

### Troubleshooting
This tool is by no means complete. It was built to handle a specific use case which encompassed a small fraction of the many and varied types of information that can show up in a PDF. While it has been tested on a decent number of large and very complicated PDFs (500-5,000 page manuals from Adobe itself) I'm sure there are a whole bunch of edge cases that will trip up the code.
Expand Down Expand Up @@ -272,3 +257,18 @@ Run all tests by typing `pytest`. Test coverage is relatively spartan but should
* highlight decodes done at `chardet`s behest
* Highlight decodes with a lot of Javascript keywords
* deal with repetitive matches



[^1]: The official Adobe PDF specification calls this tree the PDF's "logical structure", which is a good example of nomenclature that does not help those who see it understand anything about what is being described. I can forgive them given that they named this thing back in the 80s, though it's a good example of why picking good names for things at the beginning is so important.

[^2]: All internal PDF objects are guaranteed to exist in the tree except in these situations when warnings will be printed:
`/ObjStm` (object stream) is a collection of objects in a single stream that will be unrolled into its component objects.
`/XRef` Cross-reference stream objects which hold the same references as the `/Trailer` are hacked in as symlinks of the `/Trailer`

[^3]: Given the nature of the PDFs this tool is meant to be scan anything resembling "rendering" the document is pointedly NOT offered.

[^4]: Technically they are `SymlinkNodes`, a really nice feature of [AnyTree](https://github.com/c0fec0de/anytree).

[^5]: At least they weren't catching it as of September 2022.

0 comments on commit 30d9f0c

Please sign in to comment.