XHTML support, conversion to PDF or Markdown #14548

simjak · 2025-01-15T08:59:08Z

simjak
Jan 15, 2025

Hey team, does anybody know any good utility for saving XHTML to PDF?
I tested many libraries (pandoc, wkhtmltopdf, weasyprint and others), but none gave a good conversion, the only good conversion from XHTML to PDF is by print saving to PDF in Chromium, which is a very slow operation
example doc
The ultimate goal is XHTML -> Markdown. However, none of the major conversion libraries supports XHTML

cc @LDOUBLEV

GreatV · 2025-01-15T11:05:24Z

GreatV
Jan 15, 2025
Maintainer

Below are a few approaches you might consider for converting XHTML to PDF, and ultimately to Markdown, based on what has worked for others. Since you mentioned you already tested pandoc, wkhtmltopdf, WeasyPrint, etc., I’ll suggest a handful of additional ideas and typical “workarounds” that people often use.

1. Commercial Tools (Prince, Antenna House)

If image and layout fidelity is crucial, one of the more reliable XHTML-to-PDF workflows involves commercial engines like Prince XML or Antenna House Formatter. Both do an excellent job handling complex CSS layouts, and they officially support XHTML as input. They aren’t free, but if you’re dealing with complex documents and want high fidelity, they’re worth considering.

2. xhtml2pdf (Python-based “pisa” library)

xhtml2pdf is a Python library that converts (X)HTML/CSS to PDF using the ReportLab toolkit. It’s somewhat older and can be finicky with more advanced CSS, but you might have better luck if your XHTML is simple or if you can tweak its styling. Because you’ve already tested many libraries, this may or may not be an improvement, but it’s specifically designed to accept valid XHTML.

3. DOMPDF (PHP-based)

If you’re comfortable with PHP, DOMPDF is another engine that supports HTML/CSS to PDF conversions and can work with valid XHTML. Again, like xhtml2pdf, it might require simplifying or modifying your markup and styles, but some people get great results with it.

4. Shortest Path to XHTML → Markdown

Ultimately, you mention you want XHTML → Markdown. While there’s no single turnkey solution that says “XHTML → Markdown” the way you might hope, remember that XHTML is just well-formed HTML. So you can often treat your .xhtml file as HTML input (assuming it doesn’t rely on unusual XML-specific features). In practice:

Convert XHTML to standard HTML 4 or HTML5 (e.g., remove XML-specific DOCTYPEs, self-closing tags that aren’t valid in HTML5, etc.).
Use a tool like pandoc to convert the resulting HTML to Markdown. For example:
```
pandoc --from=html --to=markdown_strict yourFile.html > output.md
```
If pandoc complains about some tags or fails to parse, try “cleaning” the XHTML first with an HTML tidying tool or by stripping out problematic attributes.

This approach sidesteps ever having to make perfect PDFs from XHTML if your real end goal is to get a Markdown version of the content.

5. Headless Chromium Print (as a Programmatic Workflow)

Since you noted that printing from Chromium yields the best PDF layout but is too slow when done by hand, consider automating it with a headless browser approach (like Puppeteer in Node.js or Playwright). While this still uses Chromium behind the scenes, it can be fully scripted, which might mitigate the “slow” factor—once set up, you let your script run in the background. For example, in Puppeteer:

Install Puppeteer (Node.js).
Load your XHTML page in headless Chrome (Chromium).
Use the built-in “print to PDF” functionality.
Repeat as many times as needed.

Even though it’s effectively the same mechanism as manually printing from Chrome, once automated, it can handle large batches and might be acceptable performance-wise.

Summary

• If your key need is a perfect visual match in PDF, consider either (1) commercial engines like Prince or Antenna House, or (2) scripting a headless Chromium-based solution (e.g., Puppeteer, Playwright).
• If your ultimate goal is Markdown (with only occasional PDFs), then going from XHTML → HTML → Markdown (via pandoc) is often far simpler. You can generate any needed PDFs from the Markdown afterwards (again with pandoc, or even GitHub’s own Markdown → PDF converters).

Unfortunately, there’s no single tool that “just works” with XHTML in every scenario. But hopefully, one of the above approaches will suit your specific workflow better than the ones you’ve already tried. Good luck!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XHTML support, conversion to PDF or Markdown #14548

{{title}}

Replies: 1 comment

{{title}}

Select a reply

XHTML support, conversion to PDF or Markdown #14548

simjak Jan 15, 2025

Replies: 1 comment

GreatV Jan 15, 2025 Maintainer

1. Commercial Tools (Prince, Antenna House)

2. xhtml2pdf (Python-based “pisa” library)

3. DOMPDF (PHP-based)

4. Shortest Path to XHTML → Markdown

5. Headless Chromium Print (as a Programmatic Workflow)

Summary

simjak
Jan 15, 2025

GreatV
Jan 15, 2025
Maintainer