Replies: 1 comment
-
Below are a few approaches you might consider for converting XHTML to PDF, and ultimately to Markdown, based on what has worked for others. Since you mentioned you already tested pandoc, wkhtmltopdf, WeasyPrint, etc., I’ll suggest a handful of additional ideas and typical “workarounds” that people often use. 1. Commercial Tools (Prince, Antenna House)If image and layout fidelity is crucial, one of the more reliable XHTML-to-PDF workflows involves commercial engines like Prince XML or Antenna House Formatter. Both do an excellent job handling complex CSS layouts, and they officially support XHTML as input. They aren’t free, but if you’re dealing with complex documents and want high fidelity, they’re worth considering. 2. xhtml2pdf (Python-based “pisa” library)xhtml2pdf is a Python library that converts (X)HTML/CSS to PDF using the ReportLab toolkit. It’s somewhat older and can be finicky with more advanced CSS, but you might have better luck if your XHTML is simple or if you can tweak its styling. Because you’ve already tested many libraries, this may or may not be an improvement, but it’s specifically designed to accept valid XHTML. 3. DOMPDF (PHP-based)If you’re comfortable with PHP, DOMPDF is another engine that supports HTML/CSS to PDF conversions and can work with valid XHTML. Again, like xhtml2pdf, it might require simplifying or modifying your markup and styles, but some people get great results with it. 4. Shortest Path to XHTML → MarkdownUltimately, you mention you want XHTML → Markdown. While there’s no single turnkey solution that says “XHTML → Markdown” the way you might hope, remember that XHTML is just well-formed HTML. So you can often treat your
This approach sidesteps ever having to make perfect PDFs from XHTML if your real end goal is to get a Markdown version of the content. 5. Headless Chromium Print (as a Programmatic Workflow)Since you noted that printing from Chromium yields the best PDF layout but is too slow when done by hand, consider automating it with a headless browser approach (like Puppeteer in Node.js or Playwright). While this still uses Chromium behind the scenes, it can be fully scripted, which might mitigate the “slow” factor—once set up, you let your script run in the background. For example, in Puppeteer:
Even though it’s effectively the same mechanism as manually printing from Chrome, once automated, it can handle large batches and might be acceptable performance-wise. Summary• If your key need is a perfect visual match in PDF, consider either (1) commercial engines like Prince or Antenna House, or (2) scripting a headless Chromium-based solution (e.g., Puppeteer, Playwright). Unfortunately, there’s no single tool that “just works” with XHTML in every scenario. But hopefully, one of the above approaches will suit your specific workflow better than the ones you’ve already tried. Good luck! |
Beta Was this translation helpful? Give feedback.
-
Hey team, does anybody know any good utility for saving XHTML to PDF?
I tested many libraries (pandoc, wkhtmltopdf, weasyprint and others), but none gave a good conversion, the only good conversion from XHTML to PDF is by print saving to PDF in Chromium, which is a very slow operation
example doc
The ultimate goal is XHTML -> Markdown. However, none of the major conversion libraries supports XHTML
cc @LDOUBLEV
Beta Was this translation helpful? Give feedback.
All reactions