Support for accentued letters #189

pprw · 2024-10-15T14:25:56Z

hocr-pdf does not seems to handle accents (éàù...) probably because of the encoding.

When the hocr source contains "formé"

<span class="ocrx_word" id="segment_66" title="bbox 456 735 591 810; x_confs 0.99999523 0.9999963 0.9947832 0.830801 0.9999902 0.9998524; poly 456 736 456 801 591 809 591 735">formé</span>

after launching hocr-pdf . > output.pdf and then pdftotext output.pdf -raw output.txt, the output.txt contains "formeÄ".

Is there a way to handle utf8 character and accent with hocr-tools?

The text was updated successfully, but these errors were encountered:

pprw changed the title ~~Support for accentued letter~~ Support for accentued letters Oct 15, 2024

pprw mentioned this issue Oct 15, 2024

Support for accentued letters stefan6419846/hocr-tools#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for accentued letters #189

Support for accentued letters #189

pprw commented Oct 15, 2024

Support for accentued letters #189

Support for accentued letters #189

Comments

pprw commented Oct 15, 2024