We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hocr-pdf does not seems to handle accents (éàù...) probably because of the encoding.
When the hocr source contains "formé"
<span class="ocrx_word" id="segment_66" title="bbox 456 735 591 810; x_confs 0.99999523 0.9999963 0.9947832 0.830801 0.9999902 0.9998524; poly 456 736 456 801 591 809 591 735">formé</span>
after launching hocr-pdf . > output.pdf and then pdftotext output.pdf -raw output.txt, the output.txt contains "formeÄ".
hocr-pdf . > output.pdf
pdftotext output.pdf -raw output.txt
Is there a way to handle utf8 character and accent with hocr-tools?
The text was updated successfully, but these errors were encountered:
No branches or pull requests
hocr-pdf does not seems to handle accents (éàù...) probably because of the encoding.
When the hocr source contains "formé"
<span class="ocrx_word" id="segment_66" title="bbox 456 735 591 810; x_confs 0.99999523 0.9999963 0.9947832 0.830801 0.9999902 0.9998524; poly 456 736 456 801 591 809 591 735">formé</span>
after launching
hocr-pdf . > output.pdf
and thenpdftotext output.pdf -raw output.txt
, the output.txt contains "formeÄ".Is there a way to handle utf8 character and accent with hocr-tools?
The text was updated successfully, but these errors were encountered: