-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alto2hocr: Content in BottomMargin is not considered (PrintSpace node is missing in this example) #89
Comments
Can you share a complete ALTO file as an example illustrating the problem? |
@zuphilip Many thanks - here you go. ABBYY FineReader 12 The docker web UI gives: <title>Image: </title>On the command line I get (NB Running an XSLT 2.0 stylesheet with an XSLT 3.0 processor):
Also: Is there a way to ocr-transform directly from the ABBYY schema (which can be validated with ocr-validate) to hocr, or must one start from ABBYY's alto output? Final question: Can hocr be successfully converted back to alto, and if so, which version? Thanks so much! Here is the ABBYY-generated alto:
|
Okay, the problem with this ALTO file is that it uses ...
-<BottomMargin HEIGHT="1051" WIDTH="750" VPOS="0" HPOS="0">
+<PrintSpace HEIGHT="1051" WIDTH="750" VPOS="0" HPOS="0">
...
-</BottomMargin>
+</PrintSpace>
... However, if I understand this correct, then ABBYY will always output this as
Yes, there is also a abbyy2hocr transformation which I tried to integrate in the PR #92. Let me know if this works for you.
Yes, there are transformations from hocr to alto2.0/alto2.1. However, it always can happen in a transformation that some information get lost. Thus, you need to try it out for your examples, if this already works as you expecting it. |
It seems the TopMargin, LeftMargin, RightMargin and BottomMargin sub-elements under Page element are NOT covered in the transformation, only Page/PrintSpace. I can add these, should be simple, ie. for BottomMargin: <xsl:template match="Page">
<xsl:variable name="fname"><xsl:value-of select="//alto/Description/sourceImageInformation/fileName"/></xsl:variable>
...
+<xsl:apply-templates select="BottomMargin"/>
<xsl:apply-templates select="PrintSpace"/>
...
</xsl:template> and then add: +<xsl:template match="BottomMargin">
+<xsl:apply-templates select="ComposedBlock"/>
+<xsl:apply-templates select="TextBlock"/>
+</xsl:template> @jtlz2 Would you be so kind and try this with your ALTO file ? |
Well, I have tried and it works for the BottomMargin. I think TopMargin and BottomMargin might work if converted into ocr_page DIV, but the LeftMargin and RightMargin should go somewhere else (?) - otherwise it would break the page layout. Shall they be considered float elements ?
|
I suggest to map There are also float elements in LaTeX and there it is usually used for images or tables which might have to float to the next page when there is not enough space left. The margins on the left or right sounds for me different. Anyway, they will be packed in a separate I think you don't need to copy the same template for different names but you can expand it like this: <xsl:template match="PrintSpace|BottomMargin|TopMargin|LeftMargin|RightMargin">
<xsl:apply-templates select="ComposedBlock"/>
<xsl:apply-templates select="TextBlock"/>
</xsl:template> |
@zuphilip I am fine with the following: <xsl:template match="PrintSpace|BottomMargin|TopMargin"> The LeftMargin|RightMargin will need some testing - are there any real example files available ? |
There is a new DEV version of alto__hocr.xsl which supports TopMargin & BottomMargin: https://github.com/filak/hOCR-to-ALTO/blob/master/dev/alto__hocr.xsl If it works fine I will update the production file. |
I am not aware such examples, but maybe @cneud can help with this? Thank you very much for implementing the Top and Bottom margin @filak ! I tested the new DEV version with the example here and it works. I try to test it a little further and let you know of any issues I encounter. |
(ht @bertsky in gitter) |
@kba I do not see any content in the margin elements - there will be no output produced by the transformation. I think the Top and Bottom margins have been fixed now. What to do with the Left and Right margins? There are no respective float elements specified in the hOCR spec (like ocr_header and ocr_footer). If there are no real life examples with Left/Right margins I suggest to close this issue - and create another one here https://github.com/filak/hOCR-to-ALTO if it pop up someday. We can discuss then how to implement it. |
There is a new DEV version of hocr__alto4.xsl for testing which supports TopMargin & BottomMargin: https://github.com/filak/hOCR-to-ALTO/blob/master/dev/hocr__alto4.xsl If it works fine I will update all the production files. |
LGTM |
I have updated the master a while ago, just forgot to let you know... |
cf #95
I am targeting hocr and trying to do so from the ABBYY latest form of alto. The header for the latter is
But when I run
ocr-transform alto2.0 hocr in.alto out.hocr
I only get a header and no content:
@zuphilip Any ideas on how to proceed?
Thanks!
The text was updated successfully, but these errors were encountered: