Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract-recognize creates negative word coordinates #153

Closed
stweil opened this issue Sep 14, 2020 · 8 comments
Closed

tesseract-recognize creates negative word coordinates #153

stweil opened this issue Sep 14, 2020 · 8 comments

Comments

@stweil
Copy link
Contributor

stweil commented Sep 14, 2020

In a workflow with PPN1024726142, tesseract-recognize created a negative coordinate for page 11:

         <pc:TextLine id="region0008_line0000">
             <pc:Coords points="566,1558 0,1539 0,1565 564,1583"/>
             <pc:Word id="region0008_line0000_word0000">
                <pc:Coords points="0,1546 566,1565 565,1584 -1,1565"/>

ocrd-transform fails to process that page without an error message when converting from PAGE to ALTO.

@bertsky
Copy link
Collaborator

bertsky commented Sep 14, 2020

tesseract-recognize created a negative coordinate for page 11:

Yes, we should use the polygon_for_parent mechanism for word and glyph segmentation, too.

ocrd-transform fails to process that page without an error message when converting from PAGE to ALTO.

Cannot speak to that.

@stweil
Copy link
Contributor Author

stweil commented Sep 14, 2020

The negative x coordinate causes am exception in PageConverter:

$ ocr-transform page alto bad.xml /tmp/bad.xml
Exception in thread "main" java.lang.NullPointerException
        at org.primaresearch.dla.page.converter.PageConverter.handleNegativeCoordinates(PageConverter.java:389)
        at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:216)
        at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130)

@bertsky
Copy link
Collaborator

bertsky commented Sep 14, 2020

tesseract-recognize created a negative coordinate for page 11:

Yes, we should use the polygon_for_parent mechanism for word and glyph segmentation, too.

Should be covered by #152 now, too (sorry for inconvenience). I'll merge as soon as you give approval (again)...

@stweil
Copy link
Contributor Author

stweil commented Sep 14, 2020

It now no longer creates negative word coordinates. I still get the java.lang.NullPointerException from PageConverter, but now for page 43 (which has no text region and no word coordinates), so that seems to be unrelated.

@stweil
Copy link
Contributor Author

stweil commented Sep 15, 2020

@bertsky, @kba, the remaining problem with PageConverter occurs when converting pages without any text from PAGE to ALTO. Currently the PAGE XML of an empty page contains an entry <pc:ReadingOrder/>. This empty reading order obviously confuses PageConverter. I don't know whether an empty reading order is valid PAGE XML. Maybe it would be best to simply not write such an entry. See also PRImA-Research-Lab/prima-page-converter#15.

@bertsky
Copy link
Collaborator

bertsky commented Sep 15, 2020

the remaining problem with PageConverter occurs when converting pages without any text from PAGE to ALTO. Currently the PAGE XML of an empty page contains an entry <pc:ReadingOrder/>. This empty reading order obviously confuses PageConverter. I don't know whether an empty reading order is valid PAGE XML. Maybe it would be best to simply not write such an entry.

It's invalid by the schema, that's why the converter fails. We have already been trying to avoid these circumstances, but in case of ocrd-tesserocr-segment-region we forgot to remove empty ReadingOrder elements.

@kba do you think we could easily solve that for all processors at once with a fix during serialization in core? The alternative would be adding the following to all segmentation processors (right before to_xml):

    ro = pcgts.get_Page().get_ReadingOrder()
    if ro and not ro.get_OrderedGroup() and not ro.get_UnorderedGroup():
        pcgts.get_Page().set_ReadingOrder(None)

@kba
Copy link
Member

kba commented Sep 15, 2020

@kba do you think we could easily solve that for all processors at once with a fix during serialization in core?

Yes. OCR-D/core#602

@bertsky
Copy link
Collaborator

bertsky commented Sep 15, 2020

@kba do you think we could easily solve that for all processors at once with a fix during serialization in core?

Yes. OCR-D/core#602

Thanks!

Closing this issue – solved by #152

@bertsky bertsky closed this as completed Sep 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants