tesseract-recognize creates negative word coordinates #153

stweil · 2020-09-14T09:50:17Z

In a workflow with PPN1024726142, tesseract-recognize created a negative coordinate for page 11:

         <pc:TextLine id="region0008_line0000">
             <pc:Coords points="566,1558 0,1539 0,1565 564,1583"/>
             <pc:Word id="region0008_line0000_word0000">
                <pc:Coords points="0,1546 566,1565 565,1584 -1,1565"/>

ocrd-transform fails to process that page without an error message when converting from PAGE to ALTO.

The text was updated successfully, but these errors were encountered:

bertsky · 2020-09-14T10:09:01Z

tesseract-recognize created a negative coordinate for page 11:

Yes, we should use the polygon_for_parent mechanism for word and glyph segmentation, too.

ocrd-transform fails to process that page without an error message when converting from PAGE to ALTO.

Cannot speak to that.

stweil · 2020-09-14T10:12:58Z

The negative x coordinate causes am exception in PageConverter:

$ ocr-transform page alto bad.xml /tmp/bad.xml
Exception in thread "main" java.lang.NullPointerException
        at org.primaresearch.dla.page.converter.PageConverter.handleNegativeCoordinates(PageConverter.java:389)
        at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:216)
        at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130)

bertsky · 2020-09-14T15:24:20Z

tesseract-recognize created a negative coordinate for page 11:

Yes, we should use the polygon_for_parent mechanism for word and glyph segmentation, too.

Should be covered by #152 now, too (sorry for inconvenience). I'll merge as soon as you give approval (again)...

stweil · 2020-09-14T19:40:43Z

It now no longer creates negative word coordinates. I still get the java.lang.NullPointerException from PageConverter, but now for page 43 (which has no text region and no word coordinates), so that seems to be unrelated.

stweil · 2020-09-15T07:14:51Z

@bertsky, @kba, the remaining problem with PageConverter occurs when converting pages without any text from PAGE to ALTO. Currently the PAGE XML of an empty page contains an entry <pc:ReadingOrder/>. This empty reading order obviously confuses PageConverter. I don't know whether an empty reading order is valid PAGE XML. Maybe it would be best to simply not write such an entry. See also PRImA-Research-Lab/prima-page-converter#15.

bertsky · 2020-09-15T07:28:46Z

the remaining problem with PageConverter occurs when converting pages without any text from PAGE to ALTO. Currently the PAGE XML of an empty page contains an entry <pc:ReadingOrder/>. This empty reading order obviously confuses PageConverter. I don't know whether an empty reading order is valid PAGE XML. Maybe it would be best to simply not write such an entry.

It's invalid by the schema, that's why the converter fails. We have already been trying to avoid these circumstances, but in case of ocrd-tesserocr-segment-region we forgot to remove empty ReadingOrder elements.

@kba do you think we could easily solve that for all processors at once with a fix during serialization in core? The alternative would be adding the following to all segmentation processors (right before to_xml):

    ro = pcgts.get_Page().get_ReadingOrder()
    if ro and not ro.get_OrderedGroup() and not ro.get_UnorderedGroup():
        pcgts.get_Page().set_ReadingOrder(None)

kba · 2020-09-15T07:35:12Z

@kba do you think we could easily solve that for all processors at once with a fix during serialization in core?

Yes. OCR-D/core#602

bertsky · 2020-09-15T07:40:01Z

@kba do you think we could easily solve that for all processors at once with a fix during serialization in core?

Yes. OCR-D/core#602

Thanks!

Closing this issue – solved by #152

stweil mentioned this issue Sep 14, 2020

Processor tesserocr-segment-line terminates with exception (TopologyException: Input geom 1 is invalid) #149

Closed

bertsky mentioned this issue Sep 14, 2020

more robust intersection with parent #152

Merged

kba mentioned this issue Sep 15, 2020

Prevent serializing empty reading order OCR-D/core#602

Closed

bertsky closed this as completed Sep 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tesseract-recognize creates negative word coordinates #153

tesseract-recognize creates negative word coordinates #153

stweil commented Sep 14, 2020

bertsky commented Sep 14, 2020

stweil commented Sep 14, 2020

bertsky commented Sep 14, 2020 •

edited

Loading

stweil commented Sep 14, 2020 •

edited

Loading

stweil commented Sep 15, 2020

bertsky commented Sep 15, 2020

kba commented Sep 15, 2020

bertsky commented Sep 15, 2020

tesseract-recognize creates negative word coordinates #153

tesseract-recognize creates negative word coordinates #153

Comments

stweil commented Sep 14, 2020

bertsky commented Sep 14, 2020

stweil commented Sep 14, 2020

bertsky commented Sep 14, 2020 • edited Loading

stweil commented Sep 14, 2020 • edited Loading

stweil commented Sep 15, 2020

bertsky commented Sep 15, 2020

kba commented Sep 15, 2020

bertsky commented Sep 15, 2020

bertsky commented Sep 14, 2020 •

edited

Loading

stweil commented Sep 14, 2020 •

edited

Loading