Existing way of dumping bounds of line segmentation? #266

freen · 2017-11-28T21:37:49Z

I wanted to ask, before getting too creative, if there might be an existing, straight-forward way of extracting not simply images of segmented page lines, but also the coordinates / bounding boxes of those images, via ocropus-gpageseg or a related tool?

The lines apparently in memory at the following line: https://github.com/tmbdev/ocropy/blob/61562ce92818cecf6764c57d61e719cd2469a136/ocropus-gpageseg#L426

Does ocropus-gpageseg already drop the x / y / width / height of these coordinates / bounding boxes somewhere? Does an interface already exist to retrieve those values, in addition to the resulting segmented images?

Thanks and best regards,
freen

The text was updated successfully, but these errors were encountered:

freen · 2017-11-28T22:00:38Z

Something which seems to come close, unless I'm mistaken, is ocropus-hocr, which at the following article is used to pair extracted text with the bounding box of the text's originating cropped image segment: http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html

In my case, I'm merely looking for the bounding boxes, from within the original image, of the cropped image segments, and not at this time interested in the associated, recognized text.

zuphilip · 2017-11-30T06:20:53Z

Yes, with ocropus-hocr a hocr-file will be outputed which contains also the bounding boxes information. Alternatively, you can look at the pseg-file, in which this information is should also be encoded. But I guess it is easier to work with the hocr-format.

freen · 2017-11-30T06:34:44Z

Thank you, @zuphilip !

I noticed that, besides the OCR text output, there is no explicit ID in the hocr format which correlates the bounding information to the corresponding gpageseg-segmented file.

What is the best way of correlating the bounding boxes in the hocr file to the corresponding segmented image?

Is it 100% reliable to infer that they share the same order? E.g. the first set of "bounds" in the hocr file will always correspond to the first segmented image piece (alphabetically ordered by filename) generated by ocropus-gpageseg, the last to the last, and so on?

mittagessen · 2017-12-24T09:08:20Z

Yes they do. They are written in reading order with shared identifiers for the line images and pseg file which are used to build the final hOCR. If you just want the segmentation you can run something like:

    t = [lines[i].bounds for i in lsort]
    t = [(s2.start, s1.start, s2.stop, s1.stop) for s1, s2 in t]

after the topsort in ocropus-gpageseg. kraken also writes an explicit segmentation json but the output will be (slightly) different from ocropy's segmenter.

zuphilip · 2017-12-25T22:55:31Z

@freen There is a recent PR here which does output the bbox information of each line in a json file: #283

Moreover, I would like to add some id's also to the hocr output, see #214.

zuphilip added the ❔ question label Nov 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Existing way of dumping bounds of line segmentation? #266

Existing way of dumping bounds of line segmentation? #266

freen commented Nov 28, 2017

freen commented Nov 28, 2017

zuphilip commented Nov 30, 2017

freen commented Nov 30, 2017

mittagessen commented Dec 24, 2017

zuphilip commented Dec 25, 2017

Existing way of dumping bounds of line segmentation? #266

Existing way of dumping bounds of line segmentation? #266

Comments

freen commented Nov 28, 2017

freen commented Nov 28, 2017

zuphilip commented Nov 30, 2017

freen commented Nov 30, 2017

mittagessen commented Dec 24, 2017

zuphilip commented Dec 25, 2017