Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Existing way of dumping bounds of line segmentation? #266

Open
freen opened this issue Nov 28, 2017 · 5 comments
Open

Existing way of dumping bounds of line segmentation? #266

freen opened this issue Nov 28, 2017 · 5 comments

Comments

@freen
Copy link

freen commented Nov 28, 2017

I wanted to ask, before getting too creative, if there might be an existing, straight-forward way of extracting not simply images of segmented page lines, but also the coordinates / bounding boxes of those images, via ocropus-gpageseg or a related tool?

The lines apparently in memory at the following line: https://github.com/tmbdev/ocropy/blob/61562ce92818cecf6764c57d61e719cd2469a136/ocropus-gpageseg#L426

Does ocropus-gpageseg already drop the x / y / width / height of these coordinates / bounding boxes somewhere? Does an interface already exist to retrieve those values, in addition to the resulting segmented images?

Thanks and best regards,
freen

@freen
Copy link
Author

freen commented Nov 28, 2017

Something which seems to come close, unless I'm mistaken, is ocropus-hocr, which at the following article is used to pair extracted text with the bounding box of the text's originating cropped image segment: http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html

In my case, I'm merely looking for the bounding boxes, from within the original image, of the cropped image segments, and not at this time interested in the associated, recognized text.

@zuphilip
Copy link
Collaborator

Yes, with ocropus-hocr a hocr-file will be outputed which contains also the bounding boxes information. Alternatively, you can look at the pseg-file, in which this information is should also be encoded. But I guess it is easier to work with the hocr-format.

@freen
Copy link
Author

freen commented Nov 30, 2017

Thank you, @zuphilip !

I noticed that, besides the OCR text output, there is no explicit ID in the hocr format which correlates the bounding information to the corresponding gpageseg-segmented file.

What is the best way of correlating the bounding boxes in the hocr file to the corresponding segmented image?

Is it 100% reliable to infer that they share the same order? E.g. the first set of "bounds" in the hocr file will always correspond to the first segmented image piece (alphabetically ordered by filename) generated by ocropus-gpageseg, the last to the last, and so on?

@mittagessen
Copy link

Yes they do. They are written in reading order with shared identifiers for the line images and pseg file which are used to build the final hOCR. If you just want the segmentation you can run something like:

    t = [lines[i].bounds for i in lsort]
    t = [(s2.start, s1.start, s2.stop, s1.stop) for s1, s2 in t]

after the topsort in ocropus-gpageseg. kraken also writes an explicit segmentation json but the output will be (slightly) different from ocropy's segmenter.

@zuphilip
Copy link
Collaborator

@freen There is a recent PR here which does output the bbox information of each line in a json file: #283

Moreover, I would like to add some id's also to the hocr output, see #214.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants