Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use "block" instead of "region" throughout #135

Closed
kba opened this issue Dec 19, 2019 · 5 comments
Closed

Use "block" instead of "region" throughout #135

kba opened this issue Dec 19, 2019 · 5 comments
Assignees

Comments

@kba
Copy link
Member

kba commented Dec 19, 2019

ALTO tries to be interoperable with IIIF as discussed here. There is a "Text Granularity Extension" for IIIF that defines what we call "levels":

page A page in a paginated document
block An arbitrary region of text
paragraph A paragraph
line A topographic line
word A single word
glyph A single glyph or symbol

Seems reasonably compatible with our definitions, though we call line TextLine and have no distinct notion of a paragraph.

My point is: We do use region instead of block in a few places, such as some executables ocrd-*-region. Should we decide on a common parameter for level, it would be a moment to make sure we're consistent.

@bertsky
Copy link
Collaborator

bertsky commented Jan 3, 2020

I concur, we should use the next opportunity to use the term block instead of region more consistently. (Our METS file group USE classes already use BLOCK, but we are already discussing of relaxing that scheme.)

Incidentally, this hierarchy is also identical with Tesseract's RIL (ResultIterator levels).

But I don't think we are entirely incompatible with a paragraph level either. PAGE-XML defines all region types fully recursively, and designates @type="paragraph" for paragraphs. IIUC the current OCR-D spec and implementation is agnostic about whether regions should be used in a flat or multi-level fashion. In some places though, PAGE-XML already requires at least 2 levels (namely table cells and footnotes, perhaps also text blocks comprising a @type="drop-capital" and a @type="paragraph" region). These are also the very places we have not tackled at all with the current toolset yet. Our GT however mostly already uses 2 levels for that – and rightly so, because this is most versatile (it can still be reduced to a flat regime, but can also be used for ANN segmentation training, for which a non-flat representation is the only way to cleanly separate visual from textual cues).

So I would propose also taking that opportunity to decide in favour of a mildly recursive region representation of 2 levels, with both a block level and an explicit paragraph level in the functional model. This would allow e.g. ocrd-tesserocr-segment to operate on 3 distinct output levels:

  1. block segmentation from page to blocks (of any type),
  2. paragraph segmentation from text blocks to paragraphs and from table blocks to table cells (as a prerequisite for further representation),
  3. line segmentation from paragraphs to text lines.

@kba kba assigned kba, cneud and tboenig Jan 6, 2020
@bertsky bertsky added this to the Final workshop milestone Jan 7, 2020
@bertsky
Copy link
Collaborator

bertsky commented Jan 7, 2020

Okay, so there was consensus in the VC that:

  • region is the better term than block, because
    1. it is used in the document analysis literature
    2. it's also used in PAGE-XML
    3. it requires fewer changes (only spec and some places core, but not many processors and their documentation)
  • recursive regions are just that; there is no need for a flat hierarchy, and mixing this issue with paragraphs is inadequate – I will open a separate issue for tables and footnotes

@cneud
Copy link
Member

cneud commented May 29, 2020

With b199c62, can we release this next week?

@kba
Copy link
Member Author

kba commented May 29, 2020

Yes and also merge OCR-D/assets#73

@kba
Copy link
Member Author

kba commented Jun 15, 2020

Released and assets adapted.

@kba kba closed this as completed Jun 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants