Use "block" instead of "region" throughout #135

kba · 2019-12-19T12:29:46Z

ALTO tries to be interoperable with IIIF as discussed here. There is a "Text Granularity Extension" for IIIF that defines what we call "levels":

page	A page in a paginated document
block	An arbitrary region of text
paragraph	A paragraph
line	A topographic line
word	A single word
glyph	A single glyph or symbol

Seems reasonably compatible with our definitions, though we call line TextLine and have no distinct notion of a paragraph.

My point is: We do use region instead of block in a few places, such as some executables ocrd-*-region. Should we decide on a common parameter for level, it would be a moment to make sure we're consistent.

The text was updated successfully, but these errors were encountered:

bertsky · 2020-01-03T23:41:41Z

I concur, we should use the next opportunity to use the term block instead of region more consistently. (Our METS file group USE classes already use BLOCK, but we are already discussing of relaxing that scheme.)

Incidentally, this hierarchy is also identical with Tesseract's RIL (ResultIterator levels).

But I don't think we are entirely incompatible with a paragraph level either. PAGE-XML defines all region types fully recursively, and designates @type="paragraph" for paragraphs. IIUC the current OCR-D spec and implementation is agnostic about whether regions should be used in a flat or multi-level fashion. In some places though, PAGE-XML already requires at least 2 levels (namely table cells and footnotes, perhaps also text blocks comprising a @type="drop-capital" and a @type="paragraph" region). These are also the very places we have not tackled at all with the current toolset yet. Our GT however mostly already uses 2 levels for that – and rightly so, because this is most versatile (it can still be reduced to a flat regime, but can also be used for ANN segmentation training, for which a non-flat representation is the only way to cleanly separate visual from textual cues).

So I would propose also taking that opportunity to decide in favour of a mildly recursive region representation of 2 levels, with both a block level and an explicit paragraph level in the functional model. This would allow e.g. ocrd-tesserocr-segment to operate on 3 distinct output levels:

block segmentation from page to blocks (of any type),
paragraph segmentation from text blocks to paragraphs and from table blocks to table cells (as a prerequisite for further representation),
line segmentation from paragraphs to text lines.

bertsky · 2020-01-07T11:45:29Z

Okay, so there was consensus in the VC that:

region is the better term than block, because
1. it is used in the document analysis literature
2. it's also used in PAGE-XML
3. it requires fewer changes (only spec and some places core, but not many processors and their documentation)
recursive regions are just that; there is no need for a flat hierarchy, and mixing this issue with paragraphs is inadequate – I will open a separate issue for tables and footnotes

cneud · 2020-05-29T10:40:44Z

With b199c62, can we release this next week?

kba · 2020-05-29T11:32:00Z

Yes and also merge OCR-D/assets#73

kba · 2020-06-15T17:04:09Z

Released and assets adapted.

kba assigned kba, cneud and tboenig Jan 6, 2020

bertsky added this to the Final workshop milestone Jan 7, 2020

kba added a commit to kba/spec that referenced this issue Apr 8, 2020

use "region" instead of "block", OCR-D#135

b199c62

bertsky mentioned this issue Apr 28, 2020

introduce 2nd region level for paragraphs, table cells, footnote content #150

Open

kba closed this as completed Jun 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use "block" instead of "region" throughout #135

Use "block" instead of "region" throughout #135

kba commented Dec 19, 2019 •

edited

Loading

bertsky commented Jan 3, 2020 •

edited

Loading

bertsky commented Jan 7, 2020 •

edited

Loading

cneud commented May 29, 2020

kba commented May 29, 2020

kba commented Jun 15, 2020

Use "block" instead of "region" throughout #135

Use "block" instead of "region" throughout #135

Comments

kba commented Dec 19, 2019 • edited Loading

bertsky commented Jan 3, 2020 • edited Loading

bertsky commented Jan 7, 2020 • edited Loading

cneud commented May 29, 2020

kba commented May 29, 2020

kba commented Jun 15, 2020

kba commented Dec 19, 2019 •

edited

Loading

bertsky commented Jan 3, 2020 •

edited

Loading

bertsky commented Jan 7, 2020 •

edited

Loading