Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label, Layers and Relation #4

Open
bertsky opened this issue Mar 31, 2021 · 4 comments
Open

Label, Layers and Relation #4

bertsky opened this issue Mar 31, 2021 · 4 comments

Comments

@bertsky
Copy link
Collaborator

bertsky commented Mar 31, 2021

No description provided.

@kba
Copy link
Member

kba commented Apr 8, 2021

8c18d4b implements mapping PAGE @type attribtues to ALTO LayoutTag/@LABEL.

Layers: I cannot find any mechanism for expressing z-level in ALTO.

As for relations I also doubt it can be easily mapped, at least I don't see how :(

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 12, 2021

8c18d4b implements mapping PAGE @type attribtues to ALTO LayoutTag/@LABEL.

Excellent!

I thought just using ALTO's BlockType/@TYPE would be enough for PAGE's various regions' @type. But TagType looks better I must admit. Just a few comments:

  1. Why LayoutTag and not StructureTag?
  2. Perhaps one could do both kinds of mappings, a verbatim copy of @type as @TYPE and the elaborate tagging?
  3. How about including Page/@type vs Layout/Page/@PAGECLASS via the same mechanism?)

What about PAGE's Label mechanism though? Looks as though it is somewhat equivalent to ALTO's TagsType and @TAGREFS... Perhaps via OtherTag?

Layers: I cannot find any mechanism for expressing z-level in ALTO.

IMHO you could express it as as StructureTag with @ID for @id and @LABEL for @zIndex – but I don't know if this is of any use/relevance for anyone.

As for relations I also doubt it can be easily mapped, at least I don't see how :(

From this recommendation it looks like drop-cap relations should be represented via LayoutTag. Not sure about follow-up regions though.

@kba
Copy link
Member

kba commented Apr 12, 2021

Why LayoutTag and not StructureTag?

I was unsure myself and let @cneud be the tiebreaker :) I don't really know the difference tbh.

Perhaps one could do both kinds of mappings, a verbatim copy of @type as @type and the elaborate tagging?

I did not realize that ALTO has @TYPE. Being redundant here for implementations that use either mechanism makes sense.

How about including Page/@type vs Layout/Page/@PAGECLASS via the same mechanism?)

👍

What about PAGE's Label mechanism though?

Sure, I can have a look. Do you have an example?

IMHO you could express it as as StructureTag with @id for @id and @Label for @zindex – but I don't know if this is of any use/relevance for anyone.

Sure, why not. Again, an example would help with testing.

From this recommendation it looks like drop-cap relations should be represented via LayoutTag. Not sure about follow-up regions though.

IIUC the example cited is not a relation from drop-cap to region but just tagging that this alto:String is a DropCap with content A (which seems unneccessary). We could use a hack with @ID being the source and @LABEL or @DESCRIPTION being the target region. It would be better than losing that information for sure.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 12, 2021

I did not realize that ALTO has @TYPE. Being redundant here for implementations that use either mechanism makes sense.

I concur.

What about PAGE's Label mechanism though?

Sure, I can have a look. Do you have an example?

Pass (again), sorry. I have grepped through all my PAGE-XML GT resources (which includes various datasets from PRImA), but have not found anything on Labels or Relation. (But the latter is in some of OCR-D structure GT IIRC.)

It's quite expressive: you can have Labels under MetadataItem, all segment hierarchy types from Page to Glyph, all ReadingOrder group types, and even Relation. We should probably open an issue and demand more documentation/examples.

From this recommendation it looks like drop-cap relations should be represented via LayoutTag. Not sure about follow-up regions though.

IIUC the example cited is not a relation from drop-cap to region but just tagging that this alto:String is a DropCap with content A (which seems unneccessary).

You're right – it looked more promising at the first glance.

So we do need a representation for link vs join. PAGE's schema-internal documentation reads as if this should apply on different hierarchy levels, but I cannot find a single GT example.

I would expect:

  • drop-capital vs paragraph:
    • word level: whether or not the drop-cap Word itself is a whole word (i.e. is to be delimited by white space)
    • line/region level: always join (i.e. no extra line break or paragraph break)
  • paragraph vs paragraph:
    • word level: whether or not the last Word of the first is continued in the second (i.e. is to be de-hyphenated)
    • region level: whether or not the first paragraph is continued (i.e. no extra paragraph break)
  • line vs line: whether or not the last Word of the first is continued on the second (i.e. is to be de-hyphenated)

But with ALTO we already have an explicit white-space model – on the line level. So I guess you could argue keeping a SP after the final String could represent link (as opposed to join). But that would just be a convention, and I doubt anyone already uses it. Also, for the third case, we don't know how much use ALTO producers/consumers make of HYP and of String/@SUBS_TYPE (HypPart1 and HypPart2). And beyond that we still need to mark paragraph joins (non-breaks).

I was curious how TEI converters handle this. Sifting through with https://github.com/cneud/ocr-conversion and https://github.com/altoxml/documentation/wiki/Software

I cannot believe there is no existing ALTO-TEI converter capable of unwrapping lines and concatenating text into linear sequence (based on reading order and block/paragraph bounaries). 😦

We could use a hack with @ID being the source and @LABEL or @DESCRIPTION being the target region. It would be better than losing that information for sure.

Not sure anymore we strictly need a relation type (see above: probably just a marker for "join-with-next" on various levels)...

kba added a commit that referenced this issue Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants