-
Notifications
You must be signed in to change notification settings - Fork 0
Implementation Notes
We begin our exploration by modeling the domain.
The first goal of this project is to extract named entities from the texts in an archival collection so that they may be studied. Our first development effort, then, is to build software that can do this for us, in a straightforward fashion, without the need for human intervention.
Our first approach is to use an off-the-shelf natural-language-processing system (SpaCy) to extract named entities from OCR’d text. We must acknowledge at the outset that the naive approach–using an untrained NER system on uncorrected OCR–will be produce dirty results, but these results may prove adequate for our needs.
The digital surrogates for Princeton’s archival collections are represented as digital containers of digital pages, each container corresponding with the container elements in the collection’s finding aid. These digital containers are, in turn, represented as IIIF Manifests, with each digitized page corresponding with a IIIF Canvas. IIIF’s strong support of Web Annotations suggests we might represent named entities as annotations on a container or canvas. Such a linked-data approach leads naturally to modeling these entities and relationships with an ontology that supports graph queries and reasoning.
As it happens, the CIDOC-CRM ontology is very well suited to this application. The named entities recognized by an NER program are Symbolic Objects in CIDOC-CRM. From the CIDOC-CRM specification:
This class [E90_Symbolic_Object] comprises identifiable symbols and any aggregation of symbols, such as characters, identifiers, traffic signs, emblems, texts, data sets, images, musical scores, multimedia objects, computer program code or mathematical formulae that have an objectively recognizable structure and that are documented as single units.
Such a symbolic object comprises an Inscription, and the Inscription is carried by a physical page that is represented by a Canvas.
(SpaCy’s entity objects have various properties we may want to exploit. For now, though, we’ll simply extract the lexical properties.)
@prefix ecrm: <http://erlangen-crm.org/200717/> .
@prefix entity: <https://figgy.princeton.edu/concerns/entities/> .
@prefix etype: <https://figgy.princeton.edu/concerns/entity_types#> .
@prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
entity:2DScyJQDKeAVfgR8KdptFU a ecrm:E90_Symbolic_Object ;
rdfs:label "Wolff" ;
ecrm:P190_has_symbolic_content "Wolff" .
inscription:4CrkqjTLpyrZd5DqJf82ZK a ecrm:E34_Inscription ;
ecrm:E55_Type etype:PERSON ;
ecrm:P106_is_composed_of entity:2DScyJQDKeAVfgR8KdptFU ;
ecrm:P128i_is_carried_by <https://figgy.princeton.edu/concern/scanned_resources/2a701cb1-33d4-4112-bf5d-65123e8aa8e7/manifest/canvas/c783bbf1-156a-4fc5-b28d-a39fc9b6af03> .