Rewrite assign_placenames robot #822

edsu · 2024-02-20T16:20:22Z

The existing assign_placenames step in the gisAssemblyWorkflow tries to assign GeoNames IDs to subjects for places and stores them in the Cocina. It does this using the Gazetteer class that looks up names in a CSV file.

Work on #744 led to the conclusion that:

It's not sustainable to keep the CSV mapping up to date with all the place names that will occur.
GeoNames IDs would be difficult to lookup automatically in the GeoNames web service due to too many false positives.
Only 58% of GIS items have GeoNames IDs and they aren't used in GeoBlacklight or elsewhere in SUL access systems.

However, the place names do appear to mostly map to Library of Congress Name Authority headings. 95% of all the GIS items have subject place names that are present in the LC Name and Subject Authority File.

We would like to update assign_placenames to ensure that the place names are valid, and add the URI for the authority record to the Cocina. This will ensure that:

GIS items related to a place colocate in search results when faceting in EarthWorks and SearchWorks
The id.loc.gov URI can be used to look up links to Wikidata, GeoNames, etc to add additional information in the future (e.g providing descriptions or images for places in GeoBlacklight in the future).

So, this ticket is to reboot assign_placenames to:

Remove the CSV and all functionality related to it
Modify the Gazetteer class so that it looks up a place name in id.loc.gov, finds an exact match Name or Subject Authority record, and returns the id.loc.gov URI.
Adds the id.loc.gov URI to the place subject Cocina.

The subject should have a uri and source added:

{
  "value": "Finland",
  "type": "place",
  "uri": "http://id.loc.gov/authorities/names/n79065711",
  "source": {
    "code": "lcnaf",
    "uri": "http://id.loc.gov/authorities/names/"
  }
}

Some of the place names have been found in the subject authority file. You can identify these because the will have sh in their ID:

{
  "value": "Arctic Ocean",
  "type": "place",
  "uri": "http://id.loc.gov/authorities/subjects/sh85006951",
  "source": {
    "code": "lcsh",
    "uri": "http://id.loc.gov/authorities/subjects/"
  }
}

An example of looking up a heading using LC's id.loc.gov service can be found in this Jupyter Notebook. The Python implementation is included here, but it should map easily to Ruby. The search results are available in Atom XML. The JSON format didn't make immediate sense, but you are welcome to try to use that instead if you want.

import requests
from xml.etree import ElementTree

def lookup_name(name):
    url = "https://id.loc.gov/search/"
    params = {
        "q":  [
            f'"{name}"',
            'rdftype:Authority'
        ],
        "format": "atom"
    }
    resp = requests

    resp = requests.get(url, params)
    resp.raise_for_status()
    
    doc = ElementTree.fromstring(resp.content)
    ns = {"atom": "http://www.w3.org/2005/Atom"}
    for entry in doc.findall('atom:entry', ns):
        title = entry.find("atom:title", ns).text
        uri = entry.find("atom:link", ns).attrib["href"]

        # If the strings match return it.
        # Note: some unauthorized headings have dashes in the URI and we want to ignore those
        # prefer https://id.loc.gov/authorities/names/n79065711 to https://id.loc.gov/authorities/names/n79065711
        if title == name and '-' not in uri:
            return uri

    return None

lookup_name('Sri Lanka')

The text was updated successfully, but these errors were encountered:

edsu · 2024-02-28T18:49:48Z

Much better title, merci!

edsu added this to Geo Workcycles 2024 Feb 16, 2024

edsu converted this from a draft issue Feb 20, 2024

edsu changed the title ~~Rework assign_placenames~~ Reboot assign_placenames Feb 20, 2024

edsu changed the title ~~Reboot assign_placenames~~ Reboot assign_placenames robot Feb 20, 2024

peetucket changed the title ~~Reboot assign_placenames robot~~ Rewrite assign_placenames robot Feb 28, 2024

edsu self-assigned this Feb 28, 2024

edsu moved this from Ready to In Progress in Geo Workcycles 2024 Feb 28, 2024

edsu mentioned this issue Apr 16, 2024

Reconsider Assign Placenames robot #732

Closed

thatbudakguy removed this from Geo Workcycles 2024 Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite assign_placenames robot #822

Rewrite assign_placenames robot #822

edsu commented Feb 20, 2024 •

edited

Loading

edsu commented Feb 28, 2024 •

edited

Loading

Rewrite assign_placenames robot #822

Rewrite assign_placenames robot #822

Comments

edsu commented Feb 20, 2024 • edited Loading

edsu commented Feb 28, 2024 • edited Loading

edsu commented Feb 20, 2024 •

edited

Loading

edsu commented Feb 28, 2024 •

edited

Loading