Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite assign_placenames robot #822

Open
edsu opened this issue Feb 20, 2024 · 1 comment
Open

Rewrite assign_placenames robot #822

edsu opened this issue Feb 20, 2024 · 1 comment
Assignees

Comments

@edsu
Copy link
Contributor

edsu commented Feb 20, 2024

The existing assign_placenames step in the gisAssemblyWorkflow tries to assign GeoNames IDs to subjects for places and stores them in the Cocina. It does this using the Gazetteer class that looks up names in a CSV file.

Work on #744 led to the conclusion that:

  • It's not sustainable to keep the CSV mapping up to date with all the place names that will occur.
  • GeoNames IDs would be difficult to lookup automatically in the GeoNames web service due to too many false positives.
  • Only 58% of GIS items have GeoNames IDs and they aren't used in GeoBlacklight or elsewhere in SUL access systems.

However, the place names do appear to mostly map to Library of Congress Name Authority headings. 95% of all the GIS items have subject place names that are present in the LC Name and Subject Authority File.

We would like to update assign_placenames to ensure that the place names are valid, and add the URI for the authority record to the Cocina. This will ensure that:

  • GIS items related to a place colocate in search results when faceting in EarthWorks and SearchWorks
  • The id.loc.gov URI can be used to look up links to Wikidata, GeoNames, etc to add additional information in the future (e.g providing descriptions or images for places in GeoBlacklight in the future).

So, this ticket is to reboot assign_placenames to:

  1. Remove the CSV and all functionality related to it
  2. Modify the Gazetteer class so that it looks up a place name in id.loc.gov, finds an exact match Name or Subject Authority record, and returns the id.loc.gov URI.
  3. Adds the id.loc.gov URI to the place subject Cocina.

The subject should have a uri and source added:

{
  "value": "Finland",
  "type": "place",
  "uri": "http://id.loc.gov/authorities/names/n79065711",
  "source": {
    "code": "lcnaf",
    "uri": "http://id.loc.gov/authorities/names/"
  }
}

Some of the place names have been found in the subject authority file. You can identify these because the will have sh in their ID:

{
  "value": "Arctic Ocean",
  "type": "place",
  "uri": "http://id.loc.gov/authorities/subjects/sh85006951",
  "source": {
    "code": "lcsh",
    "uri": "http://id.loc.gov/authorities/subjects/"
  }
}

An example of looking up a heading using LC's id.loc.gov service can be found in this Jupyter Notebook. The Python implementation is included here, but it should map easily to Ruby. The search results are available in Atom XML. The JSON format didn't make immediate sense, but you are welcome to try to use that instead if you want.

import requests
from xml.etree import ElementTree

def lookup_name(name):
    url = "https://id.loc.gov/search/"
    params = {
        "q":  [
            f'"{name}"',
            'rdftype:Authority'
        ],
        "format": "atom"
    }
    resp = requests

    resp = requests.get(url, params)
    resp.raise_for_status()
    
    doc = ElementTree.fromstring(resp.content)
    ns = {"atom": "http://www.w3.org/2005/Atom"}
    for entry in doc.findall('atom:entry', ns):
        title = entry.find("atom:title", ns).text
        uri = entry.find("atom:link", ns).attrib["href"]

        # If the strings match return it.
        # Note: some unauthorized headings have dashes in the URI and we want to ignore those
        # prefer https://id.loc.gov/authorities/names/n79065711 to https://id.loc.gov/authorities/names/n79065711
        if title == name and '-' not in uri:
            return uri

    return None

lookup_name('Sri Lanka')
@edsu edsu converted this from a draft issue Feb 20, 2024
@edsu edsu changed the title Rework assign_placenames Reboot assign_placenames Feb 20, 2024
@edsu edsu changed the title Reboot assign_placenames Reboot assign_placenames robot Feb 20, 2024
@peetucket peetucket changed the title Reboot assign_placenames robot Rewrite assign_placenames robot Feb 28, 2024
@edsu
Copy link
Contributor Author

edsu commented Feb 28, 2024

Much better title, merci!

@edsu edsu self-assigned this Feb 28, 2024
@edsu edsu moved this from Ready to In Progress in Geo Workcycles 2024 Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant