Skip to content

Alignments of the GTAA with other thesauri vocabularies

mwigham edited this page Sep 3, 2019 · 11 revisions

Note: these alignments are about aligning the GTAA to other domain ontologies. For aligning the Sound and Vision schema to other schemas, see this page

Existing alignments

  • GTAA Onderwerpen to Brinkman thesaurus
  • Non-compliant GTAA Persoonsnamen to Discogs artists
  • GTAA Persoonsnamen to Wikidata (ask Jesse)

Potential alignments

Method

At present, we use SpinqueDesk for creating alignments.

Lessons learned are:

  • It's important to get to know the dataset a bit to discover if there are different groups that should be handled differently. For example, when aligning names, we found it useful to handle names with initials differently to full names.
  • Data tends to be messy - also external sources such as Discogs can contain duplications and errors
  • Evaluating the results (working out how to evaluate them!) and deciding how to use them is a large task that should not be underestimated.
  • It's important to be clear about how the metadata fields have been mapped when importing into SpinqueDesk, to ensure you traverse the right relations in SpinqueDesk, and don't miss data. Data can also be lost during the mapping, e.g. artists without roles are currently not included as the mapping of the iMMIx data assumed that there was always a role.
  • Fuzzy matching rapidly introduces a lot of noise. Even a single-letter difference in a fairly long name can still result in an error - e.g. Morgenstein vs. Morgenstern.
  • Fuzzy matching takes a long time with SpinqueDesk
  • Fingerprinting is a good alternative to fuzzy matching, offering compensation for small differences, e.g. punctuation, special characters, whitespace, while staying close to the original text. Even so, it can produce undesirable results - e.g. F. Lo matching with Flo.
  • Regexes can be used for custom processing of strings, e.g. to rearrange names in the form 'surname, initials' into 'initials surname'
  • It is very useful to confirm matches of concepts by using other metadata, e.g. when matching artist names based on string matching, to confirm the matches by creating a list of artists that match based on having worked on the same album/track. Matches that appear in both lists are more likely to be correct.
  • All actions in SpinqueDesk are based on processing the data objects. So programming logic needs to also be in terms of the objects - e.g. an if/else can be performed by filtering objects with a regex and having separate processing paths for the True and False outputs
  • The results of different paths can be combined in a Mix block. This also gives a point at which you can easily control what is included, by choosing which results to link in the Mix block
  • It is not possible to define and reuse groups of blocks in SpinqueDesk
  • Objects are passed around in SpinqueDesk with their properties, but the properties are not easily accessible. For example, to get the pref label and description of an object, you need an 'Extract Strings' block with the pref label selected, and then another Extract STrings with the description selected. From each, you get a pair of the object with the string.
  • Processing blocks in SpinqueDesk accept specific types of inputs. For example, the 'Sample' block accepts only a list of objects. Also, to merge two pairs of objects together, the pairs must be arranged so that the two 'inner' objects are the same. This introduces a lot of overhead with Merge, Swap and Split blocks to ensure the correct input type each time. For example, to get a sample of matches, you need to first split the matches to get object lists, then use a Sample block to get a sample of those objects, then merge that sample with the matches to get a sample of matches.
  • SpinqueDesk does not try to prevent actions that are too computationally heavy, with the result that the whole application can be brought to a halt by a matching strategy that has not been well thought through
  • SpinqueDesk does not indicate why a match has occurred - e.g. if you match on the names and alternative names of a concept, you do not see on the basis of which name the match was made. It is possible to reconstruct this to some extent with additional blocks. This is more complicated when the match was made based on an indirect property, e.g. matching two people based on the names of the albums they have worked on. In this case, we needed to download various results as CSV files and combine them in an external script.
  • Documenting the alignment is difficult as the possibilities for annotation in SpinqueDesk are limited (we chose to document it in a separate document, with screenshots)
  • Debugging an alignment is difficult as you cannot step through what is happening. Running with examples or small test sets can help.
  • SpinqueDesk displays lists of results that can be scrolled through. For a full overview or to be able to filter and sort the results, you can download the results as a CSV file