Skip to content

Prepping data

Michelle Janowiecki edited this page Feb 14, 2022 · 8 revisions

This is a workflow for preparing metadata for new items for ingest into Drupal.

Get list of existing terms from Drupal

1. Get existing taxonomy terms from Drupal.

input: none
script: get/getTaxonomyIdentifiers.py
output: levy-api/existing-taxonomies

First, you need to get existing taxonomy terms from Drupal. This is to ensure you don't make duplicates of already existing terms. To do this, run getTaxonomyIdentifiers.py against your production site. This will create a folder in your levy-api directory called existing-taxonomies and will create a CSV for each type of taxonomy in Drupal.

Currently, there are six taxonomies in the Lester Levy Sheet Music Collection:

  • Composition Metadata (composition_metadata.csv)
  • Content List (c.csv)
  • Creator Roles (creator_r.csv)
  • Duplicate Reason Codes (duplicat.csv)
  • Instrumentation Metadata (instrumentation_metadata.csv)
  • Publishers (publishers.csv)
  • Subjects (subjects.csv)

2. Get existing levy_collection_names from Drupal.

input: none
script: get/getNode_levy_collection_names.py
output: allCollectionNames.csv

Next, you need to get existing levy_collection_names (the entity used for creator/contributor names) from Drupal. This is to ensure you don't make duplicates of already existing names. To do this, run getNode_levy_collection_names.py against your production site. This will create a CSV called allCollectionNames.csv containing all existing levy_collection_names in Drupal in your main levy-api directory.

Get list of terms from new data.

3. Get list of taxonomy terms and levy_collection_names from spreadsheet of new data.

input: Metadata spreadsheet
script: explodeTaxonomiesAndNames.py
output: levy-api/aggregated-taxonomies & levy-api/aggregated-roles

Now, we need to determine what taxonomy terms and levy_collection_names are contained in the metadata for our new items. This script "explodes" taxonomy terms and levy_collection_names from our metadata spreadsheet into CSVs aggregated by the terms themselves (for instance, "love" or "Smith, Bob"). In step four, we will compare these terms/names to those found in Step 1 and Step 2 and determine what terms/names are new and should be created in Drupal.

To do this, run explodeTaxonomiesAndNames.py with your metadata spreadsheet in the main levy-api directory.

To learn about how to format your metadata spreadsheet and how to name your columns, please see this example.

Determine what terms need to be created in Drupal.

4. Compare taxonomy terms from new items to existing terms in Drupal.

input: spreadsheets in levy-api/existing-taxonomies & levy-api/aggregated-taxonomies
script: findExistingTaxTermsAndTermsToCreate.py
output: levy-api/items-matched, termsDone/taxonomyTermsDone.csv, termsToCreate/taxonomyTermsToCreate.csv

This script compares taxonomy terms found in Drupal (Step 1) and terms found in your metadata spreadsheet (Step 3) and produces several helpful spreadsheets.

The first is taxonomyTermsDone.csv, which is a CSV containing a list of taxonomy terms found in your metadata that already exist in Drupal. This is for your reference.

The second is taxonomyTermsToCreate.csv, which is a CSV containing a list of taxonomy terms found in your metadata that DO NOT exist in Drupal and need to be created in later steps.

Finally, a new folder named items-matched is created in your levy-api directory. This folder contains a CSV for each taxonomy with terms in your metadata spreadsheet. The CSV contains all of the taxonomy terms, associated fileIdentifiers, and the taxonomy terms' Drupal identifier (if found). We will rely on these spreadsheets in later steps.

5. Compare levy_collection_names from new items to existing terms in Drupal.

input: allCollectionNames.csv & levy-api/aggregated-roles
script: findExistingCollNamesAndNamesToCreate.py
output:matched_CollectionNames.csv, termsDone/levy_collection_namesDone.csv, termsToCreate/levy_collection_namesToCreate.csv

This script compares levy_collection_names found in Drupal (Step 2) and terms found in your metadata spreadsheet (Step 3) and produces several helpful spreadsheets.

The first is levy_collection_namesDone.csv, which is a CSV containing a list of levy_collection_names found in your metadata that already exist in Drupal. This is for your reference.

The second is levy_collection_namesToCreate.csv, which is a CSV containing a list of levy_collection_names found in your metadata that DO NOT exist in Drupal and need to be created in later steps.

Finally, a CSV called matched_CollectionNames.csv is created. It contains all of the levy_collection_names, associated fileIdentifiers, and the levy_collection_names' Drupal identifier (if found). We will rely on this spreadsheet in later steps.