Skip to content

Commit

Permalink
Add authority_terms_duplicate_mode and implement behavior
Browse files Browse the repository at this point in the history
NOTE: Also fixes a bug, regardless of this mode setting, that we
hadn't run into yet, but eventually would have: terms consisting
solely of non-Latin characters would all be treated as identical!
  • Loading branch information
kspurgin committed Nov 18, 2024
1 parent 19ca494 commit 8dc0d4b
Show file tree
Hide file tree
Showing 7 changed files with 94 additions and 36 deletions.
32 changes: 30 additions & 2 deletions doc/batch_configuration.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,13 @@ A JSON config hash may be passed to a new `Mapper::DataHandler` to control vario
"response_mode": "verbose",
"strip_id_values": true,
"multiple_recs_found": "fail",
"authority_terms_duplicate_mode": "exact",
"check_record_status" : true,
"status_check_method" : 'client',
"status_check_method" : "client",
"search_if_not_cached": true,
"force_defaults": false,
"date_format": "month day year",
"two_digit_year_handling": "convert to four digit",
"two_digit_year_handling": "coerce",
"transforms": {
"collection": {
"special": [
Expand All @@ -49,6 +50,33 @@ A JSON config hash may be passed to a new `Mapper::DataHandler` to control vario
}
----

== authority_terms_duplicate_mode

Controls how the `shortIdentifier` field value for new authority records is created, and thus what counts as a "duplicate" authority term/id (first `termDisplayName` value in record).

If `exact`, then the following near-variant terms can all be created in a batch:

- Sidewalk cafes
- Sidewalk cafes.
- Sidewalk cafes

If `normalized`, all of the listed terms would be normalized to `sidewalkcafes` in the `shortIdentifier` and the three terms would be reported in the processing step as having duplicate IDs.

The `exact` setting is useful if you do not have the capacity to normalize and programmatically handle near-duplicates in your authority data (and in the object/procedural data where the terms are used to populate fields) prior to loading your data into CollectionSpace. The down-sides of loading authority data that is this messy is that:

- You are left with LOTS of cleanup to do in CollectionSpace, which is very tedious; and
- Until that cleanup is done, if you need to do an advanced search for objects where the `contentConcept` (and not the `assocConcept`) is sidewalk cafes, you will have to "OR" together all 3 near-variant terms in your search to get comprehensive results
- If used in a field that is faceted on in the public browser, there will be 3 separate facets for the 3 different strings until you clean up the terms

These down-sides are compounded if you are starting from scratch, loading a large number of authority terms, and they are all full of near-duplicate values.

The default value is `normalized` because the more you can clean up this kind trivial-to-programmatically-address variant before you put the data into CollectionSpace, the better your data will work inside CollectionSpace.

- *Required?:* no
- *Defaults to:* `normalized`
- *Data type*: string
- *Allowed values*: `exact`, `normalized`

== batch_mode

`full record`:: The "normal" mode, using CSV template that has all fields for the record type. Allows you to create a full record from each row of data in the CSV. For structured dates, a date string is entered in the column for the structured date group field, and teh application handles parsing the date details. This processing cannot set period, association, note, certainty, qualifier, value, or unit fields within structured date detailed fields.
Expand Down
2 changes: 1 addition & 1 deletion lib/collectionspace/mapper/data_mapper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def add_short_id
else
term = response.split_data["termdisplayname"][0]
CollectionSpace::Mapper::Identifiers::AuthorityShortIdentifier.call(
term
term, handler.batch.authority_terms_duplicate_mode
)
end
response.add_identifier(shortid)
Expand Down
2 changes: 2 additions & 0 deletions lib/collectionspace/mapper/handler_full_record.rb
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ class HandlerFullRecord
end

setting :batch, reader: true do
setting :authority_terms_duplicate_mode, default: "normalized",
reader: true
setting :check_record_status, default: true, reader: true
setting :date_format, default: "month day year", reader: true
setting :default_values, default: {}, reader: true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,43 @@
module CollectionSpace
module Mapper
module Identifiers
class AuthorityShortIdentifier < ShortIdentifier
def initialize(**opts)
super
end
class AuthorityShortIdentifier
class << self
def call(term, mode = "normalized")
case mode
when "normalized"
prepped = prepped(term)
"#{prepped}#{hashed(prepped)}"
when "exact"
"#{prepped(term)}#{hashed(term)}"
end
end

def call
"#{prepped_term}#{hashed_term}"
end
private

def prepped(term)
result = term.gsub(/\W/, "")
return result unless result.empty?

private
# All non-Latin characters are removed from
# shortIdentifiers as created by the CollectionSpace
# application. However, CollectionSpace itself is able
# to generate a unique hash value from the string to use
# as the shortIdentifier value. We need to provide a
# unique string that meets the Latin alphanumeric
# requirements of a shortIdentifier value, so that
# unique strings consisting fully of non-Latin
# characters can be loaded without being flagged as
# duplicates of one another.
"spec#{term.bytes.join("b")}"
end

def hashed(term)
XXhash.xxh32(term)
end

def hashed_term
XXhash.xxh32(prepped_term)
end

end
end
end
Expand Down
20 changes: 2 additions & 18 deletions lib/collectionspace/mapper/identifiers/short_identifier.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,10 @@ module Mapper
module Identifiers
class ShortIdentifier
class << self
def call(term)
new(term: term).call
def call(term, mode = "normalized")
term.gsub(/\W/, "")
end
end

def initialize(term:)
@term = term
end

def call
prepped_term
end

private

attr_reader :term

def prepped_term
term.gsub(/\W/, "")
end
end
end
end
Expand Down
7 changes: 4 additions & 3 deletions lib/collectionspace/mapper/tools/ref_name.rb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ def from_urn(urn)
end

def from_term(source_type:, type:, subtype:, term:, handler:)
identifier = set_identifier(source_type, term)
mode = handler.batch.authority_terms_duplicate_mode
identifier = set_identifier(source_type, term, mode)
new(
type: type,
subtype: subtype,
Expand All @@ -36,8 +37,8 @@ def parse(urn)
fail CollectionSpace::Mapper::UnparseableRefNameUrnError.new(urn)
end

def set_identifier(type, term)
id_class(type).call(term)
def set_identifier(type, term, mode)
id_class(type).call(term, mode)
end

def id_class(type)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,34 @@
subject(:idgenerator) { described_class }

describe ".call" do
it "generates hashed short identifiers for authorities" do
it "generates hashed short identifiers for normalized authority terms" do
authorities = {
"Jurgen Klopp!" => "JurgenKlopp1289035554",
"Achillea millefolium" => "Achilleamillefolium1482849582"
"Jurgen Klopp" => "JurgenKlopp1289035554",
"Jurgén Klopp" => "JurgnKlopp1116712995",
"Jurgen klopp" => "Jurgenklopp1339498236",
"Achillea millefolium" => "Achilleamillefolium1482849582",
"手日尸" => "spec230b137b139b230b151b165b229b176b1841974196654", # in CS: 1731704961991
"廿木竹" => "spec229b187b191b230b156b168b231b171b1853413799245", # in CS: 1731705039577
}

result = authorities.keys.map { |term| idgenerator.call(term) }
expect(result).to eq(authorities.values)
end

it "generates hashed short identifiers for exact authority terms" do
authorities = {
"Jurgen Klopp!" => "JurgenKlopp1344333070",
"Jurgen Klopp" => "JurgenKlopp2369906287",
"Jurgén Klopp" => "JurgnKlopp1760941770",
"Jurgen klopp" => "Jurgenklopp2197261388",
"Achillea millefolium" => "Achilleamillefolium1698421148",
"手日尸" => "spec230b137b139b230b151b165b229b176b1842743601998", # in CS: 1731704961991
"廿木竹" => "spec229b187b191b230b156b168b231b171b1853049386482", # in CS: 1731705039577
}

result = authorities.keys.map { |term| idgenerator.call(term, "exact") }
expect(result).to eq(authorities.values)
end
end
end

0 comments on commit 8dc0d4b

Please sign in to comment.