Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add authority_terms_duplicate_mode and implement behavior #180

Merged
merged 2 commits into from
Nov 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ This project bumps the version number for any changes (including documentation u

## [Unreleased] - i.e. pushed to main branch but not yet tagged as a release

## [6.1.0] - 2024-11-18
- Add `authority_terms_duplicate_mode` batch config setting that changes the way authority `shortIdentifier` values are generated, allowing near-duplicate terms to be created in a batch.
- BUGFIX: Authority terms consisting solely of non-Latin characters are no longer normalized to a blank string for `shortIdentifier` creation, and thus will not be flagged as duplicate terms.

## [6.0.4] - 2024-10-21
- Make fallback term search fully case-insensitive, rather than just capitalizing/downcasing first letter of term

Expand Down
2 changes: 1 addition & 1 deletion Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ GIT
PATH
remote: .
specs:
collectionspace-mapper (6.0.4)
collectionspace-mapper (6.1.0)
activesupport (= 6.0.4.7)
chronic
collectionspace-client (~> 0.15.0)
Expand Down
32 changes: 30 additions & 2 deletions doc/batch_configuration.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,13 @@ A JSON config hash may be passed to a new `Mapper::DataHandler` to control vario
"response_mode": "verbose",
"strip_id_values": true,
"multiple_recs_found": "fail",
"authority_terms_duplicate_mode": "exact",
"check_record_status" : true,
"status_check_method" : 'client',
"status_check_method" : "client",
"search_if_not_cached": true,
"force_defaults": false,
"date_format": "month day year",
"two_digit_year_handling": "convert to four digit",
"two_digit_year_handling": "coerce",
"transforms": {
"collection": {
"special": [
Expand All @@ -49,6 +50,33 @@ A JSON config hash may be passed to a new `Mapper::DataHandler` to control vario
}
----

== authority_terms_duplicate_mode

Controls how the `shortIdentifier` field value for new authority records is created, and thus what counts as a "duplicate" authority term/id (first `termDisplayName` value in record).

If `exact`, then the following near-variant terms can all be created in a batch:

- Sidewalk cafes
- Sidewalk cafes.
- Sidewalk cafes

If `normalized`, all of the listed terms would be normalized to `sidewalkcafes` in the `shortIdentifier` and the three terms would be reported in the processing step as having duplicate IDs.

The `exact` setting is useful if you do not have the capacity to normalize and programmatically handle near-duplicates in your authority data (and in the object/procedural data where the terms are used to populate fields) prior to loading your data into CollectionSpace. The down-sides of loading authority data that is this messy is that:

- You are left with LOTS of cleanup to do in CollectionSpace, which is very tedious; and
- Until that cleanup is done, if you need to do an advanced search for objects where the `contentConcept` (and not the `assocConcept`) is sidewalk cafes, you will have to "OR" together all 3 near-variant terms in your search to get comprehensive results
- If used in a field that is faceted on in the public browser, there will be 3 separate facets for the 3 different strings until you clean up the terms

These down-sides are compounded if you are starting from scratch, loading a large number of authority terms, and they are all full of near-duplicate values.

The default value is `normalized` because the more you can clean up this kind trivial-to-programmatically-address variant before you put the data into CollectionSpace, the better your data will work inside CollectionSpace.

- *Required?:* no
- *Defaults to:* `normalized`
- *Data type*: string
- *Allowed values*: `exact`, `normalized`

== batch_mode

`full record`:: The "normal" mode, using CSV template that has all fields for the record type. Allows you to create a full record from each row of data in the CSV. For structured dates, a date string is entered in the column for the structured date group field, and teh application handles parsing the date details. This processing cannot set period, association, note, certainty, qualifier, value, or unit fields within structured date detailed fields.
Expand Down
2 changes: 1 addition & 1 deletion lib/collectionspace/mapper/data_mapper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def add_short_id
else
term = response.split_data["termdisplayname"][0]
CollectionSpace::Mapper::Identifiers::AuthorityShortIdentifier.call(
term
term, handler.batch.authority_terms_duplicate_mode
)
end
response.add_identifier(shortid)
Expand Down
2 changes: 2 additions & 0 deletions lib/collectionspace/mapper/handler_full_record.rb
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ class HandlerFullRecord
end

setting :batch, reader: true do
setting :authority_terms_duplicate_mode, default: "normalized",
reader: true
setting :check_record_status, default: true, reader: true
setting :date_format, default: "month day year", reader: true
setting :default_values, default: {}, reader: true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,40 @@
module CollectionSpace
module Mapper
module Identifiers
class AuthorityShortIdentifier < ShortIdentifier
def initialize(**opts)
super
end
class AuthorityShortIdentifier
class << self
def call(term, mode = "normalized")
case mode
when "normalized"
prepped = prepped(term)
"#{prepped}#{hashed(prepped)}"
when "exact"
"#{prepped(term)}#{hashed(term)}"
end
end

def call
"#{prepped_term}#{hashed_term}"
end
private

def prepped(term)
result = term.gsub(/\W/, "")
return result unless result.empty?

private
# All non-Latin characters are removed from
# shortIdentifiers as created by the CollectionSpace
# application. However, CollectionSpace itself is able
# to generate a unique hash value from the string to use
# as the shortIdentifier value. We need to provide a
# unique string that meets the Latin alphanumeric
# requirements of a shortIdentifier value, so that
# unique strings consisting fully of non-Latin
# characters can be loaded without being flagged as
# duplicates of one another.
"spec#{term.bytes.join("b")}"
end

def hashed_term
XXhash.xxh32(prepped_term)
def hashed(term)
XXhash.xxh32(term)
end
end
end
end
Expand Down
20 changes: 2 additions & 18 deletions lib/collectionspace/mapper/identifiers/short_identifier.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,10 @@ module Mapper
module Identifiers
class ShortIdentifier
class << self
def call(term)
new(term: term).call
def call(term, mode = "normalized")
term.gsub(/\W/, "")
end
end

def initialize(term:)
@term = term
end

def call
prepped_term
end

private

attr_reader :term

def prepped_term
term.gsub(/\W/, "")
end
end
end
end
Expand Down
7 changes: 4 additions & 3 deletions lib/collectionspace/mapper/tools/ref_name.rb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ def from_urn(urn)
end

def from_term(source_type:, type:, subtype:, term:, handler:)
identifier = set_identifier(source_type, term)
mode = handler.batch.authority_terms_duplicate_mode
identifier = set_identifier(source_type, term, mode)
new(
type: type,
subtype: subtype,
Expand All @@ -36,8 +37,8 @@ def parse(urn)
fail CollectionSpace::Mapper::UnparseableRefNameUrnError.new(urn)
end

def set_identifier(type, term)
id_class(type).call(term)
def set_identifier(type, term, mode)
id_class(type).call(term, mode)
end

def id_class(type)
Expand Down
2 changes: 1 addition & 1 deletion lib/collectionspace/mapper/version.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

module CollectionSpace
module Mapper
VERSION = "6.0.4"
VERSION = "6.1.0"
end
end
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,34 @@
subject(:idgenerator) { described_class }

describe ".call" do
it "generates hashed short identifiers for authorities" do
it "generates hashed short identifiers for normalized authority terms" do
authorities = {
"Jurgen Klopp!" => "JurgenKlopp1289035554",
"Achillea millefolium" => "Achilleamillefolium1482849582"
"Jurgen Klopp" => "JurgenKlopp1289035554",
"Jurgén Klopp" => "JurgnKlopp1116712995",
"Jurgen klopp" => "Jurgenklopp1339498236",
"Achillea millefolium" => "Achilleamillefolium1482849582",
"手日尸" => "spec230b137b139b230b151b165b229b176b1841974196654",
"廿木竹" => "spec229b187b191b230b156b168b231b171b1853413799245"
}

result = authorities.keys.map { |term| idgenerator.call(term) }
expect(result).to eq(authorities.values)
end

it "generates hashed short identifiers for exact authority terms" do
authorities = {
"Jurgen Klopp!" => "JurgenKlopp1344333070",
"Jurgen Klopp" => "JurgenKlopp2369906287",
"Jurgén Klopp" => "JurgnKlopp1760941770",
"Jurgen klopp" => "Jurgenklopp2197261388",
"Achillea millefolium" => "Achilleamillefolium1698421148",
"手日尸" => "spec230b137b139b230b151b165b229b176b1842743601998",
"廿木竹" => "spec229b187b191b230b156b168b231b171b1853049386482"
}

result = authorities.keys.map { |term| idgenerator.call(term, "exact") }
expect(result).to eq(authorities.values)
end
end
end
Loading