Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOLD] only use first initial in name queries when not ambiguous or no alternate identities #840

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

peetucket
Copy link
Member

@peetucket peetucket commented Apr 23, 2018

HOLD - may want to figure out how to do some comparison tests between both styles of queries for some subset of authors to see what the impact is

For people whose first initial and last name combo is ambiguous within Stanford, it is dangerous to use this combination with an OR in the name query as it will result in lots of false positives for those people. The solution was to stop using first initial only variants when searching (i.e. setting Settings.HARVESTER.USE_FIRST_INITIAL to false), but this has the downside of missing some legitimate publications where something was published using only the first initial and not the full first name.

This PR allows most name queries to execute with a first initial search variant to maximize results, unless we can determine

  • there actually is an ambiguity problem with first initials in our database of names OR
  • there are non-Stanford alternate identities (which means there could be name ambiguities at the other institution, which is also problematic)

Note that we currently keep authors who have left Stanford in our database and make them inactive - and thus even after someone has left Stanford, we still are not going to include the first initial query on the remaining authors at Stanford. This is desirable, otherwise an author with an ambiguous first initial that leaves Stanford would potentially cause many new false positives to appear for other authors with the same last name and first initial.

But it has some potential issues:

  • it causes the search behavior to not be consistent from author to author, which can cause coverage for some authors to better than authors (but at least its not equally worse for all)
  • this behavior may change for a given author depending on who else is at Stanford and can change over time
  • any change to queries runs the risk of unintential side effects (e.g. some additional false positivies being returned in an unexpected way)
    A more consistent solution is to have users add first initial variant into their alternate names if needed, though this requires user action and is unlikely to happen at scale.

This approach needs to be validated with the Profiles team.

@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from 8436412 to 85e42f6 Compare April 24, 2018 19:16
@peetucket peetucket changed the title [WIP] experiments in only using first initial when not ambiguous [WIP] only use first initial in name queries when not ambiguous or no alternate identities Apr 24, 2018
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from 85e42f6 to 2705bc5 Compare April 24, 2018 19:33
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from 2705bc5 to 8f96e22 Compare January 11, 2019 17:36
@peetucket peetucket changed the title [WIP] only use first initial in name queries when not ambiguous or no alternate identities only use first initial in name queries when not ambiguous or no alternate identities Mar 19, 2019
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch 4 times, most recently from 65c5488 to c4c7494 Compare March 20, 2019 19:47
@peetucket peetucket removed the ready label Apr 4, 2019
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from c4c7494 to 8bf2b1e Compare April 5, 2019 16:15
@peetucket peetucket changed the title only use first initial in name queries when not ambiguous or no alternate identities [WIP] only use first initial in name queries when not ambiguous or no alternate identities Apr 5, 2019
@peetucket peetucket changed the title [WIP] only use first initial in name queries when not ambiguous or no alternate identities only use first initial in name queries when not ambiguous or no alternate identities Apr 5, 2019
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from 0fb69c0 to 772e7ca Compare April 5, 2019 18:33
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from 772e7ca to 6cc4339 Compare August 14, 2019 19:03
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from 6cc4339 to d975288 Compare November 23, 2020 18:11
@peetucket peetucket marked this pull request as draft November 23, 2020 18:20
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from d975288 to 2580879 Compare November 24, 2020 00:35
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from 2580879 to 4497509 Compare May 4, 2021 17:07
@peetucket peetucket changed the base branch from master to main May 4, 2021 17:09
@@ -74,7 +74,6 @@ DOI:
HARVESTER:
LOG: log/all_sources_harvester.log
USE_MIDDLE_NAME: true
USE_FIRST_INITIAL: false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this configuration parameter is no longer needed since we will now selectively decide if we want to use the first initial or not

@peetucket peetucket marked this pull request as ready for review May 4, 2021 18:28
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from 4497509 to d99e2fc Compare May 11, 2021 21:56
Copy link
Member

@jmartin-sul jmartin-sul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one refactoring suggestion, and one bit of logic that i'm not sure is working as expected, but otherwise LGTM

app/models/author.rb Show resolved Hide resolved
lib/agent/author_name.rb Outdated Show resolved Hide resolved
@peetucket peetucket changed the title only use first initial in name queries when not ambiguous or no alternate identities [HOLD] only use first initial in name queries when not ambiguous or no alternate identities May 12, 2021
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch 3 times, most recently from d4b7c5e to 419f79e Compare May 12, 2021 20:51
@jmartin-sul
Copy link
Member

re-reviewed, approved, but leaving unmerged due to testing concern outlined in description

@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from 419f79e to e487aff Compare May 13, 2021 21:47
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from e487aff to 44b1c45 Compare June 8, 2021 23:12
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from 55b19f7 to 89d250b Compare March 11, 2022 03:32
@peetucket peetucket marked this pull request as draft April 28, 2022 18:30
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from 89d250b to e2898c3 Compare July 26, 2023 17:27
@peetucket peetucket force-pushed the only-use-middle-initial-when-not-ambiguous branch from e2898c3 to 594bd23 Compare October 4, 2023 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants