-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HOLD] only use first initial in name queries when not ambiguous or no alternate identities #840
base: main
Are you sure you want to change the base?
Conversation
8436412
to
85e42f6
Compare
85e42f6
to
2705bc5
Compare
2705bc5
to
8f96e22
Compare
65c5488
to
c4c7494
Compare
c4c7494
to
8bf2b1e
Compare
0fb69c0
to
772e7ca
Compare
772e7ca
to
6cc4339
Compare
6cc4339
to
d975288
Compare
d975288
to
2580879
Compare
2580879
to
4497509
Compare
@@ -74,7 +74,6 @@ DOI: | |||
HARVESTER: | |||
LOG: log/all_sources_harvester.log | |||
USE_MIDDLE_NAME: true | |||
USE_FIRST_INITIAL: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this configuration parameter is no longer needed since we will now selectively decide if we want to use the first initial or not
4497509
to
d99e2fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one refactoring suggestion, and one bit of logic that i'm not sure is working as expected, but otherwise LGTM
d4b7c5e
to
419f79e
Compare
re-reviewed, approved, but leaving unmerged due to testing concern outlined in description |
419f79e
to
e487aff
Compare
e487aff
to
44b1c45
Compare
55b19f7
to
89d250b
Compare
89d250b
to
e2898c3
Compare
e2898c3
to
594bd23
Compare
HOLD - may want to figure out how to do some comparison tests between both styles of queries for some subset of authors to see what the impact is
For people whose first initial and last name combo is ambiguous within Stanford, it is dangerous to use this combination with an OR in the name query as it will result in lots of false positives for those people. The solution was to stop using first initial only variants when searching (i.e. setting
Settings.HARVESTER.USE_FIRST_INITIAL
to false), but this has the downside of missing some legitimate publications where something was published using only the first initial and not the full first name.This PR allows most name queries to execute with a first initial search variant to maximize results, unless we can determine
Note that we currently keep authors who have left Stanford in our database and make them inactive - and thus even after someone has left Stanford, we still are not going to include the first initial query on the remaining authors at Stanford. This is desirable, otherwise an author with an ambiguous first initial that leaves Stanford would potentially cause many new false positives to appear for other authors with the same last name and first initial.
But it has some potential issues:
A more consistent solution is to have users add first initial variant into their alternate names if needed, though this requires user action and is unlikely to happen at scale.
This approach needs to be validated with the Profiles team.