disambr
is an R 📦 that provides a flexible framework for disambiguation of named entities. Currently this package implements the AEV algorithm (van den Akker et al., 2020) for Web of Science author disambiguation.
The idea of basic usage is simply piping the disambiguation procedures. Each procedure takes list of sets as input and returns list of sets either by adding a new set or modifying input list of sets.
data |>
disambr_set_on_same_paper() |>
disambr_set_similar_initials() |>
disambr_set_similar_last_names()
Sequence (piped functions) of disambiguation procedures defines a disambiguation algorithm.
disambr_eva <- function(data) {
data |>
disambr_set_on_same_paper() |>
disambr_set_similar_initials() |>
disambr_set_similar_last_names()
}
The creation of co-authorship networks is a valuable way to depict the social structure of scientific fields. However, these co-authorship networks often get distorted because of the problems of author name synonymy (the same author is split into two nodes because his name is spelled differently in different publications) and author name homonymy (different authors are compounded into one node because they share the same name). The practice of author name disambiguation (AND) tries to solve these problems by correctly identifying the authors of scientific articles.
Several algorithms have been put forward in the context of AND, but none of them are suitable for large datasets of the Web of Science database. Therefore, in an earlier part of this project we proposed a new unsupervised learning algorithm based on the most recent AND literature. This so-called AEV-algorithm involves two phases: a blocking phase, in which pairs of authors are selected that are sufficiently similar, and a disambiguation phase, in which similar author names are either split or combined into one node based on information retrieved from the Web of Science database. In the disambiguation phase, the algorithm uses information about co-authorship, e-mail addresses, institutional affiliations, cited references, and article keywords (van den Akker et al., 2020).
## Installs and loads disambr
devtools::install_github("stasvlasov/disambr")
library("disambr")
All disambiguation procedures used in disambr
package work with sets. A set is basically any R object that can represent mathematical sets (e.g., set of authors, set of companies) with special attributes that are used by disabmr
functions to identify the kind of set it is working with or produced (e.g., a set of authors that are likely to be the same person or a set of companies that are definitely different companies, etc.).
The attributes that are currently used to define/describe sets as well as their values are listed below:
disambr_set_name
- string name of the set
disambr_entity
- either
person
,organization
,publication
- either
disambr_set_type
similar_entities
,different_entities
disambr_set_coefficient
- number between 0 and 1 indicating how strongly entities are similar or different from each other. It is used only for establishing order of sets processing (e.g., start with sets of least similar entities)
disambr_set_collection
single_set_table
(first column assumed to store entity id or entity id is just row number ifentity_id_reference
attribute is set toself
, see below),list_of_sets_as_lists
(each set is a list of entity ids),dyads_table
(first and second columns assumed to be ids for the pair of entities)
disambr_entity_id_reference
self
, name of other set as in itsset_name
attribute
disambr_entity_id_reference_md5_sum
- md5 cache sum of the object where entities ids are referring to ensure that we will get to correct data for entities in the set.
disambr_recipe
- list of
disambr
procedures that were applied to produce given set(s) - if it is a named list then first item is procedure name and the rest are properties:
procedure
file_name
file_md5sum
(to check file identity later)file_header
(to check for consistency between read files)
- list of
- To allow for modular design each disambiguation procedure should accept and return list of sets (e.g., same person sets, different person sets, other probability of being the same person sets)
- List of sets from (chain of) various procedures will be then merged (using basis set algebra) according to the specific disambiguation algorithm to produce final list of sets.
- Initial input should be in the form of a list of initial sets (the simplest input is one set with every person likely to be non unique, e.g., data.table of authors from Web of Science bibliography data).
- When reading data the package should try to do as many sets as possible on a fly (cleaning and splitting initial data to different types of entities)
- Try to implement lazy data loading and processing where possible
- All functions should have a verb
- Variables should not have a verb
- Everything that is available to user should start with “disambr_” (i.e., package name)
- Internal functions and variables does not have to have this prefix
- Use mainstream coding style guides everywhere where possible
This research was supported (in part) by the Fetzer Franklin Fund of the John E. Fetzer Memorial Trust.
van den Akker, O. R., Epskamp, Sacha, & Vlasov, S. A. (2020). The AEV Algorithm—Author name disambiguation for large Web of Science datasets.
FN | File Name |
VR | Version Number |
PT | Publication Type (J=Journal; B=Book; S=Series; P=Patent) |
AU | Authors |
AF | Author Full Name |
BA | Book Authors |
BF | Book Authors Full Name |
CA | Group Authors |
GP | Book Group Authors |
BE | Editors |
TI | Document Title |
SO | Publication Name |
SE | Book Series Title |
BS | Book Series Subtitle |
LA | Language |
DT | Document Type |
CT | Conference Title |
CY | Conference Date |
CL | Conference Location |
SP | Conference Sponsors |
HO | Conference Host |
DE | Author Keywords |
ID | Keywords Plus® |
AB | Abstract |
C1 | Author Address |
RP | Reprint Address |
EM | E-mail Address |
RI | ResearcherID Number |
OI | ORCID Identifier (Open Researcher and Contributor ID) |
FU | Funding Agency and Grant Number |
FX | Funding Text |
CR | Cited References |
NR | Cited Reference Count |
TC | Web of Science Core Collection Times Cited Count |
Z9 | Total Times Cited Count* |
U1 | Usage Count (Last 180 Days) |
U2 | Usage Count (Since 2013) |
PU | Publisher |
PI | Publisher City |
PA | Publisher Address |
SN | International Standard Serial Number (ISSN) |
EI | Electronic International Standard Serial Number (eISSN) |
BN | International Standard Book Number (ISBN) |
J9 | 29-Character Source Abbreviation |
JI | ISO Source Abbreviation |
PD | Publication Date |
PY | Year Published |
VL | Volume |
IS | Issue |
SI | Special Issue |
PN | Part Number |
SU | Supplement |
MA | Meeting Abstract |
BP | Beginning Page |
EP | Ending Page |
AR | Article Number |
DI | Digital Object Identifier (DOI) |
D2 | Book Digital Object Identifier (DOI) |
PG | Page Count |
P2 | Chapter Count (Book Citation Index) |
WC | Web of Science Categories |
SC | Research Areas |
GA | Document Delivery Number |
UT | Accession Number |
PM | PubMed ID |
ER | End of Record |
EF | End of File |
name | comments |
---|---|
bibliometrix | reads only plaintext format into bibliometrixDB object |
wosr | Requires WoS API subscription |
refsplitr | package ‘refsplitr’ is not available (for R version 4.0.1) |
read.wos.R | Does not work… |
metagear | scrape_bibliography by DOI |
hindexcalculator | ? |
Site: https://bibliometrix.org/index.html
git clone https://github.com/massimoaria/bibliometrix
install.packages("bibliometrix")
library("bibliometrix")
library("magrittr")
bmdata <- convert2df(file = 'https://www.bibliometrix.org/datasets/wos_plaintext.txt', dbsource = 'wos', format = "plaintext")
bmdata %>% class
## [1] "data.frame" "bibliometrixDB"
bmdata %>% names
bmdata <- convert2df(file = 'https://www.bibliometrix.org/datasets/wos_plaintext.txt', dbsource = 'wos', format = "csv")
CRAN docs: https://cran.r-project.org/web/packages/metagear/metagear.pdf
GitHub: https://github.com/cran/metagear/
git clone https://github.com/cran/metagear/
scrape_bibliography
CRAN docs: https://cran.r-project.org/web/packages/BibPlots/BibPlots.pdf
Paper: https://arxiv.org/pdf/1905.09095.pdf
CRAN docs: https://cran.r-project.org/web/packages/hindexcalculator/hindexcalculator.pdf
git clone https://github.com/ropensci/refsplitr
install.packages("refsplitr")
library("refsplitr")
Requires premium WoS API - https://clarivate.com/webofsciencegroup/solutions/xml-and-apis
CRAN doc: https://cran.r-project.org/web/packages/wosr/wosr.pdf
Site: https://github.com/vt-arc/wosr
GitHub: https://github.com/vt-arc/wosr
git clone https://github.com/vt-arc/wosr
install.packages("wosr")
library(wosr)
## Get session ID
sid <- auth("[email protected]", password = "")
## Error: No matches returned for Username [email protected]
## Query WoS to see how many results match your query
query <- 'TS = ("animal welfare") AND PY = (2002-2003)'
query_wos(query, sid = sid)
## Download data
pull_wos(query, sid = sid)
git clone https://github.com/alberto-martin/read.wos.R
## load functions
## --------------------------------------------------------------------------------
source("../lib/read.wos.R/read.wos.functions.R")
## --------------------------------------------------------------------------------
## test
wos.data.mp <- read.wos(dir("../data/Journals in Mathematical Psychology", no.. = TRUE, full.names = TRUE))
## Error in substring(fields, 4) : invalid multibyte string at '<ff><fe>P'
## In addition: Warning message:
## In readLines(files[1], n = 1) : line 1 appears to contain an embedded nul
wos.data.mp <- read.wos("../data/Journals in Mathematical Psychology/Psychonomic Bulletin & Review 2.txt")
## Error in substring(fields, 4) : invalid multibyte string at '<ff><fe>P'
## In addition: Warning message:
## In readLines(files[1], n = 1) : line 1 appears to contain an embedded nul
wos.data <- read.wos("/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments/LN Public NAICS records from 10001 to 10500.txt")
## Error in substring(fields, 4) : invalid multibyte string at '<ff><fe>P'
## In addition: Warning message:
## In readLines(files[1], n = 1) : line 1 appears to contain an embedded nul
(save-excursion
(let ((calls '( "reset-working-directory"
"tangle-readme"
"tangle-buffer"
"generate-package-docs"
))
;; turn off babel prompts
org-confirm-babel-evaluate)
(mapcar
(lambda (name)
(let (org-confirm-babel-evaluate)
(save-excursion
(org-babel-goto-named-src-block name)
(or (org-babel-execute-src-block-maybe)
(org-babel-lob-execute-maybe)))))
calls)))
https://github.com/r-lib/actions/blob/v1/examples/README.md
usethis::use_github_action("check-release")
usethis::use_github_action("test-coverage")
usethis::use_github_action("pkgdown")
usethis::use_github_actions_badge(name = "R-CMD-check")
tinytest::setup_tinytest(".")
## Remove tangled and generated files
## --------------------------------------------------------------------------------
files_to_remove <- c(
"DESCRIPTION"
, "NAMESPACE"
, "LICENSE.md"
, "README.md"
)
dirs_to_remove <-c(
"man"
, "inst/tinytest"
, "data"
, "R"
)
## remove files in dirs recursively
sapply(dirs_to_remove
, function(dir) {
file.remove(list.files(dir , recursive = TRUE, full.names = TRUE))
})
## remove files and empty dirs
sapply(c(files_to_remove, dirs_to_remove)
, \(file) if(file.exists(file)) file.remove(file))
(require 'org-goto)
(save-excursion
(org-goto--local-search-headings "Description" nil t)
(org-pandoc-export-to-markdown nil 'subtreep))
(org-babel-tangle)
## --------------------------------------------------------------------------------
## First load default packages getOption("defaultPackages")
## Otherwise it will add it at the end which can mask some funcitons
.First.sys()
## some packages installations read .Rprofile loops the install
## current_wd <- getwd()
## setwd("~/")
## --------------------------------------------------------------------------------
## Load or Install Packages
## --------------------------------------------------------------------------------
for(pkg in c('devtools'
, 'roxygen2'
## , 'xml2'
## , 'tibble'
, 'stringi'
## , 'stringr'
, 'stringdist'
, 'digest'
, 'crayon'
, 'magrittr'
, 'lubridate'
## , 'plyr'
## , 'pipeR'
## , 'ggplot2'
, 'pbapply'
, 'testthat'
, 'microbenchmark'
, 'data.table'
## , 'dplyr'
))
if(!require(pkg, character.only = TRUE)) {
install.packages(pkg, repos = 'http://cloud.r-project.org')
require(pkg, character.only = TRUE) }
## restore current working directore
## setwd(current_wd)
## --------------------------------------------------------------------------------
## library(disambr)
## update.packages(ask = FALSE, repos = 'http://cloud.r-project.org')
## --------------------------------------------------------------------------------
## Load My pakcages
## --------------------------------------------------------------------------------
## detach(package:romRDS, unload = TRUE)
## remove.packages("romRDS")
if (!require("romRDS", character.only = TRUE)) {
if(!require("devtools")) {
install.packages("devtools"
, repos = 'http://cloud.r-project.org'
, dependencies = TRUE)
require("devtools", character.only = TRUE)
}
install_github("stasvlasov/romRDS")
require("romRDS", character.only = TRUE)
}
## --------------------------------------------------------------------------------
packages | link |
---|---|
tinytest | https://github.com/markvanderloo/tinytest/blob/master/pkg/README.md |
crayon |
packages | current_version | ensure_version | link |
---|---|---|---|
data.table | 1.14.2 | 1.13.0 | |
stringi | 1.7.6 | 1.6.0 | |
parallel | 4.1.2 | 4.0.0 | |
pbapply | 1.5.0 | 1.5.0 | |
stringdist | 0.9.8 | 0.9.0 |
packageVersion(pkg)
## Generate package description
## --------------------------------------------------------------------------------
list(Package = "disambr"
, Title = "disambr - Named Entity Disambiguation in R"
, Description = "disambr - Named Entity Disambiguation in R"
, `Authors@R` = c(person(given = c("Stanislav" ,"A.") , family = "Vlasov"
, email = "[email protected]"
, role = c("aut", "cre"))
, person(given = c("Olmo", "R."), family = "van den Akker"
, email = "[email protected]"
, role = "aut")
, person(given = "Sacha", family = "Epskamp"
, email = "[email protected]"
, role = "aut"))
, Imports = paste(
paste0(dependencies$packages, " (>= ", dependencies$ensure_version, ")")
, collapse = ", ")
## , Depends = "R (>= 4.0), data.table(>= 1.13.0)"
, Depends = "R (>= 4.0)"
, Suggests = paste(suggests$packages, collapse = ", ")
, Version = "0.0.0.9000"
, Date = Sys.Date()
, URL = "https://github.com/stasvlasov/disambr"
, BugReports = "https://github.com/stasvlasov/disambr/issues"
, References = "This research was supported (in part) by the Fetzer Franklin Fund of the John E. Fetzer Memorial Trust. The EVA disambiguation algorithm is described in van den Akker, O. R., Epskamp, Sacha, & Vlasov, S. A. (2020). The AEV Algorithm—Author name disambiguation for large Web of Science datasets."
) |> usethis::use_description(check_name = TRUE
, roxygen = TRUE)
usethis::use_lgpl_license()
## use_tidy_description()
## ----------------------------------------------------------------------------
## Update name spaces and documentation for functions
roxygen2::roxygenise()
#' @details
#' This package provides a framework for disambiguating named entities (e.g., authors in large bibliometric databases)
#'
#' Package provides following main functions
#' - disambr_read - reads WoS data
#' - disambr_aev - implementation of the AEV algorithm (van den Akker et al., 2020) for Web of Science author disambiguation.
#'
#' This is work in progress. Please, file an issues or suggestion if you have any.
#' @keywords internal
"_PACKAGE"
## Remove
## --------------------------------------------------------------------------------
detach(package:disambr, unload = TRUE)
remove.packages("disambr")
## Deploy
## --------------------------------------------------------------------------------
devtools::install(".")
option | behaviour when not set |
---|---|
disambr_save_as | NULL |
disambr_save_set_prefix | disambr-set. |
disambr_save_set_dir | disambr-sets |
disambr_save_set_time_stamp | TRUE |
disambr_verbose | TRUE |
disambr_mess_pretty | FALSE |
disambr_get_output_set | FALSE |
disambr_read_output_set | FALSE |
.onAttach <- function(libname, pkgname) {
options(
)
}
##' Formats time difference as X days HH:MM:SS
##'
##' from https://stackoverflow.com/questions/27312292
##' @param t time diff
##' @return formatted time diff string
##'
##' @export
dhms <- function(t) {
t <- abs(as.numeric(t, units = "secs"))
paste(if((t %/% (60*60*24)) > 0) paste(t %/% (60*60*24), "days") else NULL
,paste(formatC(t %/% (60*60) %% 24, width = 2, format = "d", flag = "0")
, formatC(t %/% 60 %% 60, width = 2, format = "d", flag = "0")
, formatC(t %% 60, width = 2, format = "d", flag = "0")
, sep = ":"))
}
##' Creates message string for reporting during procedures
##' @param mess Message to report. If prefixed by h `h_marks` it will be ouline of level `h`
##' @param h Forse specific ouline level of message
##' @param indent Forse indentation
##' @param prefix Add overal prefix
##' @param h_marks Marks that sets outline. Default is "-". Can be many characters, e.g. "-*#".
##' @param h_prefix Character vector of prefixes for each outline level
##' @param h_prefix_sep Separator between `h_prefix` and `mess`
##' @param pretty Whether to use "crayon" package for pretty printing
##' @param mess_color Color of message
##' @param h_prefix_color Color of ouline prefix
##' @param ... Here we can pass `verbose` argument from upper functions. Default is TRUE
##' @return Message string
##'
##' @export
create_message <- function(mess
, h = integer(0)
, indent = integer(0)
, prefix = ""
, h_marks = "-"
, h_prefix = character()
, h_prefix_sep = " "
, pretty = getOption("disambr_mess_pretty")
, mess_color = "green"
, h_prefix_color = "blue"
, ...) {
## set outline
if(isTRUE(length(h) != 1)) {
mess.regex <- paste0("^([", h_marks, "]*)\\s*(.*)")
mess.parsed <-
stringi::stri_match_first_regex(mess, mess.regex)
mess <- mess.parsed[[3]]
h <- nchar(mess.parsed[[2]]) + 1
}
## set h_prefix
if(length(h_prefix) < h) {
h_prefix_l <- length(h_prefix)
## if h_prefix is NULL
if(h_prefix_l == 0) {
h_prefix <- ""
h_prefix_l <- 1
}
h_prefix <- c(h_prefix, rep(h_prefix[h_prefix_l], h - h_prefix_l))
}
## set indentation
if(isTRUE(length(indent) != 1)) {
indent <-
nchar(paste(c("", h_prefix)[1:h], collapse = "")) +
(h-1)*nchar(h_prefix_sep)
indent <- strrep(" ", indent)
} else{
indent <-
switch(class(indent)
, numeric = if(indent == 0) ""
else strrep(" ", indent)
, character = indent)
}
## create message
h_prefix <- h_prefix[h]
mess.plain <-
paste0(prefix
, indent
, h_prefix
, h_prefix_sep
, mess)
if(isTRUE(pretty) &&
## in case I want to move crayon to Sugests:
requireNamespace("crayon", quietly = TRUE)) {
h_prefix.style <-
crayon::make_style(h_prefix_color)
mess.style <-
crayon::make_style(mess_color)
mess.style <-
crayon::combine_styles(crayon::bold, mess.style)
mess <-
Reduce(crayon::`%+%`
, list(prefix
, indent
, h_prefix.style(h_prefix)
, h_prefix_sep
, mess.style(mess)))
} else {
mess <- mess.plain
}
return(mess)
}
##' Report a message with message()
##' @param mess Message to report. If prefixed by h `h_marks` it will be ouline of level `h`
##' @param h_prefix Character vector of prefixes for each outline level
##' @inheritDotParams create_message
##' @return Same as `message` returns
##'
##' @md
##' @export
disambr_message <- function(mess
, h_prefix = c("disambr:", "-")
, ...) {
## skip is not verbose (verbose by default)
if(isFALSE(list(...)$verbose)) return()
mess <- create_message(mess, h_prefix = h_prefix, ...)
## post message
message(mess)
}
expect_message(disambr_message("Hello world!"))
expect_null(disambr_message("Hello world!", verbose = FALSE))
##' Post a starting message for disambr procedure. Records time started in `disambr_start_time` variable in its `parent.frame()`
##' @param start_mess_prefix Prefix for staring message
##' @inheritDotParams disambr_message
##' @return time started
##'
##' @export
disambr_message_start <- function(start_mess_prefix = "Making set -"
, ...) {
## get name of running procedure
running_procedure_name <- deparse(sys.calls()[[sys.nframe() - 1]])
## clean the call string
running_procedure_name <-
stringi::stri_replace_first_regex(running_procedure_name
, c("^disambr_set_([^()]+).*")
, "$1")
mess <- paste(start_mess_prefix, running_procedure_name)
disambr_message(mess, ...)
## record the time started
assign("disambr_start_time", Sys.time(), pos = parent.frame())
}
foo <- function() {
disambr_message_start()
return(disambr_start_time)
}
expect_inherits(foo(), c("POSIXt", "POSIXct"))
expect_message(foo())
##' Post a starting message for disambr procedure. Records time started in `disambr_start_time` variable in its `parent.frame()`
##' @param mess Prefix for staring message
##' @param append_running_procedure_name Whether to append running procedure name
##' @inheritDotParams disambr_message
##' @return time started
##'
##' @export
disambr_message_finish <- function(mess = "Finished -"
, append_running_procedure_name = TRUE
, ...) {
if(isTRUE(append_running_procedure_name)) {
## get name of running procedure
running_procedure_name <- deparse(sys.calls()[[sys.nframe() - 1]])
## clean the call string
running_procedure_name <-
stringi::stri_replace_first_regex(running_procedure_name
, c("^disambr_set_([^()]+).*")
, "$1")
mess <- paste(mess, running_procedure_name)
}
## assess procedure duration
if(exists("disambr_start_time", where = parent.frame())) {
disambr_duration <-
dhms(Sys.time() - get("disambr_start_time", pos = parent.frame()))
mess <- paste(mess, "in", disambr_duration)
}
disambr_message(mess, ...)
}
foo <- function() {
disambr_start_time <- Sys.time() - 1000
disambr_message_finish()
}
expect_message(foo(), "foo.*in")
##' Report a message with warning()
##' @param mess Message to report. If prefixed by h `h_marks` it will be ouline of level `h`
##' @param h_prefix Character vector of prefixes for each outline level
##' @param call. See `warning`
##' @param immediate. See `warning`
##' @inheritDotParams create_message
##' @return Same as `warning` returns
##'
##' @md
##' @export
disambr_warning <- function(mess
, h_prefix = c("disambr:", "-")
, call. = FALSE
, immediate. = TRUE
, ...) {
mess <- create_message(mess, h_prefix = h_prefix, ...)
## post message
warning(mess, call. = call. , immediate. = immediate.)
}
expect_warning(disambr_warning("Ahtung!"))
##' Report a message with stop()
##' @param mess Message to report. If prefixed by h `h_marks` it will be ouline of level `h`
##' @param call. See `stop`
##' @inheritDotParams create_message
##' @return Same as `stop` returns
##'
##' @md
##' @export
disambr_stop <- function(mess
, call. = FALSE
, ...) {
parent.call <- deparse(sys.calls()[[sys.nframe() - 1]])
parent.call <- as.character(parent.call)
mess <-
create_message(mess
, h_prefix = paste0(parent.call, ":")
, ...)
## post message
stop(mess, call. = call.)
}
expect_error(disambr_stop())
##' Extention extractor. Same as tools::file_ext but for NULL input returns NULL instead of logical(0).
##' @param f file name
##' @return extention
##'
##' @export
get_file_extension <- function(f) {
if(length(f) == 0) {
return(NULL)
} else {
tools::file_ext(f)
}
}
## my.file <- '../data/Journals in Mathematical Psychology/Applied Psychological Measurement.txt'
## my.file1 <- "/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments/LN Public NAICS records from 10001 to 10500.txt"
## get_file_extension(my.file)
## get_file_extension(my.file1)
## get_file_extension("sdfsdf....")
## get_file_extension("sdf")
## get_file_extension("")
## get_file_extension(NULL)
## get_file_extension(NA)
## get_file_extension("...sdf...sdf.df...sd.")
## get_file_extension(".")
## get_file_extension(".....")
## ## build in
## tools::file_ext(my.file)
## tools::file_ext(my.file1)
## tools::file_ext("sdfsdf....")
## tools::file_ext("sdf")
## tools::file_ext("")
## tools::file_ext(NULL)
## tools::file_ext(NA)
## tools::file_ext("...sdf...sdf.df...sd.")
## tools::file_ext(".")
## tools::file_ext(".....")
##' Stops process unless cond is true
##' @param cond condition to test
##' @param message_if_false message_if_false
##' @param stop_if_false stop_if_false
##' @param return_if_true return_if_true
##' @param return_if_false return_if_false
##' @return
##'
##' @export
stop_unless <- function(cond
, message_if_false = paste("cond in not TRUE")
, stop_if_false = TRUE
, return_if_true = TRUE
, return_if_false = isFALSE(return_if_true)) {
if(isTRUE(cond)) {
return(return_if_true)
} else if(isTRUE(stop_if_false)){
stop(message_if_false, call. = FALSE)
} else {
warning(message_if_false, call. = FALSE)
return(return_if_false)
}
}
expect_warning(stop_unless(FALSE, "Lala", FALSE))
expect_error(stop_unless(FALSE))
expect_true(stop_unless(TRUE))
expect_warning(stop_unless("sdfasdf", stop_if_false = FALSE))
expect_warning(stop_unless("sdfasdf", stop_if_false = FALSE, return_if_true = FALSE))
##' Returns vector of file paths from path(s) recursively
##' @param files_path Path(s) where the files are
##' @param recursive Whether to look in subfolders recursively
##' @return Vector of file paths from path(s) recursively
##'
##' @export
parse_files_path <- function(files_path, recursive = TRUE) {
stop_unless(is.character(files_path), "Files path shoud be a character string!")
files_path <-
lapply(files_path, function(file.path) {
if(stop_unless(file.exists(file.path)
, paste(file.path, " - does not exist!")
, stop_if_false = FALSE
, return_if_true = FALSE)) {
NULL
} else if(dir.exists(file.path)) {
dir(file.path
, full.names = TRUE
, recursive = recursive)
} else {
file.path
}
})
return(unique(normalizePath(unlist(files_path))))
}
expect_error(parse_files_path(3423))
expect_warning(parse_files_path(c(".", "gibirish file")))
expect_inherits(parse_files_path("."), "character")
## empty dirs
tmp.dir <- "test_dir_for_parse_files_path"
dir.create(tmp.dir, showWarnings = FALSE)
expect_equal(parse_files_path(tmp.dir), character(0))
file.remove(tmp.dir)
##' Reads file as UTF-8, convert it if other encoding is deteted
##' @param f file path
##' @param bytes_to_check how long to check for encoding (save time for large files)
##' @return file text as string
##'
##' @export
read_to_utf8 <- function(f, bytes_to_check = 2^14) {
## read file as raw bytes (not to Assume any encodings)
bin <- readBin(f, raw(), n = file.size(f))
## check first 2^14 bytes for encoding
encoding <- stringi::stri_enc_detect2(bin[1:bytes_to_check])[[1]][[1]][1]
if(is.na(encoding)) {
message("Could not detect encoding of file: ", f)
s <- rawToChar(bin, multiple = FALSE)
} else if(!(encoding %in% iconvlist())) {
message("Does not know how to convert from ", encoding, "for file: ", f)
} else if(encoding == "UTF8") {
s <- rawToChar(bin, multiple = FALSE)
} else {
## message("Converting to utf-8")
s <- iconv(list(NULL, bin), from = encoding, to = "UTF-8")
}
return(s)
}
## stringi::stri_enc_detect2(NULL)[[1]][[1]][1]
## stringi::stri_enc_detect2(NA)[[1]][[1]][1]
## stringi::stri_enc_detect2(123)[[1]][[1]][1]
## stringi::stri_enc_detect2("")[[1]][[1]][1]
## stringi::stri_enc_detect2("sadf")[[1]][[1]][1]
##' Fixed end of line characters in wierd text
##' @param s text string
##' @param assoc.file file name where it came from
##' @param verbose Be chatty
##' @return fixed sting
##'
##' @export
recode_return_characters <- function(s, assoc.file = NA, verbose = FALSE) {
has_return_chars <- function(s, test.first.n.char = 10^4) {
s <- stringi::stri_sub(s, to = test.first.n.char)
any(stringi::stri_detect_regex(s, "\\r"))
}
if(has_return_chars(s)) {
if(verbose) message("disambr: '\\r' char in the file: ", assoc.file
, "\n- replacing with '\\n' to fix 'datatable::fread'")
s <- stringi::stri_replace_all_regex(s, "\\R+", "\n")
}
return(s)
}
##' Makes list of each element of l
##' @param l sequence or list
##' @param l.name same name will be applies to each element
##' @return list of lists
##'
##' @export
disambr_listify_list <- function(l, l.name = NULL) {
if(isTRUE(l.name == "")) l.name = NULL
## case when all are 1 length (vector or list of single length elements)
lapply(l, function(x) {
x <- list(x)
names(x) <- l.name
return(x)
})
}
##' cbinds lists and names each element as name of each list in ...
##' @param ... Lists to cbin
##' @return Lists
##' @export
disambr_cbind_lists <- function(...) {
lists <- eval(...)
lists_n <- length(lists)
lists_names <- names(lists)
cbind_list <- disambr_listify_list(lists[[1]], lists_names[1])
for (i in 2:lists_n) {
cbind_list <-
mapply(c
, cbind_list
, disambr_listify_list(lists[[i]], lists_names[i])
, SIMPLIFY = FALSE)
}
return(cbind_list)
}
##' Fuzzy match all combinations of character vector
##' @param bank bank
##' @param method see method in stringdist
##' @param max_dist see maxDist in stringdist
##' @param id_name names that will be suffixed with _1 and _2
##' @return data.table
##'
##' @export
match_fuzzy <- function(bank, method, max_dist, id_name) {
id_name_1 <- paste0(id_name, "_1")
id_name_2 <- paste0(id_name, "_2")
match_fuzzy_x <- function(x) {
matched <- stringdist::ain(bank, x
, maxDist = max_dist
, method = method
, matchNA = FALSE)
if(any(matched)) {
matched <- bank[matched]
matched <- data.table::data.table(x, matched)
data.table::setnames(matched, c(id_name_1, id_name_2))
} else {
NULL
}
}
match_x <- function(x) {
matched <- bank %in% x
if(any(matched)) {
matched <- bank[matched]
matched <- data.table::data.table(x, matched)
data.table::setnames(matched, c(id_name_1, id_name_2))
} else {
NULL
}
}
if(max_dist > 0) {
matched_list <- lapply(bank, match_fuzzy_x)
} else if(max_dist == 0) {
matched_list <- lapply(bank, match_x)
} else {
stop()
}
return(data.table::rbindlist(matched_list))
}
expect_equal(
match_fuzzy(c("sdfsdf", "sfawefwsd", "sdfwefad", ";sldwaf", "asdfwaf")
, method = "lv"
, max_dist = 3
, id_name = "id")
, structure(list(id_1 = c("sdfsdf", "sdfsdf", "sfawefwsd", "sdfwefad",
";sldwaf", ";sldwaf", "asdfwaf", "asdfwaf", "asdfwaf"), id_2 = c("sdfsdf",
"asdfwaf", "sfawefwsd", "sdfwefad", ";sldwaf", "asdfwaf", "sdfsdf",
";sldwaf", "asdfwaf")), row.names = c(NA, -9L), class = c("data.table",
"data.frame")))
expect_equal(nrow(match_fuzzy(c("sdfsdf", "sfawefwsd", "sdfwefad", ";sldwaf", "asdfwaf")
, method = "lv"
, max_dist = 39
, id_name = "id")), 25)
##' Returns the index of element in upper triangle of squared `n` by `n` matrix
##'
##' inspired from https://math.stackexchange.com/questions/2134011 but modified so
##' index starts from 1 as the original answer was for index starting from 0
##'
##' @param i row index
##' @param j column index
##' @param n the size of squared matrix
##'
##' @return An index as integer number. Index starts from 1
get_upper_triangle_index <- function(i, j, n) {
if(j >= i) {
n*(n - 1)/2 - (n - i)*(n - i + 1)/2 + j
} else {
get_upper_triangle_index(j, i, n)
}
}
get_upper_triangle_index <- disambr:::get_upper_triangle_index
## 3x3
m <- matrix(c(1,2,3
,0,4,5
,0,0,6)
, nrow = 3
, byrow = TRUE)
i <- 1
j <- 2
expect_equal(
m[i,j]
, get_upper_triangle_index(i,j,dim(m)[1])
)
i <- 3
j <- 3
expect_equal(
m[i,j]
, get_upper_triangle_index(i,j,dim(m)[1])
)
i <- 2
j <- 3
expect_equal(
m[i,j]
, get_upper_triangle_index(j,i,dim(m)[1])
)
## 4x4
m <- matrix(c(1,2,3,4
,0,5,6,7
,0,0,8,9
,0,0,0,10)
, nrow = 4
, byrow = TRUE)
i <- 1
j <- 2
expect_equal(
m[i,j]
, get_upper_triangle_index(i,j,dim(m)[1])
)
i <- 3
j <- 3
expect_equal(
m[i,j]
, get_upper_triangle_index(i,j,dim(m)[1])
)
i <- 2
j <- 3
expect_equal(
m[i,j]
, get_upper_triangle_index(j,i,dim(m)[1])
)
i <- 4
j <- 2
expect_equal(
m[j,i]
, get_upper_triangle_index(i,j,dim(m)[1])
)
i <- 1
j <- 4
expect_equal(
m[i,j]
, get_upper_triangle_index(i,j,dim(m)[1])
)
##' Checks if sets with certain attribures are present
##' @param sets Sets to filter on sets attributes
##' @param match_attr_value_parcially whether attribute values can be matched partially
##' @param check_attr_names_prefix Whether to check for short names of attributes. See `attr_names_prefix`.
##' @param attr_names_prefix If name does not start with this prefix (default is 'disambr_set_'), it will add this prefix before attribute name.
##' @param ... Named sets attributes to filter `sets` on
##' @return logical vector of length `length(sets)`
##'
##' @export
disambr_in_sets <- function(sets
, ...
, match_attr_value_parcially = FALSE
, check_attr_names_prefix = TRUE
, attr_names_prefix = "disambr_set_") {
## check if sets is list
if(!is.list(sets)) disambr_stop("'sets' should be a list of sets!")
attrs_values <- list(...)
## check if ... is provided and if not return all
attrs_values_length <- length(attrs_values)
if(attrs_values_length == 0) return(rep(TRUE, length(sets)))
## check if all named
attrs_values_names <- names(attrs_values)
if(length(attrs_values_names) != attrs_values_length)
disambr_stop("'...' arguments should be all named!")
if(isTRUE(check_attr_names_prefix)) {
## add "disambr_set_" if attr names are short
attrs_values_names_short <-
!stringi::stri_detect_regex(attrs_values_names
, paste0("^", attr_names_prefix))
if(any(attrs_values_names_short)) {
attrs_values_names[attrs_values_names_short] <-
paste0(attr_names_prefix
, attrs_values_names[attrs_values_names_short])
}
}
if(isTRUE(match_attr_value_parcially)) {
filter_sets <- function(attr_name, attr_value) {
vals <- lapply(sets, attr, attr_name, exact = TRUE)
vals <- lapply(vals, unlist)
vals <- lapply(vals, `[`, 1)
vals <- unlist(lapply(vals, function(a) if(is.null(a)) NA else a))
sapply(stringi::stri_detect_fixed(vals, attr_value), isTRUE)
}
} else {
filter_sets <- function(attr_name, attr_value) {
vals <- lapply(sets, attr, attr_name, exact = TRUE)
vals <- lapply(vals, unlist)
vals <- lapply(vals, `[`, 1)
vals <- unlist(lapply(vals, function(a) if(is.null(a)) NA else a))
vals %in% attr_value
}
}
sets_filters <- mapply(filter_sets
, attrs_values_names
, attrs_values
, SIMPLIFY = FALSE)
## return overlap of sets_filters
return(Reduce(`&`, sets_filters))
}
a <- list(disambr_set_attr(c(1,2,3), a = 1)
, disambr_set_attr(c(1,2,3), b = 2, a = 1)
, disambr_set_attr(c(1,2,3), c = 3, a = 2)
, disambr_set_attr(c(1,2,3), d = 4, a = 212))
expect_true(all(disambr_in_sets(a)))
expect_false(any(disambr_in_sets(a, b= 2, a = 2)))
expect_true(any(disambr_in_sets(a, b= 2, a = 1)))
expect_true(any(disambr_in_sets(a, a = 1)))
expect_true(any(disambr_in_sets(a, disambr_set_a = 1)))
expect_equal(sum(disambr_in_sets(a, a = 1, match_attr_value_parcially = TRUE)), 3)
##' Get first data set in list sets
##' @param sets list of sets
##' @param recipe function that produced the data set (parcial match allowed)
##' @param ... other attributes
##' @param match_parcially whether to match recipe partially
##' @inheritDotParams disambr_in_sets
##' @return
##'
##' @export
disambr_get_first_data_set <- function(sets, recipe, ...
, match_parcially = TRUE) {
set_num <-
disambr_in_sets(sets, recipe = recipe, ...
, match_attr_value_parcially = match_parcially)
## get first
set_num <- which(set_num)[1]
if(length(set_num) != 0) {
return(sets[[set_num]])
} else {
disambr_stop(paste("Data set should be available in sets:", recipe))
}
}
a <- list(disambr_set_attr(c(1,2,3), recipe = "my_function_123")
, disambr_set_attr(c(1,2,3,4), recipe = "my_function_23")
, disambr_set_attr(c(1,2,3,4,5), recipe = "my_function_3a")
, disambr_set_attr(c(1,2,3,4,5,6), recipe = "my_function_1"))
expect_equal(
disambr_get_first_data_set(a, "3")
, structure(c(1, 2, 3), disambr_set_recipe = "my_function_123")
)
expect_equal(
disambr_get_first_data_set(a, "3a")
, structure(c(1, 2, 3, 4, 5), disambr_set_recipe = "my_function_3a")
)
expect_null(disambr_get_first_data_set(a, "aaaaa"))
##' Gets last set in sets which strength less or equal than 0.5
##' @param sets sets
##' @param ... other attributes
##' @inheritDotParams disambr_in_sets
##' @return set or NULL if not found
##'
##' @export
disambr_get_last_set <- function(sets, ...) {
set_index <- disambr_in_sets(sets, ...)
set_index <- which(set_index)
set_index <- set_index[length(set_index)]
if(length(set_index) == 1) {
return(sets[[set_index]])
} else {
return()
}
}
a <- list(disambr_set_attr(c(1,2,3), strength = 0.1)
, disambr_set_attr(c(1,2,3,4), strength = 0.6)
, disambr_set_attr(c(1,2,3,4,5), strength = 0.5)
, disambr_set_attr(c(1,2,3,4,5,6), strength = 1))
expect_equal(
disambr_get_last_set(a)
, structure(c(1, 2, 3, 4, 5, 6), disambr_set_strength = 1)
)
a <- list(disambr_set_attr(c(1,2,3,4), strength = 0.6)
, disambr_set_attr(c(1,2), strength = 1))
expect_equal(
disambr_get_last_set(a)
, structure(c(1, 2), disambr_set_strength = 1)
)
##' Gets last set in sets which strength less or equal than 0.5
##' @param sets sets
##' @param ... other attributes
##' @inheritDotParams disambr_in_sets
##' @return set or NULL if not found
##'
##' @export
disambr_get_last_weak_set <- function(sets, ...) {
set_index <- disambr_in_sets(sets, ...
, strength = seq(from = 0.1, to = 0.5, by = 0.01))
set_index <- which(set_index)
set_index <- set_index[length(set_index)]
if(length(set_index) == 1) {
return(sets[[set_index]])
} else {
disambr_stop("- can not find last weak set in sets!")
return()
}
}
a <- list(disambr_set_attr(c(1,2,3), strength = 0.1)
, disambr_set_attr(c(1,2,3,4), strength = 0.6)
, disambr_set_attr(c(1,2,3,4,5), strength = 0.5)
, disambr_set_attr(c(1,2,3,4,5,6), strength = 1))
expect_equal(
disambr_get_last_weak_set(a)
, structure(c(1, 2, 3, 4, 5), disambr_set_strength = 0.5)
)
a <- list(disambr_set_attr(c(1,2,3,4), strength = 0.6)
, disambr_set_attr(c(1,2,3,4,5,6), strength = 1))
expect_error(
disambr_get_last_weak_set(a)
)
##' Get sets with strength parameter of 1 and rbind them into single set
##' @param sets sets
##' @param ... other attributes
##' @inheritDotParams disambr_in_sets
##' @return set or NULL if none found
##'
##' @export
disambr_get_strong_set <- function(sets, ...) {
sets_index <- disambr_in_sets(sets, ...
, type = "similar"
, strength = 1)
## if not sets return NULL
if(!any(sets_index)) return()
sets <- sets[sets_index]
## bind sets depending on class
sets_class <- sapply(lapply(sets, class), `[`, 1)
if(all(sets_class %in% "data.table")) {
sets <- data.table::rbindlist(sets)
} else if(all(sets_class %in% "list")) {
sets <- do.call(c, sets)
}
return(sets)
}
a <- list(disambr_set_attr(list(1,2,3)
, type = "similar"
, strength = 1)
, disambr_set_attr(list(1,2,3,4)
, type = "similar"
, strength = 0.6)
, disambr_set_attr(list(1,2,3,4,5)
, type = "similar"
, strength = 0.4)
, disambr_set_attr(list(1,2,3,4,5,6)
, type = "similar"
, strength = 1))
expect_equal(
disambr_get_strong_set(a)
, list(1, 2, 3, 1, 2, 3, 4, 5, 6)
)
a <- list(disambr_set_attr(data.table::data.table(c(1,2,3))
, type = "similar"
, strength = 1)
, disambr_set_attr(data.table::data.table(c(1,2,3,4))
, type = "similar"
, strength = 0.6)
, disambr_set_attr(data.table::data.table(c(1,2,3,4,5))
, type = "similar"
, strength = 0.4)
, disambr_set_attr(data.table::data.table(c(1,2,3,4,5,6))
, type = "similar"
, strength = 1))
expect_inherits(disambr_get_strong_set(a), "data.table")
a <- list(disambr_set_attr(data.table::data.table(c(1,2,3))
, type = "similar"
, strength = 0.1)
, disambr_set_attr(data.table::data.table(c(1,2,3,4))
, type = "similar"
, strength = 0.6)
, disambr_set_attr(data.table::data.table(c(1,2,3,4,5))
, type = "similar"
, strength = 0.4)
, disambr_set_attr(data.table::data.table(c(1,2,3,4,5,6))
, type = "similar"
, strength = 0.1))
expect_null(disambr_get_strong_set(a))
##' Get sets with strength parameter of 10 (ground thruth)
##' @param sets sets
##' @param ... other attributes
##' @inheritDotParams disambr_in_sets
##' @return set or NULL if none found
##'
##' @export
disambr_get_truth_set <- function(sets, ...) {
sets_index <- disambr_in_sets(sets, ...
, type = "similar"
, strength = 10)
## if not sets return NULL
if(!any(sets_index)) disambr_stop("- can not find truth set!")
if(sum(sets_index) != 1) disambr_stop("- more than one truth set found!")
return(sets[[which(sets_index)]])
}
##' Gets last set from sets with strength <= 0.5 and excludes from this set all sets with strength of 1
##' @param sets sets
##' @return set or NULL
##' @export
disambr_get_last_unstrong_set <- function(sets) {
weak_set <- disambr_get_last_weak_set(sets)
strong_set <- disambr_get_strong_set(sets)
if(is.null(strong_set) || is.null(weak_set)) {
return(weak_set)
} else if("data.table" %in% class(weak_set) &&
"data.table" %in% class(strong_set)) {
return(data.table::fsetdiff(weak_set, strong_set))
## comb_set <- rbind(weak_set, strong_set)
## comb_set <-
## comb_set[!duplicated(comb_set, fromLast = FALSE) &
## !duplicated(comb_set, fromLast = TRUE)]
## return(comb_set) #
## return(weak_set[strong_set[[1]] != weak_set[[1]] ||
## strong_set[[2]] != weak_set[[2]]])
} else {
disambr_stop("Weak and strong sets should be data.tables!")
}
}
a <- list(disambr_set_attr(data.table::data.table(c(1,2,3), c(1,2,8)), strength = 1)
, disambr_set_attr(data.table::data.table(c(1,2,3,4), c(1,2,3,4)), strength = 0.6)
, disambr_set_attr(data.table::data.table(c(1,2,3,4,5,6,7,8), c(1,2,3,4,5,6,7,8)), strength = 0.4)
, disambr_set_attr(data.table::data.table(c(1,2,7,5), c(1,2,7,5)), strength = 1))
expect_equal(nrow(disambr_get_last_unstrong_set(a)), 8)
## disambr_entity
## disambr_set_type
## disambr_set_coefficient
## disambr_set_name
## disambr_set_collection
## disambr_entity_id_reference
## disambr_entity_id_reference_md5_sum
## disambr_recipe
##' Adds attribures to the set with data.table::setattr
##' @param focal_set Set to add attribute to
##' @param check_attr_names_prefix Whether to check for short names of attributes. See `attr_names_prefix`.
##' @param attr_names_prefix If name does not start with this prefix (default is 'disambr_set_'), it will add this prefix before attribute name.
##' @param ... Named attributes
##' @return `focal_set`
##'
##' @export
disambr_set_attr <- function(focal_set
, ...
, check_attr_names_prefix = TRUE
, attr_names_prefix = "disambr_set_") {
attrs_values <- list(...)
## check if ... is provided and if not do nothing
attrs_values_length <- length(attrs_values)
if(attrs_values_length == 0) return()
## check if all attributes in ... are named
attrs_values_names <- names(attrs_values)
if(length(attrs_values_names) != attrs_values_length)
disambr_stop("'...' arguments should be all named!")
if(isTRUE(check_attr_names_prefix)) {
## add "disambr_set_" if attr names are short
attrs_values_names_short <-
!stringi::stri_detect_regex(attrs_values_names
, paste0("^", attr_names_prefix))
if(any(attrs_values_names_short)) {
attrs_values_names[attrs_values_names_short] <-
paste0(attr_names_prefix
, attrs_values_names[attrs_values_names_short])
}
}
## set attributes
for (i in 1:length(attrs_values)) {
## also works for other that data.table objects
data.table::setattr(focal_set, attrs_values_names[i], attrs_values[[i]])
}
return(focal_set)
}
expect_equal(
attributes(
disambr_set_attr(c(1,2,3)
, lalala = "la"
, disambr_set_important_attr = "Hi there"))
, list(disambr_set_lalala = "la", disambr_set_important_attr = "Hi there")
)
##' Add disambr attribures to focal set from template set and update some of them
##'
##' It adds attributes by reference (with setattr {data.table}), i.e. without making a copy
##'
##' It updates:
##' - time stamp (disambr_set_st) to current
##' - disambr_set_file to NULL
##' - adds to disambr_set_recipe the calling procedure
##' - adds disambr_set_duration
##' @param focal_set Set
##' @param template_set Set to inhirit attributes from
##' @param ... other attributes
##' @param attr_names_prefix "disambr_set_" by default. Only prefixed by it will be copied from `template_set`
##' @inheritDotParams disambr_set_attr
##' @return
##'
##' @md
##' @export
disambr_add_set_attr <- function(focal_set
, template_set = NULL
, ...
, attr_names_prefix = "disambr_set_") {
if(is.null(focal_set)) return()
## copy only disambr attr from template_set
template_attr <- attributes(template_set)
template_attr_disambr <-
stringi::stri_detect_regex(names(template_attr)
, paste0("^", attr_names_prefix))
template_attr <- template_attr[template_attr_disambr]
mapply(function(a, name) {
data.table::setattr(focal_set, name, a)
}
, template_attr
, names(template_attr))
## remove file attributes
disambr_set_attr(focal_set, file = NULL)
## add time stamp
disambr_set_attr(focal_set, ts = Sys.time())
## add duration
if(exists("disambr_start_time", where = parent.frame())) {
disambr_start_time <-
get("disambr_start_time", pos = parent.frame())
disambr_set_attr(focal_set
, duration = Sys.time() - disambr_start_time)
}
## add recipe (procedure call) and name
recipe <- attributes(template_set)$disambr_set_recipe
procedure_call <- deparse(sys.calls()[[sys.nframe() - 1]])[[1]]
procedure_name <-
stringi::stri_extract_first_regex(procedure_call
, c("^[^()]+"))
procedure_short_name <-
stringi::stri_replace_first_regex(procedure_name
, paste0("^", attr_names_prefix), "")
disambr_set_attr(focal_set
, name = procedure_short_name
, recipe = c(list(list(func = procedure_name
, call = procedure_call))
, recipe))
## set attributes from ...
disambr_set_attr(focal_set, ...)
return(focal_set)
}
a <- data.table::data.table(a = c(1,2,3,4)
,b = c(11,22,33,44))
b <- data.table::data.table(a = c(6,7,8)
,b = c(66,77,88))
disambr_set_attr(a
, name = "a"
, strength = 0.5
, ts = Sys.time()
, file = "lalala.rds"
, recipe = list("second_procedure"
, "first_procedure"))
foo <- function(b, a = NULL, ...) {
disambr_message_start()
disambr_add_set_attr(b, a, ...)
return(attributes(b))
}
expect_message(foo(b, a))
expect_equal(
foo(b, a)$disambr_set_recipe
, list(list(func = "foo", call = "foo(b, a)"), "second_procedure", "first_procedure")
)
expect_equal(foo(b,a, name = "new.name")$disambr_set_name, "new.name")
expect_equal(foo(b,a, strength = 1)$disambr_set_strength, 1)
expect_equal(foo(b,a)$disambr_set_name, "foo")
expect_equal(foo(a)$disambr_set_name, "foo")
expect_null(foo(NULL, a))
expect_equal(
foo(b, NULL)$disambr_set_recipe
, list(list(func = "foo", call = "foo(b, NULL)"))
)
##' Save set and adds file attribute
##' @param set_to_save set
##' @param save_set_as if TRUE the the file name is made from disambr_set_name attribute
##' @param save_set_prefix file prefix, default is "disambr-set."
##' @param save_set_dir file dir, default is "disambr-sets-rds" in current directory
##' @param use_time_stamp add time stamps at the end of file name, adds by default
##' @return file namej
##'
##' @export
disambr_save_set <- function(set_to_save
, save_set_as = getOption("disambr_save_as")
, save_set_prefix = getOption("disambr_save_set_prefix")
, save_set_dir = getOption("disambr_save_set_dir")
, use_time_stamp = getOption("disambr_save_set_time_stamp")) {
## do not save by default
if(length(save_set_as) != 0) {
## make name if it is just TRUE
if(isTRUE(save_set_as)) {
save_set_as <- attr(set_to_save, "disambr_set_name")[[1]]
## use timestamps by default
if(isTRUE(use_time_stamp) ||
length(use_time_stamp) == 0) {
save_set_as <-
paste0(save_set_as, "."
, format(Sys.time(), "%Y-%m-%dT%H-%M"))
}
save_set_as <- paste0(save_set_as, ".rds")
if(length(save_set_prefix) == 0) {
save_set_prefix <- "disambr-set."
}
}
## if "save as" provided use it, add prefix if it is provided as well
if(is.character(save_set_as)) {
if(length(save_set_prefix) != 0) {
save_set_as <- paste0(save_set_prefix, save_set_as)
}
## add directory or default
if(length(save_set_dir) == 0) {
save_set_dir <- "disambr-sets-rds"
}
dir.create(save_set_dir, showWarnings = FALSE, recursive = TRUE)
save_set_as <- paste0(save_set_dir,"/",save_set_as)
## add file attribute
disambr_set_attr(set_to_save, file = save_set_as)
## save
saveRDS(set_to_save, file = save_set_as[[1]], compress = FALSE)
disambr_message(paste0(
"- set saved as '", save_set_as, "'"))
return(save_set_as)
} else {
disambr_message(
paste0("- do not know how to save 'set_to_save' as '"
, save_set_as, "'"))
return()
}
} else {
return()
}
}
## check files manually
expect_null(disambr_save_set(disambr_set_attr(c(1,2,3)
, name = "bar")
, save_set_as = NULL
, save_set_dir = "../disambr-sets-rds"))
## with making files
expect_inherits(disambr_save_set(disambr_set_attr(c(1,2,3)
, name = "bar")
, save_set_as = TRUE
, save_set_dir = "disambr-save-set-test")
, "character")
## readRDS("../disambr-sets-rds/disambr-set.bar.2020-08-30T10-25.rds")
unlink("disambr-save-set-test", recursive = TRUE)
##' Gets output set from sets (in case we already made it)
##' @param sets sets
##' @param get_output_set Whether to search for output set. Default is not.
##' @param attr_names_prefix prefix for attributes
##' @return NULL or output set
##'
##' @export
disambr_get_output_set <- function(sets
, get_output_set = getOption("disambr_get_output_set")
, attr_names_prefix = "disambr_set_") {
if(isTRUE(get_output_set)) {
procedure_call <- deparse(sys.calls()[[sys.nframe() - 1]])
procedure_name <-
stringi::stri_extract_first_regex(
procedure_call, c("^[^()]+"))
procedure_short_name <-
stringi::stri_replace_first_regex(
procedure_name
, paste0("^", attr_names_prefix), "")
output_set_index <-
which(disambr_in_sets(sets, name = procedure_short_name))
if(length(output_set_index) == 0) {
return()
} else if(length(output_set_index) == 1) {
disambr_message(paste("- reusing output set:", procedure_short_name))
return(sets[[output_set_index]])
} else {
disambr_message(paste("- reusing last output set:", procedure_short_name))
output_set_index <- output_set_index[length(output_set_index)]
return(sets[[output_set_index]])
}
} else {
return()
}
}
a <- list(disambr_set_attr(c(1), name = "foo1")
, disambr_set_attr(c(1,2), name = "foo", a = 1)
, disambr_set_attr(c(1,2,3), name = "foo0", a = 2)
, disambr_set_attr(c(1,2,3,4), name = "bar", a = 212))
foo <- function(sets, ...) {
return(disambr_get_output_set(sets, ...))
}
expect_message(foo(a, get_output_set = TRUE))
expect_equal(foo(a, get_output_set = TRUE)
, structure(c(1, 2), disambr_set_name = "foo", disambr_set_a = 1)
)
expect_null(foo(a))
##' Reads last output set saved on disk
##' @param read_output_set toggle. default is no
##' @param save_set_prefix file prefix
##' @param save_set_dir file dir
##' @param attr_names_prefix arrt prefix
##' @return
##'
##' @export
disambr_read_output_set <- function(read_output_set = getOption("disambr_read_output_set")
, save_set_prefix = getOption("disambr_save_set_prefix")
, save_set_dir = getOption("disambr_save_set_dir")
, attr_names_prefix = "disambr_set_") {
if(isTRUE(read_output_set)){
## make defaults if not provided
if(length(save_set_prefix) == 0) {
save_set_prefix <- "disambr-set."
}
if(length(save_set_dir) == 0) {
save_set_dir <- "disambr-sets-rds"
}
## output set name pattern
procedure_call <- deparse(sys.calls()[[sys.nframe() - 1]])
procedure_name <-
stringi::stri_extract_first_regex(
procedure_call, c("^[^()]+"))
procedure_short_name <-
stringi::stri_replace_first_regex(
procedure_name
, paste0("^", attr_names_prefix), "")
output_set_name_pattern <-
paste0(save_set_prefix, procedure_short_name, ".*", "\\.rds")
## match last file
output_set_file <-
list.files(save_set_dir, pattern = output_set_name_pattern)
if(length(output_set_file) != 0) {
## take the last file (as they are sorted alphabetically)
output_set_file <- output_set_file[length(output_set_file)]
output_set_file <- file.path(save_set_dir, output_set_file)
disambr_message(paste("- reusing saved set:", output_set_file))
return(readRDS(file = output_set_file))
} else {
return()
}
} else {
return()
}
}
a <- list(disambr_set_attr(c(1), name = "foo1")
, disambr_set_attr(c(1,2), name = "foo", a = 1)
, disambr_set_attr(c(1,2,3), name = "foo0", a = 2)
, disambr_set_attr(c(1,2,3,4), name = "bar", a = 212))
foo <- function(sets) {
disambr_save_set(sets[[2]]
, save_set_as = TRUE
, save_set_dir = "disambr_read_output_set_test")
disambr_read_output_set(read_output_set = TRUE
, save_set_dir = "disambr_read_output_set_test")
}
expect_equal(length(foo(a)), 2)
expect_equal(length(attributes(foo(a))), 3)
unlink("disambr_read_output_set_test", recursive = TRUE)
foo <- function(sets) {
disambr_save_set(sets[[2]]
, save_set_as = "tra-la-la.rds"
, save_set_dir = "disambr_read_output_set_test")
disambr_read_output_set(read_output_set = TRUE
, save_set_dir = "disambr_read_output_set_test")
}
expect_null(foo(a))
unlink("disambr_read_output_set_test", recursive = TRUE)
##' Filters list of sets
##' @param sets_list list of sets
##' @param attribute_value_list list of attribute values where list elements name correspond attribute names used for filtering sets
##' @param which_to_return whether to return "all", "first" or "last" set from filtered sets
##' @param negate_subsets whether to return sets that was not matched insted
##' @return list of sets or set if `which_to_return` is either "first" or "last"
##'
##' @export
disambr_subsets <- function(sets_list, attribute_value_list
, which_to_return = c("all", "first", "last")
, negate_subsets = FALSE) {
if (!is.list(sets_list))
stop("disambr: 'sets_list' should be a list!")
if (!is.list(attribute_value_list))
stop("disambr: 'attribute_value_list' should be a list!")
filter_sets <- function(attr_name, attr_value) {
sapply(lapply(sets_list, attr, attr_name), `[`, 1) %in% attr_value
}
sets_list_filters <- mapply(filter_sets
, names(attribute_value_list)
, attribute_value_list
, SIMPLIFY = FALSE
, USE.NAMES = TRUE)
subsets_list <-
if (isTRUE(negate_subsets)) {
sets_list[!Reduce(`&`, sets_list_filters)]
} else {
sets_list[Reduce(`&`, sets_list_filters)]
}
return(switch(which_to_return[1]
, all = subsets_list
, first = subsets_list[[1]]
, last = subsets_list[[length(subsets_list)]]))
}
## a <- c(1,2,3,4)
## b <- c("a","b","c")
## c <- NULL
## attributes(a)$name <- "aaa"
## attributes(b)$name <- "bbb"
## attributes(c)$name <- c("ccc", 3)
## attributes(a)$kind <- "good"
## attributes(b)$kind <- "good"
## attributes(c)$kind <- "bad"
## disambr_subsets(list(a,b,c), list(kind = "good"))
## disambr_subsets(list(a,b,c), list(kind = "good"), which_to_return = "last")
## disambr_subsets(list(a,b,c), list(name = "ccc"
## , kind = "good"))
## disambr_subsets(list(a,b,c), list(name = "ccc"
## , kind = "bad"))
## disambr_entity
## disambr_set_type
## disambr_set_coefficient
## disambr_set_name
## disambr_set_collection
## disambr_entity_id_reference
## disambr_entity_id_reference_md5_sum
## disambr_recipe
disambr_setattr <- function(focal_set, ...) {
attr_value_list <- list(...)
for (i in 1:length(attr_value_list)) {
setattr(focal_set, names(attr_value_list)[i], attr_value_list[[i]])
}
return(focal_set)
}
##' Reads the data for disambiguation
##' @param files_path Path to data. You can specify almost everything
##' @param save_sets_as Save set as
##' @param save_sets_dir Directory to save set in
##' @param use_time_stamp Whether to use timestamp in the file name
##' @return
##'
##' @md
##' @export
disambr_read <- function(files_path
, save_sets_as = NULL
, save_sets_dir = "disambr-data"
, use_time_stamp = FALSE) {
disambr_message_start()
## see if the data is available already
if(is.character(save_sets_as) &&
file.exists(file.path(save_sets_dir, save_sets_as))) {
disambr_message(paste("- reusing saved sets:", save_sets_as))
return(readRDS(file.path(save_sets_dir, save_sets_as)))
}
files_path <- parse_files_path(files_path)
files_data_list <- lapply(files_path, disambr_read_file)
sets <- disambr_make_data(files_data_list)
## save just in case
if(is.character(save_sets_as)) {
disambr_save_set(sets
, save_set_as = save_sets_as
, save_set_dir = save_sets_dir
, use_time_stamp = use_time_stamp)
}
disambr_message_finish()
return(sets)
}
my.file <- system.file("testdata", "wos-tsv-test-recent.txt", package = "disambr")
if(file.exists(my.file)) {
expect_equal(
length(disambr_read(my.file)), 4
)
}
test
## my.dir <- '../data'
## my.dir.large <- '/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments'
## my.dir.huge <- '/mnt/md5/data/wos'
## my.file <- '../data/Journals in Mathematical Psychology/Applied Psychological Measurement.txt'
## my.file1 <- "/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments/LN Public NAICS records from 10001 to 10500.txt"
## my.files <-
## c('../data/Journals in Mathematical Psychology/Applied Measurement in Education.txt'
## , '../data/Journals in Mathematical Psychology/Applied Psychological Measurement.txt')
my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"
dt <- my.file2 %>% disambr_read(save_sets_as = "data-test.rds")
my.dir.small <- '../data/Journals in Mathematical Psychology'
dt <- my.dir.small %>% disambr_read
dt %>% sapply(class)
##' Reads file based on file extention
##' @param f full file path name
##' @return data
##'
##' @export
disambr_read_file <- function(f) {
f_extention <- tools::file_ext(f)
switch(f_extention
, "tsv" = disambr_read_tsv(f)
## here we can add reading from .txt wos files
, "txt" = disambr_read_tsv(f)
, message("Disambr: can not read file extention: ", f_extention
, "\n - skipping file: ", f))
}
##' Read tsv file
##' @param f path
##' @return data
##'
##' @export
disambr_read_tsv <- function(f) {
## check tsv file type base on first line
first_line <- readLines(f, n = 1
, warn = FALSE
, skipNul = TRUE)
header <- parse_tsv_wos_header(first_line)
if(!isFALSE(header)) {
disambr_read_tsv_wos(f, header)
} else {
## here we can add more tsv types
message("Disambr: unrecognized header of tsv file: ", header
, "\n - skipping file: ", f)
NULL
}
}
parse_tsv_wos_header <- function(first_line) {
header <- stringi::stri_split_fixed(first_line, "\t")[[1]]
if( ## check if at least 10 fields two big letters
sum(stringi::stri_detect_regex(header, "^[A-Z0-9]{2}$")) > 10 &&
## check if main fields are present
all(c('AU', 'TI') %in% header)) {
stringi::stri_extract_first_regex(header, "[A-Z0-9]{2}")
} else {FALSE}
}
##' Reads WoS tsv export file and makes disambr set out of it (just adding some attributes to the data.table)
##'
##' @param f path
##' @param header header
##' @return disambr set
##'
##' @export
disambr_read_tsv_wos <- function(f, header) {
s <- read_to_utf8(f)
s <- recode_return_characters(s, f)
f_data <- data.table::fread(text = s
, skip = 1
, strip.white = TRUE
, header = FALSE
, col.names = header
, select = 1:length(header)
## , colClasses = rep("character", length(header))
, quote=""
, keepLeadingZeros = FALSE
, encoding = "UTF-8"
, sep = "\t")
## set attrib (file, funcall, meanning of the fields and data scheme)
disambr_add_set_attr(f_data, NULL
, unit = "publication"
, reference = "self"
, type = "different"
, id = "index"
, strength = 1
, name = "wos_tsv"
, collection = "unit_table"
, recipe = list(func = "disambr_read_tsv_wos"
, file_name = f
, file_md5sum = tools::md5sum(f)
, file_header = header))
return(f_data)
}
disambr_make_data <- function(files_data_list
, drop_ejected = FALSE) {
## TODO: add other data processing here
## TODO: add processing of wos data with differen headers
## check wos publication
processabe_data <-
disambr_in_sets(files_data_list, name = "wos_tsv")
processabe_data <- files_data_list[processabe_data]
if (length(processabe_data) != 0) {
processabe_data_recipes <-
lapply(processabe_data, attr, "disambr_set_recipe")
processabe_data_headers <-
lapply(processabe_data_recipes, `[[`, "file_header")
## check if all headers ate the same before rbindlist
if (length(unique(processabe_data_headers)) == 1) {
disambr_message("Processing wos tsv export data..")
disambr_message("- rbinding wos publication tables..")
wos_publication <-
disambr_make_wos_tsv_publications(files_data_list, processabe_data_recipes)
disambr_message("- making wos authors table..")
wos_author <- disambr_make_wos_tsv_authors(wos_publication)
if(drop_ejected) {
## remove fields that we do not need
remove_headers <- c("AU", "AF", "C1", "RP", "EM", "RI", "OI")
## filter those that exists
remove_headers <-
remove_headers[remove_headers %in% processabe_data_headers[[1]]]
## remove headers without hard copy and `[.data.table` overhead
data.table::set(wos_publication, , remove_headers, NULL)
}
disambr_message("- making wos references table..")
wos_reference <- disambr_make_wos_tsv_references(wos_publication)
if(drop_ejected) {
remove_headers <- c("CR")
## filter those that exists
remove_headers <-
remove_headers[remove_headers %in% processabe_data_headers[[1]]]
## remove headers without hard copy and `[.data.table` overhead
data.table::set(wos_publication, , remove_headers, NULL)
}
disambr_message("- making author-year citations table..")
citation_name_table <-
disambr_make_wos_tsv_author_year_citations(wos_publication, wos_reference)
## if all data is wos data return only that
if(length(processabe_data) == length(files_data_list)) {
return(list(wos_publication
, wos_author
, wos_reference
, citation_name_table
))
} else {
return(list(
## TODO implement
## disambr_subsets(files_data_list
## , list(disambr_set_name = "wos_records_tsv_export")
## , negate_subsets = TRUE)
wos_publication
, wos_author
, wos_reference
, citation_name_table
))
}
} else {
disambr_message("Files data has different headers. Skipping processing...")
return(files_data_list)
}
} else {
return(files_data_list)
}
}
##' Make set of publications
##'
##' @param tables_list list of tables
##' @param recipes recipes
##' @return set
##'
##' @import data.table
##' @export
disambr_make_wos_tsv_publications <- function(tables_list, recipes) {
publication_table <-
data.table::rbindlist(tables_list, fill=TRUE)
## add ids, first_author_last_name, first_author_first_initials
publication_table[
, `:=`(id = 1:.N
, first_author_last_name =
toupper(stringi::stri_extract_first_regex(AU, "^[^,]+"))
, first_author_first_initial =
toupper(stringi::stri_extract_first_regex(AU, "(?<=, )[A-Za-z]"))
, doi = stringi::stri_match_first_regex(DI
, "10.\\d{4,9}/[-._;()/:A-Za-z0-9]+"))]
publication_table[
, name_year := paste(first_author_last_name
, first_author_first_initial
, PY)]
## set publication attributes
disambr_add_set_attr(publication_table, NULL
, unit = "publication"
, reference = "self"
, type = "different"
, strength = 1
, name = "wos_tsv_publications"
, collection = "unit_table"
## add files recipies
, recipe = c(list('disambr_make_wos_tsv_publications')
, recipes))
return(publication_table)
}
disambr_make_wos_tsv_publications <- disambr:::disambr_make_wos_tsv_publications
my.file <- system.file("testdata", "wos-tsv-test-recent.txt", package = "disambr")
if(file.exists(my.file)) {
dt <- list(disambr_read_file(my.file))
expect_equal(
attributes(disambr_make_wos_tsv_publications(dt, "recipe-lalala"))$disambr_set_recipe[[2]]
, "recipe-lalala")
expect_equal(
length(disambr_make_wos_tsv_publications(dt, "recipe-lalala"))
, 72
)
}
stringi::stri_extract_first_regex("Vlasov, asdf;", "(?<=, )[A-Za-z]")
my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"
a <- my.file2 %>% disambr_read_file %>% list %>% disambr_make_wos_tsv_publications("recipe-lalala")
a %>% attributes
a[, .(id, first_author_last_name, first_author_first_initial)]
##' Parses AU column of WoS saved records export
##' @param author_name a record string from AU column
##' @return data.table
##'
##' @md
disambr_wos_tsv_parse_au <- function(author_name) {
author_last_name <-
stringi::stri_extract_first_regex(author_name, "^[^,]+")
author_initials <-
stringi::stri_extract_first_regex(author_name, "(?<=, )[A-Z]+")
data.table::data.table(author_name = author_name
, author_last_name = author_last_name
, author_initials = author_initials
, author_order = 1:length(author_name))
}
## tests
## "Tilly, TB; Nelson, MT; Chakravarthy, KB; Shira, EA; Debrose, MC; Grabinski, CM; Salisbury, RL; Mattie, DR; Hussain, SM" %>% stri_split_fixed("; ") %>%
## disambr_wos_tsv_parse_au
##' Parses AF (author full name) column of WoS saved records export
##' @param name (a record string from AF column)?
##' @return Data.table
disambr_wos_tsv_parse_af <- function(name) {
last_name <- stringi::stri_extract_first_regex(name, "^[^,]+")
first_names <- stringi::stri_extract_first_regex(name, "(?<=, ).*")
first_names <-
stringi::stri_split_fixed(first_names, " ", omit_empty = TRUE)
## first.full.name is first name without dot
first_full_name <-
lapply(first_names, function(n) {
n[!stringi::stri_detect_regex(n, "\\.$")][1]
})
## return
data.table::data.table(
author_full_name = name
## , author_last_name = last_name
, author_first_names = first_names
, author_first_full_name = first_full_name)
}
## test
## "Tilly, Trevor B.; Nelson, M. Tyler; Chakravarthy, Karthik B.; Shira, Emily A.; Debrose, Madeline C.; Grabinski, Christin M.; Salisbury, Richard L.; Mattie, David R.; Hussain, Saber M." %>%
## disambr_wos_tsv_parse_af
##' Parses RP (reprint author) column of WoS saved records export
##' @param record_rp a record string from RP column
##' @return Data.table with two columns - author_name and affiliations
disambr_wos_tsv_parse_rp <- function(record_rp) {
record_rp_init <- ""
authors_table <-
data.table::data.table(author_name = character(0)
, affiliations = character(0))
while(record_rp != record_rp_init) {
record_rp_init <- record_rp
record_rp_split <-
stringi::stri_match_first_regex(
record_rp
, "\\s*([^()]+)\\s+\\((corresponding author|reprint author)\\)([^;]+)")
authors <-
stringi::stri_split_fixed(record_rp_split[1,2], "; ")[[1]]
affiliation <-
stringi::stri_replace_first_regex(
record_rp_split[1,4], "^[\\s,.;]+", "")
for (author in authors) {
## check if author is already in the list
authors_table_match <-
authors_table$author_name %in% author
if(any(authors_table_match)) {
## add affiliation to affiliations of author
## the data.table way..
authors_table[authors_table_match
, affiliations :=
list(c(unlist(affiliations), affiliation))]
} else {
## add new author with affiliation otherwise
authors_table <-
data.table::rbindlist(
list(authors_table
, list(author_name = author
, affiliations =
list(affiliation))))
}
}
record_rp <-
stringi::stri_replace_first_regex(
record_rp
, "[^()]+\\((corresponding author|reprint author)\\)[^;]+[;]", "")
}
## results are not printed but the data.table is returned
return(authors_table)
}
## "Guesmi, S (corresponding author), Natl Agron Inst Tunisia INAT, 43 Ave Charles Nicolle, Tunis 1082, Tunisia.; Guesmi, S; Sghaier, H (corresponding author), Sidi Thabet Technopk, Natl Ctr Nucl Sci & Technol, Lab Energy & Matter Dev Nucl Sci LR16CNSTN02, Sidi Thabet 2020, Tunisia.; Sghaier, H (corresponding author), Sidi Thabet Technopk, Lab Biotechnol & Nucl Technol LR16CNSTN01, Sidi Thabet 2020, Tunisia.; Sghaier, H (corresponding author), Sidi Thabet Technopk, Lab Biotechnol & Biogeo Resources Valorizat LR11E, Sidi Thabet 2020, Tunisia." %>%
## disambr_wos_tsv_parse_rp %>% print
## "" %>%
## disambr_wos_tsv_parse_rp %>% nrow
##' Parses EM (email) column of WoS saved records export
##' @param emails a record string from EM column
##' @param record_au_table a data_tabe after parsing AU column with disambr_wos_tsv_parse_au
##' @param record_rp_table a data_tabe after parsing RP column with disambr_wos_tsv_parse_rp
##' @return Data.table with columns - author_name, affiliations and email
disambr_wos_tsv_parse_em <- function(emails
, record_au_table
, record_rp_table) {
if (isTRUE(length(emails) == 1 && emails == "")) {
## in case there are no emails
record_au_table[, author_email := NA]
} else if (isTRUE(length(emails) == nrow(record_rp_table))) {
## assume that emails corresponds RP authors
record_au_table[match(record_rp_table$author_name, author_name)
, author_email := emails]
} else if (isTRUE(length(emails) == nrow(record_au_table))) {
## assume that emails corresponds AU authors
record_au_table[, author_email := emails]
} else if (isTRUE(nrow(record_rp_table) != 0)) {
## in other cases just use first email for first RP author
record_au_table[match(record_rp_table$author_name, author_name)[1]
, author_email := emails[1]]
} else {
## if no RP assignt to first in AU
record_au_table[1, author_email := emails[1]]
}
## we do not need to return things as it updates record_au_table
return(record_au_table)
}
## tests
## disambr_wos_tsv_parse_em(
## record_em = "a"
## , record_au_table = data.table(author_name = c(1,2,3,4))
## , record_rp_table = data.table(author_name = c(3))
## ) %>% print
##' Parses C1 (author adress/affiliation) column of WoS saved records export
##' @param record_c1 a record string from RP column
##' @param table_af Table
##' @return Data.table with two columns - author_name and affiliations
disambr_wos_tsv_parse_c1 <- function(record_c1
, table_af = NULL) {
record_c1_init <- ""
authors_table <-
data.table::data.table(author_full_name = character()
, affiliations = list())
while(record_c1 != record_c1_init) {
record_c1_init <- record_c1
record_c1_piece <-
stringi::stri_match_first_regex(
record_c1, "\\s*\\[([^\\[\\]]+)\\]\\s+([^;]+)\\s*")
authors <-
stringi::stri_split_fixed(record_c1_piece[1,2], "; ")[[1]]
affiliation <- record_c1_piece[1,3]
for (author in authors) {
## check if author is already in the list
authors_table_match <-
authors_table$author_full_name %in% author
if(any(authors_table_match)) {
## add affiliation to affiliations of author
## the data.table way..
authors_table[authors_table_match
, affiliations :=
list(c(unlist(affiliations), affiliation))]
} else {
## add new author with affiliation otherwise
authors_table <-
data.table::rbindlist(list(authors_table
, list(author_full_name = author
, affiliations = list(affiliation))))
}
}
record_c1 <-
stringi::stri_replace_first_regex(
record_c1, "\\s*\\[[^\\[\\]]+\\][^;]+[;]", "")
}
## merge with table_af if provided
if(length(table_af) != 0) {
return(authors_table[table_af
, on = "author_full_name"
, .(affiliations)])
} else {
return(authors_table)
}
}
## "[Wang, Menglei; Li, Shunyi; Zhu, Rencheng; Zhang, Ruiqin] Zhengzhou Univ, Sch Ecol & Environm, Zhengzhou 450001, Peoples R China; [Wang, Menglei] Zhengzhou Univ, Sch Chem Engn, Zhengzhou 450001, Peoples R China; [Zu, Lei; Wang, Yunjing; Bao, Xiaofeng] Chinese Res Inst Environm Sci, State Environm Protect Key Lab Vehicle Emiss Cont, Beijing 100012, Peoples R China" %>%
## disambr_wos_tsv_parse_c1
## "[Wang, Menglei; Li, Shunyi; Zhu, Rencheng; Zhang, Ruiqin] Zhengzhou Univ, Sch Ecol & Environm, Zhengzhou 450001, Peoples R China; [Wang, Menglei] Zhengzhou Univ, Sch Chem Engn, Zhengzhou 450001, Peoples R China; [Zu, Lei; Wang, Yunjing; Bao, Xiaofeng] Chinese Res Inst Environm Sci, State Environm Protect Key Lab Vehicle Emiss Cont, Beijing 100012, Peoples R China" %>%
## disambr_wos_tsv_parse_c1(disambr_wos_tsv_parse_af("Wang, Menglei; Zu, Lei; Wang, Yunjing; Bao, Xiaofeng"))
##' Parses OI column of WoS saved records export
##' @param authors a record string from OI column
##' @param table_af Table
##' @return data.table
##'
##' @md
disambr_wos_tsv_parse_oi <- function(authors
, table_af = NULL) {
author_full_name <-
stringi::stri_extract_first_regex(authors, "^[^/]+")
author_orcid <-
stringi::stri_extract_first_regex(authors, "(?<=/).+")
authors_table <-
data.table::data.table(author_full_name = author_full_name
, author_orcid = author_orcid)
## take care of propable case of multiple ID for one person
author_full_name_unique <- unique(authors_table$author_full_name)
author_orcid_list <-
lapply(author_full_name_unique
, function(x) {
authors_table$author_orcid[authors_table$author_full_name %in% x]
})
authors_table <-
data.table::data.table(author_full_name = author_full_name_unique
, author_orcid = author_orcid_list)
if(length(table_af) != 0) {
return(authors_table[table_af
, on = "author_full_name"
, .(author_orcid)])
} else {
return(authors_table)
}
}
## "Estrela, Pedro/0000-0001-6956-1146; Maxted, Grace/0000-0002-6816-9107; Rainbow, Joshua/0000-0003-3911-928X; Richtera, Lukas/0000-0002-8288-3999; Moschou, Despina/0000-0001-9175-5852" %>%
## disambr_make_wos_tsv_authors__parse_oi
## "Estrela, Pedro/0000-0001-6956-1146; Maxted, Grace/0000-0002-6816-9107; Rainbow, Joshua/0000-0003-3911-928X; Rainbow, Joshua/0000-0003-3911-928X; Richtera, Lukas/0000-0002-8288-3999; Moschou, Despina/0000-0001-9175-5852" %>%
## disambr_make_wos_tsv_authors__parse_oi(
## table_af = data.table(author_full_name =
## c("Rainbow, Joshua", "Moschou, Despina")))
## "" %>% disambr_make_wos_tsv_authors__parse_oi(
## table_af = data.table(author_full_name = c("Rainbow, Joshua")))
## NA %>% disambr_make_wos_tsv_authors__parse_oi
##' Parses RI column of WoS saved records export
##' @param authors a record string from RI column
##' @param table_af Table
##' @return data.table
##'
##' @md
##' @export
disambr_wos_tsv_parse_ri <- function(authors
, table_af = NULL) {
author_full_name <-
stringi::stri_extract_first_regex(authors, "^[^/]+")
author_researcher_id <-
stringi::stri_extract_first_regex(authors, "(?<=/).+")
authors_table <-
data.table::data.table(author_full_name = author_full_name
, author_researcher_id = author_researcher_id)
## take care of propable case of multiple ID for one person
author_full_name_unique <- unique(authors_table$author_full_name)
author_researcher_id_list <-
lapply(author_full_name_unique
, function(x) {
authors_table$author_researcher_id[
authors_table$author_full_name %in% x]
})
authors_table <-
data.table::data.table(author_full_name = author_full_name_unique
, author_researcher_id = author_researcher_id_list)
if(length(table_af) != 0) {
return(authors_table[table_af
, on = "author_full_name"
, .(author_researcher_id)])
} else {
return(authors_table)
}
}
##' Parses all WoS field related to authors and makes author table
##'
##' Relevant fields that are parsed (as in Web of Science Field Tags 2018-06-27):
##' - AU
##' - AF (full names)
##' - C1 adresses
##' - RP reprint address (one you contact for reprint copy)
##' - EM emails
##' - RI researcher ID
##' - OI ORCID Identifier (Open Researcher and Contributor ID)
##'
##' Fields that are not parsed
##' - BA book - not parsed
##' - BF book - not parsed
##' - CA gp group author (usually organization or group name) - not parsed
##' - BE editors - not parsed
##' @param wos_data_table WoS tsv export data.table
##' @param list_of_author_fields list of data to parse to construct (if not all selected as by default it will save time)
##' @param ... set verbose if needed here
##' @return author table
##'
##' @md
##' @export
disambr_make_wos_tsv_authors <- function(wos_data_table
, list_of_author_fields =
c("author_order"
, "author_short_name"
, "author_initials"
, "author_last_name"
, "author_full_name"
, "author_first_names"
, "author_first_full_name"
, "author_email"
, "author_researcher_id"
, "author_orcid"
, "author_affiliations")
, ...) {
disambr_message_start(start_mess_prefix = "-- making set")
authors_tables <- list()
## AU
if(any(c("author_order"
, "author_short_name"
, "author_last_name"
, "author_initials"
, "author_email") %in% list_of_author_fields) &&
"AU" %in% names(wos_data_table)) {
disambr_message("-- parsing AU field", ...)
authors_tables$au <-
lapply(
stringi::stri_split_fixed(wos_data_table$AU, "; ")
, disambr_wos_tsv_parse_au)
}
## AF
if(any(c("author_full_name"
, "author_first_names"
, "author_first_full_name"
, "author_researcher_id"
, "author_orcid"
, "author_affiliations") %in% list_of_author_fields) &&
"AF" %in% names(wos_data_table)) {
disambr_message("-- parsing AF field", ...)
authors_tables$af <-
lapply(
stringi::stri_split_fixed(wos_data_table$AF, "; ")
, disambr_wos_tsv_parse_af)
}
## RP
if(any(c("author_email") %in% list_of_author_fields) &&
"RP" %in% names(wos_data_table)) {
disambr_message("-- parsing RP field", ...)
## save RP separately as it is different order from AU
rp <-
lapply(wos_data_table$RP
, disambr_wos_tsv_parse_rp)
}
## EM
if(any(c("author_email") %in% list_of_author_fields) &&
all(c("AU", "EM", "RP") %in% names(wos_data_table))) {
disambr_message("-- parsing EM field", ...)
## disambr_wos_tsv_parse_em updates authors_tables$au
## so no need to save it
pbapply::pbmapply(disambr_wos_tsv_parse_em
, stringi::stri_split_fixed(wos_data_table$EM, "; ")
, authors_tables$au
, rp
, SIMPLIFY = FALSE
, USE.NAMES = FALSE)
}
## C1
if(any(c("author_affiliations") %in% list_of_author_fields) &&
all(c("C1", "AF") %in% names(wos_data_table))) {
disambr_message("-- parsing C1 field", ...)
authors_tables$c1 <-
pbapply::pbmapply(disambr_wos_tsv_parse_c1
, wos_data_table$C1
, authors_tables$af
, SIMPLIFY = FALSE)
}
## RI
if(any(c("author_researcher_id") %in% list_of_author_fields) &&
all(c("RI", "AF") %in% names(wos_data_table))) {
disambr_message("-- parsing RI field", ...)
authors_tables$ri <-
pbapply::pbmapply(disambr_wos_tsv_parse_ri
, stringi::stri_split_fixed(wos_data_table$RI, "; ")
, authors_tables$af
, SIMPLIFY = FALSE
, USE.NAMES = FALSE)
}
## OI
if(any(c("author_orcid") %in% list_of_author_fields) &&
all(c("OI", "AF") %in% names(wos_data_table))) {
disambr_message("-- parsing OI field", ...)
authors_tables$oi <-
pbapply::pbmapply(disambr_wos_tsv_parse_oi
, stringi::stri_split_fixed(wos_data_table$OI, "; ")
, authors_tables$af
, SIMPLIFY = FALSE
, USE.NAMES = FALSE)
}
## remove duplicated columns
disambr_message("-- stacking author fields", ...)
authors_tables <-
lapply(authors_tables, data.table::rbindlist, idcol = "paper_id")
authors_table <- do.call(cbind, authors_tables)
authors_table_names <-
stringi::stri_replace_first_regex(names(authors_table), "^[^\\.]+\\.", "")
authors_table_select <- which(!duplicated(authors_table_names))
authors_table_new_names <- authors_table_names[authors_table_select]
authors_table <- authors_table[, authors_table_select, with = FALSE]
## set names
data.table::setnames(authors_table, authors_table_new_names)
## set author attributes
disambr_add_set_attr(authors_table
, wos_data_table
, unit = "person"
, reference = "wos_tsv_publications"
, type = "similar"
, strength = 0.1
, name = "wos_tsv_authors")
#disambr_save_set(authors_table)
disambr_message_finish(mess = "-- finished -")
return(authors_table)
}
my.file <- system.file("testdata", "wos-tsv-test-recent.txt", package = "disambr")
if(file.exists(my.file)) {
dt <- disambr_read_file(my.file)
expect_equal(
length(attributes(disambr_make_wos_tsv_authors(dt))$disambr_set_recipe)
, 5)
}
## test
my.file <- system.file("testdata", "wos-tsv-test-recent.txt", package = "disambr")
my.file <- "inst/testdata/wos-tsv-test-recent.txt"
dt <- disambr_read_file(my.file)
dt %>% attributes
disambr_make_wos_tsv_authors(dt) %>% attributes
## testing dt merge
a <- data.table(name = c("a", "b", "c"), order = c(1,2,3))
b <- data.table(named = c("c", "b", "c"), affil = c("b-adfsa","c-sadfsd"))
cbind(a, b, check.names = FALSE)
##' Parses WoS CR (Cited References) record into separate talbe
##' @param references CR (Cited References) record, i.e., just one row
##' @return references table
##'
##' @export
disambr_wos_tsv_parse_cr <- function(references) {
references_list <- stringi::stri_split_fixed(references, ", ")
references_list <-
lapply(references_list, function(ref) {
first_author_last_name_first_initial <-
stringi::stri_match_first_regex(ref[1], "([a-z ]*[A-Z][^ ]+)\\s+([A-Z])")
ref_tail <- ref[-c(1:3)]
vol <- stringi::stri_extract_first_regex(ref_tail, "(?<=^V)\\d+")
vol <- vol[!sapply(vol, is.na)][1]
page <- stringi::stri_extract_first_regex(ref_tail, "(?<=^P)\\d+")
page <- page[!sapply(page, is.na)][1]
doi <-
stringi::stri_match_first_regex(ref_tail
, "10.\\d{4,9}/[-._;()/:A-Za-z0-9]+")
doi <- doi[!sapply(doi, is.na)][1]
name_year <- paste(first_author_last_name_first_initial[2]
, first_author_last_name_first_initial[3]
, ref[2])
## combine
list(first_author_name = ref[1]
, first_author_last_name = first_author_last_name_first_initial[2]
, first_author_first_initial = first_author_last_name_first_initial[3]
, year = ref[2]
, outlet = ref[3]
, vol = vol
, page = page
, doi = doi
, name_year = name_year)
})
suppressWarnings(data.table::rbindlist(references_list))
}
d <- "Allen C, 2017, ENVIRON SCI-NANO, V4, P741, DOI 10.1039/c7en90014g; Baek YW, 2011, SCI TOTAL ENVIRON, V409, P1603, DOI 10.1016/j.scitotenv.2011.01.014; Baker GL, 2008, TOXICOL SCI, V101, P122, DOI 10.1093/toxsci/kfm243; Bergstrom U, 2015, J TOXICOL ENV HEAL A, V78, P645, DOI 10.1080/15287394.2015.1017682; Bhushan B, 2011, PROG MATER SCI, V56, P1, DOI 10.1016/j.pmatsci.2010.04.003; Biswas P, 2005, J AIR WASTE MANAGE, V55, P708, DOI 10.1080/10473289.2005.10464656; Bitterle E, 2006, CHEMOSPHERE, V65, P1784, DOI 10.1016/j.chemosphere.2006.04.035; Bondarenko O, 2013, ARCH TOXICOL, V87, P1181, DOI 10.1007/s00204-013-1079-4; Bonner J. C., 2003, ENV HLTH PERSPECT, V111, P1289; Brossell D, 2013, J AEROSOL SCI, V63, P75, DOI 10.1016/j.jaerosci.2013.04.012; Clift MJD, 2011, ARCH TOXICOL, V85, P723, DOI 10.1007/s00204-010-0560-6; Cohen J, 2013, NANOTOXICOLOGY, V7, P417, DOI 10.3109/17435390.2012.666576; Cohen JM, 2014, PART FIBRE TOXICOL, V11, DOI 10.1186/1743-8977-11-20; Comouth A, 2013, J AEROSOL SCI, V63, P103, DOI 10.1016/j.jaerosci.2013.04.009"
expect_equal(nrow(disambr_wos_tsv_parse_cr(stringi::stri_split_fixed(d, "; ")[[1]])), 14)
a[[3]]$first_author_name
testing authors harmonization
c("de Sena RC"
, "Viola Roberto"
, "US Environmental Protection Agency"
, "WENG Shi-fu"
, "[Anonymous]"
, "LANGENHEIM JH"
, "van Doosselaere P"
, "USDA"
, "Kosmowska-Ceranowicz Barbara") %>%
## stri_match_first_regex("([a-z ]*[A-Z][^ ]+)\\s+([A-Z])")
stringi::stri_extract_first_regex("[a-z ]*[A-Z][^ ]+")
##' Make references table from WoS tsv data
##' @param wos_data_table WoS data
##' @param ... set verbose here if neededj
##' @return references table
##'
##' @export
disambr_make_wos_tsv_references <- function(wos_data_table
, ...) {
## disambr_message_start()
if("CR" %in% names(wos_data_table)) {
disambr_message("-- parsing references", ...)
references_list <-
pbapply::pblapply(stringi::stri_split_fixed(wos_data_table$CR, "; ")
, disambr_wos_tsv_parse_cr)
disambr_message("-- stacking references", ...)
references_table <-
data.table::rbindlist(references_list, idcol = "paper_id")
disambr_message("-- matching DOI citations", ...)
## assume unique doi
doi_match <- match(references_table$doi
, wos_data_table$doi
, incomparables = NA
, nomatch = NA)
if(any(!is.na(doi_match))){
references_table[, "doi_cited_id" := wos_data_table$id[doi_match]]
} else {
references_table[, "doi_cited_id" := NA]
}
## set references attributes
disambr_add_set_attr(references_table, wos_data_table
, reference = "wos_tsv_publications"
, type = "similar"
, strength = 0.1
, name = "wos_tsv_references"
, collection = "unit_table")
## disambr_message_finish()
return(references_table)
} else {
disambr_stop("THERE IS NOT 'CR' FIELD WITH REFERENCES!")
}
}
my.file <- system.file("testdata", "wos-tsv-test-recent.txt", package = "disambr")
if(file.exists(my.file)) {
dt <- disambr_read_file(my.file)
expect_equal(nrow(disambr_make_wos_tsv_references(dt)), 3760)
}
my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"
dt <- disambr_read_file(my.file2)
disambr_make_wos_tsv_references(dt) %>% print
dt[[3]]$doi_cited_id %>% is.na %>% not %>% sum
my.dir.small <- '../data/Journals in Mathematical Psychology'
dt <- my.dir.small %>% disambr_read
dt[[3]]
dt[[3]]$doi_cited_id %>% is.na %>% not %>% sum
dt[[3]]$name_cited_id %>% is.na %>% not %>% sum
##' Makes citations table by matching first author - year keys. This, of course, can produce false positives matched
##' @param pub_table WoS publications data table
##' @param ref_table WoS references data table
##' @return citation table
##'
##' @export
disambr_make_wos_tsv_author_year_citations <- function(pub_table, ref_table) {
## filter before maching
pub_table[, "year" := as.character(PY)]
pub_table <- pub_table[!is.na(first_author_last_name) &
!is.na(first_author_first_initial) &
!is.na(year)
, .(first_author_last_name
, first_author_first_initial
, year
, cited_id = id)]
ref_table <- ref_table[!is.na(first_author_last_name) &
!is.na(first_author_first_initial) &
!is.na(year)
, .(first_author_last_name
, first_author_first_initial
, year
, citing_id = paper_id)]
cit_table <-
merge(pub_table, ref_table
, by = c("first_author_last_name"
, "first_author_first_initial"
, "year"), allow.cartesian = TRUE)
cit_table <- cit_table[, .(citing_id, cited_id
## , first_author_last_name
## , first_author_first_initial
## , year
)]
disambr_add_set_attr(cit_table, ref_table
, name = "wos_tsv_author_year_citations"
, collection = "dyad_table"
, reference = "wos_tsv_publications")
return(cit_table)
}
is.na(c(1,2,NA))
## testing data.table
p <- data.table(c(1,2,3,4)
, b = TRUE
, y = c(7,4,6,3)
, n = c("o", "z", "o", "e")
, DI = c(11,22,33,44))
p[, id := 1:.N]
r <- data.table(paper_id = c(1,1,2,2,4,4)
, b = FALSE
, y = c(2,6,4,8,3,1)
, n = c("a", "c", "z", "o", "e", "o")
, doi = c(88,22,99,55,11,55))
p[r
, on = .(y, n)
, .(DI, doi, citation_id = id, paper_id, b, y, n)]
## test function
my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"
dt <- disambr_read(my.file2)
my.dir.small <- '../data/Journals in Mathematical Psychology'
dt <- disambr_read(my.dir.small)
dt[[1]]$first_author_last_name %>% sapply(is.na) %>% which
dt[[1]] %>% nrow
dt[[3]] %>% nrow
disambr_make_wos_tsv_author_year_citations(dt[[1]], dt[[3]])
disambr_make_wos_tsv_author_year_citations(dt[[1]], dt[[3]])$first_author_last_name %>% sapply(is.na) %>% sum
## dt[[1]]$AU[3771]
## dt[[1]]$CR[3514]
disambr_eject_citations_doi <- function(pub_table, ref_table) {
ref_table <- ref_table[!is.na(doi), .(doi, paper_id)]
pub_table <- pub_table[DI != "", .(DI, paper_id)]
pub_table[
, doi := stringi::stri_match_first_regex(DI
, "10.\\d{4,9}/[-._;()/:A-Za-z0-9]+")]
pub_table <- pub_table[!is.na(doi), .(doi, id)]
ref_table[
cited_id := saply doi
]
pub_table[ref_table
, on = doi
, .(citing_id = paper_id
, cited_id = id)]
return(cit_table[!is.na(citing_id)])
}
my.dir.small <- '../data/Journals in Mathematical Psychology'
dt <- my.dir.small %>% disambr_read
dt[[1]]$DI %>% head(1000)
## testing data.table
p <- data.table(c(1,2,3,4)
, b = TRUE
, DI = c(11,22,33,44))
p[, id := 1:.N]
r <- data.table(paper_id = c(1,1,2,2,4,4)
, b = FALSE
, doi = c(88,22,99,55,11,55))
p[r
, on = .(DI=doi)
, .(DI, doi, citation_id = id, paper_id, b)]
## test function
## my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"
## dt <- disambr_read(my.file2)
## disambr_eject_citations(dt[[1]], dt[[3]])
## dt[[1]]$DI[11]
## dt[[1]]$DI[9]
## dt[[1]]$DI[20]
## dt[[1]]$CR[11]
## dt[[1]]$CR[9]
## dt[[1]]$CR[20]
## ##' Parses RI (researcher_id) column of WoS saved records export
## ##' @param record_ri a record string from RP column
## ##' @param table_af
## ##' @return Data.table with columns - author_full_name and author_researcher_id
## disambr_wos_tsv_parse_ri <- function(record_ri
## , table_af = NULL) {
## if(isTRUE(record_ri != "")) {
## authors <- stringi::stri_split_fixed(record_ri, "; ")[[1]]
## authors_list <- lapply(authors, function(author) {
## author_split <- stringi::stri_split_fixed(author, "/", n = 2)[[1]]
## list(author_full_name = author_split[1]
## , author_researcher_id = author_split[2])
## })
## authors_table <- data.table::rbindlist(authors_list)
## } else {
## authors_table <- data.table::data.table(author_full_name = character()
## , author_researcher_id = character())
## }
## ## merge with table_af if provided
## if(length(table_af) != 0) {
## return(authors_table[table_af
## , .(author_researcher_id)
## , on = "author_full_name"
## , roll = TRUE])
## } else {
## return(authors_table)
## }
## }
## rbindlist(list("Girabent, Montserrat/B-8536-2008; Maydeu-Olivares, Alberto/B-5178-2010" %>%
## disambr_wos_tsv_parse_ri
## , "" %>%
## disambr_wos_tsv_parse_ri))
## a <- data.table(a1 = c(1,2,2,3,4), a2 = c(11,22,22,33,44))
## b <- data.table(a1 = c(3,2), b2 = c(333,111))
## a[b, on = "a1", roll = TRUE]
## b[a, on = "a1"]
## NA %>%
## disambr_wos_tsv_parse_ri
## ##' Parses OI (ORCID) column of WoS saved records export
## ##' @param record_oi a record string from OI column
## ##' @return Data.table with columns - author_full_name and author_orcid
## disambr_wos_tsv_parse_oi <- function(record_oi
## , table_af = NULL) {
## if(isTRUE(record_oi != "")) {
## authors <- stringi::stri_split_fixed(record_oi, "; ")[[1]]
## authors_list <- lapply(authors, function(author) {
## author_split <- stringi::stri_split_fixed(author, "/", n = 2)[[1]]
## list(author_full_name = author_split[1]
## , author_orcid = author_split[2])
## })
## authors_table <- data.table::rbindlist(authors_list)
## } else {
## authors_table <- data.table::data.table(author_full_name = character()
## , author_orcid = character())
## }
## ## merge with table_af if provided
## if(length(table_af) != 0) {
## return(authors_table[table_af
## , on = "author_full_name"
## , .(author_orcid)
## , roll = TRUE])
## } else {
## return(authors_table)
## }
## }
## "Estrela, Pedro/0000-0001-6956-1146; Maxted, Grace/0000-0002-6816-9107; Rainbow, Joshua/0000-0003-3911-928X; Richtera, Lukas/0000-0002-8288-3999; Moschou, Despina/0000-0001-9175-5852" %>% disambr_wos_tsv_parse_oi
## ----------------------------------------------------------------------------
##' Subsets WoS authors table with Researcher IDs
##'
##' This procedure does not alter sets attributes. Just filters WoS author table.
##'
##' @param sets WoS data
##' @param file_path path to Researcher IDs. The default list of Researcher IDs is taken from Tekles & Bornmann (2019) for reproducing their test samples (see reference)
##' @return updated WoS data
##'
##' @references Tekles, A., & Bornmann, L. (2019). Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches. ArXiv:1904.12746. http://arxiv.org/abs/1904.12746
##'
##' @export
disambr_filter_authors_by_researcher_ids <-
function(sets
, file_path =
system.file("testdata"
, "tekles-bornmann-researcher-ids.txt"
, package = "disambr")) {
force(sets)
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
disambr_message("- Tekles, A., & Bornmann, L. (2019) researcher IDs")
if(file.exists(file_path)) {
researcher_ids <- readLines(file_path)
} else {
disambr_stop("- Can not find file with researcher IDs. Please, provide")
}
disambr_message("- coping 'wos_tsv_authors' data set")
authors_data_set_index <- disambr_in_sets(sets
, recipe = "wos_tsv_authors"
, match_attr_value_parcially = TRUE)
authors_data_set_index <- which(authors_data_set_index)
if(length(authors_data_set_index) != 1) {
disambr_stop("- can not find 'wos_tsv_authors' data set!")
return(sets)
}
authors_data_set <- sets[[authors_data_set_index]]
## ======================================================================
disambr_message("- filtering authors by Researcher ID")
authors_data_set_new <- authors_data_set[author_researcher_id %in% researcher_ids]
## ======================================================================
mostattributes(authors_data_set_new) <- attributes(authors_data_set)
disambr_set_attr(authors_data_set_new, name = "tekles_bornmann")
## disambr_save_set(authors_data_set_new)
sets[[authors_data_set_index]] <- authors_data_set_new
disambr_message_finish()
return(sets)
}
options(disambr_mess_pretty = TRUE)
dt <- readRDS(file = "../data/wos-slow-export-subset.rds")
attributes(dt[[2]])$disambr_set_recipe[[1]][[1]]
dt %>% sapply(length)
dt.test <-
dt %>% disambr_set_tekles_bornmann
## dt %>% length
## dt[[3]] %>% attributes
## after disambr_set_tekles_bornmann
## dt %>% length
## nrow(dt[[3]])
## 25868 vs 834090
##' Makes sets of co-authors assuming that all authors on paper are different person.
##' @param sets Sests
##' @return Updated sets
##'
##' @export
disambr_split_authors_if_on_the_same_paper <- function(sets) {
force(sets)
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
author_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
## ======================================================================
disambr_message("- spliting co-authors")
output_set <- split(1:nrow(author_data_set)
, author_data_set$paper_id)
## ======================================================================
disambr_add_set_attr(output_set, author_data_set
, type = "different"
, strength = 1
, collection = "list_of_lists"
, reference = "wos_tsv_authors")
disambr_save_set(output_set)
disambr_message_finish()
return(c(sets, list(output_set)))
}
## too slow... and eats all ram
disambr_set_not_on_same_paper <- function(sets
, verbose = FALSE
, save_set_as = TRUE
, data_set_name =
"wos_records_tsv_export_author_table") {
force(sets)
if(verbose) message("disambr: Starting disambr_set_not_on_same_paper...")
if(!is.list(sets)) stop("disambr: 'sets' parameter should be list!")
data_set <-
disambr_subsets(sets
, list(disambr_set_name = data_set_name)
, which_to_return = "first")
if(verbose) message("- spliting co-authors")
return_sets <- data_set %>% {split(1:nrow(.), .$paper_id)}
if(verbose) message("- making combinations of co-authors sets..")
if(length(return_sets) > 50000) stop("--- THE NUMBER OF COMBINATIONS IS TO HIGH!")
return_sets_comb <-
combn(length(return_sets), 2, simplify = FALSE)
if(verbose) message("--- made ,", length(return_sets_comb), " combinations")
if(verbose) message("- expanding combinations")
return_sets <-
pbapply::pblapply(return_sets_comb
, function(comb) {
expand.grid(author_id1 = return_sets[[comb[1]]]
, author_id2 = return_sets[[comb[2]]])
})
if(verbose) message("- rbinding combinations..")
return_sets <- data.table::rbindlist(return_sets)
if(verbose) message("--- rbinded into ", nrow(return_sets), " rows")
## set set's attributes
data_set_recipe <- attr(data_set, "disambr_recipe")
disambr_setattr(return_sets
, disambr_entity = "person"
, disambr_set_type = "similar"
, disambr_set_coefficient = 0.5
, disambr_set_name = "not_on_same_paper"
, disambr_set_collection = "dyads_table"
, disambr_entity_id_reference =
"wos_records_tsv_export_author_table"
, disambr_recipe = c(list("disambr_set_not_on_same_paper")
, data_set_recipe))
if(length(save_set_as) != 0)
return(c(sets, list(return_sets)))
}
##' Makes set of similar authors based on their initials
##' @param sets Sets
##' @param maxDist see max_dist in `match_fuzzy`
##' @return Sets with new appended
##' @export
disambr_merge_authors_with_similar_initials <- function(sets
, maxDist = 1) {
force(sets)
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
author_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
initials_data_set <- toupper(author_data_set$author_initials)
## use only first 2 initials
initials_data_set <- stringi::stri_sub(initials_data_set, to = 2)
input_set <- disambr_get_last_set(sets)
## ======================================================================
disambr_message("- fuzzy matching initials")
## assume all authors will be used in the table
initials_bank <- unique(initials_data_set)
## lets leave NAs
initials_bank <- sort(initials_bank, na.last = TRUE)
initials_match <-
lapply(initials_bank
, function(ini) {
matched_initials <-
stringdist::ain(initials_bank
, ini
, maxDist = maxDist
, method = "lv"
, matchNA = FALSE)
if(any(matched_initials)) {
matched_initials <- initials_bank[matched_initials]
data.table::data.table(
author_initials_1 = ini
, author_initials_2 = matched_initials)
} else NULL
})
initials_match <- data.table::rbindlist(initials_match)
## this is the case if we apply this procedure first
if(attr(input_set, "disambr_set_type") == "different") {
input_set_l <- length(input_set)
disambr_message(paste("- doing combinations on", input_set_l))
## try cluster
## cl <- parallel::makeCluster(20,type="SOCK")
output_set <-
pbapply::pblapply(1:(input_set_l - 1), function(i) {
## combn using is data.table method
comb <-
data.table::CJ(author_id1 = input_set[[i]]
, author_id2 = unlist(input_set[(i+1):input_set_l])
, sorted = FALSE)
## add initials_bank
comb[, `:=`(
author_initials_1 = initials_data_set[author_id1]
, author_initials_2 = initials_data_set[author_id2]
)]
## alternative to the above
## data.table::set(comb, , c("author_initials_1"
## , "author_initials_2")
## , list(initials_data_set[comb$author_id1]
## , initials_data_set[comb$author_id2]))
## check matches
comb <-
data.tables::merge(comb
, initials_match
, by = c("author_initials_1"
, "author_initials_2"))
return(comb[,.(author_id1, author_id2)])
})
##parallel::stopCluster(cl = cl)
disambr_message("- rbinding dyads")
output_set <- data.table::rbindlist(output_set)
## other case is when follow matching last names procedure
} else if(attr(input_set, "disambr_set_type") == "similar") {
output_set <- input_set
## add names
output_set[, `:=`(
author_initials_1 = initials_data_set[author_id1]
, author_initials_2 = initials_data_set[author_id2]
)]
## check matches
output_set <- merge(output_set, initials_match
, by = c("author_initials_1", "author_initials_2"))
output_set <- output_set[,.(author_id1, author_id2)]
## } else if(attr(input_set, "disambr_set_type") == "similar") {
## disambr_message("- subsetting first two initials_bank")
## initials_bank1 <-
## stri_sub(author_data_set$author_initials_bank[input_set$author_id1], to = 2)
## initials_bank2 <-
## stri_sub(author_data_set$author_initials_bank[input_set$author_id2], to = 2)
## disambr_message("- calculating distance b/w initials_bank")
## dist <- stringdist(initials_bank1
## , initials_bank2
## , method = "lv")
## output_set <- input_set[dist < 2]
} else disambr_stop("- UNKNOWN INPUT_SET_NAME!")
## ======================================================================
disambr_add_set_attr(output_set, author_data_set
, strength = 0.5
, collection = "dyad_table"
, reference = "wos_tsv_authors")
disambr_save_set(output_set)
disambr_message_finish()
return(c(sets, list(output_set)))
}
## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:100]
## mostattributes(dt[[4]]) <- dt_atributes
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE) %>%
## disambr_set_similar_initials_bank(verbose = TRUE)
##' Makes set of similar authors based on their last names
##' @param sets Sets
##' @param max_dist see max_dist in `match_fuzzy`
##' @param max_dist_short max_dist for short last names, default 0
##' @param min_length who is short names defined, default 0 which means do not condider short names
##' @return Sets with new appended
##' @export
disambr_merge_authors_with_similar_last_names <- function(sets
, max_dist = 1
, max_dist_short = 0
, min_length = 0) {
force(sets)
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
author_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
last_name_data_set <- toupper(author_data_set$author_last_name)
input_set <- disambr_get_last_set(sets)
input_set_l <- length(input_set)
## ======================================================================
disambr_message("- fuzzy matching last names")
## assume all authors will be used in the table
last_names_bank <- unique(last_name_data_set)
## lets leave NAs
last_names_bank <- sort(last_names_bank, na.last = TRUE)
## treshholds
last_names_bank_short <- nchar(last_names_bank) <= min_length
last_names_match <- match_fuzzy(last_names_bank[!last_names_bank_short]
, method = "dl"
, max_dist = max_dist
, id_name = "author_last_name")
if(any(last_names_bank_short)) {
last_names_match <- rbind(
last_names_match
, match_fuzzy(last_names_bank[last_names_bank_short]
, method = "dl"
, max_dist = max_dist_short
, id_name = "author_last_name"))
}
## ----------------------------------------------------------------------
if(attr(input_set, "disambr_set_type") == "different") {
disambr_message(paste("- doing combinations on", input_set_l))
output_set <-
pbapply::pblapply(1:(input_set_l-1), function(i) {
## this is data.table method
combs <-
data.table::CJ(author_id1 = input_set[[i]]
, author_id2 =
unlist(input_set[(i + 1) : input_set_l])
, sorted = FALSE)
## add names
combs[, `:=`(
author_last_name_1 = last_name_data_set[author_id1]
, author_last_name_2 = last_name_data_set[author_id2]
)]
## check matches
combs <-
merge(combs, last_names_match
, by = c("author_last_name_1", "author_last_name_2"))
return(combs[,.(author_id1, author_id2)])
})
disambr_message("- rbinding dyads")
output_set <- data.table::rbindlist(output_set)
} else if(attr(input_set, "disambr_set_type") == "similar") {
disambr_message(paste("- fuzzy matching authors by last name"))
output_set <- input_set
## add names
output_set[, `:=`(
author_last_name_1 = last_name_data_set[author_id1]
, author_last_name_2 = last_name_data_set[author_id2]
)]
## check matches
output_set <- merge(output_set, last_names_match
, by = c("author_last_name_1", "author_last_name_2"))
output_set <- output_set[,.(author_id1, author_id2)]
} else disambr_stop("- UNKNOWN INPUT_SET_NAME!")
## ======================================================================
disambr_add_set_attr(output_set, author_data_set
, strength = 0.5
, collection = "dyad_table"
, reference = "wos_tsv_authors")
disambr_save_set(output_set)
disambr_message_finish()
return(c(sets, list(output_set)))
}
tests
dt %>% length
dt[[2]]$author_initials %>% stri_sub(to = 2) %>% unique %>% length
## 562
dt[[2]]$author_initials %>% length
## 24714
24714^2
8818^2
dt[[2]]$author_last_name %>% length
## 24714
dt[[2]]$author_last_name %>% toupper %>% unique %>% length
## 8818
stringdist::amatch(c("sdfasd", NA, "sei;n;h", "sdfaij'[ksadf", "dpoj'sdf")
, c("sdfas", NA)
, maxDist = 2
, method = "dl"
, matchNA = FALSE
, nomatch = NA) %>% extract(!is.na(.))
sort(c("sdfasd", NA, "sei;n;h", "sdfaij'[ksadf", "dpoj'sdf"), na.last = TRUE)
maxDist
amatch
## full
## dt <- readRDS(file = "my.dir.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE) %>%
## disambr_set_similar_last_names(verbose = TRUE)
## partial
## dt <- readRDS(file = "my.dir.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)
disambr_read("../data/new_export/savedrecs-ms-recent.txt") %>%
disambr_set_on_same_paper() %>%
disambr_set_similar_last_names(max_dist = 1
, max_dist_short = 1
, min_length = 2) %>%
extract2(6)
## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:100]
## mostattributes(dt[[4]]) <- dt_atributes
## dt[[4]] %>% attributes
## dt %>% disambr_set_similar_last_names(verbose = TRUE)
rbindlist(list(list(1,2,3), list(2,3,4), NULL))
##' Makes set of matched authors bases on same email addresses
##' @param sets Sets
##' @return Sets with new attached
##' @export
disambr_merge_authors_with_same_emails <- function(sets) {
force(sets)
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
author_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
## emails case insensitive
email_data_set <- toupper(author_data_set$author_email)
input_set <- disambr_get_last_unstrong_set(sets)
## ======================================================================
disambr_message("- checking emails")
output_set <-
email_data_set[input_set$author_id1] ==
email_data_set[input_set$author_id2]
output_set <- input_set[sapply(output_set, isTRUE)]
## ======================================================================
disambr_add_set_attr(output_set, author_data_set
, strength = 1
, collection = "dyad_table"
, reference = "wos_tsv_authors")
disambr_save_set(output_set)
disambr_message_finish()
return(c(sets, list(output_set)))
}
## ## full
## dt <- readRDS(file = "my.dir.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE) %>%
## disambr_set_similar_last_names(verbose = TRUE)
## partial
## dt <- readRDS(file = "my.dir.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)
## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:1000]
## mostattributes(dt[[4]]) <- dt_atributes
## dt <-
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE) %>%
## disambr_set_similar_initials(verbose = TRUE) %>%
## disambr_set_same_email(verbose = TRUE)
## dt[[7]]
## dt[[3]][26]$author_email
## dt[[3]][90]$author_email
## dt[[3]][2]$author_email
## dt[[3]][264]$author_email
## dt[[3]][406]$author_email
##' Makes set of matched authors based on same affiliation
##' @param sets Sets
##' @return Sets with new attached
##' @export
disambr_merge_authors_with_same_affiliation <- function(sets) {
force(sets)
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
author_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
input_set <- disambr_get_last_unstrong_set(sets)
## ======================================================================
disambr_message("- checking overlapping affiliations")
affiliations1 <- author_data_set$affiliations[input_set$author_id1]
affiliations1 <- lapply(affiliations1, toupper)
## affiliations1 <- ifelse(is.na(affiliations1), NULL, affiliations1)
affiliations2 <- author_data_set$affiliations[input_set$author_id2]
affiliations2 <- lapply(affiliations2, toupper)
## affiliations2 <- ifelse(is.na(affiliations2), NULL, affiliations2)
affiliations_match <-
mapply(function(a1, a2) {
any(match(a1, a2, incomparables = NA, nomatch = 0) > 0)
}
, affiliations1
, affiliations2)
output_set <- input_set[affiliations_match]
## ======================================================================
disambr_add_set_attr(output_set, author_data_set
, strength = 1
, collection = "dyad_table"
, reference = "wos_tsv_authors")
disambr_save_set(output_set)
disambr_message_finish()
return(c(sets, list(output_set)))
}
## full
## dt <- readRDS(file = "my.dir.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE) %>%
## disambr_set_similar_last_names(verbose = TRUE)
## partial
## dt <- readRDS(file = "my.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)
## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:100]
## mostattributes(dt[[4]]) <- dt_atributes
## dt.new <-
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE) %>%
## disambr_set_same_affiliation(verbose = TRUE, input_set_name = "similar_last_names")
## dt.new[[6]] %>% head
## dt[[3]][96]$affiliations
## dt[[3]][31]$affiliations
## dt[[3]][2]$author_email
## dt[[3]][264]$author_email
## dt[[3]][406]$author_email
##' Make a set of matched authors bases on the cases when one author cites the others paper.
##' @param sets Sets
##' @param match_refrerences_by_name_year Whether to check citations based on first author name and year pair in addition to machich citations based on DOI
##' @return
##' @export
disambr_merge_authors_if_citing_others_papers <- function(sets
, match_refrerences_by_name_year = TRUE) {
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
author_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
reference_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_references")
citations_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_author_year_citations")
input_set <- disambr_get_last_unstrong_set(sets)
## ======================================================================
disambr_message("- checking if author sites other author's paper")
## TODO: Add papers that were already matched previously
## get paper ids
input_set[, `:=`(
paper_ids_1 = author_data_set$paper_id[author_id1]
, paper_ids_2 = author_data_set$paper_id[author_id2])]
match_list <- list()
match_list$doi_1 <-
merge(input_set, reference_data_set,
, by.x = c("paper_ids_1", "paper_ids_2")
, by.y = c("paper_id", "doi_cited_id"))[, .(author_id1, author_id2)]
match_list$doi_2 <-
merge(input_set, reference_data_set,
, by.x = c("paper_ids_2", "paper_ids_1")
, by.y = c("paper_id", "doi_cited_id"))[, .(author_id1, author_id2)]
if(match_refrerences_by_name_year) {
match_list$name_1 <-
merge(input_set, citations_data_set,
, by.x = c("paper_ids_1", "paper_ids_2")
, by.y = c("citing_id", "cited_id"))[, .(author_id1, author_id2)]
match_list$name_2 <-
merge(input_set, citations_data_set,
, by.x = c("paper_ids_2", "paper_ids_1")
, by.y = c("citing_id", "cited_id"))[, .(author_id1, author_id2)]
}
output_set <- data.table::rbindlist(match_list)
output_set <- unique(output_set)
## ======================================================================
disambr_add_set_attr(output_set, author_data_set
, strength = 1
, collection = "dyad_table"
, reference = "wos_tsv_authors")
disambr_save_set(output_set)
disambr_message_finish()
return(c(sets, list(output_set)))
}
## partial
## dt <- readRDS(file = "my.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)
## dt.test <-
## dt %>%
## disambr_set_on_same_paper %>%
## disambr_set_similar_last_names
## dt.test %>% disambr_set_cite_others_paper %>% extract2(7)
## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:2000]
## mostattributes(dt[[4]]) <- dt_atributes
## dt.short <-
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE)
## dt.short.test <-
## dt.short %>%
## disambr_set_cite_others_paper
## dt.short.test[[5]] %>% sum
## https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page
## https://www.crossref.org/blog/dois-and-matching-regular-expressions/
## "/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i"
## testing data.table
## p <- data.table(id = c(1,2,3,4)
## , b = TRUE
## , y = c(7,4,6,3)
## , n = c("o", "z", "o", "e")
## , DI = c(11,22,33,44))
## r <- data.table(id = c(6)
## , b = FALSE
## , y = c(2,6,4,8,3,1)
## , n = c("a", "c", "z", "o", "e", "o")
## , doi = c(88,22,99,55,11,55))
## merge(p,r,by = "id")
## p[r
## , on = .(y, n)
## , .(DI, doi, , paper_id, b, y, n)]
##' Make a set of matched authors bases on the cases when one author cites the others paper.
##' @param sets Sets
##' @param match_refrerences_by_name_year Whether to check citations based on first author name and year pair in addition to machich citations based on DOI
##' @return
##' @export
disambr_merge_authors_if_citing_others_papers <- function(sets
, match_refrerences_by_name_year = TRUE) {
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
author_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
reference_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_references")
citations_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_author_year_citations")
input_set <- disambr_get_last_unstrong_set(sets)
## ======================================================================
disambr_message("- checking if author sites other author's paper")
## TODO: Add papers that were already matched previously
## get paper ids
input_set[, `:=`(
paper_ids_1 = author_data_set$paper_id[author_id1]
, paper_ids_2 = author_data_set$paper_id[author_id2])]
match_list <- list()
match_list$doi_1 <-
merge(input_set, reference_data_set,
, by.x = c("paper_ids_1", "paper_ids_2")
, by.y = c("paper_id", "doi_cited_id"))[, .(author_id1, author_id2)]
match_list$doi_2 <-
merge(input_set, reference_data_set,
, by.x = c("paper_ids_2", "paper_ids_1")
, by.y = c("paper_id", "doi_cited_id"))[, .(author_id1, author_id2)]
if(match_refrerences_by_name_year) {
match_list$name_1 <-
merge(input_set, citations_data_set,
, by.x = c("paper_ids_1", "paper_ids_2")
, by.y = c("citing_id", "cited_id"))[, .(author_id1, author_id2)]
match_list$name_2 <-
merge(input_set, citations_data_set,
, by.x = c("paper_ids_2", "paper_ids_1")
, by.y = c("citing_id", "cited_id"))[, .(author_id1, author_id2)]
}
output_set <- data.table::rbindlist(match_list)
output_set <- unique(output_set)
## ======================================================================
disambr_add_set_attr(output_set, author_data_set
, strength = 1
, collection = "dyad_table"
, reference = "wos_tsv_authors")
disambr_save_set(output_set)
disambr_message_finish()
return(c(sets, list(output_set)))
}
## partial
## dt <- readRDS(file = "my.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)
## dt.test <-
## dt %>%
## disambr_set_on_same_paper %>%
## disambr_set_similar_last_names
## dt.test %>% disambr_set_cite_others_paper %>% extract2(7)
## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:2000]
## mostattributes(dt[[4]]) <- dt_atributes
## dt.short <-
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE)
## dt.short.test <-
## dt.short %>%
## disambr_set_cite_others_paper
## dt.short.test[[5]] %>% sum
## https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page
## https://www.crossref.org/blog/dois-and-matching-regular-expressions/
## "/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i"
## testing data.table
## p <- data.table(id = c(1,2,3,4)
## , b = TRUE
## , y = c(7,4,6,3)
## , n = c("o", "z", "o", "e")
## , DI = c(11,22,33,44))
## r <- data.table(id = c(6)
## , b = FALSE
## , y = c(2,6,4,8,3,1)
## , n = c("a", "c", "z", "o", "e", "o")
## , doi = c(88,22,99,55,11,55))
## merge(p,r,by = "id")
## p[r
## , on = .(y, n)
## , .(DI, doi, , paper_id, b, y, n)]
##' Make a set of matched authors that share co-authors
##'
##' - First we index dyads of papers for matched authors by indexing elements in an upper triangle in square paper by paper matrix (see `get_upper_triangle_index` function)
##' - Then we group and count matched author dyads that are associated with the same paper dyad index. If there are no duplicates in authors ids then it would be the number of co-shared co-authors but there is an issue when we try to match same author name to several author names on the other paper (next steps meant to fix this issue)
##' - Within these groups of same paper dyads we count same authors ids on each paper to access the number of open triads (when the same author is matched to two different authors from the same paper) for every author dyad in a group (Nid1 + Nid2 - 2)
##' The algorithm for matching authors based on shared co-authors is the following:
##' - Finally, we filter matched author dyads based on the difference between number of paper dyads and number of open triangles for authors (records_per_paper - open_triangles > 1). Also see `min_number_of_shared_coauthors`
##'
##' @param sets Sets of matched author names dyads
##' @param min_number_of_shared_coauthors Minimum number of co-authors that should be shared in order for author names to be cosidered as matched/merged
##' @return Original sets with table of matched author dyads appended to it
##' @export
##' @md
disambr_match_authors_if_sharing_coauthors <-
function(sets
, min_number_of_shared_coauthors = 1) {
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
author_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
paper_id_max <- max(author_data_set$paper_id)
input_set <- disambr_get_last_unstrong_set(sets)
## ======================================================================
disambr_message("- filtering cases with shared co-authors")
## get paper ids
output_set <-
input_set[
, paper_dyad_ids :=
mapply(get_upper_triangle_index
, author_data_set$paper_id[author_id1]
, author_data_set$paper_id[author_id2]
, paper_id_max)
][
, `:=`(records_per_paper = .N
, open_triangles =
as.vector(table(author_id1)[as.character(author_id1)]) +
as.vector(table(author_id2)[as.character(author_id2)]) - 2)
, keyby = paper_dyad_ids
][records_per_paper - open_triangles > min_number_of_shared_coauthors
, .(author_id1, author_id2)
]
## ======================================================================
disambr_add_set_attr(output_set, author_data_set
, strength = 1
, collection = "dyad_table"
, reference = "wos_tsv_authors")
disambr_save_set(output_set)
disambr_message_finish()
return(c(sets, list(output_set)))
}
## TESTS
## data.table(
## a1 = c(1,1,1, 8,10, 200, 400)
## , a2 = c(5,2,3, 2, 50 , 300, 500)
## , p1 = c(1,1,10,1,3,4,5)
## , p2 = c(10,10,1,10,8,5,4))[
## , paper_dyad_ids :=
## mapply(get_upper_triangle_index
## , p1
## , p2
## , max(c(p1, p2)))
## ][
## , `:=`(records_per_paper = .N
## , open_triangles =
## as.vector(table(a1)[as.character(a1)]) +
## as.vector(table(a2)[as.character(a2)]) - 2)
## , keyby = paper_dyad_ids
## ][, criteria := records_per_paper - open_triangles - 1][]
##' Make set of authors that have number of references in common
##' @param sets Sets
##' @param references_in_common number of references in common
##' @return Sets with new set
##' @export
disambr_merge_authors_with_common_references <- function(sets
, references_in_common = 3) {
force(sets)
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
author_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
reference_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_references")
citation_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_author_year_citations")
input_set <- disambr_get_last_unstrong_set(sets)
## ======================================================================
disambr_message("- checking references in common")
## TODO: Add papers that were already matched previously
input_set[, `:=`(
paper_ids_1 = author_data_set$paper_id[author_id1]
, paper_ids_2 = author_data_set$paper_id[author_id2]
)]
paper_ids_set <- unique(input_set[,.(paper_ids_1, paper_ids_2)])
## for blade
if(.Platform$OS.type == "windows") {
cl <-
parallel::makePSOCKcluster(
round(parallel::detectCores() * .70))
output_set <-
parallel::parLapply(
cl, 1:nrow(paper_ids_set)
, function(i) {
id1 <- paper_ids_set$paper_ids_1[i]
id2 <- paper_ids_set$paper_ids_2[i]
common_refs <-
match(reference_data_set[paper_id == id1, c(doi_cited_id)]
, reference_data_set[paper_id == id2, c(doi_cited_id)]
, nomatch = 0
, incomparables = NA)
common_refs <- sum(common_refs > 0)
if(common_refs < references_in_common) {
name_common_refs <-
match(citation_data_set[citing_id == id1, c(cited_id)]
, citation_data_set[citing_id == id2, c(cited_id)]
, nomatch = 0
, incomparables = NA)
name_common_refs <- sum(name_common_refs > 0)
common_refs <- common_refs + name_common_refs
if(common_refs < references_in_common) {
return(FALSE)
} else {
return(TRUE)
}
} else return(TRUE)
}
)
parallel::stopCluster(cl)
} else {
output_set <-
pbmapply(function(id1, id2) {
common_refs <-
match(reference_data_set[paper_id == id1, c(doi_cited_id)]
, reference_data_set[paper_id == id2, c(doi_cited_id)]
, nomatch = 0
, incomparables = NA)
common_refs <- sum(common_refs > 0)
if(common_refs < references_in_common) {
name_common_refs <-
match(citation_data_set[citing_id == id1, c(cited_id)]
, citation_data_set[citing_id == id2, c(cited_id)]
, nomatch = 0
, incomparables = NA)
name_common_refs <- sum(name_common_refs > 0)
common_refs <- common_refs + name_common_refs
if(common_refs < references_in_common) return(FALSE) else return(TRUE)
} else return(TRUE)
}
, paper_ids_set$paper_ids_1
, paper_ids_set$paper_ids_2)
}
output_set <- paper_ids_set[output_set]
output_set <- merge(output_set, input_set, by = c("paper_ids_1", "paper_ids_2"))
output_set <- output_set[,.(author_id1, author_id2)]
## ======================================================================
disambr_add_set_attr(output_set, author_data_set
, strength = 1
, collection = "dyad_table"
, reference = "wos_tsv_authors")
disambr_save_set(output_set)
disambr_message_finish()
return(c(sets, list(output_set)))
}
## partial
## dt <- readRDS(file = "my.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)
## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:2000]
## mostattributes(dt[[4]]) <- dt_atributes
## dt.test %>% disambr_set_common_references %>% extract2(7)
## dt.short <-
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE)
## dt.short.test <-
## dt.short %>%
## disambr_set_common_references
## dt.short.test[[5]] %>% sum
## https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page
## https://www.crossref.org/blog/dois-and-matching-regular-expressions/
## "/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i"
## a <- data.table(a = c(1,2,3,4), b = c(11,22,33,44))
## a[a %in% c(2,3), c(b)]
## c(NA,NA,1) %in% c(32,3,1,3, NA)
##' Make a set of matched authors based on cases when one author cites others self citation. Self-citations here are detected based on DOI.
##' @param sets Sets
##' @return Sets with new appended
##' @export
disambr_merge_authors_if_citing_self_citation <- function(sets) {
force(sets)
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
author_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
reference_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_references")
strong_set <- disambr_get_strong_set(sets)
input_set <- disambr_get_last_unstrong_set(sets)
## ======================================================================
## TODO: check named citations
disambr_message("- checking if author cites a self-citation of other")
check_self_citations <- function(id1, id2) {
own_papers <-
author_data_set$paper_id[
unique(strong_set[author_id1 %in% id1 |
author_id2 %in% id1
, c(author_id1, author_id2)])]
## mach own papers to own citations
self_citations <-
match(own_papers
, reference_data_set[paper_id %in% own_papers, c(doi_cited_id)]
, nomatch = 0
, incomparables = NA) > 0
self_citations <- own_papers[self_citations]
cite_self_citations <-
match(reference_data_set[paper_id == id2, c(doi_cited_id)]
, own_papers
, nomatch = 0
, incomparables = NA) > 0
return(any(cite_self_citations))
}
## blade option
if(.Platform$OS.type == "windows") {
cl <- parallel::makePSOCKcluster(round(parallel::detectCores() * .70))
output_set <-
parallel::parLapply(
cl, 1:nrow(input_set)
, function(i){
id1 <- input_set$author_id1[i]
id2 <- input_set$author_id2[i]
if(check_self_citations(id1, id2)) {
return(TRUE)
} else if(check_self_citations(id2, id1)) {
return(TRUE)
} else {
return(FALSE)
}
})
parallel::stopCluster(cl)
} else {
output_set <-
pbmapply(function(id1, id2) {
if(check_self_citations(id1, id2)) {
return(TRUE)
} else if(check_self_citations(id2, id1)) {
return(TRUE)
} else {
return(FALSE)
}
}
, input_set$author_id1
, input_set$author_id2)
}
output_set <- input_set[output_set]
## ======================================================================
disambr_add_set_attr(output_set, author_data_set
, strength = 1
, collection = "dyad_table"
, reference = "wos_tsv_authors")
disambr_save_set(output_set)
disambr_message_finish()
return(c(sets, list(output_set)))
}
## partial
## dt <- readRDS(file = "my.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)
## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:2000]
## mostattributes(dt[[4]]) <- dt_atributes
## dt.test.plus <-
## dt.test %>%
## disambr_set_common_references
## dt.test.plus[[7]]
## dt.test.plus %>%
## disambr_set_cite_self_citation %>% extract2(8)
## 7 out of 416 pairs matched
## dt.test.plus[[2]]$author_name[c(923
## , 2353
## , 2403
## , 2464
## , 2525
## , 4234
## , 4420
## , 7675)]
## not very accurate
## [1] "COHEN, AS" "COHEN, AS" "COHEN, AS" "COHEN, AS" "COHEN, AS" "Yap, MJ"
## [7] "Mayes, AR" "Burns, GN"
## dt.short <-
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE)
## dt.short.test <-
## dt.short %>%
## disambr_set_cite_self_citation
## dt.short.test[[5]] %>% sum
## https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page
## https://www.crossref.org/blog/dois-and-matching-regular-expressions/
## "/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i"
## a <- data.table(a = c(1,2,3,4), b = c(11,22,33,44))
## a[a %in% c(2,3), c(b)]
## c(NA,NA,1) %in% c(32,3,1,3, NA)
##' Makes set of authors with number of keywords in their papers in common
##' @param sets Sets
##' @param keywords_in_common number of keywords in common
##' @return Sets with new set appended
##'
##' @export
disambr_merge_authors_with_common_keywords <- function(sets
, keywords_in_common = 3) {
force(sets)
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
author_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
publication_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_publications")
input_set <- disambr_get_last_unstrong_set(sets)
## ======================================================================
disambr_message("- checking common keywords (Author Keywords)")
keywords_1 <-
publication_data_set[author_data_set[input_set$author_id1, c(paper_id)], c(DE)]
keywords_1 <- stringi::stri_split_fixed(keywords_1, "; ")
keywords_2 <-
publication_data_set[author_data_set[input_set$author_id2, c(paper_id)], c(DE)]
keywords_2 <- stringi::stri_split_fixed(keywords_2, "; ")
keywords_matched <-
pbmapply(function(k1, k2) {
sum(match(k1, k2, incomparables = c(NA, ""), nomatch = 0) > 0)
}
, keywords_1
, keywords_2)
output_set <- input_set[keywords_matched >= keywords_in_common]
## ======================================================================
disambr_add_set_attr(output_set, author_data_set
, strength = 1
, collection = "dyad_table"
, reference = "wos_tsv_authors")
disambr_save_set(output_set)
disambr_message_finish()
return(c(sets, list(output_set)))
}
## ## full
## dt <- readRDS(file = "my.dir.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE) %>%
## disambr_set_similar_last_names(verbose = TRUE)
## partial
## dt <- readRDS(file = "my.dir.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)
## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:1000]
## mostattributes(dt[[4]]) <- dt_atributes
## dt.test %>% disambr_set_common_keywords %>% extract2(7)
## dt.test[[1]]$DE[c(1113
## , 8390
## , 8438
## , 8455)]
## dt <-
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE) %>%
## disambr_set_similar_initials(verbose = TRUE) %>%
## disambr_set_common_keywords(verbose = TRUE)
## dt[[7]]
## dt[[3]][26]$author_email
## dt[[3]][90]$author_email
## dt[[3]][2]$author_email
## dt[[3]][264]$author_email
## dt[[3]][406]$author_email
##' For testing. Makes so called ground truth set based on same research IDs
##' @param sets Sets
##' @return Sets with new set appended
##'
##' @export
disambr_merge_authors_with_same_researcher_ids <- function(sets) {
force(sets)
disambr_message_start()
if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
## check if output set is ready
output_set <- disambr_get_output_set(sets)
if(!is.null(output_set)) return(sets)
output_set <- disambr_read_output_set()
if(!is.null(output_set)) return(c(sets, list(output_set)))
## ----------------------------------------------------------------------
author_data_set <-
disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
ri_data_set <- author_data_set$author_researcher_id
ri_bank <- unique(unlist(ri_data_set))
ri_bank <- sort(ri_bank) # removes NAs
## ======================================================================
disambr_message("- expanding grid and cheching researcher IDs")
## this is fast combn
combi <- function(vect)
{
l <- length(vect)
if(l == 1) return()
first <- rep(vect, (l-1):0)
vectR <- rev(vect)
second <- vectR[rev(sequence(1:(l-1)))]
combi <- data.table(first, second)
return(combi)
}
output_set <-
pbapply::pblapply(ri_bank, function(ri) {
same_ri <- sapply(ri_data_set, function(i) ri %in% i)
same_ri <- which(same_ri)
combi(same_ri)
})
output_set <- data.table::rbindlist(output_set)
## ======================================================================
disambr_add_set_attr(output_set, author_data_set
, strength = 10
, collection = "dyad_table"
, reference = "wos_tsv_authors")
disambr_save_set(output_set)
disambr_message_finish()
return(c(sets, list(output_set)))
}
data.table::CJ(id1 = c(1,2,3)
, id2 = c(1,2,3)
, sorted = FALSE
, unique = TRUE)
combi <- function(vect)
{
l <- length(vect)
first <- rep(vect, (l-1):0)
vectR <- rev(vect)
second <- vectR[rev(sequence(1:(l-1)))]
combi <- data.table(first, second)
return(combi)
}
v <- 1:100
microbenchmark(combi(v), combn(v,2))
options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)
getOption("disambr_get_output_set")
my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"
ts_eva_test_1 <-
disambr_read(my.file2)
my.dir.small <- '../data/Journals in Mathematical Psychology'
ts_eva_test <-
disambr_read(my.dir.small, )
ts_eva[[]]
ts_eva_full <-
disambr_read("../data/wos-slow-export-subset"
, save_sets_as = "wos-slow-export-subset.rds"
, save_sets_dir = "../data") %>%
disambr_set_tekles_bornmann %>% #59149 vs_1962896
disambr_set_on_same_paper %>%
disambr_set_similar_initials %>%
disambr_set_similar_last_names %>%
disambr_set_same_email %>%
disambr_set_same_affiliation %>%
disambr_set_cite_others_paper %>%
disambr_set_common_references %>%
disambr_set_cite_self_citation %>%
disambr_set_common_keywords %>%
disambr_set_same_researcher_ids
ts_eva <- ts_eva_test
ts_eva_sabe <- ts_eva
ts_eva %<>%
disambr_set_on_same_paper %>%
disambr_set_similar_initials %>%
disambr_set_similar_last_names %>%
disambr_set_same_email %>%
disambr_set_same_affiliation %>%
disambr_set_cite_others_paper %>%
disambr_set_common_references %>%
disambr_set_cite_self_citation %>%
disambr_set_common_keywords %>%
disambr_set_same_researcher_ids
ts_eva %>%
disambr_get_first_data_set(recipe = "wos_tsv_") %>%
names
ts_eva %>% length
ts_eva[[14]] <- NULL
disambr_set_cite_others_paper
## disambr_set_common_keywords
## ts_eva[[1]]$DE[c(ts_eva[[2]]$paper_id[c(12484, 18298
## , 1281, 1337)])]
## disambr_set_common_references
## disambr_set_same_affiliation
## ts_eva[[2]]$affiliations[c(
## 1254
## , 7361
## , 11379
## , 241
## , 7042)]
## disambr_set_same_email
## ts_eva[[2]]$author_email[c(
## 3
## , 18458
## , 18466
## , 10
## , 16
## , 24564
## , 24641)]
## disambr_set_similar_last_names
## ts_eva[[2]]$author_name[c(
## 1254
## , 1369
## , 6133
## , 7323
## , 7361
## , 7724
## , 2271
## , 2704
## , 17859)]
ts_eva_test[[2]]$author_initials[]
attr(ts_eva_test[[5]], "disambr_set_type")
disambr_set_cite_self_citation %>%
disambr_set_common_keywords
##' Make a statistics for a collection of sets (specific algorithm) and writes to the file in `sets_dir`
##'
##' @param sets Collection of sets
##' @param sets_dir If first arg is not provided search for collection of sets in this directory. The results are saved to this directory as well.
##' @param name Name of the collection of sets
##' @param remove_data whether to remove first 4 sets (assumed to be a data) when sets arg is set
##' @param save_rds save_rds
##' @return
##' @export
disambr_stats <- function(sets = NULL
, sets_dir = getOption("disambr_save_set_dir")
, name = sets_dir
, remove_data = TRUE
, save_rds = TRUE) {
if(is.null(sets)){
## reads sets from directory
sets <-
lapply(dir(sets_dir
, full.names = TRUE
, pattern = "disambr-set\\..*\\.rds$")
, readRDS)
} else if(remove_data){
## remove data from the sets
sets <- sets[-c(1:4)]
}
## get duration
sets_attr <-
lapply(sets, attributes)
sets_attr_duration <-
lapply(sets_attr, `[[`, "disambr_set_duration")
sets_attr_duration <-
unlist(sapply(sets_attr_duration, as.numeric, units = "mins"))
dur_mins <- sum(sets_attr_duration)
## get sets
truth <-
disambr_get_truth_set(sets)
strong <-
disambr_get_strong_set(sets)
input <- disambr_get_input_set(sets)
## mirror pairs for comparison (this assumes that sets are data.tables)
strong <-
unique(rbind(strong
, strong[, .(author_id2
, author_id1)]))
truth <-
unique(rbind(truth
, truth[, .(second
, first)]))
truth <- truth[, .(author_id1 = first
, author_id2 = second)]
## True positive (a)
a <- nrow(data.table::fintersect(truth, strong))
## False positive (b)
b <- nrow(data.table::fsetdiff(strong, truth))
## False negative (c)
c <- nrow(data.table::fsetdiff(truth, strong))
## True negative (d) (ref:bug)
d <- nrow(sets[[2]]) * (nrow(sets[[2]]) - 1) -
nrow(data.table::funion(truth, strong))
pw_presision <- a / (a + b)
pw_recall <- a / (a + c)
pw_f1 <- (2 * pw_presision * pw_recall) / (pw_presision + pw_recall)
pw_accuracy <- (a + d) / (a + b + c + d)
stats <-
list(name = name
, true_positives = a
, false_positives = b
, false_negatives = c
, true_negatives = d
, pw_presision = pw_presision
, pw_recall = pw_recall
, pw_f1 = pw_f1
, pw_accuracy = pw_accuracy
, dur_mins = dur_mins
, dur_sets = list(sets_attr_duration))
## save on disk
if(save_rds){
saveRDS(stats, paste0(sets_dir,"/", "disambr_stats.rds"))
}
return(stats)
}
##' Makes comparative table with statistics
##' @param sets_dir dir
##' @return
##'
##' @md
##' @export
disambr_stats_table <- function(sets_dir){
files <- dir(sets_dir
, recursive = TRUE
, full.names = TRUE
, pattern = "^disambr_stats\\.rds$")
stats_list <- lapply(files, readRDS)
data.table::rbindlist(stats_list)
}
my.dir <- '../data'
my.dir.small <- '../data/Journals in Mathematical Psychology'
my.dir.large <- '/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments'
my.dir.huge <- '/mnt/md5/data/wos'
my.file <- '../data/Journals in Mathematical Psychology/Applied Psychological Measurement.txt'
my.file1 <- "/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments/LN Public NAICS records from 10001 to 10500.txt"
my.files <-
c('../data/Journals in Mathematical Psychology/Applied Measurement in Education.txt'
, '../data/Journals in Mathematical Psychology/Applied Psychological Measurement.txt')
## ----------------------------------------------------------------------------
my.dat <- disambr.read(my.dir)
my.dat <- disambr.read(my.file)
attributes(my.dat[[1]])
my.dat <- disambr.read("../data/new_export")
my.dat[[1]]$RP[1:4]
my.dat[[1]]$EM[1:4]
my.dat <-
my.dat %>%
disambr_set_not_on_same_paper
my.dat[[2]]
my.dat %>%
disambr_set_not_on_same_paper %>%
disambr_set_similar_initials %>%
disambr_set_similar_last_names
dat <- disambr.read(my.file)
dat %>% extract(1) %>% disambr_set_not_on_same_paper
## new testing
d <- disambr.read("../data/wos-researchers-ids")
d <- d[[1]][1:1000,] %>% list
d.done <-
d %>%
disambr_set_not_on_same_paper %>%
disambr_set_similar_initials %>%
d.done %>% length
d.done2 %>% length
d.done2[[4]] %>% nrow
d.done2 <-
d.done %>%
disambr_set_similar_last_names
saveRDS(d.done2, "../data/d.done2.rds")
disambr.eva <- function(data) {
data %>%
disambr_set_not_on_same_paper %>%
disambr_set_similar_initials %>%
disambr_set_similar_last_names
}
## Usage
disambr.eva(data)
disambr_set_not_on_same_paper <- disambr.define.procedure(data %>%
get(publication) %>%
for.each %>%
get(person = author))
## or
disambr_set_not_on_same_paper <- disambr.define.procedure(data$
publication$
person(author))
list(data = my.dat
, similar.initials = set.similar.initials) %>%
disambr_set_similar_last_names
set.different.authors <- disambr_set_not_on_same_paper(my.dat)
set.similar.initials <-
list(data = my.dat
, different.authors = set.different.authors) %>%
disambr_set_similar_initials
## 2020-08-25
## problem with large integers
bit64
return_sets_comb <-
combn(1:100000, 2, simplify = FALSE)
10,000,000,000
return_sets_comb %>% length
install.packages("bit64")
as.integer64
library("bit64")
.Machine$integer.max
[1] 2147483647
combn(3, 2, simplify = FALSE)
my.dir.small <- '../data/Journals in Mathematical Psychology'
my.dir.large <- '/run/media/stas/ed6193a5-e4fd-4d83-9eb6-e481c39aeb8f/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments'
my.dir.wos <- "~/home/web of science"
my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"
my.file2 <- "../data/wos-researchers-ids"
dt <- disambr_read(my.dir.wos) %T>%
saveRDS(file = "my.wos.rds", compress = FALSE)
dt <- readRDS(file = "my.dir.wos.rds") %>%
disambr_set_not_on_same_paper(verbose = TRUE)
dt <- disambr_read(my.dir.large) %T>%
saveRDS(file = "my.dir.large.rds", compress = FALSE) %>%
disambr_set_not_on_same_paper(verbose = TRUE)
dt <- readRDS(file = "my.dir.large.rds") %>%
disambr_set_not_on_same_paper(verbose = TRUE)
dt[[4]]
Testing help
options(browser="firefox")
help(disambr.read, help_type = "html")