Skip to content

Latest commit

 

History

History
5323 lines (4579 loc) · 191 KB

disambr.src.org

File metadata and controls

5323 lines (4579 loc) · 191 KB

disabmr - Named Entity Disambiguation in R

Description

:export_options+: author:nil :export_options+: title:nil :export_options+: tex:nil :export_options+: ^:nil

disambr is an R 📦 that provides a flexible framework for disambiguation of named entities. Currently this package implements the AEV algorithm (van den Akker et al., 2020) for Web of Science author disambiguation.

Usage

The idea of basic usage is simply piping the disambiguation procedures. Each procedure takes list of sets as input and returns list of sets either by adding a new set or modifying input list of sets.

data |> 
    disambr_set_on_same_paper() |> 
    disambr_set_similar_initials() |>
    disambr_set_similar_last_names()

Sequence (piped functions) of disambiguation procedures defines a disambiguation algorithm.

disambr_eva <- function(data) {
    data |>
        disambr_set_on_same_paper() |>
        disambr_set_similar_initials() |>
        disambr_set_similar_last_names()
}

Implementation

The EVA-algorithm: An open-source solution for the disambiguation of author names in Web of Science data Olmo R. van den Akker Sacha Epskamp Stanislav Vlasov

The creation of co-authorship networks is a valuable way to depict the social structure of scientific fields. However, these co-authorship networks often get distorted because of the problems of author name synonymy (the same author is split into two nodes because his name is spelled differently in different publications) and author name homonymy (different authors are compounded into one node because they share the same name). The practice of author name disambiguation (AND) tries to solve these problems by correctly identifying the authors of scientific articles.

Several algorithms have been put forward in the context of AND, but none of them are suitable for large datasets of the Web of Science database. Therefore, in an earlier part of this project we proposed a new unsupervised learning algorithm based on the most recent AND literature. This so-called AEV-algorithm involves two phases: a blocking phase, in which pairs of authors are selected that are sufficiently similar, and a disambiguation phase, in which similar author names are either split or combined into one node based on information retrieved from the Web of Science database. In the disambiguation phase, the algorithm uses information about co-authorship, e-mail addresses, institutional affiliations, cited references, and article keywords (van den Akker et al., 2020).

Installation

## Installs and loads disambr
devtools::install_github("stasvlasov/disambr")
library("disambr")

Sets attributes

All disambiguation procedures used in disambr package work with sets. A set is basically any R object that can represent mathematical sets (e.g., set of authors, set of companies) with special attributes that are used by disabmr functions to identify the kind of set it is working with or produced (e.g., a set of authors that are likely to be the same person or a set of companies that are definitely different companies, etc.).

The attributes that are currently used to define/describe sets as well as their values are listed below:

  • disambr_set_name
    • string name of the set
  • disambr_entity
    • either person, organization, publication
  • disambr_set_type
    • similar_entities, different_entities
  • disambr_set_coefficient
    • number between 0 and 1 indicating how strongly entities are similar or different from each other. It is used only for establishing order of sets processing (e.g., start with sets of least similar entities)
  • disambr_set_collection
    • single_set_table (first column assumed to store entity id or entity id is just row number if entity_id_reference attribute is set to self, see below),
    • list_of_sets_as_lists (each set is a list of entity ids),
    • dyads_table (first and second columns assumed to be ids for the pair of entities)
  • disambr_entity_id_reference
    • self, name of other set as in its set_name attribute
  • disambr_entity_id_reference_md5_sum
    • md5 cache sum of the object where entities ids are referring to ensure that we will get to correct data for entities in the set.
  • disambr_recipe
    • list of disambr procedures that were applied to produce given set(s)
    • if it is a named list then first item is procedure name and the rest are properties:
      • procedure
      • file_name
      • file_md5sum (to check file identity later)
      • file_header (to check for consistency between read files)

Overall design principles

  • To allow for modular design each disambiguation procedure should accept and return list of sets (e.g., same person sets, different person sets, other probability of being the same person sets)
  • List of sets from (chain of) various procedures will be then merged (using basis set algebra) according to the specific disambiguation algorithm to produce final list of sets.
  • Initial input should be in the form of a list of initial sets (the simplest input is one set with every person likely to be non unique, e.g., data.table of authors from Web of Science bibliography data).
  • When reading data the package should try to do as many sets as possible on a fly (cleaning and splitting initial data to different types of entities)
  • Try to implement lazy data loading and processing where possible

Naming convention

References

This research was supported (in part) by the Fetzer Franklin Fund of the John E. Fetzer Memorial Trust.

van den Akker, O. R., Epskamp, Sacha, & Vlasov, S. A. (2020). The AEV Algorithm—Author name disambiguation for large Web of Science datasets.

Web of Science Field Tags 2018-06-27

https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Core-Collection-List-of-field-tags-in-output?language=en_US

FNFile Name
VRVersion Number
PTPublication Type (J=Journal; B=Book; S=Series; P=Patent)
AUAuthors
AFAuthor Full Name
BABook Authors
BFBook Authors Full Name
CAGroup Authors
GPBook Group Authors
BEEditors
TIDocument Title
SOPublication Name
SEBook Series Title
BSBook Series Subtitle
LALanguage
DTDocument Type
CTConference Title
CYConference Date
CLConference Location
SPConference Sponsors
HOConference Host
DEAuthor Keywords
IDKeywords Plus®
ABAbstract
C1Author Address
RPReprint Address
EME-mail Address
RIResearcherID Number
OIORCID Identifier (Open Researcher and Contributor ID)
FUFunding Agency and Grant Number
FXFunding Text
CRCited References
NRCited Reference Count
TCWeb of Science Core Collection Times Cited Count
Z9Total Times Cited Count*
U1Usage Count (Last 180 Days)
U2Usage Count (Since 2013)
PUPublisher
PIPublisher City
PAPublisher Address
SNInternational Standard Serial Number (ISSN)
EIElectronic International Standard Serial Number (eISSN)
BNInternational Standard Book Number (ISBN)
J929-Character Source Abbreviation
JIISO Source Abbreviation
PDPublication Date
PYYear Published
VLVolume
ISIssue
SISpecial Issue
PNPart Number
SUSupplement
MAMeeting Abstract
BPBeginning Page
EPEnding Page
ARArticle Number
DIDigital Object Identifier (DOI)
D2Book Digital Object Identifier (DOI)
PGPage Count
P2Chapter Count (Book Citation Index)
WCWeb of Science Categories
SCResearch Areas
GADocument Delivery Number
UTAccession Number
PMPubMed ID
EREnd of Record
EFEnd of File

Existing tools for reading WoS data

namecomments
bibliometrixreads only plaintext format into bibliometrixDB object
wosrRequires WoS API subscription
refsplitrpackage ‘refsplitr’ is not available (for R version 4.0.1)
read.wos.RDoes not work…
metagearscrape_bibliography by DOI
hindexcalculator?

bibliometrix

https://github.com/massimoaria/bibliometrix

Site: https://bibliometrix.org/index.html

git clone https://github.com/massimoaria/bibliometrix
install.packages("bibliometrix")
library("bibliometrix")
library("magrittr")


bmdata <- convert2df(file = 'https://www.bibliometrix.org/datasets/wos_plaintext.txt', dbsource = 'wos', format = "plaintext")


bmdata %>% class
## [1] "data.frame"     "bibliometrixDB"

bmdata %>% names


bmdata <- convert2df(file = 'https://www.bibliometrix.org/datasets/wos_plaintext.txt', dbsource = 'wos', format = "csv")

metagear

CRAN docs: https://cran.r-project.org/web/packages/metagear/metagear.pdf

GitHub: https://github.com/cran/metagear/

git clone https://github.com/cran/metagear/
scrape_bibliography

BibPlots

CRAN docs: https://cran.r-project.org/web/packages/BibPlots/BibPlots.pdf

Paper: https://arxiv.org/pdf/1905.09095.pdf

hindexcalculator

CRAN docs: https://cran.r-project.org/web/packages/hindexcalculator/hindexcalculator.pdf

refsplitr

git clone https://github.com/ropensci/refsplitr
install.packages("refsplitr")
library("refsplitr")

wosr

Requires premium WoS API - https://clarivate.com/webofsciencegroup/solutions/xml-and-apis

CRAN doc: https://cran.r-project.org/web/packages/wosr/wosr.pdf

Site: https://github.com/vt-arc/wosr

GitHub: https://github.com/vt-arc/wosr

git clone https://github.com/vt-arc/wosr
install.packages("wosr")
library(wosr)

## Get session ID
sid <- auth("[email protected]", password = "")
## Error: No matches returned for Username [email protected]

## Query WoS to see how many results match your query
query <- 'TS = ("animal welfare") AND PY = (2002-2003)'
query_wos(query, sid = sid)

## Download data
pull_wos(query, sid = sid)

read.wos.R

GitHub: https://github.com/alberto-martin/read.wos.R
git clone https://github.com/alberto-martin/read.wos.R
## load functions
## --------------------------------------------------------------------------------
source("../lib/read.wos.R/read.wos.functions.R")
## --------------------------------------------------------------------------------


## test

wos.data.mp <- read.wos(dir("../data/Journals in Mathematical Psychology", no.. = TRUE, full.names = TRUE))
## Error in substring(fields, 4) : invalid multibyte string at '<ff><fe>P'
## In addition: Warning message:
## In readLines(files[1], n = 1) : line 1 appears to contain an embedded nul

wos.data.mp <- read.wos("../data/Journals in Mathematical Psychology/Psychonomic Bulletin & Review 2.txt")
## Error in substring(fields, 4) : invalid multibyte string at '<ff><fe>P'
## In addition: Warning message:
## In readLines(files[1], n = 1) : line 1 appears to contain an embedded nul

wos.data <- read.wos("/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments/LN Public NAICS records from 10001 to 10500.txt")
## Error in substring(fields, 4) : invalid multibyte string at '<ff><fe>P'
## In addition: Warning message:
## In readLines(files[1], n = 1) : line 1 appears to contain an embedded nul

Make & Deploy

(save-excursion
  (let ((calls '( "reset-working-directory"
                  "tangle-readme"
                  "tangle-buffer"
                  "generate-package-docs"
                  ))
         ;; turn off babel prompts
         org-confirm-babel-evaluate)
    (mapcar
     (lambda (name)
       (let (org-confirm-babel-evaluate)
         (save-excursion
           (org-babel-goto-named-src-block name)
           (or (org-babel-execute-src-block-maybe)
               (org-babel-lob-execute-maybe)))))
     calls)))

Initiate project

Adding github workflows and badges

https://github.com/r-lib/actions/blob/v1/examples/README.md

usethis::use_github_action("check-release")
usethis::use_github_action("test-coverage")
usethis::use_github_action("pkgdown")
usethis::use_github_actions_badge(name = "R-CMD-check")

Set up tinytest

tinytest::setup_tinytest(".")

Reset working directory

## Remove tangled and generated files
## --------------------------------------------------------------------------------
files_to_remove <- c(
    "DESCRIPTION"
  , "NAMESPACE"
  , "LICENSE.md"
  , "README.md"
)

dirs_to_remove <-c(
    "man"
  , "inst/tinytest"
  , "data"
  , "R"
)

## remove files in dirs recursively
sapply(dirs_to_remove
     , function(dir) {
         file.remove(list.files(dir , recursive = TRUE, full.names = TRUE))
     })

## remove files and empty dirs
sapply(c(files_to_remove, dirs_to_remove)
     , \(file) if(file.exists(file)) file.remove(file))

Tangling README.md

(require 'org-goto)
(save-excursion
  (org-goto--local-search-headings "Description" nil t)
  (org-pandoc-export-to-markdown nil 'subtreep))

Tangle source

(org-babel-tangle)

Set .Rprofile (developer enviroment)

CRAN Packages

  ## --------------------------------------------------------------------------------
  ## First load default packages getOption("defaultPackages")
  ## Otherwise it will add it at the end which can mask some funcitons
  .First.sys()


  ## some packages installations read .Rprofile loops the install
  ## current_wd <- getwd()
  ## setwd("~/")

  ## --------------------------------------------------------------------------------
  ## Load or Install Packages
  ## --------------------------------------------------------------------------------
  for(pkg in c('devtools'
             , 'roxygen2'
             ## , 'xml2'
             ## , 'tibble'
             , 'stringi'
             ## , 'stringr'
             , 'stringdist'
             , 'digest'
             , 'crayon'
             , 'magrittr'
             , 'lubridate'
             ## , 'plyr'
             ## , 'pipeR'
             ## , 'ggplot2'
             , 'pbapply'
             , 'testthat'
             , 'microbenchmark'
             , 'data.table'
             ## , 'dplyr'
))
      if(!require(pkg, character.only = TRUE)) {
          install.packages(pkg, repos = 'http://cloud.r-project.org')
          require(pkg, character.only = TRUE) }


  ## restore current working directore
  ## setwd(current_wd)

  ## --------------------------------------------------------------------------------
  ## library(disambr)

  ## update.packages(ask = FALSE, repos = 'http://cloud.r-project.org')

My Packages

## --------------------------------------------------------------------------------
## Load My pakcages
## --------------------------------------------------------------------------------
## detach(package:romRDS, unload = TRUE)
## remove.packages("romRDS")
if (!require("romRDS", character.only = TRUE)) {
  if(!require("devtools")) {
    install.packages("devtools"
                   , repos = 'http://cloud.r-project.org'
                   , dependencies = TRUE)
    require("devtools", character.only = TRUE)
  }
  install_github("stasvlasov/romRDS")
  require("romRDS", character.only = TRUE)
}
## --------------------------------------------------------------------------------

Package documentation

packageslink
tinytesthttps://github.com/markvanderloo/tinytest/blob/master/pkg/README.md
crayon
packagescurrent_versionensure_versionlink
data.table1.14.21.13.0
stringi1.7.61.6.0
parallel4.1.24.0.0
pbapply1.5.01.5.0
stringdist0.9.80.9.0
packageVersion(pkg)
## Generate package description
## --------------------------------------------------------------------------------
list(Package = "disambr"
   , Title  = "disambr - Named Entity Disambiguation in R"
   , Description = "disambr - Named Entity Disambiguation in R"
   , `Authors@R` = c(person(given = c("Stanislav" ,"A.") , family = "Vlasov"
                   , email = "[email protected]"
                   , role = c("aut", "cre"))
            , person(given = c("Olmo", "R."), family = "van den Akker"
                   , email = "[email protected]"
                   , role = "aut")
            , person(given = "Sacha", family = "Epskamp"
                   , email = "[email protected]"
                   , role = "aut"))
   , Imports = paste(
         paste0(dependencies$packages, " (>= ", dependencies$ensure_version, ")")
       , collapse = ", ")
     ## , Depends = "R (>= 4.0), data.table(>= 1.13.0)"
   , Depends = "R (>= 4.0)"
   , Suggests = paste(suggests$packages, collapse = ", ")
   , Version = "0.0.0.9000"
   , Date = Sys.Date()
   , URL = "https://github.com/stasvlasov/disambr"
   , BugReports = "https://github.com/stasvlasov/disambr/issues"
   , References = "This research was supported (in part) by the Fetzer Franklin Fund of the John E. Fetzer Memorial Trust. The EVA disambiguation algorithm is described in van den Akker, O. R., Epskamp, Sacha, & Vlasov, S. A. (2020). The AEV Algorithm—Author name disambiguation for large Web of Science datasets."
     ) |> usethis::use_description(check_name = TRUE
                                 , roxygen = TRUE)

usethis::use_lgpl_license()

## use_tidy_description()
## ----------------------------------------------------------------------------
## Update name spaces and documentation for functions
roxygen2::roxygenise()


#' @details
#' This package provides a framework for disambiguating named entities (e.g., authors in large bibliometric databases)
#' 
#' Package provides following main functions
#' - disambr_read - reads WoS data
#' - disambr_aev - implementation of the AEV algorithm (van den Akker et al., 2020) for Web of Science author disambiguation.
#' 
#' This is work in progress. Please, file an issues or suggestion if you have any.
#' @keywords internal
"_PACKAGE"

Install package

## Remove
## --------------------------------------------------------------------------------
detach(package:disambr, unload = TRUE)
remove.packages("disambr")

## Deploy
## --------------------------------------------------------------------------------
devtools::install(".")

Functions

disambr_options

optionbehaviour when not set
disambr_save_asNULL
disambr_save_set_prefixdisambr-set.
disambr_save_set_dirdisambr-sets
disambr_save_set_time_stampTRUE
disambr_verboseTRUE
disambr_mess_prettyFALSE
disambr_get_output_setFALSE
disambr_read_output_setFALSE
.onAttach <- function(libname, pkgname) {
    options(
        
    )
}

disambr_utils

dhms

##' Formats time difference as X days HH:MM:SS
##'
##' from https://stackoverflow.com/questions/27312292
##' @param t time diff
##' @return formatted time diff string
##' 
##' @export 
dhms <- function(t) {
    t <-  abs(as.numeric(t, units = "secs"))
    paste(if((t %/% (60*60*24)) > 0) paste(t %/% (60*60*24), "days") else NULL
         ,paste(formatC(t %/% (60*60) %% 24, width = 2, format = "d", flag = "0")
              , formatC(t %/% 60 %% 60, width = 2, format = "d", flag = "0")
              , formatC(t %% 60, width = 2, format = "d", flag = "0")
              , sep = ":"))
}

messaging

create_message

##' Creates message string for reporting during procedures
##' @param mess Message to report. If prefixed by h `h_marks` it will be ouline of level `h`
##' @param h Forse specific ouline level of message
##' @param indent Forse indentation
##' @param prefix Add overal prefix
##' @param h_marks Marks that sets outline. Default is "-". Can be many characters, e.g. "-*#".
##' @param h_prefix Character vector of prefixes for each outline level
##' @param h_prefix_sep Separator between `h_prefix` and `mess`
##' @param pretty Whether to use "crayon" package for pretty printing
##' @param mess_color Color of message
##' @param h_prefix_color Color of ouline prefix
##' @param ... Here we can pass `verbose` argument from upper functions. Default is TRUE
##' @return Message string
##' 
##' @export 
create_message <- function(mess
                         , h = integer(0)
                         , indent = integer(0)
                         , prefix = ""
                         , h_marks = "-"
                         , h_prefix = character()
                         , h_prefix_sep = " "
                         , pretty = getOption("disambr_mess_pretty")
                         , mess_color = "green"
                         , h_prefix_color = "blue"
                         , ...) {
    ## set outline
    if(isTRUE(length(h) != 1)) {
        mess.regex <- paste0("^([", h_marks, "]*)\\s*(.*)")
        mess.parsed <-
            stringi::stri_match_first_regex(mess, mess.regex)
        mess <- mess.parsed[[3]]
        h <- nchar(mess.parsed[[2]]) + 1
    }
    ## set h_prefix
    if(length(h_prefix) < h) {
        h_prefix_l <- length(h_prefix)
        ## if h_prefix is NULL
        if(h_prefix_l == 0) {
            h_prefix <- ""
            h_prefix_l <- 1
        }
        h_prefix <- c(h_prefix, rep(h_prefix[h_prefix_l], h - h_prefix_l))
    }
    ## set indentation
    if(isTRUE(length(indent) != 1)) {
        indent <-
            nchar(paste(c("", h_prefix)[1:h], collapse = "")) +
            (h-1)*nchar(h_prefix_sep)
        indent <- strrep(" ",  indent)
    } else{
        indent <-
            switch(class(indent)
                 , numeric = if(indent == 0) ""
                             else strrep(" ",  indent)
                 , character = indent)
    }
    ## create message
    h_prefix <- h_prefix[h]
    mess.plain <-
        paste0(prefix
             , indent
             , h_prefix
             , h_prefix_sep
             , mess)
    if(isTRUE(pretty) &&
       ## in case I want to move crayon to Sugests:
       requireNamespace("crayon", quietly = TRUE)) {
        h_prefix.style <-
            crayon::make_style(h_prefix_color)
        mess.style <-
            crayon::make_style(mess_color)
        mess.style <-
            crayon::combine_styles(crayon::bold, mess.style)
        mess <-
            Reduce(crayon::`%+%`
                 , list(prefix
                      , indent
                      , h_prefix.style(h_prefix)
                      , h_prefix_sep
                      , mess.style(mess)))
    } else {
        mess <- mess.plain
    }
    return(mess)
}

disambr_message

##' Report a message with message()
##' @param mess Message to report. If prefixed by h `h_marks` it will be ouline of level `h`
##' @param h_prefix Character vector of prefixes for each outline level
##' @inheritDotParams create_message
##' @return Same as `message` returns
##' 
##' @md 
##' @export
disambr_message <- function(mess
                       , h_prefix = c("disambr:", "-")
                       , ...) {
    ## skip is not verbose (verbose by default)
    if(isFALSE(list(...)$verbose)) return()
    mess <- create_message(mess, h_prefix = h_prefix, ...)
    ## post message
    message(mess)
}
expect_message(disambr_message("Hello world!"))
expect_null(disambr_message("Hello world!", verbose = FALSE))

disambr_message_start

##' Post a starting message for disambr procedure. Records time started in `disambr_start_time` variable in its `parent.frame()`
##' @param start_mess_prefix Prefix for staring message
##' @inheritDotParams disambr_message
##' @return time started
##' 
##' @export 
disambr_message_start <- function(start_mess_prefix = "Making set -"
                              , ...) {
    ## get name of running procedure
    running_procedure_name <- deparse(sys.calls()[[sys.nframe() - 1]])
    ## clean the call string
    running_procedure_name <-
        stringi::stri_replace_first_regex(running_procedure_name
                                        , c("^disambr_set_([^()]+).*")
                                        , "$1")
    mess <- paste(start_mess_prefix, running_procedure_name)
    disambr_message(mess, ...)
    ## record the time started
    assign("disambr_start_time", Sys.time(), pos = parent.frame())
}
foo <- function() {
    disambr_message_start()
    return(disambr_start_time)
}
expect_inherits(foo(), c("POSIXt", "POSIXct"))
expect_message(foo())

disambr_message_finish

##' Post a starting message for disambr procedure. Records time started in `disambr_start_time` variable in its `parent.frame()`
##' @param mess Prefix for staring message
##' @param append_running_procedure_name Whether to append running procedure name
##' @inheritDotParams disambr_message
##' @return time started
##' 
##' @export 
disambr_message_finish <- function(mess = "Finished -"
                              , append_running_procedure_name = TRUE
                              , ...) {
    if(isTRUE(append_running_procedure_name)) {
        ## get name of running procedure
        running_procedure_name <- deparse(sys.calls()[[sys.nframe() - 1]])
        ## clean the call string
        running_procedure_name <-
            stringi::stri_replace_first_regex(running_procedure_name
                                            , c("^disambr_set_([^()]+).*")
                                            , "$1")
        mess <- paste(mess, running_procedure_name)    
    } 
    ## assess procedure duration
    if(exists("disambr_start_time", where = parent.frame())) {
        disambr_duration <-
            dhms(Sys.time() - get("disambr_start_time", pos = parent.frame()))
        mess <- paste(mess, "in", disambr_duration)
    }   
    disambr_message(mess, ...)
}
foo <- function() {
    disambr_start_time <- Sys.time() - 1000
    disambr_message_finish()
}

expect_message(foo(), "foo.*in")

disambr_warning

##' Report a message with warning()
##' @param mess Message to report. If prefixed by h `h_marks` it will be ouline of level `h`
##' @param h_prefix Character vector of prefixes for each outline level
##' @param call. See `warning`
##' @param immediate. See `warning`
##' @inheritDotParams create_message
##' @return Same as `warning` returns
##' 
##' @md 
##' @export
disambr_warning <- function(mess
                       , h_prefix = c("disambr:", "-")
                       , call. = FALSE
                       , immediate. = TRUE
                       , ...) {
    mess <- create_message(mess, h_prefix = h_prefix, ...)
    ## post message
    warning(mess, call. = call. , immediate. = immediate.)
}
expect_warning(disambr_warning("Ahtung!"))

disambr_stop

##' Report a message with stop()
##' @param mess Message to report. If prefixed by h `h_marks` it will be ouline of level `h`
##' @param call. See `stop`
##' @inheritDotParams create_message
##' @return Same as `stop` returns
##' 
##' @md 
##' @export
disambr_stop <- function(mess
                       , call. = FALSE
                       , ...) {
    parent.call <- deparse(sys.calls()[[sys.nframe() - 1]])
    parent.call <- as.character(parent.call)
    mess <-
        create_message(mess
                  , h_prefix = paste0(parent.call, ":")
                  , ...)
    ## post message
    stop(mess, call. = call.)
}
expect_error(disambr_stop())

get_file_extension

##' Extention extractor. Same as tools::file_ext but for NULL input returns NULL instead of logical(0).
##' @param f file name 
##' @return extention
##' 
##' @export 
get_file_extension <- function(f) {
    if(length(f) == 0) {
        return(NULL)
    } else {
        tools::file_ext(f)
    }
}

## my.file <- '../data/Journals in Mathematical Psychology/Applied Psychological Measurement.txt' 
## my.file1 <- "/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments/LN Public NAICS records from 10001 to 10500.txt"

## get_file_extension(my.file)
## get_file_extension(my.file1)
## get_file_extension("sdfsdf....")
## get_file_extension("sdf")
## get_file_extension("")
## get_file_extension(NULL)
## get_file_extension(NA)
## get_file_extension("...sdf...sdf.df...sd.")
## get_file_extension(".")
## get_file_extension(".....")

## ## build in
## tools::file_ext(my.file)
## tools::file_ext(my.file1)
## tools::file_ext("sdfsdf....")
## tools::file_ext("sdf")
## tools::file_ext("")
## tools::file_ext(NULL)
## tools::file_ext(NA)
## tools::file_ext("...sdf...sdf.df...sd.")
## tools::file_ext(".")
## tools::file_ext(".....")

stop_unless

##' Stops process unless cond is true
##' @param cond condition to test
##' @param message_if_false message_if_false
##' @param stop_if_false stop_if_false 
##' @param return_if_true return_if_true
##' @param return_if_false return_if_false 
##' @return 
##' 
##' @export 
stop_unless <- function(cond
                      , message_if_false = paste("cond in not TRUE")
                      , stop_if_false = TRUE
                      , return_if_true = TRUE
                      , return_if_false = isFALSE(return_if_true)) {
    if(isTRUE(cond)) {
        return(return_if_true)
    } else if(isTRUE(stop_if_false)){
        stop(message_if_false, call. = FALSE)
    } else {
        warning(message_if_false, call. = FALSE)
        return(return_if_false)
    }
}
expect_warning(stop_unless(FALSE, "Lala", FALSE))
expect_error(stop_unless(FALSE))
expect_true(stop_unless(TRUE))
expect_warning(stop_unless("sdfasdf", stop_if_false = FALSE))
expect_warning(stop_unless("sdfasdf", stop_if_false = FALSE, return_if_true = FALSE))

parse_files_path

##' Returns vector of file paths from path(s) recursively
##' @param files_path Path(s) where the files are
##' @param recursive Whether to look in subfolders recursively
##' @return Vector of file paths from path(s) recursively
##' 
##' @export 
parse_files_path <- function(files_path, recursive = TRUE) {
    stop_unless(is.character(files_path), "Files path shoud be a character string!")
    files_path <- 
        lapply(files_path, function(file.path) {
            if(stop_unless(file.exists(file.path)
                         , paste(file.path, " - does not exist!")
                         , stop_if_false = FALSE
                         , return_if_true = FALSE)) {
                NULL
            } else if(dir.exists(file.path)) {
                dir(file.path
                  , full.names = TRUE
                  , recursive = recursive)
            } else {
                file.path
            }
        })
    return(unique(normalizePath(unlist(files_path))))
}
expect_error(parse_files_path(3423))
expect_warning(parse_files_path(c(".", "gibirish file")))
expect_inherits(parse_files_path("."), "character")

## empty dirs
tmp.dir <- "test_dir_for_parse_files_path"
dir.create(tmp.dir, showWarnings = FALSE)
expect_equal(parse_files_path(tmp.dir), character(0))
file.remove(tmp.dir)

read_to_utf8

##' Reads file as UTF-8, convert it if other encoding is deteted
##' @param f file path
##' @param bytes_to_check how long to check for encoding (save time for large files)
##' @return file text as string
##' 
##' @export 
read_to_utf8 <- function(f, bytes_to_check = 2^14) {
    ## read file as raw bytes (not to Assume any encodings)
    bin <- readBin(f, raw(), n = file.size(f))
    ## check first 2^14 bytes for encoding
    encoding <- stringi::stri_enc_detect2(bin[1:bytes_to_check])[[1]][[1]][1]
    if(is.na(encoding)) {
        message("Could not detect encoding of file: ", f)
        s <- rawToChar(bin, multiple = FALSE)
    } else if(!(encoding %in% iconvlist())) {
        message("Does not know how to convert from ", encoding, "for file: ", f)
    } else if(encoding == "UTF8") {
        s <- rawToChar(bin, multiple = FALSE)
    } else {
        ## message("Converting to utf-8")
        s <- iconv(list(NULL, bin), from = encoding, to = "UTF-8")
    }
    return(s)
}

## stringi::stri_enc_detect2(NULL)[[1]][[1]][1]
## stringi::stri_enc_detect2(NA)[[1]][[1]][1]
## stringi::stri_enc_detect2(123)[[1]][[1]][1]
## stringi::stri_enc_detect2("")[[1]][[1]][1]
## stringi::stri_enc_detect2("sadf")[[1]][[1]][1]

recode_return_characters

##' Fixed end of line characters in wierd text
##' @param s text string
##' @param assoc.file file name where it came from
##' @param verbose Be chatty
##' @return fixed sting
##' 
##' @export 
recode_return_characters <- function(s, assoc.file = NA, verbose = FALSE) {
    has_return_chars <- function(s, test.first.n.char = 10^4) {
        s <- stringi::stri_sub(s, to = test.first.n.char)
        any(stringi::stri_detect_regex(s, "\\r"))
    }
    if(has_return_chars(s)) {
        if(verbose) message("disambr: '\\r' char in the file: ", assoc.file
                          , "\n- replacing with '\\n' to fix 'datatable::fread'")
        s <- stringi::stri_replace_all_regex(s, "\\R+", "\n")
    }
    return(s)
}

disambr_cbind_lists

##' Makes list of each element of l
##' @param l sequence or list
##' @param l.name same name will be applies to each element
##' @return list of lists
##' 
##' @export 
disambr_listify_list <- function(l, l.name = NULL) {
    if(isTRUE(l.name == "")) l.name =  NULL
    ## case when all are 1 length (vector or list of single length elements)
    lapply(l, function(x) {
        x <- list(x)
        names(x) <- l.name
        return(x)
    })
}


##' cbinds lists and names each element as name of each list in ...
##' @param ... Lists to cbin
##' @return Lists
##' @export 
disambr_cbind_lists <- function(...) {
    lists <- eval(...)
    lists_n <- length(lists)
    lists_names <- names(lists)
    cbind_list <- disambr_listify_list(lists[[1]], lists_names[1])
    for (i in 2:lists_n) {
        cbind_list <- 
            mapply(c
                 , cbind_list
                 , disambr_listify_list(lists[[i]], lists_names[i])
                 , SIMPLIFY = FALSE)
    }
    return(cbind_list)
}

match_fuzzy

##' Fuzzy match all combinations of character vector
##' @param bank bank
##' @param method see method in stringdist
##' @param max_dist see maxDist in stringdist
##' @param id_name names that will be suffixed with _1 and _2
##' @return data.table
##' 
##' @export 
match_fuzzy <- function(bank, method, max_dist, id_name) {
        id_name_1 <- paste0(id_name, "_1")
        id_name_2 <- paste0(id_name, "_2")
        match_fuzzy_x <- function(x) {
            matched <- stringdist::ain(bank, x
                                     , maxDist = max_dist
                                     , method = method
                                     , matchNA = FALSE)
            if(any(matched)) {
                matched <- bank[matched]
                matched <- data.table::data.table(x, matched)
                data.table::setnames(matched, c(id_name_1, id_name_2))
            } else {
                NULL
            }
        }
        match_x <- function(x) {
            matched <- bank %in% x
            if(any(matched)) {
                matched <- bank[matched]
                matched <- data.table::data.table(x, matched)
                data.table::setnames(matched, c(id_name_1, id_name_2))
            } else {
                NULL
            }
        }
        if(max_dist > 0) {
            matched_list <- lapply(bank, match_fuzzy_x)
        } else if(max_dist == 0) {
            matched_list <- lapply(bank, match_x)
        } else {
            stop()
        }
        return(data.table::rbindlist(matched_list))
}
expect_equal(
    match_fuzzy(c("sdfsdf", "sfawefwsd", "sdfwefad", ";sldwaf", "asdfwaf")
              , method = "lv"
              , max_dist = 3
              , id_name = "id")
, structure(list(id_1 = c("sdfsdf", "sdfsdf", "sfawefwsd", "sdfwefad", 
";sldwaf", ";sldwaf", "asdfwaf", "asdfwaf", "asdfwaf"), id_2 = c("sdfsdf", 
"asdfwaf", "sfawefwsd", "sdfwefad", ";sldwaf", "asdfwaf", "sdfsdf", 
";sldwaf", "asdfwaf")), row.names = c(NA, -9L), class = c("data.table", 
"data.frame")))

expect_equal(nrow(match_fuzzy(c("sdfsdf", "sfawefwsd", "sdfwefad", ";sldwaf", "asdfwaf")
          , method = "lv"
          , max_dist = 39
          , id_name = "id")), 25)

get_upper_triangle_index

##' Returns the index of element in upper triangle of squared `n` by `n` matrix
##'
##' inspired from https://math.stackexchange.com/questions/2134011 but modified so
##' index starts from 1 as the original answer was for index starting from 0
##' 
##' @param i row index
##' @param j column index
##' @param n the size of squared matrix
##'
##' @return An index as integer number. Index starts from 1
get_upper_triangle_index <- function(i, j, n) {
    if(j >= i) {
        n*(n - 1)/2 - (n - i)*(n - i + 1)/2 + j
    } else {
        get_upper_triangle_index(j, i, n)
    }
}
get_upper_triangle_index <- disambr:::get_upper_triangle_index

## 3x3
m <- matrix(c(1,2,3
             ,0,4,5
             ,0,0,6)
          , nrow = 3
          , byrow = TRUE)

i <- 1
j <- 2

expect_equal(
    m[i,j]
  , get_upper_triangle_index(i,j,dim(m)[1])
)


i <- 3
j <- 3

expect_equal(
    m[i,j]
  , get_upper_triangle_index(i,j,dim(m)[1])
)


i <- 2
j <- 3

expect_equal(
    m[i,j]
  , get_upper_triangle_index(j,i,dim(m)[1])
)

## 4x4
m <- matrix(c(1,2,3,4
             ,0,5,6,7
             ,0,0,8,9
             ,0,0,0,10)
          , nrow = 4
          , byrow = TRUE)


i <- 1
j <- 2

expect_equal(
    m[i,j]
  , get_upper_triangle_index(i,j,dim(m)[1])
)


i <- 3
j <- 3

expect_equal(
    m[i,j]
  , get_upper_triangle_index(i,j,dim(m)[1])
)


i <- 2
j <- 3

expect_equal(
    m[i,j]
  , get_upper_triangle_index(j,i,dim(m)[1])
)



i <- 4
j <- 2

expect_equal(
    m[j,i]
  , get_upper_triangle_index(i,j,dim(m)[1])
)


i <- 1
j <- 4

expect_equal(
    m[i,j]
  , get_upper_triangle_index(i,j,dim(m)[1])
)

disambr_sets

TEMPLATE

disambr_in_sets

##' Checks if sets with certain attribures are present
##' @param sets Sets to filter on sets attributes
##' @param match_attr_value_parcially whether attribute values can be matched partially
##' @param check_attr_names_prefix Whether to check for short names of attributes. See `attr_names_prefix`.
##' @param attr_names_prefix If name does not start with this prefix (default is 'disambr_set_'), it will add this prefix before attribute name.
##' @param ... Named sets attributes to filter `sets` on
##' @return logical vector of length `length(sets)`
##' 
##' @export 
disambr_in_sets <- function(sets
                          , ...
                          , match_attr_value_parcially = FALSE
                          , check_attr_names_prefix = TRUE
                          , attr_names_prefix = "disambr_set_") {
    ## check if sets is list
    if(!is.list(sets)) disambr_stop("'sets' should be a list of sets!")
    attrs_values <- list(...)
    ## check if ... is provided and if not return all
    attrs_values_length <- length(attrs_values)
    if(attrs_values_length == 0) return(rep(TRUE, length(sets)))
    ## check if all named
    attrs_values_names <- names(attrs_values)
    if(length(attrs_values_names) != attrs_values_length)
        disambr_stop("'...' arguments should be all named!")
    if(isTRUE(check_attr_names_prefix)) {
        ## add "disambr_set_" if attr names are short
        attrs_values_names_short <-
            !stringi::stri_detect_regex(attrs_values_names
                                      , paste0("^", attr_names_prefix))
        if(any(attrs_values_names_short)) {
            attrs_values_names[attrs_values_names_short] <-
                paste0(attr_names_prefix
                     , attrs_values_names[attrs_values_names_short])
        }
    }
    if(isTRUE(match_attr_value_parcially)) {
        filter_sets <- function(attr_name, attr_value) {
            vals <- lapply(sets, attr, attr_name, exact = TRUE)
            vals <- lapply(vals, unlist)
            vals <- lapply(vals, `[`, 1)
            vals <- unlist(lapply(vals, function(a) if(is.null(a)) NA else a))
            sapply(stringi::stri_detect_fixed(vals, attr_value), isTRUE)
        }
    } else {
        filter_sets <- function(attr_name, attr_value) {
            vals <- lapply(sets, attr, attr_name, exact = TRUE)
            vals <- lapply(vals, unlist)
            vals <- lapply(vals, `[`, 1)
            vals <- unlist(lapply(vals, function(a) if(is.null(a)) NA else a))
            vals %in% attr_value
        }
    }
    sets_filters <- mapply(filter_sets
                         , attrs_values_names
                         , attrs_values
                         , SIMPLIFY = FALSE)
    ## return overlap of sets_filters
    return(Reduce(`&`, sets_filters))
}
a <- list(disambr_set_attr(c(1,2,3), a = 1)
        , disambr_set_attr(c(1,2,3), b = 2, a = 1)
        , disambr_set_attr(c(1,2,3), c = 3, a = 2)
        , disambr_set_attr(c(1,2,3), d = 4, a = 212))

expect_true(all(disambr_in_sets(a)))

expect_false(any(disambr_in_sets(a, b= 2, a = 2)))

expect_true(any(disambr_in_sets(a, b= 2, a = 1)))

expect_true(any(disambr_in_sets(a, a = 1)))

expect_true(any(disambr_in_sets(a, disambr_set_a = 1)))

expect_equal(sum(disambr_in_sets(a, a = 1, match_attr_value_parcially = TRUE)), 3)

disambr_get_first_data_set

##' Get first data set in list sets
##' @param sets list of sets
##' @param recipe function that produced the data set (parcial match allowed)
##' @param ... other attributes
##' @param match_parcially whether to match recipe partially
##' @inheritDotParams disambr_in_sets
##' @return 
##' 
##' @export 
disambr_get_first_data_set <- function(sets, recipe, ...
                                     , match_parcially = TRUE) {
    set_num <-
        disambr_in_sets(sets, recipe = recipe, ...
                      , match_attr_value_parcially = match_parcially)
    ## get first
    set_num <- which(set_num)[1]
    if(length(set_num) != 0) {
        return(sets[[set_num]])
    } else {
        disambr_stop(paste("Data set should be available in sets:", recipe))
    }
}
a <- list(disambr_set_attr(c(1,2,3), recipe = "my_function_123")
        , disambr_set_attr(c(1,2,3,4), recipe = "my_function_23")
        , disambr_set_attr(c(1,2,3,4,5), recipe = "my_function_3a")
        , disambr_set_attr(c(1,2,3,4,5,6), recipe = "my_function_1"))

expect_equal(
    disambr_get_first_data_set(a, "3")
  , structure(c(1, 2, 3), disambr_set_recipe = "my_function_123")
)


expect_equal(
    disambr_get_first_data_set(a, "3a")
  , structure(c(1, 2, 3, 4, 5), disambr_set_recipe = "my_function_3a")
)

expect_null(disambr_get_first_data_set(a, "aaaaa"))

disambr_get_last_set

##' Gets last set in sets which strength less or equal than 0.5
##' @param sets sets
##' @param ... other attributes
##' @inheritDotParams disambr_in_sets
##' @return set or NULL if not found
##' 
##' @export 
disambr_get_last_set <- function(sets, ...) {
    set_index <- disambr_in_sets(sets, ...)
    set_index <- which(set_index)
    set_index <- set_index[length(set_index)]
    if(length(set_index) == 1) {
        return(sets[[set_index]])
    } else {
        return()
    }
}
a <- list(disambr_set_attr(c(1,2,3), strength = 0.1)
        , disambr_set_attr(c(1,2,3,4), strength = 0.6)
        , disambr_set_attr(c(1,2,3,4,5), strength = 0.5)
        , disambr_set_attr(c(1,2,3,4,5,6), strength = 1))

expect_equal(
    disambr_get_last_set(a)
  , structure(c(1, 2, 3, 4, 5, 6), disambr_set_strength = 1)
)

a <- list(disambr_set_attr(c(1,2,3,4), strength = 0.6)
        , disambr_set_attr(c(1,2), strength = 1))

expect_equal(
    disambr_get_last_set(a)
  , structure(c(1, 2), disambr_set_strength = 1)
)

disambr_get_last_weak_set

##' Gets last set in sets which strength less or equal than 0.5
##' @param sets sets
##' @param ... other attributes
##' @inheritDotParams disambr_in_sets
##' @return set or NULL if not found
##' 
##' @export 
disambr_get_last_weak_set <- function(sets, ...) {
    set_index <- disambr_in_sets(sets, ...
                               , strength = seq(from = 0.1, to = 0.5, by = 0.01))
    set_index <- which(set_index)
    set_index <- set_index[length(set_index)]
    if(length(set_index) == 1) {
        return(sets[[set_index]])    
    } else {
        disambr_stop("- can not find last weak set in sets!")
        return()
    }
}
a <- list(disambr_set_attr(c(1,2,3), strength = 0.1)
        , disambr_set_attr(c(1,2,3,4), strength = 0.6)
        , disambr_set_attr(c(1,2,3,4,5), strength = 0.5)
        , disambr_set_attr(c(1,2,3,4,5,6), strength = 1))

expect_equal(
    disambr_get_last_weak_set(a)
  , structure(c(1, 2, 3, 4, 5), disambr_set_strength = 0.5)
)



a <- list(disambr_set_attr(c(1,2,3,4), strength = 0.6)
        , disambr_set_attr(c(1,2,3,4,5,6), strength = 1))

expect_error(
    disambr_get_last_weak_set(a)
)

disambr_get_strong_set

##' Get sets with strength parameter of 1 and rbind them into single set
##' @param sets sets
##' @param ... other attributes
##' @inheritDotParams disambr_in_sets
##' @return set or NULL if none found
##' 
##' @export 
disambr_get_strong_set <- function(sets, ...) {
    sets_index <- disambr_in_sets(sets, ...
                                , type = "similar"
                                , strength = 1)
    ## if not sets return NULL
    if(!any(sets_index)) return()
    sets <- sets[sets_index]
    ## bind sets depending on class
    sets_class <- sapply(lapply(sets, class), `[`, 1)
    if(all(sets_class %in% "data.table")) {
        sets <- data.table::rbindlist(sets)
    } else if(all(sets_class %in% "list")) {
        sets <- do.call(c, sets)
    }
    return(sets)
}
a <- list(disambr_set_attr(list(1,2,3)
                         , type = "similar"
                         , strength = 1)
        , disambr_set_attr(list(1,2,3,4)
                         , type = "similar"
                         , strength = 0.6)
        , disambr_set_attr(list(1,2,3,4,5)
                         , type = "similar"
                         , strength = 0.4)
        , disambr_set_attr(list(1,2,3,4,5,6)
                         , type = "similar"
                         , strength = 1))

expect_equal(
    disambr_get_strong_set(a)
  , list(1, 2, 3, 1, 2, 3, 4, 5, 6)
)



a <- list(disambr_set_attr(data.table::data.table(c(1,2,3))
                         , type = "similar"
                         , strength = 1)
        , disambr_set_attr(data.table::data.table(c(1,2,3,4))
                         , type = "similar"
                         , strength = 0.6)
        , disambr_set_attr(data.table::data.table(c(1,2,3,4,5))
                         , type = "similar"
                         , strength = 0.4)
        , disambr_set_attr(data.table::data.table(c(1,2,3,4,5,6))
                         , type = "similar"
                         , strength = 1))

expect_inherits(disambr_get_strong_set(a), "data.table")

a <- list(disambr_set_attr(data.table::data.table(c(1,2,3))
                         , type = "similar"
                         , strength = 0.1)
        , disambr_set_attr(data.table::data.table(c(1,2,3,4))
                         , type = "similar"
                         , strength = 0.6)
        , disambr_set_attr(data.table::data.table(c(1,2,3,4,5))
                         , type = "similar"
                         , strength = 0.4)
        , disambr_set_attr(data.table::data.table(c(1,2,3,4,5,6))
                         , type = "similar"
                         , strength = 0.1))

expect_null(disambr_get_strong_set(a))

disambr_get_truth_set

##' Get sets with strength parameter of 10 (ground thruth)
##' @param sets sets
##' @param ... other attributes
##' @inheritDotParams disambr_in_sets
##' @return set or NULL if none found
##' 
##' @export 
disambr_get_truth_set <- function(sets, ...) {
    sets_index <- disambr_in_sets(sets, ...
                                , type = "similar"
                                , strength = 10)
    ## if not sets return NULL
    if(!any(sets_index)) disambr_stop("- can not find truth set!")
    if(sum(sets_index) != 1) disambr_stop("- more than one truth set found!")
    return(sets[[which(sets_index)]])
}

disambr_get_last_unstrong_set

##' Gets last set from sets with strength <= 0.5 and excludes from this set all sets with strength of 1
##' @param sets sets
##' @return set or NULL
##' @export 
disambr_get_last_unstrong_set <- function(sets) {
    weak_set <- disambr_get_last_weak_set(sets)
    strong_set <- disambr_get_strong_set(sets)
    if(is.null(strong_set) || is.null(weak_set)) {
        return(weak_set)
    } else if("data.table" %in% class(weak_set) &&
              "data.table" %in% class(strong_set)) {
        return(data.table::fsetdiff(weak_set, strong_set))
        ## comb_set <- rbind(weak_set, strong_set)
        ## comb_set <- 
        ##     comb_set[!duplicated(comb_set, fromLast = FALSE) &
        ##              !duplicated(comb_set, fromLast = TRUE)]
        ## return(comb_set) #
        ## return(weak_set[strong_set[[1]] !=  weak_set[[1]] ||
                        ## strong_set[[2]] !=  weak_set[[2]]])
    } else {
        disambr_stop("Weak and strong sets should be data.tables!")
    }
}
a <- list(disambr_set_attr(data.table::data.table(c(1,2,3), c(1,2,8)), strength = 1)
        , disambr_set_attr(data.table::data.table(c(1,2,3,4), c(1,2,3,4)), strength = 0.6)
        , disambr_set_attr(data.table::data.table(c(1,2,3,4,5,6,7,8), c(1,2,3,4,5,6,7,8)), strength = 0.4)
        , disambr_set_attr(data.table::data.table(c(1,2,7,5), c(1,2,7,5)), strength = 1))

expect_equal(nrow(disambr_get_last_unstrong_set(a)), 8)

disambr_set_attr

## disambr_entity
## disambr_set_type
## disambr_set_coefficient
## disambr_set_name
## disambr_set_collection
## disambr_entity_id_reference
## disambr_entity_id_reference_md5_sum
## disambr_recipe

##' Adds attribures to the set with data.table::setattr
##' @param focal_set Set to add attribute to
##' @param check_attr_names_prefix Whether to check for short names of attributes. See `attr_names_prefix`.
##' @param attr_names_prefix If name does not start with this prefix (default is 'disambr_set_'), it will add this prefix before attribute name.
##' @param ... Named attributes
##' @return `focal_set`
##' 
##' @export 
disambr_set_attr <- function(focal_set
                           , ...
                           , check_attr_names_prefix = TRUE
                           , attr_names_prefix = "disambr_set_") {
    attrs_values <- list(...)
    ## check if ... is provided and if not do nothing
    attrs_values_length <- length(attrs_values)
    if(attrs_values_length == 0) return()
    ## check if all attributes in ... are named
    attrs_values_names <- names(attrs_values)
    if(length(attrs_values_names) != attrs_values_length)
        disambr_stop("'...' arguments should be all named!")
    if(isTRUE(check_attr_names_prefix)) {
    ## add "disambr_set_" if attr names are short
    attrs_values_names_short <-
        !stringi::stri_detect_regex(attrs_values_names
                                  , paste0("^", attr_names_prefix))
    if(any(attrs_values_names_short)) {
        attrs_values_names[attrs_values_names_short] <-
            paste0(attr_names_prefix
                 , attrs_values_names[attrs_values_names_short])
    }
    }
    ## set attributes
    for (i in 1:length(attrs_values)) {
        ## also works for other that data.table objects
        data.table::setattr(focal_set, attrs_values_names[i], attrs_values[[i]])
    }
    return(focal_set)
}
expect_equal(
    attributes(
        disambr_set_attr(c(1,2,3)
                       , lalala = "la"
                       , disambr_set_important_attr = "Hi there"))
  , list(disambr_set_lalala = "la", disambr_set_important_attr = "Hi there")
)

disambr_add_set_attr

##' Add disambr attribures to focal set from template set and update some of them
##'
##' It adds attributes by reference (with setattr {data.table}), i.e. without making a copy
##'
##' It updates:
##' - time stamp (disambr_set_st) to current
##' - disambr_set_file to NULL
##' - adds to disambr_set_recipe the calling procedure
##' - adds disambr_set_duration 
##' @param focal_set Set
##' @param template_set Set to inhirit attributes from
##' @param ... other attributes
##' @param attr_names_prefix "disambr_set_" by default. Only prefixed by it will be copied from `template_set` 
##' @inheritDotParams disambr_set_attr
##' @return 
##' 
##' @md 
##' @export 
disambr_add_set_attr <- function(focal_set
                               , template_set = NULL
                               , ...
                               , attr_names_prefix = "disambr_set_") {
    if(is.null(focal_set)) return()
    ## copy only disambr attr from template_set
    template_attr <- attributes(template_set)
    template_attr_disambr <- 
        stringi::stri_detect_regex(names(template_attr)
                                 , paste0("^", attr_names_prefix))
    template_attr <- template_attr[template_attr_disambr]
    mapply(function(a, name) {
        data.table::setattr(focal_set, name, a)
    }
  , template_attr
  , names(template_attr))
    ## remove file attributes
    disambr_set_attr(focal_set, file = NULL)
    ## add time stamp
    disambr_set_attr(focal_set, ts = Sys.time())
    ## add duration
    if(exists("disambr_start_time", where = parent.frame())) {
        disambr_start_time <- 
            get("disambr_start_time", pos = parent.frame())
        disambr_set_attr(focal_set
                       , duration = Sys.time() - disambr_start_time)
    }
    ## add recipe (procedure call) and name
    recipe <- attributes(template_set)$disambr_set_recipe
    procedure_call <- deparse(sys.calls()[[sys.nframe() - 1]])[[1]]
    procedure_name <-
        stringi::stri_extract_first_regex(procedure_call
                                        , c("^[^()]+"))
    procedure_short_name <-
        stringi::stri_replace_first_regex(procedure_name
                                        , paste0("^", attr_names_prefix), "")
    disambr_set_attr(focal_set
                   , name = procedure_short_name
                   , recipe = c(list(list(func = procedure_name
                                        , call = procedure_call))
                              , recipe))
    ## set attributes from ...
    disambr_set_attr(focal_set, ...)
    return(focal_set)
}
a <- data.table::data.table(a = c(1,2,3,4)
               ,b = c(11,22,33,44))

b <- data.table::data.table(a = c(6,7,8)
               ,b = c(66,77,88))

disambr_set_attr(a
               , name = "a"
               , strength = 0.5
               , ts = Sys.time()
               , file = "lalala.rds"
               , recipe = list("second_procedure"
                             , "first_procedure"))

foo <- function(b, a = NULL, ...) {
    disambr_message_start()
    disambr_add_set_attr(b, a, ...)
    return(attributes(b))
}

expect_message(foo(b, a))

expect_equal(
    foo(b, a)$disambr_set_recipe
  , list(list(func = "foo", call = "foo(b, a)"), "second_procedure", "first_procedure")
)

expect_equal(foo(b,a, name = "new.name")$disambr_set_name, "new.name")
expect_equal(foo(b,a, strength = 1)$disambr_set_strength, 1)
expect_equal(foo(b,a)$disambr_set_name, "foo")

expect_equal(foo(a)$disambr_set_name, "foo")
expect_null(foo(NULL, a))

expect_equal(
    foo(b, NULL)$disambr_set_recipe
  , list(list(func = "foo", call = "foo(b, NULL)"))
)

disambr_save_set

##' Save set and adds file attribute
##' @param set_to_save set
##' @param save_set_as if TRUE the the file name is made from disambr_set_name attribute
##' @param save_set_prefix file prefix, default is "disambr-set."
##' @param save_set_dir file dir, default is "disambr-sets-rds" in current directory
##' @param use_time_stamp add time stamps at the end of file name, adds by default
##' @return file namej
##' 
##' @export 
disambr_save_set <- function(set_to_save
                           , save_set_as = getOption("disambr_save_as")
                           , save_set_prefix = getOption("disambr_save_set_prefix")
                           , save_set_dir = getOption("disambr_save_set_dir")
                           , use_time_stamp = getOption("disambr_save_set_time_stamp")) {
    ## do not save by default
    if(length(save_set_as) != 0) {
        ## make name if it is just TRUE
        if(isTRUE(save_set_as)) {
            save_set_as <- attr(set_to_save, "disambr_set_name")[[1]]
            ## use timestamps by default
            if(isTRUE(use_time_stamp) ||
               length(use_time_stamp) == 0) {
                save_set_as <-
                    paste0(save_set_as, "."
                         , format(Sys.time(), "%Y-%m-%dT%H-%M"))
            }
            save_set_as <- paste0(save_set_as, ".rds")
            if(length(save_set_prefix) == 0) {
                save_set_prefix <- "disambr-set."
            }
        }
        ## if "save as" provided use it, add prefix if it is provided as well
        if(is.character(save_set_as)) {
            if(length(save_set_prefix) != 0) {
                save_set_as <- paste0(save_set_prefix, save_set_as)
            }
            ## add directory or default
            if(length(save_set_dir) == 0) {
                save_set_dir <- "disambr-sets-rds"
            }
            dir.create(save_set_dir, showWarnings = FALSE, recursive = TRUE)
            save_set_as <- paste0(save_set_dir,"/",save_set_as)
            ## add file attribute
            disambr_set_attr(set_to_save, file = save_set_as)
            ## save
            saveRDS(set_to_save, file = save_set_as[[1]], compress = FALSE)
            disambr_message(paste0(
                "- set saved as '", save_set_as, "'"))
            return(save_set_as)
        } else {
            disambr_message(
                paste0("- do not know how to save 'set_to_save' as '"
                     , save_set_as, "'"))
            return()
        }
    } else {
        return()
    }
}
## check files manually
expect_null(disambr_save_set(disambr_set_attr(c(1,2,3)
                                            , name = "bar")
                           , save_set_as = NULL
                           , save_set_dir = "../disambr-sets-rds"))

## with making files
expect_inherits(disambr_save_set(disambr_set_attr(c(1,2,3)
                                             , name = "bar")
                            , save_set_as = TRUE
                            , save_set_dir = "disambr-save-set-test")
              , "character")
## readRDS("../disambr-sets-rds/disambr-set.bar.2020-08-30T10-25.rds")
unlink("disambr-save-set-test", recursive = TRUE)

disambr_get_output_set

##' Gets output set from sets (in case we already made it)
##' @param sets sets
##' @param get_output_set Whether to search for output set. Default is not.
##' @param attr_names_prefix prefix for attributes
##' @return NULL or output set
##' 
##' @export 
disambr_get_output_set <- function(sets
                                 , get_output_set = getOption("disambr_get_output_set")
                                 , attr_names_prefix = "disambr_set_") {
    if(isTRUE(get_output_set)) {
        procedure_call <- deparse(sys.calls()[[sys.nframe() - 1]])
        procedure_name <-
            stringi::stri_extract_first_regex(
                         procedure_call, c("^[^()]+"))
        procedure_short_name <-
            stringi::stri_replace_first_regex(
                         procedure_name
                       , paste0("^", attr_names_prefix), "")
        output_set_index <-
            which(disambr_in_sets(sets, name = procedure_short_name))
        if(length(output_set_index) == 0) {
            return()
        } else if(length(output_set_index) == 1) {
            disambr_message(paste("- reusing output set:", procedure_short_name))
            return(sets[[output_set_index]])
        } else {
            disambr_message(paste("- reusing last output set:", procedure_short_name))
            output_set_index <- output_set_index[length(output_set_index)]
            return(sets[[output_set_index]])
        }
    } else {
        return()
    }
 }
a <- list(disambr_set_attr(c(1), name = "foo1")
        , disambr_set_attr(c(1,2), name = "foo", a = 1)
        , disambr_set_attr(c(1,2,3), name = "foo0", a = 2)
        , disambr_set_attr(c(1,2,3,4), name = "bar", a = 212))


foo <- function(sets, ...) {
    return(disambr_get_output_set(sets, ...))
}

expect_message(foo(a, get_output_set = TRUE))

expect_equal(foo(a, get_output_set = TRUE)
           , structure(c(1, 2), disambr_set_name = "foo", disambr_set_a = 1)
             )

expect_null(foo(a))

disambr_read_output_set

##' Reads last output set saved on disk
##' @param read_output_set toggle. default is no
##' @param save_set_prefix file prefix
##' @param save_set_dir file dir
##' @param attr_names_prefix arrt prefix
##' @return 
##' 
##' @export 
disambr_read_output_set <- function(read_output_set = getOption("disambr_read_output_set")
                                  , save_set_prefix = getOption("disambr_save_set_prefix")
                                  , save_set_dir = getOption("disambr_save_set_dir")
                                  , attr_names_prefix = "disambr_set_") {
    if(isTRUE(read_output_set)){
        ## make defaults if not provided
        if(length(save_set_prefix) == 0) {
            save_set_prefix <- "disambr-set." 
        }
        if(length(save_set_dir) == 0) {
            save_set_dir <- "disambr-sets-rds"
        }
        ## output set name pattern
        procedure_call <- deparse(sys.calls()[[sys.nframe() - 1]])
        procedure_name <-
            stringi::stri_extract_first_regex(
                         procedure_call, c("^[^()]+"))
        procedure_short_name <-
            stringi::stri_replace_first_regex(
                         procedure_name
                       , paste0("^", attr_names_prefix), "")
        output_set_name_pattern <-
            paste0(save_set_prefix, procedure_short_name, ".*", "\\.rds")
        ## match last file
        output_set_file <- 
            list.files(save_set_dir, pattern = output_set_name_pattern)
        if(length(output_set_file) != 0) {
            ## take the last file (as they are sorted alphabetically)
            output_set_file <- output_set_file[length(output_set_file)]
            output_set_file <- file.path(save_set_dir, output_set_file)
            disambr_message(paste("- reusing saved set:", output_set_file))
            return(readRDS(file = output_set_file))
        } else {
            return()
        }
    } else {
        return()
    }
}
a <- list(disambr_set_attr(c(1), name = "foo1")
        , disambr_set_attr(c(1,2), name = "foo", a = 1)
        , disambr_set_attr(c(1,2,3), name = "foo0", a = 2)
        , disambr_set_attr(c(1,2,3,4), name = "bar", a = 212))


foo <- function(sets) {
    disambr_save_set(sets[[2]]
                   , save_set_as = TRUE
                   , save_set_dir = "disambr_read_output_set_test")
    disambr_read_output_set(read_output_set = TRUE
                          , save_set_dir = "disambr_read_output_set_test")
}

expect_equal(length(foo(a)), 2)

expect_equal(length(attributes(foo(a))), 3)

unlink("disambr_read_output_set_test", recursive = TRUE)

foo <- function(sets) {
    disambr_save_set(sets[[2]]
                   , save_set_as = "tra-la-la.rds"
                   , save_set_dir = "disambr_read_output_set_test")
    disambr_read_output_set(read_output_set = TRUE
                          , save_set_dir = "disambr_read_output_set_test")
}

expect_null(foo(a))

unlink("disambr_read_output_set_test", recursive = TRUE)

disambr_subsets

##' Filters list of sets
##' @param sets_list list of sets
##' @param attribute_value_list list of attribute values where list elements name correspond attribute names used for filtering sets
##' @param which_to_return whether to return "all", "first" or "last" set from filtered sets
##' @param negate_subsets whether to return sets that was not matched insted
##' @return list of sets or set if `which_to_return` is either "first" or "last"
##' 
##' @export 
disambr_subsets <- function(sets_list, attribute_value_list
                          , which_to_return = c("all", "first", "last")
                          , negate_subsets = FALSE) {
    if (!is.list(sets_list))
        stop("disambr: 'sets_list' should be a list!")
    if (!is.list(attribute_value_list))
        stop("disambr: 'attribute_value_list' should be a list!")
    filter_sets <- function(attr_name, attr_value) {
        sapply(lapply(sets_list, attr, attr_name), `[`, 1) %in% attr_value
    }
    sets_list_filters <- mapply(filter_sets
                              , names(attribute_value_list)
                              , attribute_value_list
                              , SIMPLIFY = FALSE
                              , USE.NAMES = TRUE)
    subsets_list <-
        if (isTRUE(negate_subsets)) {
            sets_list[!Reduce(`&`, sets_list_filters)]
        } else {
            sets_list[Reduce(`&`, sets_list_filters)]
        }
    return(switch(which_to_return[1]
                , all = subsets_list
                , first = subsets_list[[1]]
                , last = subsets_list[[length(subsets_list)]]))
}



## a <- c(1,2,3,4)
## b <- c("a","b","c")
## c <- NULL
## attributes(a)$name <- "aaa"
## attributes(b)$name <- "bbb"
## attributes(c)$name <- c("ccc", 3)
## attributes(a)$kind <- "good"
## attributes(b)$kind <- "good"
## attributes(c)$kind <- "bad"

## disambr_subsets(list(a,b,c), list(kind = "good"))

## disambr_subsets(list(a,b,c), list(kind = "good"), which_to_return = "last")

## disambr_subsets(list(a,b,c), list(name = "ccc"
##                                 , kind = "good"))

## disambr_subsets(list(a,b,c), list(name = "ccc"
##                                 , kind = "bad"))

disambr_setattr

## disambr_entity
## disambr_set_type
## disambr_set_coefficient
## disambr_set_name
## disambr_set_collection
## disambr_entity_id_reference
## disambr_entity_id_reference_md5_sum
## disambr_recipe
disambr_setattr <- function(focal_set, ...) {
    attr_value_list <- list(...)
    for (i in 1:length(attr_value_list)) {
        setattr(focal_set, names(attr_value_list)[i], attr_value_list[[i]])
    }
    return(focal_set)
}

disambr_read

disambr_read

##' Reads the data for disambiguation
##' @param files_path Path to data. You can specify almost everything
##' @param save_sets_as Save set as
##' @param save_sets_dir Directory to save set in
##' @param use_time_stamp Whether to use timestamp in the file name
##' @return 
##' 
##' @md 
##' @export 
disambr_read <- function(files_path
                       , save_sets_as = NULL
                       , save_sets_dir = "disambr-data"
                       , use_time_stamp = FALSE) {
    disambr_message_start()
    ## see if the data is available already
    if(is.character(save_sets_as) &&
       file.exists(file.path(save_sets_dir, save_sets_as))) {
        disambr_message(paste("- reusing saved sets:", save_sets_as))
        return(readRDS(file.path(save_sets_dir, save_sets_as)))
    }
    files_path <- parse_files_path(files_path)
    files_data_list <- lapply(files_path, disambr_read_file)
    sets <- disambr_make_data(files_data_list)
    ## save just in case
    if(is.character(save_sets_as)) {
        disambr_save_set(sets
                       , save_set_as =  save_sets_as
                       , save_set_dir = save_sets_dir
                       , use_time_stamp = use_time_stamp)
    }
    disambr_message_finish()
    return(sets)
}
my.file <- system.file("testdata", "wos-tsv-test-recent.txt", package = "disambr")

if(file.exists(my.file)) {
    expect_equal(
        length(disambr_read(my.file)), 4
    )
}

test

## my.dir <- '../data'
  ## my.dir.large <- '/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments'
  ## my.dir.huge <- '/mnt/md5/data/wos'


  ## my.file <- '../data/Journals in Mathematical Psychology/Applied Psychological Measurement.txt' 
  ## my.file1 <- "/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments/LN Public NAICS records from 10001 to 10500.txt"

  ## my.files <- 
  ## c('../data/Journals in Mathematical Psychology/Applied Measurement in Education.txt'
  ## , '../data/Journals in Mathematical Psychology/Applied Psychological Measurement.txt')

  my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"
  dt <- my.file2 %>% disambr_read(save_sets_as = "data-test.rds")

  my.dir.small <- '../data/Journals in Mathematical Psychology'
  dt <- my.dir.small %>% disambr_read


  dt %>% sapply(class)

disambr_read_file

##' Reads file based on file extention
##' @param f full file path name
##' @return data
##' 
##' @export 
disambr_read_file <- function(f) {
    f_extention <- tools::file_ext(f)
    switch(f_extention
         , "tsv" = disambr_read_tsv(f)
           ## here we can add reading from .txt wos files
         , "txt" = disambr_read_tsv(f)
         , message("Disambr: can not read file extention: ", f_extention
                 , "\n  - skipping file: ", f))
}

disambr_read_tsv

##' Read tsv file
##' @param f path
##' @return data
##' 
##' @export 
disambr_read_tsv <- function(f) {
    ## check tsv file type base on first line
    first_line <- readLines(f, n = 1
                          , warn = FALSE
                          , skipNul = TRUE)
    header <- parse_tsv_wos_header(first_line)
    if(!isFALSE(header)) {
        disambr_read_tsv_wos(f, header)
    } else {
        ## here we can add more tsv types
        message("Disambr: unrecognized header of tsv file: ", header
              , "\n  - skipping file: ", f)
        NULL
    }
}

parse_tsv_wos_header

parse_tsv_wos_header <- function(first_line) {
    header <- stringi::stri_split_fixed(first_line, "\t")[[1]]
    if( ## check if at least 10 fields two big letters
        sum(stringi::stri_detect_regex(header, "^[A-Z0-9]{2}$")) > 10 &&
        ## check if main fields are present
        all(c('AU', 'TI') %in% header)) {
        stringi::stri_extract_first_regex(header, "[A-Z0-9]{2}")
    } else {FALSE}
}

disambr_read_tsv_wos

##' Reads WoS tsv export file and makes disambr set out of it (just adding some attributes to the data.table)
##' 
##' @param f path
##' @param header header 
##' @return disambr set
##' 
##' @export 
disambr_read_tsv_wos <- function(f, header) {
    s <- read_to_utf8(f)
    s <- recode_return_characters(s, f)
    f_data <- data.table::fread(text = s
                              , skip = 1
                              , strip.white = TRUE
                              , header = FALSE
                              , col.names = header
                              , select = 1:length(header)
                                ## , colClasses = rep("character", length(header))
                              , quote=""
                              , keepLeadingZeros = FALSE
                              , encoding = "UTF-8"
                              , sep = "\t")
    ## set attrib (file, funcall, meanning of the fields and data scheme)
    disambr_add_set_attr(f_data, NULL
                       , unit = "publication"
                       , reference = "self"
                       , type = "different"
                       , id = "index"
                       , strength = 1
                       , name = "wos_tsv"
                       , collection = "unit_table"
                       , recipe = list(func = "disambr_read_tsv_wos"
                                     , file_name = f
                                     , file_md5sum = tools::md5sum(f)
                                     , file_header = header))
    return(f_data)
}

disambr_make_data

disambr_make_data <- function(files_data_list
                            , drop_ejected = FALSE) {
    ## TODO: add other data processing here
    ## TODO: add processing of wos data with differen headers
    ## check wos publication
    processabe_data <-
        disambr_in_sets(files_data_list, name = "wos_tsv")
    processabe_data <- files_data_list[processabe_data]
    if (length(processabe_data) != 0) {
        processabe_data_recipes <-
            lapply(processabe_data, attr, "disambr_set_recipe")
        processabe_data_headers <-
            lapply(processabe_data_recipes, `[[`, "file_header")
        ## check if all headers ate the same before rbindlist
        if (length(unique(processabe_data_headers)) == 1) {
            disambr_message("Processing wos tsv export data..")
            disambr_message("- rbinding wos publication tables..")
            wos_publication <-
                disambr_make_wos_tsv_publications(files_data_list, processabe_data_recipes)
            disambr_message("- making wos authors table..")
            wos_author <- disambr_make_wos_tsv_authors(wos_publication)
            if(drop_ejected) {
                ## remove fields that we do not need
                remove_headers <- c("AU", "AF", "C1", "RP", "EM", "RI", "OI")
                ## filter those that exists
                remove_headers <-
                    remove_headers[remove_headers %in% processabe_data_headers[[1]]]
                ## remove headers without hard copy and `[.data.table` overhead
                data.table::set(wos_publication, , remove_headers, NULL)
            }
            disambr_message("- making wos references table..")
            wos_reference <- disambr_make_wos_tsv_references(wos_publication)
            if(drop_ejected) {
                remove_headers <- c("CR")
                ## filter those that exists
                remove_headers <-
                    remove_headers[remove_headers %in% processabe_data_headers[[1]]]
                ## remove headers without hard copy and `[.data.table` overhead
                data.table::set(wos_publication, , remove_headers, NULL)
            }
            disambr_message("- making author-year citations table..")
            citation_name_table <-
                disambr_make_wos_tsv_author_year_citations(wos_publication, wos_reference)
            ## if all data is wos data return only that
            if(length(processabe_data) == length(files_data_list)) {
                return(list(wos_publication
                          , wos_author
                          , wos_reference
                          , citation_name_table
                            ))
            } else {
                return(list(
                    ## TODO implement
                    ## disambr_subsets(files_data_list
                    ## , list(disambr_set_name = "wos_records_tsv_export")
                    ## , negate_subsets = TRUE)
                    wos_publication
                  , wos_author
                  , wos_reference
                  , citation_name_table
                ))
            }
        } else {
            disambr_message("Files data has different headers. Skipping processing...")
            return(files_data_list)
        }
    } else {
        return(files_data_list)
    }
}

disambr_make_wos_tsv_publications

##' Make set of publications
##' 
##' @param tables_list list of tables
##' @param recipes recipes
##' @return set
##' 
##' @import data.table
##' @export 
disambr_make_wos_tsv_publications <- function(tables_list, recipes) {
    publication_table <-
        data.table::rbindlist(tables_list, fill=TRUE)
    ## add ids, first_author_last_name, first_author_first_initials
    publication_table[
      , `:=`(id = 1:.N
           , first_author_last_name =
                 toupper(stringi::stri_extract_first_regex(AU, "^[^,]+"))
           , first_author_first_initial =
                 toupper(stringi::stri_extract_first_regex(AU, "(?<=, )[A-Za-z]"))
           , doi = stringi::stri_match_first_regex(DI
                                                 , "10.\\d{4,9}/[-._;()/:A-Za-z0-9]+"))]
    publication_table[
      , name_year := paste(first_author_last_name
                         , first_author_first_initial
                         , PY)]
    ## set publication attributes
    disambr_add_set_attr(publication_table, NULL
                       , unit = "publication"
                       , reference = "self"
                       , type = "different"
                       , strength = 1
                       , name = "wos_tsv_publications"
                       , collection = "unit_table"
                         ## add files recipies
                       , recipe = c(list('disambr_make_wos_tsv_publications')
                                  , recipes))
    return(publication_table)
}
disambr_make_wos_tsv_publications <- disambr:::disambr_make_wos_tsv_publications

my.file <- system.file("testdata", "wos-tsv-test-recent.txt", package = "disambr")

if(file.exists(my.file)) {
    dt <- list(disambr_read_file(my.file))

    expect_equal(
        attributes(disambr_make_wos_tsv_publications(dt, "recipe-lalala"))$disambr_set_recipe[[2]]
      , "recipe-lalala")
    expect_equal(
        length(disambr_make_wos_tsv_publications(dt, "recipe-lalala"))
      , 72
    )
}
stringi::stri_extract_first_regex("Vlasov, asdf;", "(?<=, )[A-Za-z]")

my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"

a <- my.file2 %>% disambr_read_file %>% list %>% disambr_make_wos_tsv_publications("recipe-lalala")

a %>% attributes

a[, .(id, first_author_last_name, first_author_first_initial)]

disambr_wos_tsv_parse_au

##' Parses AU column of WoS saved records export
##' @param author_name a record string from AU column
##' @return data.table
##' 
##' @md 
disambr_wos_tsv_parse_au <- function(author_name) {
    author_last_name <-
        stringi::stri_extract_first_regex(author_name, "^[^,]+")
    author_initials <-
        stringi::stri_extract_first_regex(author_name, "(?<=, )[A-Z]+")
    data.table::data.table(author_name = author_name
                         , author_last_name = author_last_name
                         , author_initials = author_initials
                         , author_order = 1:length(author_name))
}

## tests

## "Tilly, TB; Nelson, MT; Chakravarthy, KB; Shira, EA; Debrose, MC; Grabinski, CM; Salisbury, RL; Mattie, DR; Hussain, SM" %>% stri_split_fixed("; ") %>% 
## disambr_wos_tsv_parse_au

disambr_wos_tsv_parse_af

##' Parses AF (author full name) column of WoS saved records export
##' @param name (a record string from AF column)?
##' @return Data.table
disambr_wos_tsv_parse_af <- function(name) {
    last_name <- stringi::stri_extract_first_regex(name, "^[^,]+")
    first_names <- stringi::stri_extract_first_regex(name, "(?<=, ).*")
    first_names <-
        stringi::stri_split_fixed(first_names, " ", omit_empty = TRUE)
    ## first.full.name is first name without dot
    first_full_name <-
        lapply(first_names, function(n) {
            n[!stringi::stri_detect_regex(n, "\\.$")][1]
        })
    ## return
    data.table::data.table(
                    author_full_name = name
                  ## , author_last_name = last_name 
                  , author_first_names = first_names
                  , author_first_full_name =  first_full_name)
}


## test
## "Tilly, Trevor B.; Nelson, M. Tyler; Chakravarthy, Karthik B.; Shira, Emily A.; Debrose, Madeline C.; Grabinski, Christin M.; Salisbury, Richard L.; Mattie, David R.; Hussain, Saber M." %>%
## disambr_wos_tsv_parse_af

disambr_wos_tsv_parse_rp

##' Parses RP (reprint author) column of WoS saved records export
##' @param record_rp a record string from RP column
##' @return Data.table with two columns -  author_name and affiliations
disambr_wos_tsv_parse_rp <- function(record_rp) {
    record_rp_init <- ""
    authors_table <-
        data.table::data.table(author_name = character(0)
                             , affiliations = character(0))
    while(record_rp != record_rp_init) {
        record_rp_init <- record_rp
        record_rp_split <- 
            stringi::stri_match_first_regex(
                         record_rp
                       , "\\s*([^()]+)\\s+\\((corresponding author|reprint author)\\)([^;]+)")
        authors <-
            stringi::stri_split_fixed(record_rp_split[1,2], "; ")[[1]]
        affiliation <-
            stringi::stri_replace_first_regex(
                         record_rp_split[1,4], "^[\\s,.;]+", "")
        for (author in authors) {
            ## check if author is already in the list
            authors_table_match <-
                authors_table$author_name %in% author
            if(any(authors_table_match)) {
                ## add affiliation to affiliations of author
                ## the data.table way..
                authors_table[authors_table_match
                            , affiliations :=
                                  list(c(unlist(affiliations), affiliation))]
            } else {
                ## add new author with affiliation otherwise
                authors_table <-
                    data.table::rbindlist(
                                    list(authors_table
                                       , list(author_name = author
                                            , affiliations =
                                                  list(affiliation))))
            }
        }
        record_rp <-
            stringi::stri_replace_first_regex(
                         record_rp
                       , "[^()]+\\((corresponding author|reprint author)\\)[^;]+[;]", "")
    }
    ## results are not printed but the data.table is returned
    return(authors_table)
}

## "Guesmi, S (corresponding author), Natl Agron Inst Tunisia INAT, 43 Ave Charles Nicolle, Tunis 1082, Tunisia.; Guesmi, S; Sghaier, H (corresponding author), Sidi Thabet Technopk, Natl Ctr Nucl Sci & Technol, Lab Energy & Matter Dev Nucl Sci LR16CNSTN02, Sidi Thabet 2020, Tunisia.; Sghaier, H (corresponding author), Sidi Thabet Technopk, Lab Biotechnol & Nucl Technol LR16CNSTN01, Sidi Thabet 2020, Tunisia.; Sghaier, H (corresponding author), Sidi Thabet Technopk, Lab Biotechnol & Biogeo Resources Valorizat LR11E, Sidi Thabet 2020, Tunisia." %>%
## disambr_wos_tsv_parse_rp %>% print

## "" %>%
## disambr_wos_tsv_parse_rp %>% nrow

disambr_wos_tsv_parse_em

##' Parses EM (email) column of WoS saved records export
##' @param emails a record string from EM column
##' @param record_au_table a data_tabe after parsing AU column with disambr_wos_tsv_parse_au
##' @param record_rp_table a data_tabe after parsing RP column with disambr_wos_tsv_parse_rp
##' @return Data.table with columns - author_name, affiliations and email
disambr_wos_tsv_parse_em <- function(emails
                                   , record_au_table
                                   , record_rp_table) {
    if (isTRUE(length(emails) == 1 && emails == "")) {
        ## in case there are no emails
        record_au_table[, author_email := NA]
    } else if (isTRUE(length(emails) == nrow(record_rp_table))) {
        ## assume that emails corresponds RP authors
        record_au_table[match(record_rp_table$author_name, author_name)
                      , author_email := emails]
    } else if (isTRUE(length(emails) == nrow(record_au_table))) {
        ## assume that emails corresponds AU authors
        record_au_table[, author_email := emails]
    } else if (isTRUE(nrow(record_rp_table) != 0)) {
        ## in other cases just use first email for first RP author
        record_au_table[match(record_rp_table$author_name, author_name)[1]
                      , author_email := emails[1]]
    } else {
        ## if no RP assignt to first in AU
        record_au_table[1, author_email := emails[1]]
    }
    ## we do not need to return things as it updates record_au_table
    return(record_au_table)
}


## tests
## disambr_wos_tsv_parse_em(
## record_em = "a"
## , record_au_table = data.table(author_name = c(1,2,3,4))
## , record_rp_table = data.table(author_name = c(3))
## ) %>% print

disambr_wos_tsv_parse_c1

##' Parses C1 (author adress/affiliation) column of WoS saved records export
##' @param record_c1 a record string from RP column 
##' @param table_af Table
##' @return Data.table with two columns -  author_name and affiliations
disambr_wos_tsv_parse_c1 <- function(record_c1
                                   , table_af = NULL) {
    record_c1_init <- ""
    authors_table <-
        data.table::data.table(author_full_name = character()
                             , affiliations = list())
    while(record_c1 != record_c1_init) {
        record_c1_init <- record_c1
        record_c1_piece <- 
            stringi::stri_match_first_regex(
                         record_c1, "\\s*\\[([^\\[\\]]+)\\]\\s+([^;]+)\\s*")
        authors <-
            stringi::stri_split_fixed(record_c1_piece[1,2], "; ")[[1]]
        affiliation <- record_c1_piece[1,3]
        for (author in authors) {
            ## check if author is already in the list
            authors_table_match <-
                authors_table$author_full_name %in% author
            if(any(authors_table_match)) {
                ## add affiliation to affiliations of author
                ## the data.table way..
                authors_table[authors_table_match
                            , affiliations :=
                                  list(c(unlist(affiliations), affiliation))]
            } else {
                ## add new author with affiliation otherwise
                authors_table <-
                    data.table::rbindlist(list(authors_table
                                             , list(author_full_name = author
                                                  , affiliations = list(affiliation))))
            }
        }
        record_c1 <-
            stringi::stri_replace_first_regex(
                         record_c1, "\\s*\\[[^\\[\\]]+\\][^;]+[;]", "")
    }
    ## merge with table_af if provided
    if(length(table_af) != 0) {
        return(authors_table[table_af
                           , on = "author_full_name"
                           , .(affiliations)])
    } else {
        return(authors_table)
    }
}



## "[Wang, Menglei; Li, Shunyi; Zhu, Rencheng; Zhang, Ruiqin] Zhengzhou Univ, Sch Ecol & Environm, Zhengzhou 450001, Peoples R China; [Wang, Menglei] Zhengzhou Univ, Sch Chem Engn, Zhengzhou 450001, Peoples R China; [Zu, Lei; Wang, Yunjing; Bao, Xiaofeng] Chinese Res Inst Environm Sci, State Environm Protect Key Lab Vehicle Emiss Cont, Beijing 100012, Peoples R China" %>%
## disambr_wos_tsv_parse_c1


## "[Wang, Menglei; Li, Shunyi; Zhu, Rencheng; Zhang, Ruiqin] Zhengzhou Univ, Sch Ecol & Environm, Zhengzhou 450001, Peoples R China; [Wang, Menglei] Zhengzhou Univ, Sch Chem Engn, Zhengzhou 450001, Peoples R China; [Zu, Lei; Wang, Yunjing; Bao, Xiaofeng] Chinese Res Inst Environm Sci, State Environm Protect Key Lab Vehicle Emiss Cont, Beijing 100012, Peoples R China" %>%
## disambr_wos_tsv_parse_c1(disambr_wos_tsv_parse_af("Wang, Menglei; Zu, Lei; Wang, Yunjing; Bao, Xiaofeng"))

disambr_wos_tsv_parse_oi

##' Parses OI column of WoS saved records export
##' @param authors a record string from OI column
##' @param table_af Table
##' @return data.table
##' 
##' @md 
disambr_wos_tsv_parse_oi <- function(authors
                                   , table_af = NULL) {
    author_full_name <-
        stringi::stri_extract_first_regex(authors, "^[^/]+")
    author_orcid <-
        stringi::stri_extract_first_regex(authors, "(?<=/).+")
    authors_table <- 
        data.table::data.table(author_full_name = author_full_name
                             , author_orcid = author_orcid)
    ## take care of propable case of multiple ID for one person
    author_full_name_unique <- unique(authors_table$author_full_name)
    author_orcid_list <-
        lapply(author_full_name_unique
             , function(x) {
                 authors_table$author_orcid[authors_table$author_full_name %in% x]
             })
    authors_table <-
        data.table::data.table(author_full_name = author_full_name_unique
                             , author_orcid = author_orcid_list)
    if(length(table_af) != 0) {
        return(authors_table[table_af
                           , on = "author_full_name"
                           , .(author_orcid)])
    } else {
        return(authors_table)
    }
}



## "Estrela, Pedro/0000-0001-6956-1146; Maxted, Grace/0000-0002-6816-9107; Rainbow, Joshua/0000-0003-3911-928X; Richtera, Lukas/0000-0002-8288-3999; Moschou, Despina/0000-0001-9175-5852" %>%
    ## disambr_make_wos_tsv_authors__parse_oi



## "Estrela, Pedro/0000-0001-6956-1146; Maxted, Grace/0000-0002-6816-9107; Rainbow, Joshua/0000-0003-3911-928X; Rainbow, Joshua/0000-0003-3911-928X; Richtera, Lukas/0000-0002-8288-3999; Moschou, Despina/0000-0001-9175-5852" %>%
    ## disambr_make_wos_tsv_authors__parse_oi(
        ## table_af = data.table(author_full_name =
                                  ## c("Rainbow, Joshua", "Moschou, Despina")))                                                                                       

## "" %>% disambr_make_wos_tsv_authors__parse_oi(
           ## table_af = data.table(author_full_name = c("Rainbow, Joshua")))

## NA %>% disambr_make_wos_tsv_authors__parse_oi

disambr_wos_tsv_parse_ri

##' Parses RI column of WoS saved records export
##' @param authors a record string from RI column
##' @param table_af Table
##' @return data.table
##' 
##' @md 
##' @export 
disambr_wos_tsv_parse_ri <- function(authors
                                   , table_af = NULL) {
    author_full_name <-
        stringi::stri_extract_first_regex(authors, "^[^/]+")
    author_researcher_id <-
        stringi::stri_extract_first_regex(authors, "(?<=/).+")
    authors_table <- 
        data.table::data.table(author_full_name = author_full_name
                             , author_researcher_id = author_researcher_id)
    ## take care of propable case of multiple ID for one person
    author_full_name_unique <- unique(authors_table$author_full_name)
    author_researcher_id_list <-
        lapply(author_full_name_unique
             , function(x) {
                 authors_table$author_researcher_id[
                                   authors_table$author_full_name %in% x]
             })
    authors_table <-
        data.table::data.table(author_full_name = author_full_name_unique
                             , author_researcher_id = author_researcher_id_list)
    if(length(table_af) != 0) {
        return(authors_table[table_af
                           , on = "author_full_name"
                           , .(author_researcher_id)])
    } else {
        return(authors_table)
    }
}

disambr_make_wos_tsv_authors

##' Parses all WoS field related to authors and makes author table
##'
##' Relevant fields that are parsed (as in Web of Science Field Tags 2018-06-27):
##' - AU
##' - AF (full names)
##' - C1 adresses
##' - RP reprint address (one you contact for reprint copy)
##' - EM emails
##' - RI researcher ID
##' - OI ORCID Identifier (Open Researcher and Contributor ID)
##'
##' Fields that are not parsed
##' - BA book - not parsed
##' - BF book - not parsed
##' - CA gp group author (usually organization or group name) - not parsed
##' - BE editors - not parsed
##' @param wos_data_table WoS tsv export data.table
##' @param list_of_author_fields list of data to parse to construct (if not all selected as by default it will save time)
##' @param ... set verbose if needed here
##' @return author table
##' 
##' @md 
##' @export 
disambr_make_wos_tsv_authors <- function(wos_data_table
                                , list_of_author_fields =
                                      c("author_order"
                                      , "author_short_name"
                                      , "author_initials"
                                      , "author_last_name"
                                      , "author_full_name"
                                      , "author_first_names"
                                      , "author_first_full_name"
                                      , "author_email"
                                      , "author_researcher_id"
                                      , "author_orcid"
                                      , "author_affiliations")
                                , ...) {
    disambr_message_start(start_mess_prefix = "-- making set")
    authors_tables <- list()
    ## AU
    if(any(c("author_order"
           , "author_short_name"
           , "author_last_name"
           , "author_initials"
           , "author_email") %in% list_of_author_fields) &&
       "AU" %in% names(wos_data_table)) {
        disambr_message("-- parsing AU field", ...)
        authors_tables$au <-
            lapply(
                stringi::stri_split_fixed(wos_data_table$AU, "; ")
              , disambr_wos_tsv_parse_au)
    }
    ## AF
    if(any(c("author_full_name"
           , "author_first_names"
           , "author_first_full_name"
           , "author_researcher_id"
           , "author_orcid"
           , "author_affiliations") %in% list_of_author_fields) &&
       "AF" %in% names(wos_data_table)) {
        disambr_message("-- parsing AF field", ...)
        authors_tables$af <-
            lapply(
                stringi::stri_split_fixed(wos_data_table$AF, "; ")
              , disambr_wos_tsv_parse_af)
    }
    ## RP
    if(any(c("author_email") %in% list_of_author_fields) &&
       "RP" %in% names(wos_data_table)) {
        disambr_message("-- parsing RP field", ...)
        ## save RP separately as it is different order from AU
        rp <-
            lapply(wos_data_table$RP
                 , disambr_wos_tsv_parse_rp)
    }

    ## EM
    if(any(c("author_email") %in% list_of_author_fields) &&
       all(c("AU", "EM", "RP") %in% names(wos_data_table))) {
        disambr_message("-- parsing EM field", ...)
        ## disambr_wos_tsv_parse_em updates authors_tables$au
        ## so no need to save it
        pbapply::pbmapply(disambr_wos_tsv_parse_em
             , stringi::stri_split_fixed(wos_data_table$EM, "; ")
             , authors_tables$au
             , rp
             , SIMPLIFY = FALSE
             , USE.NAMES = FALSE)
    }
    ## C1
    if(any(c("author_affiliations") %in% list_of_author_fields) &&
       all(c("C1", "AF") %in% names(wos_data_table))) {
        disambr_message("-- parsing C1 field", ...)
        authors_tables$c1 <-
            pbapply::pbmapply(disambr_wos_tsv_parse_c1
                 , wos_data_table$C1
                 , authors_tables$af
                 , SIMPLIFY = FALSE)
    }
    ## RI
    if(any(c("author_researcher_id") %in% list_of_author_fields) &&
       all(c("RI", "AF") %in% names(wos_data_table))) {
        disambr_message("-- parsing RI field", ...)
        authors_tables$ri <-
            pbapply::pbmapply(disambr_wos_tsv_parse_ri
                 , stringi::stri_split_fixed(wos_data_table$RI, "; ")
                 , authors_tables$af
                 , SIMPLIFY = FALSE
                 , USE.NAMES = FALSE)
    }
    ## OI
    if(any(c("author_orcid") %in% list_of_author_fields) &&
       all(c("OI", "AF") %in% names(wos_data_table))) {
        disambr_message("-- parsing OI field", ...)
        authors_tables$oi <-
            pbapply::pbmapply(disambr_wos_tsv_parse_oi
                 , stringi::stri_split_fixed(wos_data_table$OI, "; ")
                 , authors_tables$af
                 , SIMPLIFY = FALSE
                 , USE.NAMES = FALSE)
    }
    ## remove duplicated columns
    disambr_message("-- stacking author fields", ...)
    authors_tables <- 
        lapply(authors_tables, data.table::rbindlist, idcol = "paper_id")
    authors_table <- do.call(cbind, authors_tables)
    authors_table_names <- 
        stringi::stri_replace_first_regex(names(authors_table), "^[^\\.]+\\.", "")
    authors_table_select <- which(!duplicated(authors_table_names))
    authors_table_new_names <- authors_table_names[authors_table_select]
    authors_table <- authors_table[, authors_table_select, with = FALSE]
    ## set names
    data.table::setnames(authors_table, authors_table_new_names)
    ## set author attributes
    disambr_add_set_attr(authors_table
                       , wos_data_table
                       , unit = "person"
                       , reference = "wos_tsv_publications"
                       , type = "similar"
                       , strength = 0.1
                       , name = "wos_tsv_authors")
    #disambr_save_set(authors_table)
    disambr_message_finish(mess = "-- finished -")
    return(authors_table)
}
my.file <- system.file("testdata", "wos-tsv-test-recent.txt", package = "disambr")

if(file.exists(my.file)) {

    dt <- disambr_read_file(my.file)

    expect_equal(
        length(attributes(disambr_make_wos_tsv_authors(dt))$disambr_set_recipe)
      , 5)
}
## test
  my.file <- system.file("testdata", "wos-tsv-test-recent.txt", package = "disambr")

  my.file <- "inst/testdata/wos-tsv-test-recent.txt"
  dt <- disambr_read_file(my.file)

  dt %>% attributes

  disambr_make_wos_tsv_authors(dt) %>% attributes

  ## testing dt merge
  a <- data.table(name = c("a", "b", "c"), order = c(1,2,3))
  b <- data.table(named = c("c", "b", "c"), affil = c("b-adfsa","c-sadfsd"))
  cbind(a, b, check.names = FALSE)

disambr_wos_tsv_parse_cr

##' Parses WoS CR (Cited References) record into separate talbe
##' @param references CR (Cited References) record, i.e., just one row
##' @return references table
##' 
##' @export 
disambr_wos_tsv_parse_cr <- function(references) {
    references_list <- stringi::stri_split_fixed(references, ", ")
    references_list <-
        lapply(references_list, function(ref) {
            first_author_last_name_first_initial <-
                stringi::stri_match_first_regex(ref[1], "([a-z ]*[A-Z][^ ]+)\\s+([A-Z])")
            ref_tail <- ref[-c(1:3)]
            vol <- stringi::stri_extract_first_regex(ref_tail, "(?<=^V)\\d+")
            vol <- vol[!sapply(vol, is.na)][1]
            page <- stringi::stri_extract_first_regex(ref_tail, "(?<=^P)\\d+")
            page <- page[!sapply(page, is.na)][1]
            doi <-
                stringi::stri_match_first_regex(ref_tail
                                              , "10.\\d{4,9}/[-._;()/:A-Za-z0-9]+")
            doi <- doi[!sapply(doi, is.na)][1]
            name_year <- paste(first_author_last_name_first_initial[2]
                             , first_author_last_name_first_initial[3]
                             , ref[2])
            ## combine
            list(first_author_name = ref[1]
               , first_author_last_name = first_author_last_name_first_initial[2]
               , first_author_first_initial = first_author_last_name_first_initial[3]
               , year = ref[2]
               , outlet = ref[3]
               , vol = vol
               , page = page
               , doi = doi
               , name_year = name_year)
        })
    suppressWarnings(data.table::rbindlist(references_list))
}
d <- "Allen C, 2017, ENVIRON SCI-NANO, V4, P741, DOI 10.1039/c7en90014g; Baek YW, 2011, SCI TOTAL ENVIRON, V409, P1603, DOI 10.1016/j.scitotenv.2011.01.014; Baker GL, 2008, TOXICOL SCI, V101, P122, DOI 10.1093/toxsci/kfm243; Bergstrom U, 2015, J TOXICOL ENV HEAL A, V78, P645, DOI 10.1080/15287394.2015.1017682; Bhushan B, 2011, PROG MATER SCI, V56, P1, DOI 10.1016/j.pmatsci.2010.04.003; Biswas P, 2005, J AIR WASTE MANAGE, V55, P708, DOI 10.1080/10473289.2005.10464656; Bitterle E, 2006, CHEMOSPHERE, V65, P1784, DOI 10.1016/j.chemosphere.2006.04.035; Bondarenko O, 2013, ARCH TOXICOL, V87, P1181, DOI 10.1007/s00204-013-1079-4; Bonner J. C., 2003, ENV HLTH PERSPECT, V111, P1289; Brossell D, 2013, J AEROSOL SCI, V63, P75, DOI 10.1016/j.jaerosci.2013.04.012; Clift MJD, 2011, ARCH TOXICOL, V85, P723, DOI 10.1007/s00204-010-0560-6; Cohen J, 2013, NANOTOXICOLOGY, V7, P417, DOI 10.3109/17435390.2012.666576; Cohen JM, 2014, PART FIBRE TOXICOL, V11, DOI 10.1186/1743-8977-11-20; Comouth A, 2013, J AEROSOL SCI, V63, P103, DOI 10.1016/j.jaerosci.2013.04.009"

expect_equal(nrow(disambr_wos_tsv_parse_cr(stringi::stri_split_fixed(d, "; ")[[1]])), 14)
a[[3]]$first_author_name

testing authors harmonization
c("de Sena RC"
, "Viola Roberto"
, "US Environmental Protection Agency"
, "WENG Shi-fu"
, "[Anonymous]"
, "LANGENHEIM JH"
, "van Doosselaere P"
, "USDA"
, "Kosmowska-Ceranowicz Barbara") %>%
## stri_match_first_regex("([a-z ]*[A-Z][^ ]+)\\s+([A-Z])")
stringi::stri_extract_first_regex("[a-z ]*[A-Z][^ ]+")

disambr_make_wos_tsv_references

##' Make references table from WoS tsv data
##' @param wos_data_table WoS data
##' @param ... set verbose here if neededj
##' @return references table
##' 
##' @export 
disambr_make_wos_tsv_references <- function(wos_data_table
                                          , ...) {
    ## disambr_message_start()
    if("CR" %in% names(wos_data_table)) {
        disambr_message("-- parsing references", ...)
        references_list <-
            pbapply::pblapply(stringi::stri_split_fixed(wos_data_table$CR, "; ")
                   , disambr_wos_tsv_parse_cr)
        disambr_message("-- stacking references", ...)
        references_table <-
            data.table::rbindlist(references_list, idcol = "paper_id")
        disambr_message("-- matching DOI citations", ...)
        ## assume unique doi
        doi_match <- match(references_table$doi
                         , wos_data_table$doi
                         , incomparables = NA
                         , nomatch = NA)
        if(any(!is.na(doi_match))){
            references_table[, "doi_cited_id" := wos_data_table$id[doi_match]]
        } else {
            references_table[, "doi_cited_id" := NA]
        }
        ## set references attributes
        disambr_add_set_attr(references_table, wos_data_table
          , reference = "wos_tsv_publications"
          , type = "similar"
          , strength = 0.1
          , name = "wos_tsv_references"
          , collection = "unit_table")
        ## disambr_message_finish()
        return(references_table)
    } else {
        disambr_stop("THERE IS NOT 'CR' FIELD WITH REFERENCES!")
    }
}
my.file <- system.file("testdata", "wos-tsv-test-recent.txt", package = "disambr")

if(file.exists(my.file)) {

    dt <- disambr_read_file(my.file)

    expect_equal(nrow(disambr_make_wos_tsv_references(dt)), 3760)

}
my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"
dt <- disambr_read_file(my.file2)

disambr_make_wos_tsv_references(dt) %>% print

dt[[3]]$doi_cited_id %>% is.na %>% not %>% sum

my.dir.small <- '../data/Journals in Mathematical Psychology'
dt <- my.dir.small %>% disambr_read

dt[[3]]
dt[[3]]$doi_cited_id %>% is.na %>% not %>% sum
dt[[3]]$name_cited_id %>% is.na %>% not %>% sum

disambr_make_wos_tsv_author_year_citations

##' Makes citations table by matching first author - year keys. This, of course, can produce false positives matched
##' @param pub_table WoS publications data table
##' @param ref_table WoS references data table
##' @return citation table
##' 
##' @export 
disambr_make_wos_tsv_author_year_citations <- function(pub_table, ref_table) {
    ## filter before maching
    pub_table[, "year" := as.character(PY)]
    pub_table <- pub_table[!is.na(first_author_last_name) &
                           !is.na(first_author_first_initial) &
                           !is.na(year)
                         , .(first_author_last_name
                           , first_author_first_initial
                           , year
                           , cited_id = id)]
    ref_table <- ref_table[!is.na(first_author_last_name) &
                           !is.na(first_author_first_initial) &
                           !is.na(year)
                         , .(first_author_last_name
                           , first_author_first_initial
                           , year
                           , citing_id = paper_id)]
    cit_table <- 
        merge(pub_table, ref_table
            , by = c("first_author_last_name"
                   , "first_author_first_initial"
                   , "year"), allow.cartesian = TRUE)
    cit_table <- cit_table[, .(citing_id, cited_id
                               ## , first_author_last_name
                               ## , first_author_first_initial
                               ## , year
                               )]
    disambr_add_set_attr(cit_table, ref_table
                       , name = "wos_tsv_author_year_citations"
                       , collection = "dyad_table"
                       , reference = "wos_tsv_publications")
    return(cit_table)
}
is.na(c(1,2,NA))

## testing data.table
p <- data.table(c(1,2,3,4)
              , b = TRUE
              , y = c(7,4,6,3)
              , n = c("o", "z", "o", "e")
              , DI = c(11,22,33,44))
p[, id := 1:.N]
r <- data.table(paper_id = c(1,1,2,2,4,4)
              , b = FALSE
              , y = c(2,6,4,8,3,1)
              , n = c("a", "c", "z", "o", "e", "o")
              , doi = c(88,22,99,55,11,55))

p[r
, on = .(y, n)
, .(DI, doi, citation_id = id, paper_id, b, y, n)]


## test function
my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"
dt <- disambr_read(my.file2)

my.dir.small <- '../data/Journals in Mathematical Psychology'
dt <- disambr_read(my.dir.small)

dt[[1]]$first_author_last_name %>% sapply(is.na) %>% which

dt[[1]] %>% nrow
dt[[3]] %>% nrow

disambr_make_wos_tsv_author_year_citations(dt[[1]], dt[[3]])

disambr_make_wos_tsv_author_year_citations(dt[[1]], dt[[3]])$first_author_last_name %>% sapply(is.na) %>% sum

## dt[[1]]$AU[3771]
## dt[[1]]$CR[3514]

disambr_eject_citations_doi

this one is not needed as doi is unique and we can add them to refs_table
disambr_eject_citations_doi <- function(pub_table, ref_table) {
    ref_table <- ref_table[!is.na(doi), .(doi, paper_id)]
    pub_table <- pub_table[DI != "", .(DI, paper_id)]
    pub_table[
      , doi := stringi::stri_match_first_regex(DI
                                             , "10.\\d{4,9}/[-._;()/:A-Za-z0-9]+")]
    pub_table <- pub_table[!is.na(doi), .(doi, id)]
    ref_table[

        cited_id := saply  doi
    ]
        pub_table[ref_table
                , on = doi
                , .(citing_id = paper_id
                  , cited_id = id)]
    return(cit_table[!is.na(citing_id)])
}


my.dir.small <- '../data/Journals in Mathematical Psychology'
dt <- my.dir.small %>% disambr_read

dt[[1]]$DI %>% head(1000)

## testing data.table
p <- data.table(c(1,2,3,4)
              , b = TRUE
              , DI = c(11,22,33,44))
p[, id := 1:.N]
r <- data.table(paper_id = c(1,1,2,2,4,4)
              , b = FALSE
              , doi = c(88,22,99,55,11,55))
p[r
, on = .(DI=doi)
, .(DI, doi, citation_id = id, paper_id, b)]


## test function
## my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"
## dt <- disambr_read(my.file2)

## disambr_eject_citations(dt[[1]], dt[[3]])

## dt[[1]]$DI[11]
## dt[[1]]$DI[9]
## dt[[1]]$DI[20]

## dt[[1]]$CR[11]
## dt[[1]]$CR[9]

## dt[[1]]$CR[20]

DEPRICATED

## ##' Parses RI (researcher_id) column of WoS saved records export
## ##' @param record_ri a record string from RP column
## ##' @param table_af 
## ##' @return Data.table with columns - author_full_name and author_researcher_id 
## disambr_wos_tsv_parse_ri <- function(record_ri
##                                          , table_af = NULL) {
##     if(isTRUE(record_ri != "")) {
##         authors <- stringi::stri_split_fixed(record_ri, "; ")[[1]]
##         authors_list <- lapply(authors, function(author) {
##             author_split <- stringi::stri_split_fixed(author, "/", n = 2)[[1]]
##             list(author_full_name = author_split[1]
##                , author_researcher_id = author_split[2])
##         })
##         authors_table <- data.table::rbindlist(authors_list)
##     } else {
##         authors_table <- data.table::data.table(author_full_name = character()
##                                               , author_researcher_id = character())
##     }
##     ## merge with table_af if provided
##     if(length(table_af) != 0) {
##         return(authors_table[table_af
##                            , .(author_researcher_id)
##                            , on = "author_full_name"
##                            , roll = TRUE])
##     } else {
##         return(authors_table)
##     }
## }

## rbindlist(list("Girabent, Montserrat/B-8536-2008; Maydeu-Olivares, Alberto/B-5178-2010" %>%
## disambr_wos_tsv_parse_ri
## , "" %>%
## disambr_wos_tsv_parse_ri))


## a <- data.table(a1 = c(1,2,2,3,4), a2 = c(11,22,22,33,44))
## b <- data.table(a1 = c(3,2), b2 = c(333,111))


## a[b, on = "a1", roll = TRUE]

## b[a, on = "a1"]

## NA %>%
## disambr_wos_tsv_parse_ri



## ##' Parses OI (ORCID) column of WoS saved records export
## ##' @param record_oi a record string from OI column
## ##' @return Data.table with columns - author_full_name and author_orcid
## disambr_wos_tsv_parse_oi <- function(record_oi
##                                          , table_af = NULL) {
##     if(isTRUE(record_oi != "")) {
##         authors <- stringi::stri_split_fixed(record_oi, "; ")[[1]]
##         authors_list <- lapply(authors, function(author) {
##             author_split <- stringi::stri_split_fixed(author, "/", n = 2)[[1]]
##             list(author_full_name = author_split[1]
##                , author_orcid = author_split[2])
##         })
##         authors_table <- data.table::rbindlist(authors_list)
##     } else {
##         authors_table <- data.table::data.table(author_full_name = character()
##                                               , author_orcid = character())
##     }
##     ## merge with table_af if provided
##     if(length(table_af) != 0) {
##         return(authors_table[table_af
##                            , on = "author_full_name"
##                            , .(author_orcid)
##                            , roll = TRUE])
##     } else {
##         return(authors_table)
##     }
## }


## "Estrela, Pedro/0000-0001-6956-1146; Maxted, Grace/0000-0002-6816-9107; Rainbow, Joshua/0000-0003-3911-928X; Richtera, Lukas/0000-0002-8288-3999; Moschou, Despina/0000-0001-9175-5852" %>% disambr_wos_tsv_parse_oi

## ----------------------------------------------------------------------------






disambr_aev

disambr_filter_authors_by_researcher_ids

##' Subsets WoS authors table with Researcher IDs
##'
##' This procedure does not alter sets attributes. Just filters WoS author table.
##'
##' @param sets WoS data
##' @param file_path path to Researcher IDs. The default list of Researcher IDs is taken from Tekles & Bornmann (2019) for reproducing their test samples (see reference)
##' @return updated WoS data
##'
##' @references Tekles, A., & Bornmann, L. (2019). Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches. ArXiv:1904.12746. http://arxiv.org/abs/1904.12746

##'
##' @export
disambr_filter_authors_by_researcher_ids <-
    function(sets
           , file_path =
                 system.file("testdata" 
                           , "tekles-bornmann-researcher-ids.txt"
                           , package = "disambr")) {
        force(sets)
        disambr_message_start()
        if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
        ## check if output set is ready
        output_set <- disambr_get_output_set(sets)
        if(!is.null(output_set)) return(sets)
        output_set <- disambr_read_output_set()
        if(!is.null(output_set)) return(c(sets, list(output_set)))
        ## ----------------------------------------------------------------------
        disambr_message("- Tekles, A., & Bornmann, L. (2019) researcher IDs")
        if(file.exists(file_path)) {
            researcher_ids <- readLines(file_path)
        } else {
            disambr_stop("- Can not find file with researcher IDs. Please, provide")
        }
        disambr_message("- coping 'wos_tsv_authors' data set")
        authors_data_set_index <- disambr_in_sets(sets
                                                , recipe = "wos_tsv_authors"
                                                , match_attr_value_parcially = TRUE)
        authors_data_set_index <- which(authors_data_set_index)
        if(length(authors_data_set_index) != 1) {
            disambr_stop("- can not find 'wos_tsv_authors' data set!")
            return(sets)
        }
        authors_data_set <- sets[[authors_data_set_index]]
        ## ======================================================================
        disambr_message("- filtering authors by Researcher ID")
        authors_data_set_new <- authors_data_set[author_researcher_id %in% researcher_ids]
        ## ======================================================================
        mostattributes(authors_data_set_new) <- attributes(authors_data_set)
        disambr_set_attr(authors_data_set_new, name = "tekles_bornmann")
        ## disambr_save_set(authors_data_set_new)
        sets[[authors_data_set_index]] <- authors_data_set_new
        disambr_message_finish()
        return(sets)
    }
options(disambr_mess_pretty = TRUE)

dt <- readRDS(file = "../data/wos-slow-export-subset.rds")

attributes(dt[[2]])$disambr_set_recipe[[1]][[1]]

dt %>% sapply(length)

dt.test <- 
    dt %>% disambr_set_tekles_bornmann

## dt %>% length
## dt[[3]] %>% attributes


## after disambr_set_tekles_bornmann
## dt %>% length
## nrow(dt[[3]])
## 25868 vs 834090

disambr_split_authors_if_on_the_same_paper

##' Makes sets of co-authors assuming that all authors on paper are different person.
##' @param sets Sests
##' @return Updated sets
##' 
##' @export 
disambr_split_authors_if_on_the_same_paper <- function(sets) {
    force(sets)
    disambr_message_start()
    if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
    ## check if output set is ready
    output_set <- disambr_get_output_set(sets)
    if(!is.null(output_set)) return(sets)
    output_set <- disambr_read_output_set()
    if(!is.null(output_set)) return(c(sets, list(output_set)))
    ## ----------------------------------------------------------------------
    author_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
    ## ======================================================================
    disambr_message("- spliting co-authors")
    output_set <- split(1:nrow(author_data_set)
                      , author_data_set$paper_id)
    ## ======================================================================
    disambr_add_set_attr(output_set, author_data_set
                       , type = "different"
                       , strength = 1
                       , collection = "list_of_lists"
                       , reference = "wos_tsv_authors")
    disambr_save_set(output_set)
    disambr_message_finish()
    return(c(sets, list(output_set)))
}

disambr_merge_authors_if_not_on_the_same_paper

## too slow... and eats all ram
disambr_set_not_on_same_paper <- function(sets
                                        , verbose = FALSE
                                        , save_set_as = TRUE
                                        , data_set_name =
                                              "wos_records_tsv_export_author_table") {
    force(sets)
    if(verbose) message("disambr: Starting disambr_set_not_on_same_paper...")
    if(!is.list(sets)) stop("disambr: 'sets' parameter should be list!")
    data_set <-
        disambr_subsets(sets
                      , list(disambr_set_name = data_set_name)
                      , which_to_return = "first")
    if(verbose) message("- spliting co-authors")
    return_sets <- data_set %>% {split(1:nrow(.), .$paper_id)}
    if(verbose) message("- making combinations of co-authors sets..")
    if(length(return_sets) > 50000) stop("--- THE NUMBER OF COMBINATIONS IS TO HIGH!")
    return_sets_comb <-
        combn(length(return_sets), 2, simplify = FALSE)
    if(verbose) message("--- made ,", length(return_sets_comb), " combinations")
    if(verbose) message("- expanding combinations")
    return_sets <-
        pbapply::pblapply(return_sets_comb
                        , function(comb) {
                            expand.grid(author_id1 = return_sets[[comb[1]]]
                                      , author_id2 = return_sets[[comb[2]]])
                        })
    if(verbose) message("- rbinding combinations..")
    return_sets <- data.table::rbindlist(return_sets)
    if(verbose) message("--- rbinded into ", nrow(return_sets), " rows")
    ## set set's attributes
    data_set_recipe <- attr(data_set, "disambr_recipe")
    disambr_setattr(return_sets
                  , disambr_entity = "person"
                  , disambr_set_type = "similar"
                  , disambr_set_coefficient = 0.5
                  , disambr_set_name = "not_on_same_paper"
                  , disambr_set_collection = "dyads_table"
                  , disambr_entity_id_reference =
                        "wos_records_tsv_export_author_table"
                  , disambr_recipe = c(list("disambr_set_not_on_same_paper")
                                     , data_set_recipe))
    if(length(save_set_as) != 0)
        return(c(sets, list(return_sets)))
}

disambr_merge_authors_with_similar_initials

##' Makes set of similar authors based on their initials
##' @param sets Sets
##' @param maxDist see max_dist in `match_fuzzy`
##' @return Sets with new appended
##' @export 
disambr_merge_authors_with_similar_initials <- function(sets
                                       , maxDist = 1) {
    force(sets)
    disambr_message_start()
    if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
    ## check if output set is ready
    output_set <- disambr_get_output_set(sets)
    if(!is.null(output_set)) return(sets)
    output_set <- disambr_read_output_set()
    if(!is.null(output_set)) return(c(sets, list(output_set)))
    ## ----------------------------------------------------------------------
    author_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
    initials_data_set <- toupper(author_data_set$author_initials)
    ## use only first 2 initials
    initials_data_set <- stringi::stri_sub(initials_data_set, to = 2)
    input_set <- disambr_get_last_set(sets)
    ## ======================================================================
    disambr_message("- fuzzy matching initials")
    ## assume all authors will be used in the table
    initials_bank <- unique(initials_data_set)
    ## lets leave NAs
    initials_bank <- sort(initials_bank, na.last = TRUE)
    initials_match <- 
        lapply(initials_bank
             , function(ini) {
                 matched_initials <- 
                     stringdist::ain(initials_bank
                                   , ini
                                   , maxDist = maxDist
                                   , method = "lv"
                                   , matchNA = FALSE)
                 if(any(matched_initials)) {
                     matched_initials <- initials_bank[matched_initials]
                     data.table::data.table(
                                     author_initials_1 = ini
                                   , author_initials_2 = matched_initials)
                 } else NULL
             })
    initials_match <- data.table::rbindlist(initials_match)
    ## this is the case if we apply this procedure first
    if(attr(input_set, "disambr_set_type") == "different") {
        input_set_l <- length(input_set)
        disambr_message(paste("- doing combinations on", input_set_l))
        ## try cluster
        ## cl <- parallel::makeCluster(20,type="SOCK")
        output_set <- 
            pbapply::pblapply(1:(input_set_l - 1), function(i) {
                ## combn using is data.table method
                comb <-
                    data.table::CJ(author_id1 = input_set[[i]]
                                 , author_id2 = unlist(input_set[(i+1):input_set_l])
                                 , sorted = FALSE)
                ## add initials_bank
                comb[, `:=`(
                    author_initials_1 = initials_data_set[author_id1]
                  , author_initials_2 = initials_data_set[author_id2]
                )]
                ## alternative to the above
                ## data.table::set(comb, , c("author_initials_1"
                                        ## , "author_initials_2")
                              ## , list(initials_data_set[comb$author_id1]
                                   ## , initials_data_set[comb$author_id2]))
                ## check matches
                comb <-
                    data.tables::merge(comb
                                     , initials_match
                                     , by = c("author_initials_1"
                                            , "author_initials_2"))
                return(comb[,.(author_id1, author_id2)])
            })
        ##parallel::stopCluster(cl = cl)
        disambr_message("- rbinding dyads")
        output_set <- data.table::rbindlist(output_set)
        ## other case is when follow matching last names procedure
    } else if(attr(input_set, "disambr_set_type") == "similar") {
        output_set <- input_set
        ## add names
        output_set[, `:=`(
            author_initials_1 = initials_data_set[author_id1]
          , author_initials_2 = initials_data_set[author_id2]
        )]
        ## check matches
        output_set <- merge(output_set, initials_match
                          , by = c("author_initials_1", "author_initials_2"))
        output_set <- output_set[,.(author_id1, author_id2)]
        ## } else if(attr(input_set, "disambr_set_type") == "similar") {
        ##     disambr_message("- subsetting first two initials_bank")
        ##     initials_bank1 <-
        ##         stri_sub(author_data_set$author_initials_bank[input_set$author_id1], to = 2)
        ##     initials_bank2 <-
        ##         stri_sub(author_data_set$author_initials_bank[input_set$author_id2], to = 2)
        ##     disambr_message("- calculating distance b/w initials_bank")
        ##     dist <- stringdist(initials_bank1
        ##                      , initials_bank2
        ##                      , method = "lv")
        ##     output_set <- input_set[dist < 2]
    } else disambr_stop("- UNKNOWN INPUT_SET_NAME!")
    ## ======================================================================
    disambr_add_set_attr(output_set, author_data_set
                       , strength = 0.5
                       , collection = "dyad_table"
                       , reference = "wos_tsv_authors")
    disambr_save_set(output_set)
    disambr_message_finish()
    return(c(sets, list(output_set)))
}


## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:100]
## mostattributes(dt[[4]]) <- dt_atributes


## dt %>%
## disambr_set_similar_last_names(verbose = TRUE) %>% 
## disambr_set_similar_initials_bank(verbose = TRUE)

disambr_merge_authors_with_similar_last_names

##' Makes set of similar authors based on their last names
##' @param sets Sets
##' @param max_dist see max_dist in `match_fuzzy`
##' @param max_dist_short max_dist for short last names, default 0
##' @param min_length who is short names defined, default 0 which means do not condider short names
##' @return Sets with new appended
##' @export 
disambr_merge_authors_with_similar_last_names <- function(sets
                                         , max_dist = 1
                                         , max_dist_short = 0
                                         , min_length = 0) {
    force(sets)
    disambr_message_start()
    if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
    ## check if output set is ready
    output_set <- disambr_get_output_set(sets)
    if(!is.null(output_set)) return(sets)
    output_set <- disambr_read_output_set()
    if(!is.null(output_set)) return(c(sets, list(output_set)))
    ## ----------------------------------------------------------------------
    author_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
    last_name_data_set <- toupper(author_data_set$author_last_name)
    input_set <- disambr_get_last_set(sets)
    input_set_l <- length(input_set)
    ## ======================================================================
    disambr_message("- fuzzy matching last names")
    ## assume all authors will be used in the table
    last_names_bank <- unique(last_name_data_set)
    ## lets leave NAs
    last_names_bank <- sort(last_names_bank, na.last = TRUE)
    ## treshholds
    last_names_bank_short <- nchar(last_names_bank) <= min_length
    last_names_match <- match_fuzzy(last_names_bank[!last_names_bank_short]
                                  , method = "dl"
                                  , max_dist = max_dist
                                  , id_name = "author_last_name")
    if(any(last_names_bank_short)) {
        last_names_match <- rbind(
            last_names_match
          , match_fuzzy(last_names_bank[last_names_bank_short]
                      , method = "dl"
                      , max_dist = max_dist_short
                      , id_name = "author_last_name"))
    }
    ## ----------------------------------------------------------------------
    if(attr(input_set, "disambr_set_type") == "different") {
        disambr_message(paste("- doing combinations on", input_set_l))
        output_set <- 
            pbapply::pblapply(1:(input_set_l-1), function(i) {

                ## this is data.table method
                combs <-
                    data.table::CJ(author_id1 = input_set[[i]]
                                 , author_id2 =
                                        unlist(input_set[(i + 1) : input_set_l])
                                 , sorted = FALSE)
                ## add names
                combs[, `:=`(
                    author_last_name_1 = last_name_data_set[author_id1]
                  , author_last_name_2 = last_name_data_set[author_id2]
                )]
                ## check matches
                combs <- 
                    merge(combs, last_names_match
                        , by = c("author_last_name_1", "author_last_name_2"))
                return(combs[,.(author_id1, author_id2)])
            })
        disambr_message("- rbinding dyads")
        output_set <- data.table::rbindlist(output_set)
    } else if(attr(input_set, "disambr_set_type") == "similar") {
        disambr_message(paste("- fuzzy matching authors by last name"))
        output_set <- input_set
        ## add names
        output_set[, `:=`(
            author_last_name_1 = last_name_data_set[author_id1]
          , author_last_name_2 = last_name_data_set[author_id2]
        )]
        ## check matches
        output_set <- merge(output_set, last_names_match
                          , by = c("author_last_name_1", "author_last_name_2"))
        output_set <- output_set[,.(author_id1, author_id2)]
    } else disambr_stop("- UNKNOWN INPUT_SET_NAME!")
    ## ======================================================================
    disambr_add_set_attr(output_set, author_data_set
                       , strength = 0.5
                       , collection = "dyad_table"
                       , reference = "wos_tsv_authors")
    disambr_save_set(output_set)
    disambr_message_finish()
    return(c(sets, list(output_set)))
}

tests

dt %>% length

dt[[2]]$author_initials %>% stri_sub(to = 2) %>% unique %>% length
## 562
dt[[2]]$author_initials %>% length
## 24714

24714^2
8818^2

dt[[2]]$author_last_name %>% length
## 24714
dt[[2]]$author_last_name %>% toupper %>% unique %>% length
## 8818


stringdist::amatch(c("sdfasd", NA, "sei;n;h", "sdfaij'[ksadf", "dpoj'sdf")
                 , c("sdfas", NA)
                 , maxDist = 2
                 , method = "dl"
                 , matchNA = FALSE
                 , nomatch = NA) %>% extract(!is.na(.))



sort(c("sdfasd", NA, "sei;n;h", "sdfaij'[ksadf", "dpoj'sdf"), na.last = TRUE)


maxDist
amatch


## full
## dt <- readRDS(file = "my.dir.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE) %>% 
## disambr_set_similar_last_names(verbose = TRUE)

## partial
## dt <- readRDS(file = "my.dir.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)

disambr_read("../data/new_export/savedrecs-ms-recent.txt") %>% 
    disambr_set_on_same_paper() %>%
    disambr_set_similar_last_names(max_dist = 1
                                 , max_dist_short = 1
                                 , min_length = 2) %>%
    extract2(6)

## dt_atributes <- attributes(dt[[4]])

## dt[[4]] <- dt[[4]][1:100]

## mostattributes(dt[[4]]) <- dt_atributes

## dt[[4]] %>% attributes

## dt %>% disambr_set_similar_last_names(verbose = TRUE)


rbindlist(list(list(1,2,3), list(2,3,4), NULL))

disambr_merge_authors_with_same_emails

##' Makes set of matched authors bases on same email addresses
##' @param sets Sets
##' @return Sets with new attached
##' @export 
disambr_merge_authors_with_same_emails <- function(sets) {
    force(sets)
    disambr_message_start()
    if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
    ## check if output set is ready
    output_set <- disambr_get_output_set(sets)
    if(!is.null(output_set)) return(sets)
    output_set <- disambr_read_output_set()
    if(!is.null(output_set)) return(c(sets, list(output_set)))
    ## ----------------------------------------------------------------------
    author_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
    ## emails case insensitive
    email_data_set <- toupper(author_data_set$author_email)
    input_set <- disambr_get_last_unstrong_set(sets)
    ## ======================================================================
    disambr_message("- checking emails")
    output_set <-
        email_data_set[input_set$author_id1] == 
        email_data_set[input_set$author_id2]
    output_set <- input_set[sapply(output_set, isTRUE)]
    ## ======================================================================
    disambr_add_set_attr(output_set, author_data_set
                       , strength = 1
                       , collection = "dyad_table"
                       , reference = "wos_tsv_authors")
    disambr_save_set(output_set)
    disambr_message_finish()
    return(c(sets, list(output_set)))
}


## ## full
## dt <- readRDS(file = "my.dir.wos.rds") %>%
##     disambr_set_tekles_bornmann(verbose = TRUE) %>%
##     disambr_set_on_same_paper(verbose = TRUE) %>% 
##     disambr_set_similar_last_names(verbose = TRUE)

## partial
## dt <- readRDS(file = "my.dir.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)

## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:1000]
## mostattributes(dt[[4]]) <- dt_atributes

## dt <- 
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE) %>%
## disambr_set_similar_initials(verbose = TRUE) %>% 
## disambr_set_same_email(verbose = TRUE)

## dt[[7]]

## dt[[3]][26]$author_email
## dt[[3]][90]$author_email

## dt[[3]][2]$author_email
## dt[[3]][264]$author_email
## dt[[3]][406]$author_email

disambr_merge_authors_with_same_affiliation

##' Makes set of matched authors based on same affiliation
##' @param sets Sets
##' @return Sets with new attached
##' @export 
disambr_merge_authors_with_same_affiliation <- function(sets) {
    force(sets)
    disambr_message_start()
    if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
    ## check if output set is ready
    output_set <- disambr_get_output_set(sets)
    if(!is.null(output_set)) return(sets)
    output_set <- disambr_read_output_set()
    if(!is.null(output_set)) return(c(sets, list(output_set)))
    ## ----------------------------------------------------------------------
    author_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
    input_set <- disambr_get_last_unstrong_set(sets)
    ## ======================================================================
    disambr_message("- checking overlapping affiliations")
    affiliations1 <- author_data_set$affiliations[input_set$author_id1]
    affiliations1 <- lapply(affiliations1, toupper)
    ## affiliations1 <- ifelse(is.na(affiliations1), NULL, affiliations1)
    affiliations2 <- author_data_set$affiliations[input_set$author_id2]
    affiliations2 <- lapply(affiliations2, toupper)
    ## affiliations2 <- ifelse(is.na(affiliations2), NULL, affiliations2)
    affiliations_match <-
        mapply(function(a1, a2) {
            any(match(a1, a2, incomparables = NA, nomatch = 0) > 0)
        }
      , affiliations1
      , affiliations2)
    output_set <- input_set[affiliations_match]
    ## ======================================================================
    disambr_add_set_attr(output_set, author_data_set
                       , strength = 1
                       , collection = "dyad_table"
                       , reference = "wos_tsv_authors")
    disambr_save_set(output_set)
    disambr_message_finish()
    return(c(sets, list(output_set)))
}

## full
## dt <- readRDS(file = "my.dir.wos.rds") %>%
##     disambr_set_tekles_bornmann(verbose = TRUE) %>%
##     disambr_set_on_same_paper(verbose = TRUE) %>% 
##     disambr_set_similar_last_names(verbose = TRUE)

## partial
## dt <- readRDS(file = "my.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)

## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:100]
## mostattributes(dt[[4]]) <- dt_atributes

## dt.new <- 
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE) %>%
## disambr_set_same_affiliation(verbose = TRUE, input_set_name = "similar_last_names")

## dt.new[[6]] %>% head

## dt[[3]][96]$affiliations
## dt[[3]][31]$affiliations

## dt[[3]][2]$author_email
## dt[[3]][264]$author_email
## dt[[3]][406]$author_email

disambr_merge_authors_if_citing_others_papers

##' Make a set of matched authors bases on the cases when one author cites the others paper.
##' @param sets Sets
##' @param match_refrerences_by_name_year Whether to check citations based on first author name and year pair in addition to machich citations based on DOI
##' @return 
##' @export 
disambr_merge_authors_if_citing_others_papers <- function(sets
                                        , match_refrerences_by_name_year = TRUE) {
    disambr_message_start()
    if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
    ## check if output set is ready
    output_set <- disambr_get_output_set(sets)
    if(!is.null(output_set)) return(sets)
    output_set <- disambr_read_output_set()
    if(!is.null(output_set)) return(c(sets, list(output_set)))
    ## ----------------------------------------------------------------------
    author_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
    reference_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_references")
    citations_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_author_year_citations")
    input_set <- disambr_get_last_unstrong_set(sets)
      ## ======================================================================
    disambr_message("- checking if author sites other author's paper")
    ## TODO: Add papers that were already matched previously
    ## get paper ids
    input_set[, `:=`(
        paper_ids_1 = author_data_set$paper_id[author_id1]
      , paper_ids_2 = author_data_set$paper_id[author_id2])]
    match_list <- list()
    match_list$doi_1 <-
        merge(input_set, reference_data_set,
            , by.x = c("paper_ids_1", "paper_ids_2")
            , by.y = c("paper_id", "doi_cited_id"))[, .(author_id1, author_id2)]
    match_list$doi_2 <-
        merge(input_set, reference_data_set,
            , by.x = c("paper_ids_2", "paper_ids_1")
            , by.y = c("paper_id", "doi_cited_id"))[, .(author_id1, author_id2)]
    if(match_refrerences_by_name_year) {
        match_list$name_1 <-
            merge(input_set, citations_data_set,
                , by.x = c("paper_ids_1", "paper_ids_2")
                , by.y = c("citing_id", "cited_id"))[, .(author_id1, author_id2)]
        match_list$name_2 <-
            merge(input_set, citations_data_set,
                , by.x = c("paper_ids_2", "paper_ids_1")
                , by.y = c("citing_id", "cited_id"))[, .(author_id1, author_id2)]
    }
    output_set <- data.table::rbindlist(match_list)
    output_set <- unique(output_set)
    ## ======================================================================
    disambr_add_set_attr(output_set, author_data_set
                       , strength = 1
                       , collection = "dyad_table"
                       , reference = "wos_tsv_authors")
    disambr_save_set(output_set)
    disambr_message_finish()
    return(c(sets, list(output_set)))
}


## partial
## dt <- readRDS(file = "my.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)

## dt.test <- 
    ## dt %>%
    ## disambr_set_on_same_paper %>% 
    ## disambr_set_similar_last_names


## dt.test %>% disambr_set_cite_others_paper %>% extract2(7)

## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:2000]
## mostattributes(dt[[4]]) <- dt_atributes

## dt.short <-
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE)
## dt.short.test <-
## dt.short %>%
## disambr_set_cite_others_paper
## dt.short.test[[5]] %>% sum





## https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page
## https://www.crossref.org/blog/dois-and-matching-regular-expressions/
## "/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i"


## testing data.table
## p <- data.table(id = c(1,2,3,4)
## , b = TRUE
## , y = c(7,4,6,3)
## , n = c("o", "z", "o", "e")
## , DI = c(11,22,33,44))

## r <- data.table(id = c(6)
## , b = FALSE
## , y = c(2,6,4,8,3,1)
## , n = c("a", "c", "z", "o", "e", "o")
## , doi = c(88,22,99,55,11,55))

## merge(p,r,by = "id")


## p[r
## , on = .(y, n)
## , .(DI, doi, , paper_id, b, y, n)]

disambr_merge_authors_if_citing_others_papers

##' Make a set of matched authors bases on the cases when one author cites the others paper.
##' @param sets Sets
##' @param match_refrerences_by_name_year Whether to check citations based on first author name and year pair in addition to machich citations based on DOI
##' @return 
##' @export 
disambr_merge_authors_if_citing_others_papers <- function(sets
                                        , match_refrerences_by_name_year = TRUE) {
    disambr_message_start()
    if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
    ## check if output set is ready
    output_set <- disambr_get_output_set(sets)
    if(!is.null(output_set)) return(sets)
    output_set <- disambr_read_output_set()
    if(!is.null(output_set)) return(c(sets, list(output_set)))
    ## ----------------------------------------------------------------------
    author_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
    reference_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_references")
    citations_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_author_year_citations")
    input_set <- disambr_get_last_unstrong_set(sets)
      ## ======================================================================
    disambr_message("- checking if author sites other author's paper")
    ## TODO: Add papers that were already matched previously
    ## get paper ids
    input_set[, `:=`(
        paper_ids_1 = author_data_set$paper_id[author_id1]
      , paper_ids_2 = author_data_set$paper_id[author_id2])]
    match_list <- list()
    match_list$doi_1 <-
        merge(input_set, reference_data_set,
            , by.x = c("paper_ids_1", "paper_ids_2")
            , by.y = c("paper_id", "doi_cited_id"))[, .(author_id1, author_id2)]
    match_list$doi_2 <-
        merge(input_set, reference_data_set,
            , by.x = c("paper_ids_2", "paper_ids_1")
            , by.y = c("paper_id", "doi_cited_id"))[, .(author_id1, author_id2)]
    if(match_refrerences_by_name_year) {
        match_list$name_1 <-
            merge(input_set, citations_data_set,
                , by.x = c("paper_ids_1", "paper_ids_2")
                , by.y = c("citing_id", "cited_id"))[, .(author_id1, author_id2)]
        match_list$name_2 <-
            merge(input_set, citations_data_set,
                , by.x = c("paper_ids_2", "paper_ids_1")
                , by.y = c("citing_id", "cited_id"))[, .(author_id1, author_id2)]
    }
    output_set <- data.table::rbindlist(match_list)
    output_set <- unique(output_set)
    ## ======================================================================
    disambr_add_set_attr(output_set, author_data_set
                       , strength = 1
                       , collection = "dyad_table"
                       , reference = "wos_tsv_authors")
    disambr_save_set(output_set)
    disambr_message_finish()
    return(c(sets, list(output_set)))
}


## partial
## dt <- readRDS(file = "my.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)

## dt.test <- 
    ## dt %>%
    ## disambr_set_on_same_paper %>% 
    ## disambr_set_similar_last_names


## dt.test %>% disambr_set_cite_others_paper %>% extract2(7)

## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:2000]
## mostattributes(dt[[4]]) <- dt_atributes

## dt.short <-
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE)
## dt.short.test <-
## dt.short %>%
## disambr_set_cite_others_paper
## dt.short.test[[5]] %>% sum





## https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page
## https://www.crossref.org/blog/dois-and-matching-regular-expressions/
## "/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i"


## testing data.table
## p <- data.table(id = c(1,2,3,4)
## , b = TRUE
## , y = c(7,4,6,3)
## , n = c("o", "z", "o", "e")
## , DI = c(11,22,33,44))

## r <- data.table(id = c(6)
## , b = FALSE
## , y = c(2,6,4,8,3,1)
## , n = c("a", "c", "z", "o", "e", "o")
## , doi = c(88,22,99,55,11,55))

## merge(p,r,by = "id")


## p[r
## , on = .(y, n)
## , .(DI, doi, , paper_id, b, y, n)]

disambr_match_authors_if_sharing_coauthors

##' Make a set of matched authors that share co-authors
##'
##' - First we index dyads of papers for matched authors by indexing elements in an upper triangle in square paper by paper matrix (see `get_upper_triangle_index` function)
##' - Then we group and count matched author dyads that are associated with the same paper dyad index. If there are no duplicates in authors ids then it would be the number of co-shared co-authors but there is an issue when we try to match same author name to several author names on the other paper (next steps meant to fix this issue)
##' - Within these groups of same paper dyads we count same authors ids on each paper to access the number of open triads (when the same author is matched to two different authors from the same paper) for every author dyad in a group (Nid1 + Nid2 - 2)
##' The algorithm for matching authors based on shared co-authors is the following:
##' - Finally, we filter matched author dyads based on the difference between number of paper dyads and number of open triangles for authors (records_per_paper - open_triangles > 1). Also see `min_number_of_shared_coauthors`
##' 
##' @param sets Sets of matched author names dyads
##' @param min_number_of_shared_coauthors Minimum number of co-authors that should be shared in order for author names to be cosidered as matched/merged
##' @return Original sets with table of matched author dyads appended to it
##' @export
##' @md
disambr_match_authors_if_sharing_coauthors <-
    function(sets
           , min_number_of_shared_coauthors = 1) {
    disambr_message_start()
    if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
    ## check if output set is ready
    output_set <- disambr_get_output_set(sets)
    if(!is.null(output_set)) return(sets)
    output_set <- disambr_read_output_set()
    if(!is.null(output_set)) return(c(sets, list(output_set)))
    ## ----------------------------------------------------------------------
    author_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
    paper_id_max <- max(author_data_set$paper_id)
    input_set <- disambr_get_last_unstrong_set(sets)
      ## ======================================================================
    disambr_message("- filtering cases with shared co-authors")
    ## get paper ids
    output_set <- 
        input_set[
          , paper_dyad_ids :=
                mapply(get_upper_triangle_index
                     , author_data_set$paper_id[author_id1]
                     , author_data_set$paper_id[author_id2]
                     , paper_id_max)
        ][
          , `:=`(records_per_paper = .N
               , open_triangles =
                     as.vector(table(author_id1)[as.character(author_id1)]) +
                     as.vector(table(author_id2)[as.character(author_id2)]) - 2)
          , keyby = paper_dyad_ids
        ][records_per_paper - open_triangles > min_number_of_shared_coauthors
        , .(author_id1, author_id2)
          ]
    ## ======================================================================
    disambr_add_set_attr(output_set, author_data_set
                       , strength = 1
                       , collection = "dyad_table"
                       , reference = "wos_tsv_authors")
    disambr_save_set(output_set)
    disambr_message_finish()
    return(c(sets, list(output_set)))
}



## TESTS
## data.table(
##     a1 = c(1,1,1, 8,10, 200, 400)
##   , a2 = c(5,2,3, 2, 50 , 300, 500)
##   , p1 = c(1,1,10,1,3,4,5)
##   , p2 = c(10,10,1,10,8,5,4))[
##   , paper_dyad_ids :=
##         mapply(get_upper_triangle_index
##              , p1
##              , p2
##              , max(c(p1, p2)))
## ][
##   , `:=`(records_per_paper = .N
##        , open_triangles =
##              as.vector(table(a1)[as.character(a1)]) +
##              as.vector(table(a2)[as.character(a2)]) - 2)
##   , keyby = paper_dyad_ids
## ][, criteria := records_per_paper - open_triangles - 1][]

disambr_merge_authors_with_common_references

##' Make set of authors that have number of references in common
##' @param sets Sets
##' @param references_in_common number of references in common
##' @return Sets with new set
##' @export 
disambr_merge_authors_with_common_references <- function(sets
                                        , references_in_common = 3) {
    force(sets)
    disambr_message_start()
    if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
    ## check if output set is ready
    output_set <- disambr_get_output_set(sets)
    if(!is.null(output_set)) return(sets)
    output_set <- disambr_read_output_set()
    if(!is.null(output_set)) return(c(sets, list(output_set)))
    ## ----------------------------------------------------------------------
    author_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
    reference_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_references")
    citation_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_author_year_citations")
    input_set <- disambr_get_last_unstrong_set(sets)
    ## ======================================================================
    disambr_message("- checking references in common")
    ## TODO: Add papers that were already matched previously
    input_set[, `:=`(
        paper_ids_1 = author_data_set$paper_id[author_id1]
      , paper_ids_2 = author_data_set$paper_id[author_id2]
    )]
    paper_ids_set <- unique(input_set[,.(paper_ids_1, paper_ids_2)])
    ## for blade
    if(.Platform$OS.type == "windows") {
        cl <-
            parallel::makePSOCKcluster(
                          round(parallel::detectCores() * .70))
        output_set <-
            parallel::parLapply(
                          cl, 1:nrow(paper_ids_set)
                        , function(i) {
                            id1 <- paper_ids_set$paper_ids_1[i]
                            id2 <- paper_ids_set$paper_ids_2[i]
                            common_refs <- 
                                match(reference_data_set[paper_id == id1, c(doi_cited_id)]
                                    , reference_data_set[paper_id == id2, c(doi_cited_id)]
                                    , nomatch = 0
                                    , incomparables = NA)
                            common_refs <- sum(common_refs > 0)
                            if(common_refs < references_in_common) {
                                name_common_refs <- 
                                    match(citation_data_set[citing_id == id1, c(cited_id)]
                                        , citation_data_set[citing_id == id2, c(cited_id)]
                                        , nomatch = 0
                                        , incomparables = NA)
                                name_common_refs <- sum(name_common_refs > 0)
                                common_refs <- common_refs + name_common_refs
                                if(common_refs < references_in_common) {
                                    return(FALSE)
                                } else {
                                    return(TRUE)
                                }
                            } else return(TRUE)
                        }
                      )
        parallel::stopCluster(cl)
    } else {
        output_set <- 
            pbmapply(function(id1, id2) {
                common_refs <- 
                    match(reference_data_set[paper_id == id1, c(doi_cited_id)]
                        , reference_data_set[paper_id == id2, c(doi_cited_id)]
                        , nomatch = 0
                        , incomparables = NA)
                common_refs <- sum(common_refs > 0)
                if(common_refs < references_in_common) {
                    name_common_refs <- 
                        match(citation_data_set[citing_id == id1, c(cited_id)]
                            , citation_data_set[citing_id == id2, c(cited_id)]
                            , nomatch = 0
                            , incomparables = NA)
                    name_common_refs <- sum(name_common_refs > 0)
                    common_refs <- common_refs + name_common_refs
                    if(common_refs < references_in_common) return(FALSE) else return(TRUE)
                } else return(TRUE)
            }
          , paper_ids_set$paper_ids_1
          , paper_ids_set$paper_ids_2)
    }
    output_set <- paper_ids_set[output_set]
    output_set <- merge(output_set, input_set, by = c("paper_ids_1", "paper_ids_2"))
    output_set <- output_set[,.(author_id1, author_id2)]
    ## ======================================================================
    disambr_add_set_attr(output_set, author_data_set
                       , strength = 1
                       , collection = "dyad_table"
                       , reference = "wos_tsv_authors")
    disambr_save_set(output_set)
    disambr_message_finish()
    return(c(sets, list(output_set)))
}


## partial
## dt <- readRDS(file = "my.wos.rds") %>%
##     disambr_set_tekles_bornmann(verbose = TRUE) %>%
##     disambr_set_on_same_paper(verbose = TRUE)

## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:2000]
## mostattributes(dt[[4]]) <- dt_atributes

## dt.test %>% disambr_set_common_references %>% extract2(7)

## dt.short <-
##     dt %>%
##     disambr_set_similar_last_names(verbose = TRUE)
## dt.short.test <-
##     dt.short %>%
##     disambr_set_common_references
## dt.short.test[[5]] %>% sum





## https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page
## https://www.crossref.org/blog/dois-and-matching-regular-expressions/
## "/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i"

## a <- data.table(a = c(1,2,3,4), b = c(11,22,33,44))

## a[a %in% c(2,3), c(b)]

## c(NA,NA,1) %in% c(32,3,1,3, NA)

disambr_merge_authors_if_citing_self_citation

##' Make a set of matched authors based on cases when one author cites others self citation. Self-citations here are detected based on DOI.
##' @param sets Sets
##' @return Sets with new appended
##' @export 
disambr_merge_authors_if_citing_self_citation <- function(sets) {
    force(sets)
    disambr_message_start()
    if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
    ## check if output set is ready
    output_set <- disambr_get_output_set(sets)
    if(!is.null(output_set)) return(sets)
    output_set <- disambr_read_output_set()
    if(!is.null(output_set)) return(c(sets, list(output_set)))
    ## ----------------------------------------------------------------------
    author_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
    reference_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_references")
    strong_set <- disambr_get_strong_set(sets)
    input_set <- disambr_get_last_unstrong_set(sets)
    ## ======================================================================
    ## TODO: check named citations
    disambr_message("- checking if author cites a self-citation of other")
    check_self_citations <- function(id1, id2) {
        own_papers <- 
            author_data_set$paper_id[
                                unique(strong_set[author_id1 %in% id1 |
                                                  author_id2 %in% id1
                                                , c(author_id1, author_id2)])]
        ## mach own papers to own citations
        self_citations <- 
            match(own_papers
                , reference_data_set[paper_id %in% own_papers, c(doi_cited_id)]
                , nomatch = 0
                , incomparables = NA) > 0
        self_citations <- own_papers[self_citations]
        cite_self_citations <-
            match(reference_data_set[paper_id == id2, c(doi_cited_id)]
                , own_papers
                , nomatch = 0
                , incomparables = NA) > 0
        return(any(cite_self_citations))
    }
    ## blade option
    if(.Platform$OS.type == "windows") {
        cl <- parallel::makePSOCKcluster(round(parallel::detectCores() * .70))
        output_set <-
            parallel::parLapply(
                          cl, 1:nrow(input_set)
                        , function(i){
                            id1 <- input_set$author_id1[i]
                            id2 <- input_set$author_id2[i]
                            if(check_self_citations(id1, id2)) {
                                return(TRUE)
                            } else if(check_self_citations(id2, id1)) {
                                return(TRUE)
                            } else {
                                return(FALSE)
                            }
                        })
        parallel::stopCluster(cl)
    } else {
        output_set <- 
            pbmapply(function(id1, id2) {
                if(check_self_citations(id1, id2)) {
                    return(TRUE)
                } else if(check_self_citations(id2, id1)) {
                    return(TRUE)
                } else {
                    return(FALSE)
                }
            }
          , input_set$author_id1
          , input_set$author_id2)
    }
    output_set <- input_set[output_set]
    ## ======================================================================
    disambr_add_set_attr(output_set, author_data_set
                       , strength = 1
                       , collection = "dyad_table"
                       , reference = "wos_tsv_authors")
    disambr_save_set(output_set)
    disambr_message_finish()
    return(c(sets, list(output_set)))
}


## partial
## dt <- readRDS(file = "my.wos.rds") %>%
##     disambr_set_tekles_bornmann(verbose = TRUE) %>%
##     disambr_set_on_same_paper(verbose = TRUE)

## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:2000]
## mostattributes(dt[[4]]) <- dt_atributes

## dt.test.plus <- 
## dt.test %>%
## disambr_set_common_references


## dt.test.plus[[7]]

## dt.test.plus %>% 
## disambr_set_cite_self_citation %>% extract2(8)


## 7 out of 416 pairs matched
## dt.test.plus[[2]]$author_name[c(923
## , 2353         
## , 2403
## , 2464
## , 2525
## , 4234
## , 4420
## , 7675)]

## not very accurate
## [1] "COHEN, AS" "COHEN, AS" "COHEN, AS" "COHEN, AS" "COHEN, AS" "Yap, MJ"  
## [7] "Mayes, AR" "Burns, GN"





## dt.short <-
##     dt %>%
##     disambr_set_similar_last_names(verbose = TRUE)
## dt.short.test <-
##     dt.short %>%
##     disambr_set_cite_self_citation
## dt.short.test[[5]] %>% sum





## https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page
## https://www.crossref.org/blog/dois-and-matching-regular-expressions/
## "/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i"

## a <- data.table(a = c(1,2,3,4), b = c(11,22,33,44))

## a[a %in% c(2,3), c(b)]

## c(NA,NA,1) %in% c(32,3,1,3, NA)

disambr_merge_authors_with_common_keywords

##' Makes set of authors with number of keywords in their papers in common
##' @param sets Sets
##' @param keywords_in_common number of keywords in common
##' @return Sets with new set appended
##' 
##' @export 
disambr_merge_authors_with_common_keywords <- function(sets
                                      , keywords_in_common = 3) {
    force(sets)
    disambr_message_start()
    if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
    ## check if output set is ready
    output_set <- disambr_get_output_set(sets)
    if(!is.null(output_set)) return(sets)
    output_set <- disambr_read_output_set()
    if(!is.null(output_set)) return(c(sets, list(output_set)))
    ## ----------------------------------------------------------------------
    author_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
    publication_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_publications")
    input_set <- disambr_get_last_unstrong_set(sets)
    ## ======================================================================
    disambr_message("- checking common keywords (Author Keywords)")
    keywords_1 <- 
        publication_data_set[author_data_set[input_set$author_id1, c(paper_id)], c(DE)]
    keywords_1 <- stringi::stri_split_fixed(keywords_1, "; ")
    keywords_2 <- 
        publication_data_set[author_data_set[input_set$author_id2, c(paper_id)], c(DE)]
    keywords_2 <- stringi::stri_split_fixed(keywords_2, "; ")
    keywords_matched <- 
        pbmapply(function(k1, k2) {
            sum(match(k1, k2, incomparables = c(NA, ""), nomatch = 0) > 0)
        }
      , keywords_1
      , keywords_2)
    output_set <- input_set[keywords_matched >= keywords_in_common]
    ## ======================================================================
    disambr_add_set_attr(output_set, author_data_set
                       , strength = 1
                       , collection = "dyad_table"
                       , reference = "wos_tsv_authors")
    disambr_save_set(output_set)
    disambr_message_finish()
    return(c(sets, list(output_set)))
}




## ## full
## dt <- readRDS(file = "my.dir.wos.rds") %>%
##     disambr_set_tekles_bornmann(verbose = TRUE) %>%
##     disambr_set_on_same_paper(verbose = TRUE) %>% 
##     disambr_set_similar_last_names(verbose = TRUE)

## partial
## dt <- readRDS(file = "my.dir.wos.rds") %>%
## disambr_set_tekles_bornmann(verbose = TRUE) %>%
## disambr_set_on_same_paper(verbose = TRUE)

## dt_atributes <- attributes(dt[[4]])
## dt[[4]] <- dt[[4]][1:1000]
## mostattributes(dt[[4]]) <- dt_atributes

## dt.test %>% disambr_set_common_keywords %>% extract2(7)

## dt.test[[1]]$DE[c(1113
## , 8390
## , 8438                  
## , 8455)]

## dt <- 
## dt %>%
## disambr_set_similar_last_names(verbose = TRUE) %>%
## disambr_set_similar_initials(verbose = TRUE) %>% 
## disambr_set_common_keywords(verbose = TRUE)

## dt[[7]]

## dt[[3]][26]$author_email
## dt[[3]][90]$author_email

## dt[[3]][2]$author_email
## dt[[3]][264]$author_email
## dt[[3]][406]$author_email

disambr_merge_authors_with_same_researcher_ids

##' For testing. Makes so called ground truth set based on same research IDs
##' @param sets Sets
##' @return Sets with new set appended
##' 
##' @export 
disambr_merge_authors_with_same_researcher_ids <- function(sets) {
    force(sets)
    disambr_message_start()
    if(!is.list(sets)) disambr_stop("- 'sets' parameter should be list!")
    ## check if output set is ready
    output_set <- disambr_get_output_set(sets)
    if(!is.null(output_set)) return(sets)
    output_set <- disambr_read_output_set()
    if(!is.null(output_set)) return(c(sets, list(output_set)))
    ## ----------------------------------------------------------------------
    author_data_set <-
        disambr_get_first_data_set(sets, recipe = "wos_tsv_authors")
    ri_data_set <- author_data_set$author_researcher_id
    ri_bank <- unique(unlist(ri_data_set))
    ri_bank <- sort(ri_bank) # removes NAs
    ## ======================================================================
    disambr_message("- expanding grid and cheching researcher IDs")
    ## this is fast combn
    combi <- function(vect)
    {
        l <- length(vect)
        if(l == 1) return()
        first <- rep(vect, (l-1):0)
        vectR <- rev(vect)
        second <- vectR[rev(sequence(1:(l-1)))]
        combi <- data.table(first, second)
        return(combi)
    }
    output_set <-
        pbapply::pblapply(ri_bank, function(ri) {
            same_ri <- sapply(ri_data_set, function(i) ri %in% i)
            same_ri <- which(same_ri)
            combi(same_ri)
        })
    output_set <- data.table::rbindlist(output_set)
    ## ======================================================================
    disambr_add_set_attr(output_set, author_data_set
                       , strength = 10
                       , collection = "dyad_table"
                       , reference = "wos_tsv_authors")
    disambr_save_set(output_set)
    disambr_message_finish()
    return(c(sets, list(output_set)))
}
data.table::CJ(id1 = c(1,2,3)
             , id2 = c(1,2,3)
             , sorted = FALSE
             , unique = TRUE)

combi <- function(vect)
    {
        l <- length(vect)
        first <- rep(vect, (l-1):0)
        vectR <- rev(vect)
        second <- vectR[rev(sequence(1:(l-1)))]
        combi <- data.table(first, second)
        return(combi)
    }

v <- 1:100
microbenchmark(combi(v), combn(v,2))

testing

options(disambr_get_output_set = TRUE)
options(disambr_read_output_set = TRUE)
options(disambr_mess_pretty = TRUE)
options(disambr_save_as = TRUE)

getOption("disambr_get_output_set")

my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"

ts_eva_test_1 <- 
    disambr_read(my.file2)

my.dir.small <- '../data/Journals in Mathematical Psychology'

ts_eva_test <- 
    disambr_read(my.dir.small, )



ts_eva[[]]



ts_eva_full <-
    disambr_read("../data/wos-slow-export-subset"
               , save_sets_as = "wos-slow-export-subset.rds"
               , save_sets_dir = "../data") %>% 
    disambr_set_tekles_bornmann %>% #59149  vs_1962896
    disambr_set_on_same_paper %>%
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names %>%
    disambr_set_same_email %>% 
    disambr_set_same_affiliation %>%
    disambr_set_cite_others_paper %>%
    disambr_set_common_references %>%
    disambr_set_cite_self_citation %>%
    disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids



ts_eva <- ts_eva_test

ts_eva_sabe <- ts_eva

ts_eva %<>%
    disambr_set_on_same_paper %>% 
    disambr_set_similar_initials %>% 
    disambr_set_similar_last_names %>%
    disambr_set_same_email %>% 
    disambr_set_same_affiliation %>%
    disambr_set_cite_others_paper %>%
    disambr_set_common_references %>%
    disambr_set_cite_self_citation %>%
    disambr_set_common_keywords %>%
    disambr_set_same_researcher_ids





ts_eva %>% 
    disambr_get_first_data_set(recipe = "wos_tsv_") %>%
    names


ts_eva %>% length

ts_eva[[14]] <-  NULL


disambr_set_cite_others_paper

## disambr_set_common_keywords 
## ts_eva[[1]]$DE[c(ts_eva[[2]]$paper_id[c(12484, 18298
                                      ## , 1281, 1337)])]


## disambr_set_common_references



## disambr_set_same_affiliation
## ts_eva[[2]]$affiliations[c(
                ## 1254
              ## , 7361
              ## , 11379
              ## , 241
              ## , 7042)]


## disambr_set_same_email
## ts_eva[[2]]$author_email[c(
                ## 3
               ## , 18458
               ## , 18466
               ## , 10
               ## , 16
              ## , 24564
                ## , 24641)]

## disambr_set_similar_last_names
## ts_eva[[2]]$author_name[c(
                ## 1254
               ## , 1369
               ## , 6133
               ## , 7323
               ## , 7361
               ## , 7724
               ## , 2271
               ## , 2704
               ## , 17859)]

ts_eva_test[[2]]$author_initials[]

attr(ts_eva_test[[5]], "disambr_set_type")






disambr_set_cite_self_citation %>%
disambr_set_common_keywords

analysis

##' Make a statistics for a collection of sets (specific algorithm) and writes to the file in `sets_dir`
##' 
##' @param sets Collection of sets
##' @param sets_dir If first arg is not provided search for collection of sets in this directory. The results are saved to this directory as well.
##' @param name Name of the collection of sets
##' @param remove_data whether to remove first 4 sets (assumed to be a data) when sets arg is set
##' @param save_rds save_rds 
##' @return 
##' @export 
disambr_stats <- function(sets = NULL
                        , sets_dir = getOption("disambr_save_set_dir")
                        , name = sets_dir
                        , remove_data = TRUE
                        , save_rds = TRUE) {
    if(is.null(sets)){
        ## reads sets from directory
        sets <- 
            lapply(dir(sets_dir
                     , full.names = TRUE
                     , pattern = "disambr-set\\..*\\.rds$")
                 , readRDS)
    } else if(remove_data){
        ## remove data from the sets
        sets <- sets[-c(1:4)]
    }
    ## get duration
    sets_attr <- 
        lapply(sets, attributes)
    sets_attr_duration <- 
        lapply(sets_attr, `[[`, "disambr_set_duration")
    sets_attr_duration <-
        unlist(sapply(sets_attr_duration, as.numeric, units = "mins"))
    dur_mins <- sum(sets_attr_duration)
    ## get sets
    truth <- 
        disambr_get_truth_set(sets)
    strong <- 
        disambr_get_strong_set(sets)
    input <- disambr_get_input_set(sets)
    ## mirror pairs for comparison (this assumes that sets are data.tables)
    strong <-
        unique(rbind(strong
                   , strong[, .(author_id2
                              , author_id1)]))
    truth <-
        unique(rbind(truth
                   , truth[, .(second
                                 , first)]))
    truth <- truth[, .(author_id1 = first
                     , author_id2 = second)]

    ## True positive (a)
    a <- nrow(data.table::fintersect(truth, strong))
    ## False positive (b)
    b <- nrow(data.table::fsetdiff(strong, truth))
    ## False negative (c)
    c <- nrow(data.table::fsetdiff(truth, strong))
    ## True negative (d)                                              (ref:bug)
    d <- nrow(sets[[2]]) * (nrow(sets[[2]]) - 1) -
        nrow(data.table::funion(truth, strong))

    pw_presision <- a / (a + b)
    pw_recall <- a / (a + c)
    pw_f1 <- (2 * pw_presision * pw_recall) / (pw_presision + pw_recall)
    pw_accuracy <-  (a + d) / (a + b + c + d)

    stats <- 
        list(name = name
           , true_positives = a
           , false_positives = b
           , false_negatives = c
           , true_negatives = d
           , pw_presision = pw_presision
           , pw_recall = pw_recall
           , pw_f1 = pw_f1
           , pw_accuracy = pw_accuracy
           , dur_mins = dur_mins
           , dur_sets = list(sets_attr_duration))

    ## save on disk
    if(save_rds){
        saveRDS(stats, paste0(sets_dir,"/", "disambr_stats.rds"))
    }
    return(stats)
}

##' Makes comparative table with statistics
##' @param sets_dir dir 
##' @return 
##' 
##' @md 
##' @export 
disambr_stats_table <- function(sets_dir){
    files <- dir(sets_dir
               , recursive = TRUE
               , full.names = TRUE
               , pattern = "^disambr_stats\\.rds$")
    stats_list <- lapply(files, readRDS)
    data.table::rbindlist(stats_list)
}

tests

my.dir <- '../data'
my.dir.small <- '../data/Journals in Mathematical Psychology'
my.dir.large <- '/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments'
my.dir.huge <- '/mnt/md5/data/wos'

my.file <- '../data/Journals in Mathematical Psychology/Applied Psychological Measurement.txt' 
my.file1 <- "/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments/LN Public NAICS records from 10001 to 10500.txt"
my.files <-
    c('../data/Journals in Mathematical Psychology/Applied Measurement in Education.txt'
    , '../data/Journals in Mathematical Psychology/Applied Psychological Measurement.txt')


## ----------------------------------------------------------------------------

my.dat <- disambr.read(my.dir)

my.dat <- disambr.read(my.file)

attributes(my.dat[[1]])

my.dat <- disambr.read("../data/new_export")

my.dat[[1]]$RP[1:4]
my.dat[[1]]$EM[1:4]

my.dat <- 
    my.dat %>%
    disambr_set_not_on_same_paper


my.dat[[2]]


my.dat %>% 
    disambr_set_not_on_same_paper %>% 
    disambr_set_similar_initials %>%
    disambr_set_similar_last_names


dat <- disambr.read(my.file)

dat %>% extract(1) %>% disambr_set_not_on_same_paper


## new testing
d <- disambr.read("../data/wos-researchers-ids")

d <- d[[1]][1:1000,] %>% list

d.done <- 
    d %>% 
    disambr_set_not_on_same_paper %>% 
    disambr_set_similar_initials  %>%


d.done %>% length

d.done2 %>% length

d.done2[[4]] %>% nrow

d.done2 <- 
    d.done %>%
    disambr_set_similar_last_names

saveRDS(d.done2, "../data/d.done2.rds")




disambr.eva <- function(data) {
    data %>% 
        disambr_set_not_on_same_paper %>% 
        disambr_set_similar_initials %>%
        disambr_set_similar_last_names
}

## Usage
disambr.eva(data)


disambr_set_not_on_same_paper <- disambr.define.procedure(data %>% 
                                                          get(publication) %>%
                                                          for.each %>%
                                                          get(person = author))

## or
disambr_set_not_on_same_paper <- disambr.define.procedure(data$
                                                          publication$
                                                          person(author))





list(data = my.dat
   , similar.initials = set.similar.initials) %>% 
    disambr_set_similar_last_names


set.different.authors <- disambr_set_not_on_same_paper(my.dat)

set.similar.initials <- 
    list(data = my.dat
       , different.authors = set.different.authors) %>% 
    disambr_set_similar_initials




## 2020-08-25

## problem with large integers
bit64

return_sets_comb <-
        combn(1:100000, 2, simplify = FALSE)

10,000,000,000

return_sets_comb %>% length

install.packages("bit64")
as.integer64
library("bit64")

.Machine$integer.max
[1] 2147483647


combn(3, 2, simplify = FALSE)

my.dir.small <- '../data/Journals in Mathematical Psychology'

my.dir.large <- '/run/media/stas/ed6193a5-e4fd-4d83-9eb6-e481c39aeb8f/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments'


my.dir.wos <- "~/home/web of science"

my.file2 <- "../data/new_export/savedrecs-ms-recent.txt"
my.file2 <- "../data/wos-researchers-ids"


dt <- disambr_read(my.dir.wos) %T>%
    saveRDS(file = "my.wos.rds", compress = FALSE)


dt <- readRDS(file = "my.dir.wos.rds") %>% 
    disambr_set_not_on_same_paper(verbose = TRUE)



dt <- disambr_read(my.dir.large) %T>%
    saveRDS(file = "my.dir.large.rds", compress = FALSE) %>% 
    disambr_set_not_on_same_paper(verbose = TRUE)

dt <- readRDS(file = "my.dir.large.rds") %>% 
    disambr_set_not_on_same_paper(verbose = TRUE)

dt[[4]]

Testing help

options(browser="firefox")
help(disambr.read, help_type = "html")