README.Rmd

---
output:
  md_document:
    variant: gfm
bibliography: refs.bib
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)
```

# BDD: Duplicate detection using a Bayesian partitioning approach

This R package implements a Bayesian model for duplicate detection 
as proposed in [@sadinle2014]. It resembles the classic Fellegi-Sunter 
model [@fellegi1969] in that coreference decisions are based on pairwise 
attribute-level comparisons. However, unlike the Fellegi-Sunter mdoel, 
this model targets a _partition_ of the records, rather than pairwise 
coreference predictions, thereby ensuring transitivity. Inference is 
conducted using a Gibbs sampler, with parts implemented in 
[Rcpp](http://www.rcpp.org/).

## Installation
`BDD` is not currently available on CRAN. 
The latest development version can be installed from source using `devtools` 
as follows:

```{r, eval=FALSE}
library(devtools)
install_github("cleanzr/BDD")
```

## Example
We demonstrate how to perform duplicate detection using the `RLdata500` test 
data set included with the package. `RLdata500` contains 500 synthetic 
personal records, 50 of which are duplicates with randomly-generated errors.
In the code block, below we load the package and examine the first few rows 
of `RLdata500`. 
```{r, eval=TRUE, message=FALSE, warning=FALSE}
library(magrittr)    # pipe operator (must be loaded before BDD)
library(comparator)  # normalized Levenshtein distance
library(clevr)       # evaluation functions
library(BDD)

RLdata500[['rec_id']] <- seq.int(nrow(RLdata500))
head(RLdata500)
```

The model uses agreement levels between attributes in order to assess 
whether a pair of records is a match (referring to the same entity) or not.
To compute the agreement levels, we'll first compute real-valued scores
between the attributes, then maps the scores to discrete levels of agreement.
In the code block below, we specify scoring functions for the five attributes 
we will use for matching (we don't use `fname_c2` and `lname_c2` as more than 
90% of the values are missing). 
Note that the scoring functions must accept a pair of attribute vectors `x` 
and `y` as arguments and return a vector of scores.
```{r, eval=TRUE}
scoring_fns <- list(
  fname_c1 = Levenshtein(normalize = TRUE),
  lname_c1 = Levenshtein(normalize = TRUE),
  by = function(x, y) abs(x - y),
  bm = function(x, y) abs(x - y),
  bd = function(x, y) abs(x - y)
)
```

For each scoring function above, we provide a breaks vector which 
specifies the discrete levels of agreement (from 'high' agreement to 'low').
```{r, eval=TRUE}
scoring_breaks <- list(
  fname_c1 = c(-Inf, .05, .2, .4, Inf),
  lname_c1 = c(-Inf, .05, .2, .4, Inf),
  by = c(-Inf, 0, 1, 3, Inf),
  bm = c(-Inf, 0, 1, 3, Inf),
  bd = c(-Inf, 0, 2, 7, Inf)
)
```

Now we are ready to compute the agreement levels for the record pairs. 
Since this is a small data set, we consider all pairs using the `pairs_all` 
function. 
For larger data sets, blocking/indexing is recommended using `pairs_hamming`, 
`pairs_fuzzyblock` or a custom indexing function.
```{r, eval=TRUE}
pairs <- pairs_all(RLdata500$rec_id) %>%
  compute_scores(RLdata500, scoring_fns, id_col = 'rec_id') %>% 
  discretize_scores(scoring_breaks)
```

To speed up inference, we only consider a subset of the pairs as candidate 
matches. 
Specifically, we consider pairs that have a strong agreement on name 
(accounting for missing names).
```{r, eval=TRUE}
pairs[['candidate']] <- (pairs$fname_c1 < 4) & (pairs$lname_c1 < 4) | 
                            is.na(pairs$fname_c1) | is.na(pairs$lname_c1) 
```

Next we specify the priors on the m* and u* probabilities for each 
attribute and agreement level. 
`lambda` specifies the lower truncation points for the truncated Beta priors 
on the m* probabilities. 
By default, a uniform prior is used over the truncated interval; a non-uniform 
prior can be specified using the `alpha1` and `beta1` arguments to `BDD` 
below.
The Beta priors on the u* probabilities are also uniform by default and 
can be adjusted using the `alpha0` and `beta0` arguments.
```{r, eval=TRUE, message=FALSE, warning=FALSE}
lambda <- list(
  fname_c1 = c(0.8, 0.85, 0.99),
  lname_c1 = c(0.8, 0.85, 0.99),
  by = c(0.8, 0.85, 0.99),
  bm = c(0.8, 0.85, 0.99),
  bd = c(0.8, 0.85, 0.99)
)
```

Finally, we initialize the model and run inference using Markov chain 
Monte Carlo.
```{r, eval=TRUE}
model <- BDD(pairs, lambda, id_cols = c("rec_id.x", "rec_id.y"), 
             candidate_col = "candidate")
fit <- run_inference(model, 100, thin_interval = 10, burnin_interval = 100)
```

The posterior samples of the linkage structure can be accessed by calling 
`extract(fit, "links")`. 
However, these samples only cover the records that were considered as 
candidate pairs. 
We can obtain samples of the complete linkage structure (for all records) 
using the following function.
```{r, eval=TRUE}
links_samples <- complete_links_samples(fit, RLdata500$rec_id)
```

Since RLdata500 comes with ground truth, we can evaluate the predicted 
linkage structure using supervised metrics. 
We find that the model performs well on this data set, achieving high 
pairwise precision and recall.
```{r, eval=TRUE}
n_records <- nrow(RLdata500)
true_pairs <- membership_to_pairs(identity.RLdata500)
metrics <- apply(links_samples, 1, function(links) {
  pred_pairs <- membership_to_pairs(links)
  pr <- precision_pairs(true_pairs, pred_pairs)
  re <- recall_pairs(true_pairs, pred_pairs)
  c(Precision = pr, Recall = re)
}) %>% t()
summary(metrics)
```

## Licence
GPL-3

## References