-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
162 lines (144 loc) · 5.57 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
output:
md_document:
variant: gfm
bibliography: refs.bib
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# BDD: Duplicate detection using a Bayesian partitioning approach
This R package implements a Bayesian model for duplicate detection
as proposed in [@sadinle2014]. It resembles the classic Fellegi-Sunter
model [@fellegi1969] in that coreference decisions are based on pairwise
attribute-level comparisons. However, unlike the Fellegi-Sunter mdoel,
this model targets a _partition_ of the records, rather than pairwise
coreference predictions, thereby ensuring transitivity. Inference is
conducted using a Gibbs sampler, with parts implemented in
[Rcpp](http://www.rcpp.org/).
## Installation
`BDD` is not currently available on CRAN.
The latest development version can be installed from source using `devtools`
as follows:
```{r, eval=FALSE}
library(devtools)
install_github("cleanzr/BDD")
```
## Example
We demonstrate how to perform duplicate detection using the `RLdata500` test
data set included with the package. `RLdata500` contains 500 synthetic
personal records, 50 of which are duplicates with randomly-generated errors.
In the code block, below we load the package and examine the first few rows
of `RLdata500`.
```{r, eval=TRUE, message=FALSE, warning=FALSE}
library(magrittr) # pipe operator (must be loaded before BDD)
library(comparator) # normalized Levenshtein distance
library(clevr) # evaluation functions
library(BDD)
RLdata500[['rec_id']] <- seq.int(nrow(RLdata500))
head(RLdata500)
```
The model uses agreement levels between attributes in order to assess
whether a pair of records is a match (referring to the same entity) or not.
To compute the agreement levels, we'll first compute real-valued scores
between the attributes, then maps the scores to discrete levels of agreement.
In the code block below, we specify scoring functions for the five attributes
we will use for matching (we don't use `fname_c2` and `lname_c2` as more than
90% of the values are missing).
Note that the scoring functions must accept a pair of attribute vectors `x`
and `y` as arguments and return a vector of scores.
```{r, eval=TRUE}
scoring_fns <- list(
fname_c1 = Levenshtein(normalize = TRUE),
lname_c1 = Levenshtein(normalize = TRUE),
by = function(x, y) abs(x - y),
bm = function(x, y) abs(x - y),
bd = function(x, y) abs(x - y)
)
```
For each scoring function above, we provide a breaks vector which
specifies the discrete levels of agreement (from 'high' agreement to 'low').
```{r, eval=TRUE}
scoring_breaks <- list(
fname_c1 = c(-Inf, .05, .2, .4, Inf),
lname_c1 = c(-Inf, .05, .2, .4, Inf),
by = c(-Inf, 0, 1, 3, Inf),
bm = c(-Inf, 0, 1, 3, Inf),
bd = c(-Inf, 0, 2, 7, Inf)
)
```
Now we are ready to compute the agreement levels for the record pairs.
Since this is a small data set, we consider all pairs using the `pairs_all`
function.
For larger data sets, blocking/indexing is recommended using `pairs_hamming`,
`pairs_fuzzyblock` or a custom indexing function.
```{r, eval=TRUE}
pairs <- pairs_all(RLdata500$rec_id) %>%
compute_scores(RLdata500, scoring_fns, id_col = 'rec_id') %>%
discretize_scores(scoring_breaks)
```
To speed up inference, we only consider a subset of the pairs as candidate
matches.
Specifically, we consider pairs that have a strong agreement on name
(accounting for missing names).
```{r, eval=TRUE}
pairs[['candidate']] <- (pairs$fname_c1 < 4) & (pairs$lname_c1 < 4) |
is.na(pairs$fname_c1) | is.na(pairs$lname_c1)
```
Next we specify the priors on the m* and u* probabilities for each
attribute and agreement level.
`lambda` specifies the lower truncation points for the truncated Beta priors
on the m* probabilities.
By default, a uniform prior is used over the truncated interval; a non-uniform
prior can be specified using the `alpha1` and `beta1` arguments to `BDD`
below.
The Beta priors on the u* probabilities are also uniform by default and
can be adjusted using the `alpha0` and `beta0` arguments.
```{r, eval=TRUE, message=FALSE, warning=FALSE}
lambda <- list(
fname_c1 = c(0.8, 0.85, 0.99),
lname_c1 = c(0.8, 0.85, 0.99),
by = c(0.8, 0.85, 0.99),
bm = c(0.8, 0.85, 0.99),
bd = c(0.8, 0.85, 0.99)
)
```
Finally, we initialize the model and run inference using Markov chain
Monte Carlo.
```{r, eval=TRUE}
model <- BDD(pairs, lambda, id_cols = c("rec_id.x", "rec_id.y"),
candidate_col = "candidate")
fit <- run_inference(model, 100, thin_interval = 10, burnin_interval = 100)
```
The posterior samples of the linkage structure can be accessed by calling
`extract(fit, "links")`.
However, these samples only cover the records that were considered as
candidate pairs.
We can obtain samples of the complete linkage structure (for all records)
using the following function.
```{r, eval=TRUE}
links_samples <- complete_links_samples(fit, RLdata500$rec_id)
```
Since RLdata500 comes with ground truth, we can evaluate the predicted
linkage structure using supervised metrics.
We find that the model performs well on this data set, achieving high
pairwise precision and recall.
```{r, eval=TRUE}
n_records <- nrow(RLdata500)
true_pairs <- membership_to_pairs(identity.RLdata500)
metrics <- apply(links_samples, 1, function(links) {
pred_pairs <- membership_to_pairs(links)
pr <- precision_pairs(true_pairs, pred_pairs)
re <- recall_pairs(true_pairs, pred_pairs)
c(Precision = pr, Recall = re)
}) %>% t()
summary(metrics)
```
## Licence
GPL-3
## References