This folder contains files required to reproduce the experiments for the following paper:
N. G. Marchant, B. I. P. Rubinstein, R. C. Steorts, "Bayesian Graphical Entity Resolution using Exchangeable Random Partition Priors," Journal of Survey Statistics and Methodology, 2023, smac030. DOI: 10.1093/jssam/smac030. arXiv: 2301.02962.
Four of the five data sets are included in the datasets
directory:
rest
: with filenamefz-nophone.arff.gz
. Original source.cora
: with filenamecora.arff.gz
. Modified from original source to correct erroneous ground truth labels.RLdata
: with filenameRLdata10000.csv.gz
. From the RecordLinkage R package.synthdata
: with filenames matching the patterngen_link-conf-mu-*_dist-conf-*_seed-*_exp-num-recs-*_records.csv.gz
. The Python notebook used to generate the synthetic data is included.
nltcs
is available from NACDA
after signing a data usage agreement.
To run the experiments for our model, you must install the exchanger
R package. It is hosted on GitHub
and can be installed from within R using
devtools::install_github("cleanzr/exchanger")
.
Similarly, to run the experiments for the model of Sadinle (2014), you
must install the BDD
R package. It is also hosted on
GitHub and can be installed from
within R using devtools::install_github("cleanzr/BDD")
.
Other dependencies include the following R packages from CRAN:
comparator
clevr
tidyverse
ggdist
egg
The following scripts define functions that are shared across the experiments:
run_ours.R
run_sadinle.R
util.R
To run all of the experiments for one of the models, execute the following in a terminal:
$ Rscript run_<model>_all.R
replacing <model>
with:
ours
for our model,blink
for the model by Steorts (2015),ours-blinkdist
for our model with the distortion model by Steorts (2015), orsadinle
for the model by Sadinle (2014).
To run an experiment for a particular data set and model, execute the following in a terminal:
Rscript run_<model>_<dataset>.R
where <model>
is defined as above and <dataset>
can be one of
cora
, nltcs
, restaurant
, RLdata10000
or synthdata
.
Each experiment will produce several files:
<prefix>_result.rds
: the saved state of the model and Markov chain<prefix>_eval.txt
: pairwise and clustering evaluation metrics computed for a point estimate.<prefix>_trace-*.png
: various diagnostic plots. These vary for each model.
After running all experiments, tables and figures can be reproduced as follows:
- Figure 3 can be reproduced by running
plot_err-num_ents_comparison.R
- Table 2 and Figure S7 can be reproduced by running
evaluate_prior_dist_model.R
- Figure 4 can be reproduced by running
plot_dist-level_comparison.R
- Table 3 can be reproduced by running
evaluate_models.R
- Figures S5 and S6 can be reproduced by running
evaluate_synthdata.R
- Figure S8 can be reproduced by running
plot_ep-params.R
- Figure S9 can be reproduced by running
plot_m_sadinle.R
- The diagnostic plots in Appendix G can be reproduced by running
plot_geweke.R
andplot_trace.R