| Title: | Multifile Record Linkage and Duplicate Detection |
| Version: | 0.1.1 |
| Description: | Implementation of the methodology of Aleshin-Guendel & Sadinle (2022) <doi:10.1080/01621459.2021.2013242>. It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates. |
| Depends: | R (≥ 3.5.0) |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.1.2 |
| URL: | https://github.com/aleshing/multilink |
| BugReports: | https://github.com/aleshing/multilink/issues |
| Imports: | igraph, RecordLinkage, Rcpp, utils, mcclust, geosphere,stringr |
| LinkingTo: | Rcpp, RcppArmadillo |
| NeedsCompilation: | yes |
| Packaged: | 2023-06-08 20:25:20 UTC; sergealeshin-guendel |
| Author: | Serge Aleshin-Guendel [aut, cre] |
| Maintainer: | Serge Aleshin-Guendel <saleshinguendel@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2023-06-09 14:20:07 UTC |
Create Comparison Data
Description
Create comparison data for all pairs of records, except for those records infiles which are assumed to have no duplicates.
Usage
create_comparison_data( records, types, breaks, file_sizes, duplicates, verbose = TRUE)Arguments
records | A |
types | A |
breaks | A |
file_sizes | A |
duplicates | A |
verbose | A |
Details
The purpose of this function is to construct comparison vectors for each pairof records. In order to construct these vectors, one needs to specify thetypes andbreaks arguments. Thetypes argument specifieshow each field should be compared, and thebreaks argument specifieshow to discretize these comparisons.
Currently, thetypes argument supports three types of fieldcomparisons: binary, absolute difference, and the normalized Levenshteindistance. Please contact the package maintainer if you need a new type ofcomparison to be supported.
Thebreaks argument should be alist, with with one element foreach field. If a field is being compared with a binary comparison, i.e.types[f]="bi", then the corresponding element ofbreaks shouldbeNA, i.e.breaks[[f]]=NA. If a field is being compared with anumeric or string comparison, then the corresponding element ofbreaksshould be a vector of cut points used to discretize the comparisons. To givemore detail, suppose you pass in cut pointsbreaks[[f]]=c(cut_1, ...,cut_L). These cut pointsdiscretize the range of the comparisons intoL+1 intervals:I_0=(-\infty, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, \infty]. Theraw comparisons, which lie in[0,\infty) for numeric comparisons and[0,1] for string comparisons, are then replaced with indicators ofwhich interval the comparisons lie in. The intervalI_0 corresponds tothe lowest level of disagreement for a comparison, while the intervalI_L corresponds to the highest level of disagreement for a comparison.
Value
a list containing:
record_pairsA
data.frame, where each rowcontains the pair of records being compared in the corresponding row ofcomparisons. The rows are sorted in ascending order according to thefirst column, with ties broken according to the second column in ascendingorder. For any given row, the first column is less than the second column,i.e.record_pairs[i, 1] < record_pairs[i, 2]for each rowi.comparisonsA
logicalmatrix, where each row containsthe comparisons for the record pair in the corresponding row ofrecord_pairs. Comparisons are in the same order as the columns ofrecords, and are represented byL + 1columns ofTRUE/FALSEindicators, whereL + 1is the number ofdisagreement levels for the field based onbreaks.KThe number of files, assumed to be of class
numeric.file_sizesA
numericvector of lengthK,indicating the size of each file.duplicatesA
numericvector of lengthK,indicating which files are assumed to have duplicates.duplicates[k]should be1if filekhas duplicates, andduplicates[k]should be0if filekhas no duplicates.If any files do not have duplicates, we strongly recommend that the largestsuch file is organized to be the first file.field_levelsA
numericvector indicating the number ofdisagreement levels for each field.file_labelsAn
integervector of lengthsum(file_sizes), wherefile_labels[i]indicates which filerecordiis in.fp_matrixAn
integermatrix, wherefp_matrix[k1, k2]is a label for the file pair(k1, k2). Notethatfp_matrix[k1, k2] = fp_matrix[k2, k1].rp_to_fpA
logicalmatrix that indicates which recordpairs belong to which file pairs.rp_to_fp[fp, rp]isTRUEifthe recordsrecord_pairs[rp, ]belong to the file pairfp,and is FALSE otherwise. Note thatfpis given by the labeling infp_matrix.abAn
integervector, of lengthncol(comparisons) * K * (K + 1) / 2that indicates how many recordpairs there are with a given disagreement level for a given field, for eachfile pair.file_sizes_not_includedA
numericvector of0s.This element is non-zero whenreduce_comparison_dataisused.ab_not_includedA
numericvector of0s. Thiselement is non-zero whenreduce_comparison_datais used.labelsNA. This element is notNAwhenreduce_comparison_datais used.pairs_to_keepNA. This element is notNAwhenreduce_comparison_datais used.cc0. This element is non-zero whenreduce_comparison_datais used.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]
Examples
## Example with small no duplicate datasetdata(no_dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0))## Example with small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))Duplicate Dataset
Description
A dataset containing867 simulated records from3 files withno duplicate records in each file.
Usage
dup_dataFormat
A list with three elements:
- records
A
data.framewith the records, containing7fields, from all three files, in the format used for input tocreate_comparison_data.- file_sizes
The size of each file.
- IDs
The true partition of the records, represented as an
integervector of arbitrary labels of lengthsum(file_sizes).
Source
Extracted from the datasets used in the simulation study of thepaper. The datasets were generated using code from Peter Christen's grouphttps://dmm.anu.edu.au/geco/index.php.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]
Examples
data(dup_data)# There are 500 entities represented in the recordslength(unique(dup_data$IDs))Small Duplicate Dataset
Description
A dataset containing96 simulated records from3 files withno duplicate records in each file, subset fromdup_data.
Usage
dup_data_smallFormat
A list with three elements:
- records
A
data.framewith the records, containing7fields, from all three files, in the format used for input tocreate_comparison_data.- file_sizes
The size of each file.
- IDs
The true partition of the records, represented as an
integervector of arbitrary labels of lengthsum(file_sizes).
Source
Extracted from the datasets used in the simulation study of thepaper. The datasets were generated using code from Peter Christen's grouphttps://dmm.anu.edu.au/geco/index.php.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]
Examples
data(dup_data_small)# There are 96 entities represented in the recordslength(unique(dup_data_small$IDs))Find the Bayes Estimate of a Partition
Description
Find the (approximate) Bayes estimate of a partition based on MCMC samplesof the partition and a specified loss function.
Usage
find_bayes_estimate( partitions, burn_in, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = nrow(partitions), verbose = TRUE)Arguments
partitions | Posterior samples of the partition, where each columnis one sample and the partition is represented as an |
burn_in | The number of samples to discard for burn in. |
L_FNM | Positive loss for a false non-match. Default is |
L_FM1 | Positive loss for a type 1 false match. Default is |
L_FM2 | Positive loss for a type 2 false match. Default is |
L_A | Positive loss for abstaining from making a decision for a record.Default is |
max_cc_size | The maximum allowable connected component size over whichthe posterior expected loss is minimized. Default is |
verbose | A |
Value
A vector, the same length of a column ofpartitions containing the(approximate) Bayes estimate of the partition. If!is.infinite(L_A)the output may be a partial estimate. A positive numberl in indexi indicates that recordi is in the same cluster as every otherrecordj withl in indexj. A value of-1 inindexi indicates that the Bayes estimate abstained from making adecision for recordi.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small no duplicate datasetdata(no_dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0))# Specify the priorprior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA)# Find initialization for the matching (this step is optional)# The following line corresponds to only keeping pairs of records as# potential matches in the initialization for which neither gname nor fname# disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)# Run the Gibbs samplerresults <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42)# Find the full Bayes estimatefull_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)# Find the partial Bayes estimatepartial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)# Example with small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))# Reduce the comparison data# The following line corresponds to only keeping pairs of records for which# neither gname nor fname disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)# Specify the priorprior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)# Run the Gibbs samplerresults <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42)# Find the full Bayes estimatefull_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)# Find the partial Bayes estimatepartial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)Gibbs Sampler for Posterior Inference
Description
Run a Gibbs sampler to explore the posterior distribution of partitions ofrecords.
Usage
gibbs_sampler( comparison_list, prior_list, n_iter = 2000, Z_init = 1:sum(comparison_list$file_sizes), seed = 70, single_likelihood = FALSE, chaperones_info = NA, verbose = TRUE)Arguments
comparison_list | The output from a call to |
prior_list | The output from a call to |
n_iter | The number of iterations of the Gibbs sampler to run. |
Z_init | Initialization of the partition of records, represented as an |
seed | The seed to use while running the Gibbs sampler. |
single_likelihood | A |
chaperones_info | If |
verbose | A |
Details
Given the prior specified usingspecify_prior, this functionruns a Gibbs sampler to explore the posterior distribution of partitions ofrecords, conditional on the comparison data created usingcreate_comparison_data orreduce_comparison_data.
Value
a list containing:
mPosterior samples of the
mparameters. Each columnis one sample.uPosterior samples of the
uparameters. Each columnis one sample.partitionsPosterior samples of the partition. Each columnis one sample. Note that the partition is represented as an
integervector of arbitrary labels of lengthsum(comparison_list$file_sizes).contingency_tablesPosterior samples of the overlap table.Each column is one sample. This incorporates counts of records determinednot to be candidate matches to any other records using
reduce_comparison_data.cluster_sizesPosterior samples of the size of each cluster(associated with an arbitrary label from
1tosum(comparison_list$file_sizes)). Each column is one sample.sampling_timeThe time in seconds it took to run thesampler.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]
Jeffrey Miller, Brenda Betancourt, Abbas Zaidi, Hanna Wallach, & Rebecca C. Steorts (2015).Microclustering: When the cluster sizes grow sublinearly with the size of the data set.NeurIPS Bayesian Nonparametrics: The Next Generation Workshop Series. [arXiv]
Brenda Betancourt, Giacomo Zanella, Jeffrey Miller, Hanna Wallach, Abbas Zaidi, & Rebecca C. Steorts (2016).Flexible Models for Microclustering with Application to Entity Resolution.Advances in neural information processing systems. [Published] [arXiv]
Examples
# Example with small no duplicate datasetdata(no_dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0))# Specify the priorprior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA)# Find initialization for the matching (this step is optional)# The following line corresponds to only keeping pairs of records as# potential matches in the initialization for which neither gname nor fname# disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)# Run the Gibbs sampler{results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42)}# Example with small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))# Reduce the comparison data# The following line corresponds to only keeping pairs of records for which# neither gname nor fname disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)# Specify the priorprior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)# Run the Gibbs sampler{results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42)}Initialize the Partition
Description
Generate an initialization for the partition in the case when it is assumedthere are no duplicates in all files (so that the partition is a matching).
Usage
initialize_partition(comparison_list, pairs_to_keep, seed = NA)Arguments
comparison_list | the output from a call to |
pairs_to_keep | A |
seed | The seed to use to generate the initialization. |
Details
When it is assumed that there are no duplicates in all files, andreduce_comparison_data is not used to reduce the number ofpotential matches, the Gibbs sampler used for posterior inference mayexperience slow mixing when using an initialization for the partition whereeach record is in its own cluster (the default option for the Gibbs sampler).The purpose of this function is to provide an alternative initializationscheme.
To use this initialization scheme, the user passes in alogical vectorthat indicates which record pairs are potential matches according to anindexing method (as inreduce_comparison_data). Note that thisindexing is only used to generate the initialization, it is not used forinference. The initialization scheme first finds the transitive closure ofthe potential matches, which partitions the records into blocks. Within eachblock of records, the scheme randomly selects a record from each file, andthese selected records are then placed in the same cluster for the partitioninitialization. All other records are placed in their own clusters.
Value
aninteger vector of arbitrary labels of lengthsum(comparison_list$file_sizes), giving an initialization for thepartition.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small no duplicate datasetdata(no_dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0))# Find initialization for the matching# The following line corresponds to only keeping pairs of records as# potential matches in the initialization for which neither gname nor fname# disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)Multifile Record Linkage and Duplicate Detection
Description
The multilink package implements the methodology of Aleshin-Guendel & Sadinle(2022). It handles the general problem of multifile record linkage andduplicate detection, where any number of files are to be linked, and any ofthe files may have duplicates.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242] [arXiv]
Examples
# Here we demonstrate an example workflow with the small no duplicate datasetdata(no_dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0))# Specify the priorprior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA)# Find initialization for the matching (this step is optional)# The following line corresponds to only keeping pairs of records as# potential matches in the initialization for which neither gname nor fname# disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)# Run the Gibbs samplerresults <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42)# Find the full Bayes estimatefull_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)# The number of clusters in the full estimatelength(unique(full_estimate))# The number of entities represented in the recordslength(unique(no_dup_data_small$IDs))# Find which record pairs are truly coreferent based on IDstrue_links <- no_dup_data_small$IDs[comparison_list$record_pairs[, 1]] ==no_dup_data_small$IDs[comparison_list$record_pairs[, 2]]# Find which record pairs are in the same clusters in the full estimatefull_estimate_links <- full_estimate[comparison_list$record_pairs[, 1]] ==full_estimate[comparison_list$record_pairs[, 2]]# Find the number of true matches in the full estimatetrue_matches <- sum(full_estimate_links & true_links)# Precision of the full estimatetrue_matches / sum(full_estimate_links)# Recall of the full estimatetrue_matches / sum(true_links)# Find the partial Bayes estimatepartial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)# The partial estimate abstains from making decisions for how many records?sum(partial_estimate == -1)# For the records which decisions were made for in the partial estimate,# there are how many clusters?length(unique(partial_estimate))# Abstain rate of partial_estimatesum(partial_estimate == -1) / length(partial_estimate)# Relabel records where we abstainedpartial_estimate[which(partial_estimate == -1)] <- length(partial_estimate) +which(partial_estimate == -1)# Find which record pairs are in the same clusters in the full estimatepartial_estimate_links <- partial_estimate[comparison_list$record_pairs[, 1]] == partial_estimate[comparison_list$record_pairs[, 2]]# Find the number of true matches in the partial estimatetrue_matches_A <- sum(partial_estimate_links & true_links)# Precision of the partial estimatetrue_matches_A / sum(partial_estimate_links)# Here we demonstrate an example workflow with the small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))# Reduce the comparison data# The following line corresponds to only keeping pairs of records for which# neither gname nor fname disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)# Specify the priorprior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)# Run the Gibbs samplerresults <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42)# Find the full Bayes estimatefull_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)# The number of clusters in the full estimate (including records records# determined not to be candidate matches to any other records using# reduce_comparison_data)length(unique(full_estimate)) +sum(reduced_comparison_list$file_sizes_not_included)# The number of entities represented in the recordslength(unique(dup_data_small$IDs))# Find which record pairs are truly coreferent based on IDstrue_links <- dup_data_small$IDs[comparison_list$record_pairs[, 1]] ==dup_data_small$IDs[comparison_list$record_pairs[, 2]]# Focus on the record pairs that were candidate matchestrue_links_reduced <- true_links[reduced_comparison_list$pairs_to_keep]# Calculate the number of prior false non-matches based on the indexing# scheme usedprior_fnm <- nrow(comparison_list$record_pairs[true_links & (!reduced_comparison_list$pairs_to_keep), ])# Find which record pairs are in the same clusters in the full estimatefull_estimate_links <- full_estimate[reduced_comparison_list$record_pairs[, 1]] == full_estimate[reduced_comparison_list$record_pairs[, 2]]# Find the number of true matches in the full estimatetrue_matches <- sum(full_estimate_links & true_links_reduced)# Precision of the full estimatetrue_matches / sum(full_estimate_links)# Recall of the full estimatetrue_matches / (sum(true_links_reduced) + prior_fnm)# Find the partial Bayes estimatepartial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)# The partial estimate abstains from making decisions for how many records?sum(partial_estimate == -1)# For the records which decisions were made for in the partial estimate,# there are how many clusters? (including records determined not to be# candidate matches to any other records using reduce_comparison_data)length(unique(partial_estimate)) + sum(reduced_comparison_list$file_sizes_not_included)# Abstain rate of partial_estimat (excluding records determined not# to be candidate matches to any other records using reduce_comparison_data)sum(partial_estimate == -1) / length(partial_estimate)# Relabel records where we abstainedpartial_estimate[which(partial_estimate == -1)] <- length(partial_estimate) +which(partial_estimate == -1)# Find which record pairs are in the same clusters in the full estimatepartial_estimate_links <- partial_estimate[reduced_comparison_list$record_pairs[, 1]] == partial_estimate[reduced_comparison_list$record_pairs[, 2]]# Find the number of true matches in the partial estimatetrue_matches_A <- sum(partial_estimate_links & true_links_reduced)# Precision of the partial estimatetrue_matches_A / sum(partial_estimate_links)# Relabel the full and partial Bayes estimatesfull_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, full_estimate)partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, partial_estimate)# Add columns to the records corresponding to their full and partial# Bayes estimatesdup_data_small$records <- cbind(dup_data_small$records, full_estimate_id = full_estimate_relabel$link_id, partial_estimate_id = partial_estimate_relabel$link_id)No Duplicate Dataset
Description
A dataset containing730 simulated records from3 files withno duplicate records in each file.
Usage
no_dup_dataFormat
A list with three elements:
- records
A
data.framewith the records, containing7fields, from all three files, in the format used for input tocreate_comparison_data.- file_sizes
The size of each file.
- IDs
The true partition of the records, represented as an
integervector of arbitrary labels of lengthsum(file_sizes).
Source
Extracted from the datasets used in the simulation study of thepaper. The datasets were generated using code from Peter Christen's grouphttps://dmm.anu.edu.au/geco/index.php.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242] [arXiv]
Examples
data(no_dup_data)# There are 500 entities represented in the recordslength(unique(no_dup_data$IDs))Small No Duplicate Dataset
Description
A dataset containing71 simulated records from3 files withno duplicate records in each file, subset fromno_dup_data.
Usage
no_dup_data_smallFormat
A list with three elements:
- records
A
data.framewith the records, containing7fields, from all three files, in the format used for input tocreate_comparison_data.- file_sizes
The size of each file.
- IDs
The true partition of the records, represented as an
integervector of arbitrary labels of lengthsum(file_sizes).
Source
Extracted from the datasets used in the simulation study of thepaper. The datasets were generated using code from Peter Christen's grouphttps://dmm.anu.edu.au/geco/index.php.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242] [arXiv]
Examples
data(no_dup_data_small)# There are 71 entities represented in the recordslength(unique(no_dup_data_small$IDs))Reduce Comparison Data Size
Description
Use indexing to reduce the number of record pairs that are potential matches.
Usage
reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)Arguments
comparison_list | The output of a call to |
pairs_to_keep | A |
cc | A |
Details
When using comparison-based record linkage methods, scalability is a concern,as the number of record pairs is quadratic in the number of records. Inorder to address these concerns, it's common to declare certain record pairsto not be potential matches a priori, using indexing methods. The user isfree to index using any method they like, as long as they can produce alogical vector that indicates which record pairs are potential matchesaccording to their indexing method. We recommend, if the user chosen indexingmethod does not output potential matches that are transitive, to set thecc argument to1. By transitive we mean, for any three recordsi,j, andk, ifi andj are potential matches,andj andk are potential matches, theni andk arepotential matches. Non-transitive indexing schemes can lead to poor mixing ofthe Gibbs sampler used for posterior inference, and suggests that theindexing method used may have been too stringent.
If indexing is used, it may be the case that some records are declared to notbe potential matches to any other records. In this case, the indexing methodhas made the decision that these records have no matches, and thus we canremove them from the data set and relabel the remaining records; see thedocumentation forlabels for information on how to go between theoriginal labeling and the new labeling.
If indexing is used, comparisons for record pairs that aren't potentialmatches are still used during inference, where they're used to inform thedistribution of comparisons for non-matches.
Value
a list containing:
record_pairsA
data.frame, where each rowcontains the pair of records being compared in the corresponding row ofcomparisons. The rows are sorted in ascending order according to thefirst column, with ties broken according to the second column in ascendingorder. For any given row, the first column is less than the second column,i.e.record_pairs[i, 1] < record_pairs[i, 2]for each rowi.If according topairs_to_keepthere are records which are notpotential matches to any other records, the remaining records arerelabeled (seelabels).comparisonsA
logicalmatrix, where each row containsthe comparisons between the record pair in the corresponding row ofrecord_pairs. Comparisons are in the same order as the columns ofrecords, and are represented byL + 1columns ofTRUE/FALSEindicators, whereL + 1is the number ofdisagreement levels for the field based onbreaks.KThe number of files, assumed to be of class
numeric.file_sizesA
numericvector of lengthK,indicating the size of each file. If according topairs_to_keepthere are records which are not potential matches to any other records, theremaining records are relabeled (seelabels), andfile_sizesnow represents the sizes of each file after removing such records.duplicatesA
numericvector of lengthK,indicating which files are assumed to have duplicates.duplicates[k]should be1if filekhas duplicates, andduplicates[k]should be0if filekhas noduplicates.field_levelsA
numericvector indicating the number ofdisagreement levels for each field.file_labelsAn
integervector of lengthsum(file_sizes), wherefile_labels[i]indicated which filerecordiis in.fp_matrixAn
integermatrix, wherefp_matrix[k1, k2]is a label for the file pair(k1, k2). Notethatfp_matrix[k1, k2] = fp_matrix[k2, k1].rp_to_fpA
logicalmatrix that indicates which recordpairs belong to which file pairs.rp_to_fp[fp, rp]isTRUEifthe recordsrecord_pairs[rp, ]belong to the file pairfp,and is FALSE otherwise. Note thatfpis given by the labeling infp_matrix.abAn
integervector, of lengthncol(comparisons) * K * (K + 1) / 2that indicates how many recordpairs there are with a given disagreement level for a given field, for eachfile pair.file_sizes_not_includedIf according to
pairs_to_keepthere are records which are not potential matches to any other records, theremaining records are relabeled (seelabels), andfile_sizes_not_includedindicates, for each file, the number of suchrecords that were removed.ab_not_includedFor record pairs not included according to
pairs_to_keep, this is anintegervector, of lengthncol(comparisons) * K * (K + 1) / 2that indicates how many recordpairs there are with a given disagreement level for a given field, for eachfile pair.labelsIf according to
pairs_to_keepthere are records which are not potential matches to any other records, theremaining records are relabeled.labelsprovides a dictionary thatindicates, for each of the new labels, which record in the originallabeling the new label corresponds to. In particular, the first columnindicates the record in the original labeling, and the second columnindicates the new labeling.pairs_to_keepA
logicalvector, the same length ascomparison_list$record_pairs, indicating which record pairs werekept as potential matches. This may not be the same as the inputpairs_to_keepifccwas set to 1.ccA
numericindicator of whether the connectedcomponents of the potential matches are closed under transitivity.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))# Reduce the comparison data# The following line corresponds to only keeping pairs of records for which# neither gname nor fname disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)Relabel the Bayes Estimate of a Partition
Description
Relabel the Bayes estimate of a partition, for use after using indexing toreduce the number of record pairs that are potential matches.
Usage
relabel_bayes_estimate(reduced_comparison_list, bayes_estimate)Arguments
reduced_comparison_list | The output from a call to |
bayes_estimate | The output from a call to |
Details
When the functionreduce_comparison_data is used to reduce thenumber of record pairs that are potential matches, it may be the case thatsome records are declared to not be potential matches to any other records.In this case, the indexing method has made the decision that these recordshave no matches, and thus we can remove them from the data set and relabelthe remaining records; see the documentation forlabels inreduce_comparison_data for information on how to go between theoriginal labeling and the new labeling. The purpose of this function is torelabel the output offind_bayes_estimate when the functionreduce_comparison_data is used, so that the user doesn't haveto do this relabeling themselves.
Value
Adata.frame, with as many rows assum(reduced_comparison_list$file_sizes +reduced_comparison_list$file_sizes_not_included), i.e. the number ofrecords originally input tocreate_comparison_data, beforeindexing occurred. Thisdata.frame has two columns,"original_labels" and"link_id". Given rowi ofrecords originally input tocreate_comparison_data,the linkage id according tobayes_estimate is given by theithrow of thelink_id column. See the documentation forfind_bayes_estimate for information on how to interpret thislinkage id.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))# Reduce the comparison data# The following line corresponds to only keeping pairs of records for which# neither gname nor fname disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)# Specify the priorprior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)# Run the Gibbs sampler{results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42)# Find the full Bayes estimatefull_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)# Find the partial Bayes estimatepartial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)# Relabel the full and partial Bayes estimatesfull_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, full_estimate)partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, partial_estimate)# Add columns to the records corresponding to their full and partial# Bayes estimatesdup_data_small$records <- cbind(dup_data_small$records, full_estimate_id = full_estimate_relabel$link_id, partial_estimate_id = partial_estimate_relabel$link_id)}Specify the Prior Distributions
Description
Specify the prior distributions for them andu parameters of themodels for comparison data among matches and non-matches, and the partition.
Usage
specify_prior( comparison_list, mus = NA, nus = NA, flat = 0, alphas = NA, dup_upper_bound = NA, dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = NA, n_prior_pars = NA)Arguments
comparison_list | the output from a call to |
mus,nus | The hyperparameters of the Dirichlet priors for the |
flat | A |
alphas | The hyperparameters for the Dirichlet-multinomial overlap tableprior, a positive |
dup_upper_bound | A |
dup_count_prior_family | A |
dup_count_prior_pars | A |
n_prior_family | A |
n_prior_pars | Currently set to |
Details
The purpose of this function is to specify prior distributions for allparameters of the model. Please note that ifreduce_comparison_data is used to the reduce the number ofrecord pairs that are potential matches, then the output ofreduce_comparison_data (notcreate_comparison_data) should be used as input.
For the hyperparameters of the Dirichlet priors for themandu parameters for the comparisons among matches and non-matches,respectively, we recommend using a flat prior. This is accomplished bysettingmus=NA andnus=NA. Informative prior specificationsare possible, but in practice they will be overwhelmed by the large number ofcomparisons.
For the prior for partitions, we do not recommend using a flat prior. Insteadwe recommend using our structure prior for partitions. By settingflat=0 and the remaining arguments toNA, one obtains thedefault specification for the structured prior that we have found to performwell in simulation studies. The structured prior for partitions is specifiedas follows:
Specify a prior for
n, the number of clusters represented inthe records. Note that this includes records determined not to be potentialmatches to any other records usingreduce_comparison_data.Currently, a uniform prior and a scale prior fornare supported.Our default specification uses a uniform prior.Specify a prior for the overlap table (see the documentation for
alphasfor more information). Currently a Dirichlet-multinomialprior is supported. Our default specification sets all hyperparameters ofthe Dirichlet-multinomial prior to1.For each file, specify a prior for the number of duplicates in eachcluster. As a part of this prior, we specify the maximum number of recordsin a cluster for each file, through
dup_upper_bound. When thereare assumed to be no duplicates in a file, the maximum number of records ina cluster for that file is set to1. When there are assumed to beduplicates in a file, we recommend setting the maximum number of records ina cluster for that file to be less than the file size, if prior knowledgeallows. Currently, a Poisson prior for the the number of duplicates ineach cluster is supported. Our default specification uses a Poisson priorwith mean1.
Please contact the package maintainer if you need new prior familiesforn or the number of duplicates in each cluster to be supported.
Value
a list containing:
musThe hyperparameters of the Dirichlet priors for the
mparameters for the comparisons among matches.nusThe hyperparameters of the Dirichlet priors for the
uparameters for the comparisons among non-matches. Includes datafrom comparisons of record pairs that were declared to not be potentialmatches usingreduce_comparison_data.flatA
numericindicator of whether a flat prior forpartitions should be used.flatis1if a flat prior is used,andflatis0if a structured prior is used.no_dupsA
numericindicator of whether no duplicatesare allowed in all of the files.alphasThe hyperparameters for the Dirichlet-multinomialoverlap table prior, a positive
numericvector of length2 ^ comparison_list$K, where the first element is0.alpha_0The sum of
alphas.dup_upper_boundA
numericvector indicating themaximum number of duplicates, from each file, allowed in each cluster. Fora given filek,dup_upper_bound[k]should be between1andcomparison_list$file_sizes[k], i.e. even if you don't want toimpose an upper bound, you have to implicitly place an upper bound: thenumber of records in a file.log_dup_count_priorA
listcontaining the log densityof the prior distribution for the number of duplicates in each cluster, foreach file.log_n_priorA
numericvector containing the logdensity of the prior distribution for the number of clusters represented inthe records.nus_specifiedThe
nusbefore data from comparisons ofrecord pairs that were declared to not be potential matches usingreduce_comparison_dataare added. Used for input checking.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242] [arXiv]
Examples
# Example with small no duplicate datasetdata(no_dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0))# Specify the priorprior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA)# Example with small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))# Reduce the comparison data# The following line corresponds to only keeping pairs of records for which# neither gname nor fname disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)# Specify the priorprior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)