Movatterモバイル変換


[0]ホーム

URL:


Title:Multifile Record Linkage and Duplicate Detection
Version:0.1.1
Description:Implementation of the methodology of Aleshin-Guendel & Sadinle (2022) <doi:10.1080/01621459.2021.2013242>. It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates.
Depends:R (≥ 3.5.0)
License:GPL-3
Encoding:UTF-8
LazyData:true
RoxygenNote:7.1.2
URL:https://github.com/aleshing/multilink
BugReports:https://github.com/aleshing/multilink/issues
Imports:igraph, RecordLinkage, Rcpp, utils, mcclust, geosphere,stringr
LinkingTo:Rcpp, RcppArmadillo
NeedsCompilation:yes
Packaged:2023-06-08 20:25:20 UTC; sergealeshin-guendel
Author:Serge Aleshin-Guendel [aut, cre]
Maintainer:Serge Aleshin-Guendel <saleshinguendel@gmail.com>
Repository:CRAN
Date/Publication:2023-06-09 14:20:07 UTC

Create Comparison Data

Description

Create comparison data for all pairs of records, except for those records infiles which are assumed to have no duplicates.

Usage

create_comparison_data(  records,  types,  breaks,  file_sizes,  duplicates,  verbose = TRUE)

Arguments

records

Adata.frame containing the records to be linked, whereeach column ofrecords is a field to be compared. If there aremultiple files,records should be obtained by stacking the files ontop of each other so thatrecords[1:file_sizes[1], ] contains therecords for file1,records[(file_sizes[1] + 1):(file_sizes[1] + file_sizes[2]), ]contains the records for file2, and so on. Missing values should becoded asNA.

types

Acharacter vector, indicating the comparison to be usedfor each field (i.e. each column ofrecords). The options are:"bi" for binary comparisons,"nu" for numeric comparisons(absolute difference),"lv" for string comparisons (normalizedLevenshtein distance),"lv_sep" for string comparisons (normalizedLevenshtein distance) where each string may contain multiple spellingsseparated by the "|" character. We assume that fields using options"bi","lv", and"lv_sep" are of classcharacter, and fields using the"nu" option are of classnumeric. For fields using the"lv_sep" option, for each recordpair the normalized Levenshtein distance is computed between each possiblespelling, and the minimum normalized Levenshtein distance between spellingsis then used as the comparison for that record pair.

breaks

Alist, the same length astypes, indicating thebreak points used to compute disagreement levels for each fields'comparisons. Iftypes[f]="bi",breaks[[f]] is ignored (and thuscan be set toNA). See Details for more information on specifying thisargument.

file_sizes

Anumeric vector indicating the size of each file.

duplicates

Anumeric vector indicating which files are assumedto have duplicates.duplicates[k] should be1 if filekhas duplicates, andduplicates[k] should be0 if filekhas no duplicates. If any files do not have duplicates, we strongly recommendthat the largest such file is organized to be the first file.

verbose

Alogical indicator of whether progress messages shouldbe print (defaultTRUE).

Details

The purpose of this function is to construct comparison vectors for each pairof records. In order to construct these vectors, one needs to specify thetypes andbreaks arguments. Thetypes argument specifieshow each field should be compared, and thebreaks argument specifieshow to discretize these comparisons.

Currently, thetypes argument supports three types of fieldcomparisons: binary, absolute difference, and the normalized Levenshteindistance. Please contact the package maintainer if you need a new type ofcomparison to be supported.

Thebreaks argument should be alist, with with one element foreach field. If a field is being compared with a binary comparison, i.e.types[f]="bi", then the corresponding element ofbreaks shouldbeNA, i.e.breaks[[f]]=NA. If a field is being compared with anumeric or string comparison, then the corresponding element ofbreaksshould be a vector of cut points used to discretize the comparisons. To givemore detail, suppose you pass in cut pointsbreaks[[f]]=c(cut_1, ...,cut_L). These cut pointsdiscretize the range of the comparisons intoL+1 intervals:I_0=(-\infty, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, \infty]. Theraw comparisons, which lie in[0,\infty) for numeric comparisons and[0,1] for string comparisons, are then replaced with indicators ofwhich interval the comparisons lie in. The intervalI_0 corresponds tothe lowest level of disagreement for a comparison, while the intervalI_L corresponds to the highest level of disagreement for a comparison.

Value

a list containing:

record_pairs

Adata.frame, where each rowcontains the pair of records being compared in the corresponding row ofcomparisons. The rows are sorted in ascending order according to thefirst column, with ties broken according to the second column in ascendingorder. For any given row, the first column is less than the second column,i.e.record_pairs[i, 1] < record_pairs[i, 2] for each rowi.

comparisons

Alogical matrix, where each row containsthe comparisons for the record pair in the corresponding row ofrecord_pairs. Comparisons are in the same order as the columns ofrecords, and are represented byL + 1 columns ofTRUE/FALSE indicators, whereL + 1 is the number ofdisagreement levels for the field based onbreaks.

K

The number of files, assumed to be of classnumeric.

file_sizes

Anumeric vector of lengthK,indicating the size of each file.

duplicates

Anumeric vector of lengthK,indicating which files are assumed to have duplicates.duplicates[k]should be1 if filek has duplicates, andduplicates[k] should be0 if filek has no duplicates.If any files do not have duplicates, we strongly recommend that the largestsuch file is organized to be the first file.

field_levels

Anumeric vector indicating the number ofdisagreement levels for each field.

file_labels

Aninteger vector of lengthsum(file_sizes), wherefile_labels[i] indicates which filerecordi is in.

fp_matrix

Aninteger matrix, wherefp_matrix[k1, k2] is a label for the file pair(k1, k2). Notethatfp_matrix[k1, k2] = fp_matrix[k2, k1].

rp_to_fp

Alogical matrix that indicates which recordpairs belong to which file pairs.rp_to_fp[fp, rp] isTRUE ifthe recordsrecord_pairs[rp, ] belong to the file pairfp,and is FALSE otherwise. Note thatfp is given by the labeling infp_matrix.

ab

Aninteger vector, of lengthncol(comparisons) * K * (K + 1) / 2 that indicates how many recordpairs there are with a given disagreement level for a given field, for eachfile pair.

file_sizes_not_included

Anumeric vector of0s.This element is non-zero whenreduce_comparison_data isused.

ab_not_included

Anumeric vector of0s. Thiselement is non-zero whenreduce_comparison_data is used.

labels

NA. This element is notNA whenreduce_comparison_data is used.

pairs_to_keep

NA. This element is notNA whenreduce_comparison_data is used.

cc

0. This element is non-zero whenreduce_comparison_data is used.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

## Example with small no duplicate datasetdata(no_dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0))## Example with small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))

Duplicate Dataset

Description

A dataset containing867 simulated records from3 files withno duplicate records in each file.

Usage

dup_data

Format

A list with three elements:

records

Adata.frame with the records, containing7fields, from all three files, in the format used for input tocreate_comparison_data.

file_sizes

The size of each file.

IDs

The true partition of the records, represented as aninteger vector of arbitrary labels of lengthsum(file_sizes).

Source

Extracted from the datasets used in the simulation study of thepaper. The datasets were generated using code from Peter Christen's grouphttps://dmm.anu.edu.au/geco/index.php.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

data(dup_data)# There are 500 entities represented in the recordslength(unique(dup_data$IDs))

Small Duplicate Dataset

Description

A dataset containing96 simulated records from3 files withno duplicate records in each file, subset fromdup_data.

Usage

dup_data_small

Format

A list with three elements:

records

Adata.frame with the records, containing7fields, from all three files, in the format used for input tocreate_comparison_data.

file_sizes

The size of each file.

IDs

The true partition of the records, represented as aninteger vector of arbitrary labels of lengthsum(file_sizes).

Source

Extracted from the datasets used in the simulation study of thepaper. The datasets were generated using code from Peter Christen's grouphttps://dmm.anu.edu.au/geco/index.php.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

data(dup_data_small)# There are 96 entities represented in the recordslength(unique(dup_data_small$IDs))

Find the Bayes Estimate of a Partition

Description

Find the (approximate) Bayes estimate of a partition based on MCMC samplesof the partition and a specified loss function.

Usage

find_bayes_estimate(  partitions,  burn_in,  L_FNM = 1,  L_FM1 = 1,  L_FM2 = 2,  L_A = Inf,  max_cc_size = nrow(partitions),  verbose = TRUE)

Arguments

partitions

Posterior samples of the partition, where each columnis one sample and the partition is represented as aninteger vector ofarbitrary labels, as produced by the output of a call togibbs_sampler.

burn_in

The number of samples to discard for burn in.

L_FNM

Positive loss for a false non-match. Default is1.

L_FM1

Positive loss for a type 1 false match. Default is1.

L_FM2

Positive loss for a type 2 false match. Default is2.

L_A

Positive loss for abstaining from making a decision for a record.Default isInf, i.e. decisions are made for all records.

max_cc_size

The maximum allowable connected component size over whichthe posterior expected loss is minimized. Default isnrow(partitions),i.e. no approximation is used. Whenis.infinite(L_A), we recommendsetting this argument to50, then increasing based on a computationalbudget. When!is.infinite(L_A), we recommend setting this argument to10-12, then increasing based on a computational budget (although anincrease of1 in this argument can in the worst case lead to adoubling in computation time).

verbose

Alogical indicator of whether progress messages shouldbe print (defaultTRUE).

Value

A vector, the same length of a column ofpartitions containing the(approximate) Bayes estimate of the partition. If!is.infinite(L_A)the output may be a partial estimate. A positive numberl in indexi indicates that recordi is in the same cluster as every otherrecordj withl in indexj. A value of-1 inindexi indicates that the Bayes estimate abstained from making adecision for recordi.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

# Example with small no duplicate datasetdata(no_dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0))# Specify the priorprior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA)# Find initialization for the matching (this step is optional)# The following line corresponds to only keeping pairs of records as# potential matches in the initialization for which neither gname nor fname# disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)# Run the Gibbs samplerresults <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42)# Find the full Bayes estimatefull_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)# Find the partial Bayes estimatepartial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)# Example with small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))# Reduce the comparison data# The following line corresponds to only keeping pairs of records for which# neither gname nor fname disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)# Specify the priorprior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)# Run the Gibbs samplerresults <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42)# Find the full Bayes estimatefull_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)# Find the partial Bayes estimatepartial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)

Gibbs Sampler for Posterior Inference

Description

Run a Gibbs sampler to explore the posterior distribution of partitions ofrecords.

Usage

gibbs_sampler(  comparison_list,  prior_list,  n_iter = 2000,  Z_init = 1:sum(comparison_list$file_sizes),  seed = 70,  single_likelihood = FALSE,  chaperones_info = NA,  verbose = TRUE)

Arguments

comparison_list

The output from a call tocreate_comparison_data orreduce_comparison_data.

prior_list

The output from a call tospecify_prior.

n_iter

The number of iterations of the Gibbs sampler to run.

Z_init

Initialization of the partition of records, represented as aninteger vector of arbitrary labels of lengthsum(comparison_list$file_sizes). The default initialization placeseach record in its own cluster. Seeinitialize_partition for analternative initialization when there are no duplicates in each file.

seed

The seed to use while running the Gibbs sampler.

single_likelihood

Alogical indicator of whether to use asingle likelihood for comparisons for all file pairs, or whether to use aseparate likelihood for comparisons for each file pair. Whensingle_likelihood=TRUE, a single likelihood is used, and the priorhyperparameters form andu from the first file pair are used.We do not recommend using a single likelihood in general.

chaperones_info

Ifchaperones_info is set toNA, thenGibbs updates to the partition are used during the Gibbs sampler, asdescribed in Aleshin-Guendel & Sadinle (2022). Else, Chaperones updates,as described in Miller et al. (2015) and Betancourt et al. (2016), are usedandchaperones_info should be alist with five elementscontrolling Chaperones updates to the partition during the Gibbs sampler:chap_type,num_chap_iter,nonuniform_chap_type,extra_gibbs,num_restrict.chap_type is0 ifusing a uniform Chaperones distribution, and1 ifusing a nonuniform Chaperones distribution.num_chap_iter is thenumber of Chaperones updates to the partition that are made during eachiteration of the Gibbs sampler. When using a nonuniform Chaperonesdistribution,nonuniform_chap_type is0 if using the exactversion, or1 if using the partial version.extra_gibbs is alogical indicator of whether a Gibbs update to the partition should bedone after the Chaperones updates, at each iteration of the Gibbs sampler.num_restrict is the number of restricted Gibbs steps to take duringeach Chaperones update to the partition.

verbose

Alogical indicator of whether progress messages shouldbe print (defaultTRUE).

Details

Given the prior specified usingspecify_prior, this functionruns a Gibbs sampler to explore the posterior distribution of partitions ofrecords, conditional on the comparison data created usingcreate_comparison_data orreduce_comparison_data.

Value

a list containing:

m

Posterior samples of them parameters. Each columnis one sample.

u

Posterior samples of theu parameters. Each columnis one sample.

partitions

Posterior samples of the partition. Each columnis one sample. Note that the partition is represented as anintegervector of arbitrary labels of lengthsum(comparison_list$file_sizes).

contingency_tables

Posterior samples of the overlap table.Each column is one sample. This incorporates counts of records determinednot to be candidate matches to any other records usingreduce_comparison_data.

cluster_sizes

Posterior samples of the size of each cluster(associated with an arbitrary label from1 tosum(comparison_list$file_sizes)). Each column is one sample.

sampling_time

The time in seconds it took to run thesampler.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Jeffrey Miller, Brenda Betancourt, Abbas Zaidi, Hanna Wallach, & Rebecca C. Steorts (2015).Microclustering: When the cluster sizes grow sublinearly with the size of the data set.NeurIPS Bayesian Nonparametrics: The Next Generation Workshop Series. [arXiv]

Brenda Betancourt, Giacomo Zanella, Jeffrey Miller, Hanna Wallach, Abbas Zaidi, & Rebecca C. Steorts (2016).Flexible Models for Microclustering with Application to Entity Resolution.Advances in neural information processing systems. [Published] [arXiv]

Examples

# Example with small no duplicate datasetdata(no_dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0))# Specify the priorprior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA)# Find initialization for the matching (this step is optional)# The following line corresponds to only keeping pairs of records as# potential matches in the initialization for which neither gname nor fname# disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)# Run the Gibbs sampler{results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42)}# Example with small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))# Reduce the comparison data# The following line corresponds to only keeping pairs of records for which# neither gname nor fname disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)# Specify the priorprior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)# Run the Gibbs sampler{results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42)}

Initialize the Partition

Description

Generate an initialization for the partition in the case when it is assumedthere are no duplicates in all files (so that the partition is a matching).

Usage

initialize_partition(comparison_list, pairs_to_keep, seed = NA)

Arguments

comparison_list

the output from a call tocreate_comparison_data orreduce_comparison_data.Note that in order to correctly specify the initialization, ifreduce_comparison_data is used to the reduce the number ofrecord pairs that are candidate matches, then the output ofreduce_comparison_data (notcreate_comparison_data) should be used for this argument.

pairs_to_keep

Alogical vector, the same length ascomparison_list$record_pairs, indicating which record pairs arepotential matches in the initialization.

seed

The seed to use to generate the initialization.

Details

When it is assumed that there are no duplicates in all files, andreduce_comparison_data is not used to reduce the number ofpotential matches, the Gibbs sampler used for posterior inference mayexperience slow mixing when using an initialization for the partition whereeach record is in its own cluster (the default option for the Gibbs sampler).The purpose of this function is to provide an alternative initializationscheme.

To use this initialization scheme, the user passes in alogical vectorthat indicates which record pairs are potential matches according to anindexing method (as inreduce_comparison_data). Note that thisindexing is only used to generate the initialization, it is not used forinference. The initialization scheme first finds the transitive closure ofthe potential matches, which partitions the records into blocks. Within eachblock of records, the scheme randomly selects a record from each file, andthese selected records are then placed in the same cluster for the partitioninitialization. All other records are placed in their own clusters.

Value

aninteger vector of arbitrary labels of lengthsum(comparison_list$file_sizes), giving an initialization for thepartition.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

# Example with small no duplicate datasetdata(no_dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0))# Find initialization for the matching# The following line corresponds to only keeping pairs of records as# potential matches in the initialization for which neither gname nor fname# disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)

Description

The multilink package implements the methodology of Aleshin-Guendel & Sadinle(2022). It handles the general problem of multifile record linkage andduplicate detection, where any number of files are to be linked, and any ofthe files may have duplicates.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242] [arXiv]

Examples

# Here we demonstrate an example workflow with the small no duplicate datasetdata(no_dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0))# Specify the priorprior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA)# Find initialization for the matching (this step is optional)# The following line corresponds to only keeping pairs of records as# potential matches in the initialization for which neither gname nor fname# disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)# Run the Gibbs samplerresults <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000, Z_init = Z_init, seed = 42)# Find the full Bayes estimatefull_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)# The number of clusters in the full estimatelength(unique(full_estimate))# The number of entities represented in the recordslength(unique(no_dup_data_small$IDs))# Find which record pairs are truly coreferent based on IDstrue_links <- no_dup_data_small$IDs[comparison_list$record_pairs[, 1]] ==no_dup_data_small$IDs[comparison_list$record_pairs[, 2]]# Find which record pairs are in the same clusters in the full estimatefull_estimate_links <- full_estimate[comparison_list$record_pairs[, 1]] ==full_estimate[comparison_list$record_pairs[, 2]]# Find the number of true matches in the full estimatetrue_matches <- sum(full_estimate_links & true_links)# Precision of the full estimatetrue_matches / sum(full_estimate_links)# Recall of the full estimatetrue_matches / sum(true_links)# Find the partial Bayes estimatepartial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)# The partial estimate abstains from making decisions for how many records?sum(partial_estimate == -1)# For the records which decisions were made for in the partial estimate,# there are how many clusters?length(unique(partial_estimate))# Abstain rate of partial_estimatesum(partial_estimate == -1) / length(partial_estimate)# Relabel records where we abstainedpartial_estimate[which(partial_estimate == -1)] <- length(partial_estimate) +which(partial_estimate == -1)# Find which record pairs are in the same clusters in the full estimatepartial_estimate_links <- partial_estimate[comparison_list$record_pairs[, 1]] == partial_estimate[comparison_list$record_pairs[, 2]]# Find the number of true matches in the partial estimatetrue_matches_A <- sum(partial_estimate_links & true_links)# Precision of the partial estimatetrue_matches_A / sum(partial_estimate_links)# Here we demonstrate an example workflow with the small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))# Reduce the comparison data# The following line corresponds to only keeping pairs of records for which# neither gname nor fname disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)# Specify the priorprior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)# Run the Gibbs samplerresults <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42)# Find the full Bayes estimatefull_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)# The number of  clusters in the full estimate (including records records# determined not to be candidate matches to any other records using# reduce_comparison_data)length(unique(full_estimate)) +sum(reduced_comparison_list$file_sizes_not_included)# The number of entities represented in the recordslength(unique(dup_data_small$IDs))# Find which record pairs are truly coreferent based on IDstrue_links <- dup_data_small$IDs[comparison_list$record_pairs[, 1]] ==dup_data_small$IDs[comparison_list$record_pairs[, 2]]# Focus on the record pairs that were candidate matchestrue_links_reduced <- true_links[reduced_comparison_list$pairs_to_keep]# Calculate the number of prior false non-matches based on the indexing# scheme usedprior_fnm <- nrow(comparison_list$record_pairs[true_links & (!reduced_comparison_list$pairs_to_keep), ])# Find which record pairs are in the same clusters in the full estimatefull_estimate_links <- full_estimate[reduced_comparison_list$record_pairs[, 1]] == full_estimate[reduced_comparison_list$record_pairs[, 2]]# Find the number of true matches in the full estimatetrue_matches <- sum(full_estimate_links & true_links_reduced)# Precision of the full estimatetrue_matches / sum(full_estimate_links)# Recall of the full estimatetrue_matches / (sum(true_links_reduced) + prior_fnm)# Find the partial Bayes estimatepartial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)# The partial estimate abstains from making decisions for how many records?sum(partial_estimate == -1)# For the records which decisions were made for in the partial estimate,# there are how many clusters? (including records determined not to be# candidate matches to any other records using reduce_comparison_data)length(unique(partial_estimate)) + sum(reduced_comparison_list$file_sizes_not_included)# Abstain rate of partial_estimat (excluding records determined not# to be candidate matches to any other records using reduce_comparison_data)sum(partial_estimate == -1) / length(partial_estimate)# Relabel records where we abstainedpartial_estimate[which(partial_estimate == -1)] <- length(partial_estimate) +which(partial_estimate == -1)# Find which record pairs are in the same clusters in the full estimatepartial_estimate_links <- partial_estimate[reduced_comparison_list$record_pairs[, 1]] == partial_estimate[reduced_comparison_list$record_pairs[, 2]]# Find the number of true matches in the partial estimatetrue_matches_A <- sum(partial_estimate_links & true_links_reduced)# Precision of the partial estimatetrue_matches_A / sum(partial_estimate_links)# Relabel the full and partial Bayes estimatesfull_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, full_estimate)partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, partial_estimate)# Add columns to the records corresponding to their full and partial# Bayes estimatesdup_data_small$records <- cbind(dup_data_small$records, full_estimate_id = full_estimate_relabel$link_id, partial_estimate_id = partial_estimate_relabel$link_id)

No Duplicate Dataset

Description

A dataset containing730 simulated records from3 files withno duplicate records in each file.

Usage

no_dup_data

Format

A list with three elements:

records

Adata.frame with the records, containing7fields, from all three files, in the format used for input tocreate_comparison_data.

file_sizes

The size of each file.

IDs

The true partition of the records, represented as aninteger vector of arbitrary labels of lengthsum(file_sizes).

Source

Extracted from the datasets used in the simulation study of thepaper. The datasets were generated using code from Peter Christen's grouphttps://dmm.anu.edu.au/geco/index.php.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242] [arXiv]

Examples

data(no_dup_data)# There are 500 entities represented in the recordslength(unique(no_dup_data$IDs))

Small No Duplicate Dataset

Description

A dataset containing71 simulated records from3 files withno duplicate records in each file, subset fromno_dup_data.

Usage

no_dup_data_small

Format

A list with three elements:

records

Adata.frame with the records, containing7fields, from all three files, in the format used for input tocreate_comparison_data.

file_sizes

The size of each file.

IDs

The true partition of the records, represented as aninteger vector of arbitrary labels of lengthsum(file_sizes).

Source

Extracted from the datasets used in the simulation study of thepaper. The datasets were generated using code from Peter Christen's grouphttps://dmm.anu.edu.au/geco/index.php.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242] [arXiv]

Examples

data(no_dup_data_small)# There are 71 entities represented in the recordslength(unique(no_dup_data_small$IDs))

Reduce Comparison Data Size

Description

Use indexing to reduce the number of record pairs that are potential matches.

Usage

reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)

Arguments

comparison_list

The output of a call tocreate_comparison_data.

pairs_to_keep

Alogical vector, the same length ascomparison_list$record_pairs, indicating which record pairs should bekept as potential matches. These potential matches do not have to betransitive (see the argumentcc).

cc

Anumeric indicator of whether to find the transitiveclosure ofpairs_to_keep, and use these potential matches insteadof just those frompairs_to_keep.cc should be1 if thetransitive closure is being used, andcc should be0 if thetransitive closure is not being used. We recommend settingcc to1.

Details

When using comparison-based record linkage methods, scalability is a concern,as the number of record pairs is quadratic in the number of records. Inorder to address these concerns, it's common to declare certain record pairsto not be potential matches a priori, using indexing methods. The user isfree to index using any method they like, as long as they can produce alogical vector that indicates which record pairs are potential matchesaccording to their indexing method. We recommend, if the user chosen indexingmethod does not output potential matches that are transitive, to set thecc argument to1. By transitive we mean, for any three recordsi,j, andk, ifi andj are potential matches,andj andk are potential matches, theni andk arepotential matches. Non-transitive indexing schemes can lead to poor mixing ofthe Gibbs sampler used for posterior inference, and suggests that theindexing method used may have been too stringent.

If indexing is used, it may be the case that some records are declared to notbe potential matches to any other records. In this case, the indexing methodhas made the decision that these records have no matches, and thus we canremove them from the data set and relabel the remaining records; see thedocumentation forlabels for information on how to go between theoriginal labeling and the new labeling.

If indexing is used, comparisons for record pairs that aren't potentialmatches are still used during inference, where they're used to inform thedistribution of comparisons for non-matches.

Value

a list containing:

record_pairs

Adata.frame, where each rowcontains the pair of records being compared in the corresponding row ofcomparisons. The rows are sorted in ascending order according to thefirst column, with ties broken according to the second column in ascendingorder. For any given row, the first column is less than the second column,i.e.record_pairs[i, 1] < record_pairs[i, 2] for each rowi.If according topairs_to_keep there are records which are notpotential matches to any other records, the remaining records arerelabeled (seelabels).

comparisons

Alogical matrix, where each row containsthe comparisons between the record pair in the corresponding row ofrecord_pairs. Comparisons are in the same order as the columns ofrecords, and are represented byL + 1 columns ofTRUE/FALSE indicators, whereL + 1 is the number ofdisagreement levels for the field based onbreaks.

K

The number of files, assumed to be of classnumeric.

file_sizes

Anumeric vector of lengthK,indicating the size of each file. If according topairs_to_keepthere are records which are not potential matches to any other records, theremaining records are relabeled (seelabels), andfile_sizesnow represents the sizes of each file after removing such records.

duplicates

Anumeric vector of lengthK,indicating which files are assumed to have duplicates.duplicates[k]should be1 if filek has duplicates, andduplicates[k] should be0 if filek has noduplicates.

field_levels

Anumeric vector indicating the number ofdisagreement levels for each field.

file_labels

Aninteger vector of lengthsum(file_sizes), wherefile_labels[i] indicated which filerecordi is in.

fp_matrix

Aninteger matrix, wherefp_matrix[k1, k2] is a label for the file pair(k1, k2). Notethatfp_matrix[k1, k2] = fp_matrix[k2, k1].

rp_to_fp

Alogical matrix that indicates which recordpairs belong to which file pairs.rp_to_fp[fp, rp] isTRUE ifthe recordsrecord_pairs[rp, ] belong to the file pairfp,and is FALSE otherwise. Note thatfp is given by the labeling infp_matrix.

ab

Aninteger vector, of lengthncol(comparisons) * K * (K + 1) / 2 that indicates how many recordpairs there are with a given disagreement level for a given field, for eachfile pair.

file_sizes_not_included

If according topairs_to_keepthere are records which are not potential matches to any other records, theremaining records are relabeled (seelabels), andfile_sizes_not_included indicates, for each file, the number of suchrecords that were removed.

ab_not_included

For record pairs not included according topairs_to_keep, this is aninteger vector, of lengthncol(comparisons) * K * (K + 1) / 2 that indicates how many recordpairs there are with a given disagreement level for a given field, for eachfile pair.

labels

If according topairs_to_keepthere are records which are not potential matches to any other records, theremaining records are relabeled.labels provides a dictionary thatindicates, for each of the new labels, which record in the originallabeling the new label corresponds to. In particular, the first columnindicates the record in the original labeling, and the second columnindicates the new labeling.

pairs_to_keep

Alogical vector, the same length ascomparison_list$record_pairs, indicating which record pairs werekept as potential matches. This may not be the same as the inputpairs_to_keep ifcc was set to 1.

cc

Anumeric indicator of whether the connectedcomponents of the potential matches are closed under transitivity.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

# Example with small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))# Reduce the comparison data# The following line corresponds to only keeping pairs of records for which# neither gname nor fname disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)

Relabel the Bayes Estimate of a Partition

Description

Relabel the Bayes estimate of a partition, for use after using indexing toreduce the number of record pairs that are potential matches.

Usage

relabel_bayes_estimate(reduced_comparison_list, bayes_estimate)

Arguments

reduced_comparison_list

The output from a call toreduce_comparison_data.

bayes_estimate

The output from a call tofind_bayes_estimate.

Details

When the functionreduce_comparison_data is used to reduce thenumber of record pairs that are potential matches, it may be the case thatsome records are declared to not be potential matches to any other records.In this case, the indexing method has made the decision that these recordshave no matches, and thus we can remove them from the data set and relabelthe remaining records; see the documentation forlabels inreduce_comparison_data for information on how to go between theoriginal labeling and the new labeling. The purpose of this function is torelabel the output offind_bayes_estimate when the functionreduce_comparison_data is used, so that the user doesn't haveto do this relabeling themselves.

Value

Adata.frame, with as many rows assum(reduced_comparison_list$file_sizes +reduced_comparison_list$file_sizes_not_included), i.e. the number ofrecords originally input tocreate_comparison_data, beforeindexing occurred. Thisdata.frame has two columns,"original_labels" and"link_id". Given rowi ofrecords originally input tocreate_comparison_data,the linkage id according tobayes_estimate is given by theithrow of thelink_id column. See the documentation forfind_bayes_estimate for information on how to interpret thislinkage id.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242][arXiv]

Examples

# Example with small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))# Reduce the comparison data# The following line corresponds to only keeping pairs of records for which# neither gname nor fname disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)# Specify the priorprior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)# Run the Gibbs sampler{results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000, seed = 42)# Find the full Bayes estimatefull_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)# Find the partial Bayes estimatepartial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100, L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)# Relabel the full and partial Bayes estimatesfull_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, full_estimate)partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list, partial_estimate)# Add columns to the records corresponding to their full and partial# Bayes estimatesdup_data_small$records <- cbind(dup_data_small$records, full_estimate_id = full_estimate_relabel$link_id, partial_estimate_id = partial_estimate_relabel$link_id)}

Specify the Prior Distributions

Description

Specify the prior distributions for them andu parameters of themodels for comparison data among matches and non-matches, and the partition.

Usage

specify_prior(  comparison_list,  mus = NA,  nus = NA,  flat = 0,  alphas = NA,  dup_upper_bound = NA,  dup_count_prior_family = NA,  dup_count_prior_pars = NA,  n_prior_family = NA,  n_prior_pars = NA)

Arguments

comparison_list

the output from a call tocreate_comparison_data orreduce_comparison_data.Note that in order to correctly specify the prior, ifreduce_comparison_data is used to the reduce the number ofrecord pairs that are potential matches, then the output ofreduce_comparison_data (notcreate_comparison_data) should be used for this argument.

mus,nus

The hyperparameters of the Dirichlet priors for themandu parameters for the comparisons among matches and non-matches,respectively. These are positivenumeric vectors which have lengthequal to the number of columns ofcomparison_list$comparisons timesthe number of file pairs(comparison_list$K * (comparison_list$K + 1) / 2). If set toNA, flat priors are used. We recommend using flat priors formandu.

flat

Anumeric indicator of whether a flat prior for partitionsshould be used.flat should be1 if a flat prior is used, andflat should be0 if a structured prior is used. If a flat prioris used, the remaining arguments should be set toNA. Otherwise, theremaining arguments should be specified. We do not recommend using a flatprior for partitions in general.

alphas

The hyperparameters for the Dirichlet-multinomial overlap tableprior, a positivenumeric vector of length2 ^ comparison_list$K - 1. The indexing of these hyperparameters isbased on the thecomparison_list$K-bit binary representation of theinclusion patterns of the overlap table. To give a few examples, supposecomparison_list$K is3.1 in3-bit binary is001, soalphas[1] is the hyperparameter for the001 cell of the overlap table, representing clusters containing onlyrecords from the third file.2 in3-bit binary is010, soalphas[2] is the hyperparameter for the010 cell of the overlap table, representing clusters containing onlyrecords from the second file.3 in3-bit binary is011, soalphas[3] is the hyperparameter for the011 cell of the overlap table, representing clusters containing onlyrecords from the second and third files. If set toNA, thehyperparameters will all be set to1.

dup_upper_bound

Anumeric vector indicating the maximum numberof duplicates, from each file, allowed in each cluster. For a given filek,dup_upper_bound[k] should be between1 andcomparison_list$file_sizes[k], i.e. even if you don't want to imposean upper bound, you have to implicitly place an upper bound: the number ofrecords in a file. If set toNA, the upper bound for filekwill be set to1 if no duplicates are allowed for that file, orcomparison_list$file_sizes[k] if duplicates are allowed for that file.

dup_count_prior_family

Acharacter vector indicating theprior distribution family used for the number of duplicates in each cluster,for each file. Currently the only option is"Poisson" for a Poissonprior, truncated to lie between1 anddup_upper_bound[k]. Themean parameter of the Poisson distribution is specified using thedup_count_prior_pars argument. If set toNA, a Poisson priorwith mean1 will be used.

dup_count_prior_pars

Alist containing the parameters forthe prior distribution for the number of duplicates in each cluster, for eachfile. For filek, whendup_count_prior_family[k]="Poisson",dup_count_prior_pars[[k]] is a positive constant representing the meanof the Poisson prior.

n_prior_family

Acharacter indicating the prior distributionfamily used forn, the number of clusters represented in therecords. Note that this includes records determined not to be potentialmatches to any other records usingreduce_comparison_data.Currently the there are two options:"uniform" for a uniform priorforn, i.e.p(n) \propto 1, and"scale" for a scale priorforn, i.e.p(n) \propto 1/n. If set toNA, a uniformprior will be used.

n_prior_pars

Currently set toNA. When more prior distributionfamilies forn are implemented, this will be a vector of parametersfor those priors.

Details

The purpose of this function is to specify prior distributions for allparameters of the model. Please note that ifreduce_comparison_data is used to the reduce the number ofrecord pairs that are potential matches, then the output ofreduce_comparison_data (notcreate_comparison_data) should be used as input.

For the hyperparameters of the Dirichlet priors for themandu parameters for the comparisons among matches and non-matches,respectively, we recommend using a flat prior. This is accomplished bysettingmus=NA andnus=NA. Informative prior specificationsare possible, but in practice they will be overwhelmed by the large number ofcomparisons.

For the prior for partitions, we do not recommend using a flat prior. Insteadwe recommend using our structure prior for partitions. By settingflat=0 and the remaining arguments toNA, one obtains thedefault specification for the structured prior that we have found to performwell in simulation studies. The structured prior for partitions is specifiedas follows:

Please contact the package maintainer if you need new prior familiesforn or the number of duplicates in each cluster to be supported.

Value

a list containing:

mus

The hyperparameters of the Dirichlet priors for them parameters for the comparisons among matches.

nus

The hyperparameters of the Dirichlet priors for theu parameters for the comparisons among non-matches. Includes datafrom comparisons of record pairs that were declared to not be potentialmatches usingreduce_comparison_data.

flat

Anumeric indicator of whether a flat prior forpartitions should be used.flat is1 if a flat prior is used,andflat is0 if a structured prior is used.

no_dups

Anumeric indicator of whether no duplicatesare allowed in all of the files.

alphas

The hyperparameters for the Dirichlet-multinomialoverlap table prior, a positivenumeric vector of length2 ^ comparison_list$K, where the first element is0.

alpha_0

The sum ofalphas.

dup_upper_bound

Anumeric vector indicating themaximum number of duplicates, from each file, allowed in each cluster. Fora given filek,dup_upper_bound[k] should be between1andcomparison_list$file_sizes[k], i.e. even if you don't want toimpose an upper bound, you have to implicitly place an upper bound: thenumber of records in a file.

log_dup_count_prior

Alist containing the log densityof the prior distribution for the number of duplicates in each cluster, foreach file.

log_n_prior

Anumeric vector containing the logdensity of the prior distribution for the number of clusters represented inthe records.

nus_specified

Thenus before data from comparisons ofrecord pairs that were declared to not be potential matches usingreduce_comparison_data are added. Used for input checking.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection.Journal of theAmerican Statistical Association. [doi:10.1080/01621459.2021.2013242] [arXiv]

Examples

# Example with small no duplicate datasetdata(no_dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(no_dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = no_dup_data_small$file_sizes, duplicates = c(0, 0, 0))# Specify the priorprior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1), dup_count_prior_family = NA, dup_count_prior_pars = NA, n_prior_family = "uniform", n_prior_pars = NA)# Example with small duplicate datasetdata(dup_data_small)# Create the comparison datacomparison_list <- create_comparison_data(dup_data_small$records, types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"), breaks = list(NA,  c(0, 0.25, 0.5),  c(0, 0.25, 0.5),               c(0, 0.25, 0.5), c(0, 0.25, 0.5),  NA, NA), file_sizes = dup_data_small$file_sizes, duplicates = c(1, 1, 1))# Reduce the comparison data# The following line corresponds to only keeping pairs of records for which# neither gname nor fname disagree at the highest levelpairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) & (comparison_list$comparisons[, "fname_DL_3"] != TRUE)reduced_comparison_list <- reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)# Specify the priorprior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA, flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10), dup_count_prior_family = c("Poisson", "Poisson", "Poisson"), dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform", n_prior_pars = NA)

[8]ページ先頭

©2009-2025 Movatter.jp