Movatterモバイル変換


[0]ホーム

URL:


Title:Cluster and Merge Similar Values Within a Character Vector
Version:0.3.3
Description:These functions take a character vector as input, identify and cluster similar values, and then merge clusters together so their values become identical. The functions are an implementation of the key collision and ngram fingerprint algorithms from the open source tool Open Refinehttps://openrefine.org/. More info on key collision and ngram fingerprint can be found herehttps://openrefine.org/docs/technical-reference/clustering-in-depth.
Depends:R (≥ 3.0.2)
License:GPL-3
Encoding:UTF-8
Imports:Rcpp, stringdist (≥ 0.9.5.1), stringi
RoxygenNote:7.2.3
LinkingTo:Rcpp, stringdist (≥ 0.9.5.1)
URL:https://github.com/ChrisMuir/refinr
BugReports:https://github.com/ChrisMuir/refinr/issues
Suggests:testthat, knitr, rmarkdown, dplyr
VignetteBuilder:knitr
NeedsCompilation:yes
Packaged:2023-11-12 21:47:22 UTC; chrismuir
Author:Chris Muir [aut, cre]
Maintainer:Chris Muir <chrismuirRVA@gmail.com>
Repository:CRAN
Date/Publication:2023-11-12 22:20:02 UTC

Cluster and Merge Similar Values Within a Character Vector

Description

These functions take a character vector as input, identify andcluster similar values, and then merge clusters together so their valuesbecome identical. The functions are an implementation of the key collisionand ngram fingerprint algorithms from the open source tool Open Refine.

Documentation for Open Refine

Development links

refinr features the following functions

Author(s)

Maintainer: Chris MuirchrismuirRVA@gmail.com

See Also

Useful links:


Value merging based on Key Collision

Description

This function takes a character vector and makes edits and merges valuesthat are approximately equivalent yet not identical. It clusters valuesbased on the key collision method, described herehttps://openrefine.org/docs/technical-reference/clustering-in-depth.

Usage

key_collision_merge(  vect,  ignore_strings = NULL,  bus_suffix = TRUE,  dict = NULL)

Arguments

vect

Character vector, items to be potentially clustered and merged.

ignore_strings

Character vector, these strings will be ignored duringthe merging of values withinvect. Default value is NULL.

bus_suffix

Logical, indicating whether the merging of records shouldbe insensitive to common business suffixes or not. Default value is TRUE.

dict

Character vector, meant to act as a dictionary during themerging process. If any items withinvect have a match in dict,then those items will always be edited to be identical to their match indict. Default value is NULL.

Value

Character vector with similar values merged.

Examples

x <- c("Acme Pizza, Inc.", "ACME PIZZA COMPANY", "pizza, acme llc",       "Acme Pizza, Inc.")key_collision_merge(vect = x)# Use parameter "dict" to influence how clustered values are edited.key_collision_merge(vect = x, dict = c("Nicks Pizza", "acme PIZZA inc"))# Use parameter 'ignore_strings' to ignore specific strings during merging# of values.x <- c("Bakersfield Highschool", "BAKERSFIELD high",       "high school, bakersfield")key_collision_merge(x, ignore_strings = c("high", "school", "highschool"))

Value merging based on ngram fingerprints

Description

This function takes a character vector and makes edits and merges valuesthat are approximately equivalent yet not identical. It uses a two stepprocess, the first is clustering values based on their ngram fingerprint (described herehttps://openrefine.org/docs/technical-reference/clustering-in-depth).The second step is merging values based on approximate string matching ofthe ngram fingerprints, using the [sd_lower_tri()] C function from thepackagestringdist.

Usage

n_gram_merge(  vect,  numgram = 2,  ignore_strings = NULL,  bus_suffix = TRUE,  edit_threshold = 1,  weight = c(d = 0.33, i = 0.33, s = 1, t = 0.5),  ...)

Arguments

vect

Character vector, items to be potentially clustered and merged.

numgram

Numeric value, indicating the number of characters thatwill occupy each ngram token. Default value is 2.

ignore_strings

Character vector, these strings will be ignored duringthe merging of values withinvect. Default value is NULL.

bus_suffix

Logical, indicating whether the merging of records shouldbe insensitive to common business suffixes or not. Default value is TRUE.

edit_threshold

Numeric value, indicating the threshold at which amerge is performed, based on the sum of the edit values derived fromparamweight. Default value is 1. If this parameter isset to 0 or NA, then no approximate string matching will be done, and allmerging will be based on strings that have identical ngram fingerprints.

weight

Numeric vector, indicating the weights to assign tothe four edit operations (see details below), for the purpose ofapproximate string matching. Default values arec(d = 0.33, i = 0.33, s = 1, t = 0.5). This parameter gets passed alongto thestringdist function. Must be eithera numeric vector of length four, or NA.

...

additional args to be passed along to thestringdistfunction. The acceptable args are identical to those of[stringdistmatrix()].

Details

The values of argweight are edit distance values thatget passed to thestringdist edit distance function. Theparam takes four arguments, each one is a specific type of edit, withdefault penalty value.

Value

Character vector with similar values merged.

Examples

x <- c("Acme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC")n_gram_merge(vect = x)# The performance of the approximate string matching can be ajusted using# parameters 'weight' or 'edit_threshold'n_gram_merge(vect = x,             weight = c(d = 0.4, i = 1, s = 1, t = 1))# Use parameter 'ignore_strings' to ignore specific strings during merging# of values.x <- c("Bakersfield Highschool", "BAKERSFIELD high",       "high school, bakersfield")n_gram_merge(vect = x, ignore_strings = c("high", "school", "highschool"))

[8]ページ先頭

©2009-2025 Movatter.jp