| Title: | Cluster and Merge Similar Values Within a Character Vector |
| Version: | 0.3.3 |
| Description: | These functions take a character vector as input, identify and cluster similar values, and then merge clusters together so their values become identical. The functions are an implementation of the key collision and ngram fingerprint algorithms from the open source tool Open Refinehttps://openrefine.org/. More info on key collision and ngram fingerprint can be found herehttps://openrefine.org/docs/technical-reference/clustering-in-depth. |
| Depends: | R (≥ 3.0.2) |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| Imports: | Rcpp, stringdist (≥ 0.9.5.1), stringi |
| RoxygenNote: | 7.2.3 |
| LinkingTo: | Rcpp, stringdist (≥ 0.9.5.1) |
| URL: | https://github.com/ChrisMuir/refinr |
| BugReports: | https://github.com/ChrisMuir/refinr/issues |
| Suggests: | testthat, knitr, rmarkdown, dplyr |
| VignetteBuilder: | knitr |
| NeedsCompilation: | yes |
| Packaged: | 2023-11-12 21:47:22 UTC; chrismuir |
| Author: | Chris Muir [aut, cre] |
| Maintainer: | Chris Muir <chrismuirRVA@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2023-11-12 22:20:02 UTC |
Cluster and Merge Similar Values Within a Character Vector
Description
These functions take a character vector as input, identify andcluster similar values, and then merge clusters together so their valuesbecome identical. The functions are an implementation of the key collisionand ngram fingerprint algorithms from the open source tool Open Refine.
Documentation for Open Refine
Open Refine Sitehttps://openrefine.org/
Details on Open Refine clustering algorithmshttps://openrefine.org/docs/technical-reference/clustering-in-depth
Development links
refinr features the following functions
Author(s)
Maintainer: Chris MuirchrismuirRVA@gmail.com
See Also
Useful links:
Value merging based on Key Collision
Description
This function takes a character vector and makes edits and merges valuesthat are approximately equivalent yet not identical. It clusters valuesbased on the key collision method, described herehttps://openrefine.org/docs/technical-reference/clustering-in-depth.
Usage
key_collision_merge( vect, ignore_strings = NULL, bus_suffix = TRUE, dict = NULL)Arguments
vect | Character vector, items to be potentially clustered and merged. |
ignore_strings | Character vector, these strings will be ignored duringthe merging of values within |
bus_suffix | Logical, indicating whether the merging of records shouldbe insensitive to common business suffixes or not. Default value is TRUE. |
dict | Character vector, meant to act as a dictionary during themerging process. If any items within |
Value
Character vector with similar values merged.
Examples
x <- c("Acme Pizza, Inc.", "ACME PIZZA COMPANY", "pizza, acme llc", "Acme Pizza, Inc.")key_collision_merge(vect = x)# Use parameter "dict" to influence how clustered values are edited.key_collision_merge(vect = x, dict = c("Nicks Pizza", "acme PIZZA inc"))# Use parameter 'ignore_strings' to ignore specific strings during merging# of values.x <- c("Bakersfield Highschool", "BAKERSFIELD high", "high school, bakersfield")key_collision_merge(x, ignore_strings = c("high", "school", "highschool"))Value merging based on ngram fingerprints
Description
This function takes a character vector and makes edits and merges valuesthat are approximately equivalent yet not identical. It uses a two stepprocess, the first is clustering values based on their ngram fingerprint (described herehttps://openrefine.org/docs/technical-reference/clustering-in-depth).The second step is merging values based on approximate string matching ofthe ngram fingerprints, using the [sd_lower_tri()] C function from thepackagestringdist.
Usage
n_gram_merge( vect, numgram = 2, ignore_strings = NULL, bus_suffix = TRUE, edit_threshold = 1, weight = c(d = 0.33, i = 0.33, s = 1, t = 0.5), ...)Arguments
vect | Character vector, items to be potentially clustered and merged. |
numgram | Numeric value, indicating the number of characters thatwill occupy each ngram token. Default value is 2. |
ignore_strings | Character vector, these strings will be ignored duringthe merging of values within |
bus_suffix | Logical, indicating whether the merging of records shouldbe insensitive to common business suffixes or not. Default value is TRUE. |
edit_threshold | Numeric value, indicating the threshold at which amerge is performed, based on the sum of the edit values derived fromparam |
weight | Numeric vector, indicating the weights to assign tothe four edit operations (see details below), for the purpose ofapproximate string matching. Default values arec(d = 0.33, i = 0.33, s = 1, t = 0.5). This parameter gets passed alongto the |
... | additional args to be passed along to the |
Details
The values of argweight are edit distance values thatget passed to thestringdist edit distance function. Theparam takes four arguments, each one is a specific type of edit, withdefault penalty value.
d: deletion, default value is 0.33
i: insertion, default value is 0.33
s: substitution, default value is 1
t: transposition, default value is 0.5
Value
Character vector with similar values merged.
Examples
x <- c("Acme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC")n_gram_merge(vect = x)# The performance of the approximate string matching can be ajusted using# parameters 'weight' or 'edit_threshold'n_gram_merge(vect = x, weight = c(d = 0.4, i = 1, s = 1, t = 1))# Use parameter 'ignore_strings' to ignore specific strings during merging# of values.x <- c("Bakersfield Highschool", "BAKERSFIELD high", "high school, bakersfield")n_gram_merge(vect = x, ignore_strings = c("high", "school", "highschool"))