Movatterモバイル変換

levitate

levitate is based on the Pythonthefuzz (formerlyfuzzywuzzy) package for fuzzy string matching. An R port ofthis already exists, but unlikefuzzywuzzyR,levitate is written entirely in R with no externaldependencies onreticulate or Python. It also offers acouple of extra bells and whistles in the form of vectorisedfunctions.

View the docs athttps://lewinfox.com/levitate/.

Why “`levitate`”?

A common measure of string similarity is theLevenshteindistance, and the name was available on CRAN.

NOTE The default distance metric is Optimal StringAlignment (OSA), not Levenshtein distance. This is the default methodused by thestringdist package, whichlevitateuses for distance calculations. OSA allows transpositions whereasLevenshtein distance does not. To use Levenshtein distance passmethod = "lv" to anylev_*() functions.

lev_distance("01","10")# Transpositions allowed by the default `method = "osa"`#> [1] 1lev_distance("01","10",method ="lv")# No transpositions#> [1] 2

A full list of distance metrics is available inhelp("stringdist-metrics", package = stringdist).

Installation

Install the released version from CRAN:

install.packages("levitate")

Alternatively, you can install the development version fromGithub:

devtools::install_github("lewinfox/levitate")

Examples

`lev_distance()`

The edit distance is the number of additions, subtractions orsubstitutions needed to transform one string into another. Base Rprovides theadist() function to compute this.levitate provideslev_distance() which ispowered by thestringdistpackage.

lev_distance("cat","bat")#> [1] 1lev_distance("rat","rats")#> [1] 1lev_distance("cat","rats")#> [1] 2

The function can accept vectorised input. Where the inputs have alength() greater than 1 the results are returned as avector unlesspairwise = FALSE, in which case a matrix isreturned.

lev_distance(c("cat","dog","clog"),c("rat","log","frog"))#> [1] 1 1 2lev_distance(c("cat","dog","clog"),c("rat","log","frog"),pairwise =FALSE)#>      rat log frog#> cat    1   3    4#> dog    3   1    2#> clog   4   1    2

If at least one (or both) of the inputs is scalar (length 1) theresult will be a vector. The elements of the vector are named based onthe longer input (unlessuseNames = FALSE).

lev_distance(c("cat","dog","clog"),"rat")#>  cat  dog clog#>    1    3    4lev_distance("cat",c("rat","log","frog","other"))#>   rat   log  frog other#>     1     3     4     5lev_distance("cat",c("rat","log","frog","other"),useNames =FALSE)#> [1] 1 3 4 5

`lev_ratio()`

More useful than the edit distance,lev_ratio() makes iteasier to compare similarity across different strings. Identical stringswill get a score of 1 and entirely dissimilar strings will get a scoreof 0.

This function behaves exactly likelev_distance():

lev_ratio("cat","bat")#> [1] 0.6666667lev_ratio("rat","rats")#> [1] 0.75lev_ratio("cat","rats")#> [1] 0.5lev_ratio(c("cat","dog","clog"),c("rat","log","frog"))#> [1] 0.6666667 0.6666667 0.5000000

`lev_partial_ratio()`

Ifa andb are different lengths, thisfunction compares all the substrings of the longer string that are thesame length as the shorter string and returns the highestlev_ratio() of all of them. E.g. when comparing"actor" and"tractor" we would compare"actor" with"tract","racto" and"actor" and return the highest score (in this case 1).

lev_partial_ratio("actor","tractor")#> [1] 1# What's actually happening is the max() of this result is being returnedlev_ratio("actor",c("tract","racto","actor"))#> tract racto actor#>   0.2   0.6   1.0

`lev_token_sort_ratio()`

The inputs are tokenised and the tokens are sorted alphabetically,then the resulting strings are compared.

x<-"Episode IV - Star Wars: A New Hope"y<-"Star Wars Episode IV - New Hope"# Because the order of words is different the simple approach gives a low match ratio.lev_ratio(x, y)#> [1] 0.3529412# The sorted token approach ignores word order.lev_token_sort_ratio(x, y)#> [1] 0.9354839

`lev_token_set_ratio()`

Similar tolev_token_sort_ratio() this function breaksthe input down into tokens. It then identifies any common tokens betweenstrings and creates three new strings:

x <- {common_tokens}y <- {common_tokens}{remaining_unique_tokens_from_string_a}z <- {common_tokens}{remaining_unique_tokens_from_string_b}

and performs three pairwiselev_ratio() calculationsbetween them (x vsy,y vsz andx vsz). The highest ofthose three ratios is returned.

x<-"the quick brown fox jumps over the lazy dog"y<-"my lazy dog was jumped over by a quick brown fox"lev_ratio(x, y)#> [1] 0.2916667lev_token_sort_ratio(x, y)#> [1] 0.6458333lev_token_set_ratio(x, y)#> [1] 0.7435897

lev_weighted_token_ratio()

Thelev_weighted_*() family of functions work slightlydifferently from the others. They always tokenise their input, and theyallow you to assign different weights to specific tokens. This allowsyou to exert some influence over parts of the input strings that aremore or less interesting to you.

For example, maybe you’re comparing company names from differentsources, trying to match them up.

lev_ratio("united widgets, ltd","utd widgets, ltd")# Note the typos#> [1] 0.8421053

These strings score quite highly already, but the"ltd"in each name isn’t very helpful. We can uselev_weighted_token_ratio() to reduce the impact of"ltd".

NOTE Because the tokenisation affects the score, wecan’t compare the output of thelev_weighted_*() functionswith the non-weighted versions. To get a baseline, call the weightedfunction without supplying aweights argument.

lev_weighted_token_ratio("united widgets, ltd","utd widgets, ltd")#> [1] 0.8125lev_weighted_token_ratio("united widgets, ltd","utd widgets, ltd",weights =list(ltd =0.1))#> [1] 0.7744361

De-weighting"ltd" has reduced the similarity score ofthe strings, which gives a more accurate impression of theirsimilarity.

We can remove the effect of"ltd" altogether by settingits weight to zero.

lev_weighted_token_ratio("united widgets, ltd","utd widgets, ltd",weights =list(ltd =0))#> [1] 0.7692308lev_weighted_token_ratio("united widgets","utd widgets")#> [1] 0.7692308

De-weighting also works the other way - if the token to be weightedappears in one string but not the other, then de-weighting itincreases the similarity score:

lev_weighted_token_ratio("utd widgets","united widgets, ltd")#> [1] 0.625lev_weighted_token_ratio("utd widgets","united widgets, ltd",weights =list(ltd =0.1))#> [1] 0.7518797

Limitations of tokenweighting

lev_weighted_token_ratio() has a key limitation: tokenswill only be weighted if:

The token appears in the same position in both strings (i.e. it’sthe first/second/third, etc. token in both)
OR the strings contain different numbers of tokens, and thecorresponding token position in the other string is empty.

This is probably easiest to see by example.

lev_weighted_token_ratio("utd widgets limited","united widgets, ltd")#> [1] 0.65lev_weighted_token_ratio("utd widgets limited","united widgets, ltd",weights =list(ltd =0.1,limited =0.1))#> [1] 0.65

In this case the weighting has had no effect. Why not? Internally,the function has tokenised the strings as follows:

token_1	token_2	token_3
“utd”	“widgets”	“limited”
“united”	“widgets”	“ltd”

Because the token"ltd" doesn’t appear in the sameposition in both strings, the function doesn’t apply any weights.

This is a deliberate decision; while in the example above it’s easyto say “well, clearly ltd and limited are the same thing so we ought toweight them”, how should we handle a less clear example?

lev_weighted_token_ratio("green eggs and ham","spam spam spam spam")#> [1] 0.1176471lev_weighted_token_ratio("green eggs and ham","spam spam spam spam",weights =list(spam =0.1,eggs =0.5))#> [1] 0.1176471

In this case it’s hard to say what the “correct” approach would be.There isn’t a meaningful way of applying weights to dissimilar tokens.In situations like “ltd”/“limited”, a pre-cleaning or standardisationprocess might be helpful, but that is outside the scope of whatlevitate offers.

I recommend exploringlev_weighted_token_sort_ratio()andlev_weighted_token_set_ratio() as they may give moreuseful results for some problems. Remember,weighting is goingto be most useful when compared to the unweighted output of the samefunction.

Ranking functions

A common problem in this area is “given a string x and a set ofstrings y, which string in y is most / least similar to x?”.levitate provides two functions to help with this:lev_score_multiple() andlev_best_match().

lev_score_multiple() returns a ranked list ofcandidates. By default the highest-scoring is first.

lev_score_multiple("bilbo",c("gandalf","frodo","legolas"))#> $frodo#> [1] 0.2#>#> $legolas#> [1] 0.1428571#>#> $gandalf#> [1] 0

lev_best_match() returns the best matched string withoutany score information.

lev_best_match("bilbo",c("gandalf","frodo","legolas"))#> [1] "frodo"

Both functions take a.fn argument which allows you toselect a different ranking function. The default islev_ratio() but you can pick another or write your own. See?lev_score_multiple for details.

You can also reverse the direction of sorting by usingdecreasing = FALSE. This reverses the sort direction solower scoring items are preferred. This may be helpful ifyou’re using a distance rather than a similarity measure, or if you wantto return least similar strings.

lev_score_multiple("bilbo",c("gandalf","frodo","legolas"),decreasing =FALSE)#> $gandalf#> [1] 0#>#> $legolas#> [1] 0.1428571#>#> $frodo#> [1] 0.2

Porting code from`thefuzz` or`fuzzywuzzyR`

Results differ betweenlevitate andthefuzz, not least becausestringdistoffers several possible similarity measures. Be careful if you areporting code that relies on hard-coded or learned cutoffs for similaritymeasures.

[8]ページ先頭

Movatterモバイル変換

levitate

Why “levitate”?

Installation

Examples

lev_distance()

lev_ratio()

lev_partial_ratio()

lev_token_sort_ratio()

lev_token_set_ratio()