Movatterモバイル変換


[0]ホーム

URL:


Example ofstringdist_inner_join: Correcting misspellings against adictionary

David Robinson

2025-07-10

Often you find yourself with a set of words that you want to combinewith a “dictionary”- it could be a literal dictionary (as in this case)or a domain-specific category system. But you want to allow for smalldifferences in spelling or punctuation.

The fuzzyjoin package comes with a set of common misspellings (fromWikipedia):

library(dplyr)library(fuzzyjoin)data(misspellings)misspellings
## # A tibble: 4,505 × 2##    misspelling correct   ##    <chr>       <chr>     ##  1 abandonned  abandoned ##  2 aberation   aberration##  3 abilties    abilities ##  4 abilty      ability   ##  5 abondon     abandon   ##  6 abbout      about     ##  7 abotu       about     ##  8 abouta      about a   ##  9 aboutit     about it  ## 10 aboutthe    about the ## # ℹ 4,495 more rows
# use the dictionary of words from the qdapDictionaries package,# which is based on the Nettalk corpus.library(qdapDictionaries)words<-tbl_df(DICTIONARY)
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.## ℹ Please use `tibble::as_tibble()` instead.## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was## generated.
words
## # A tibble: 20,137 × 2##    word  syllables##    <chr>     <dbl>##  1 hm            1##  2 hmm           1##  3 hmmm          1##  4 hmph          1##  5 mmhmm         2##  6 mmhm          2##  7 mm            1##  8 mmm           1##  9 mmmm          1## 10 pff           1## # ℹ 20,127 more rows

As an example, we’ll pick 1000 of these words (you could try it onall of them though), and usestringdist_inner_join to jointhem against our dictionary.

set.seed(2016)sub_misspellings<- misspellings%>%sample_n(1000)
joined<- sub_misspellings%>%stringdist_inner_join(words,by =c(misspelling ="word"),max_dist =1)

By default,stringdist_inner_join uses optimal stringalignment (Damerau–Levenshtein distance), and we’re setting a maximumdistance of 1 for a join. Notice that they’ve been joined in cases wheremisspelling is close to (but not equal to)word:

joined
## # A tibble: 760 × 4##    misspelling correct    word       syllables##    <chr>       <chr>      <chr>          <dbl>##  1 cyclinder   cylinder   cylinder           3##  2 beastiality bestiality bestiality         5##  3 affilate    affiliate  affiliate          4##  4 supress     suppress   suppress           2##  5 intevene    intervene  intervene          3##  6 resaurant   restaurant restaurant         3##  7 univesity   university university         5##  8 allegedely  allegedly  allegedly          4##  9 emiting     emitting   smiting            2## 10 probaly     probably   probably           3## # ℹ 750 more rows

Note that there are some redundancies; words that could be multipleitems in the dictionary. These end up with one row per “guess” in theoutput. How many words did we classify?

joined%>%count(misspelling, correct)
## # A tibble: 462 × 3##    misspelling correct         n##    <chr>       <chr>       <int>##  1 abilty      ability         1##  2 accademic   academic        1##  3 accademy    academy         1##  4 accension   accession       2##  5 acceptence  acceptance      1##  6 acedemic    academic        1##  7 achive      achieve         4##  8 acommodate  accommodate     1##  9 acuracy     accuracy        1## 10 addmission  admission       1## # ℹ 452 more rows

So we found a match in the dictionary for about half of themisspellings. In how many of the ones we classified did we get at leastone of our guesses right?

which_correct<- joined%>%group_by(misspelling, correct)%>%summarize(guesses =n(),one_correct =any(correct== word))which_correct
## # A tibble: 462 × 4## # Groups:   misspelling [453]##    misspelling correct     guesses one_correct##    <chr>       <chr>         <int> <lgl>      ##  1 abilty      ability           1 TRUE       ##  2 accademic   academic          1 TRUE       ##  3 accademy    academy           1 TRUE       ##  4 accension   accession         2 TRUE       ##  5 acceptence  acceptance        1 TRUE       ##  6 acedemic    academic          1 TRUE       ##  7 achive      achieve           4 TRUE       ##  8 acommodate  accommodate       1 TRUE       ##  9 acuracy     accuracy          1 TRUE       ## 10 addmission  admission         1 TRUE       ## # ℹ 452 more rows
# percentage of guesses getting at least one rightmean(which_correct$one_correct)
## [1] 0.8246753
# number uniquely correct (out of the original 1000)sum(which_correct$guesses==1& which_correct$one_correct)
## [1] 290

Not bad.

Note thatstringdist_inner_join is not the only functionwe can use. If we’re interested in including the words that wecouldn’t classify, we could have usedstringdist_left_join:

left_joined<- sub_misspellings%>%stringdist_left_join(words,by =c(misspelling ="word"),max_dist =1)left_joined
## # A tibble: 1,298 × 4##    misspelling   correct       word       syllables##    <chr>         <chr>         <chr>          <dbl>##  1 Sanhedrim     Sanhedrin     <NA>              NA##  2 cyclinder     cylinder      cylinder           3##  3 beastiality   bestiality    bestiality         5##  4 consicousness consciousness <NA>              NA##  5 affilate      affiliate     affiliate          4##  6 repubicans    republicans   <NA>              NA##  7 comitted      committed     <NA>              NA##  8 emmisions     emissions     <NA>              NA##  9 acquited      acquitted     <NA>              NA## 10 decompositing decomposing   <NA>              NA## # ℹ 1,288 more rows
left_joined%>%filter(is.na(word))
## # A tibble: 538 × 4##    misspelling   correct       word  syllables##    <chr>         <chr>         <chr>     <dbl>##  1 Sanhedrim     Sanhedrin     <NA>         NA##  2 consicousness consciousness <NA>         NA##  3 repubicans    republicans   <NA>         NA##  4 comitted      committed     <NA>         NA##  5 emmisions     emissions     <NA>         NA##  6 acquited      acquitted     <NA>         NA##  7 decompositing decomposing   <NA>         NA##  8 decieved      deceived      <NA>         NA##  9 asociated     associated    <NA>         NA## 10 commonweath   commonwealth  <NA>         NA## # ℹ 528 more rows

(To getjust the ones without matches immediately, we couldhave usedstringdist_anti_join). If we increase ourdistance threshold, we’ll increase the fraction with a correct guess,but also get more false positive guesses:

left_joined2<- sub_misspellings%>%stringdist_left_join(words,by =c(misspelling ="word"),max_dist =2)left_joined2
## # A tibble: 8,721 × 4##    misspelling   correct       word       syllables##    <chr>         <chr>         <chr>          <dbl>##  1 Sanhedrim     Sanhedrin     <NA>              NA##  2 cyclinder     cylinder      cylinder           3##  3 beastiality   bestiality    bestiality         5##  4 consicousness consciousness <NA>              NA##  5 affilate      affiliate     affiliate          4##  6 repubicans    republicans   <NA>              NA##  7 comitted      committed     committee          3##  8 emmisions     emissions     <NA>              NA##  9 acquited      acquitted     acquire            2## 10 acquited      acquitted     acquit             2## # ℹ 8,711 more rows
left_joined2%>%filter(is.na(word))
## # A tibble: 286 × 4##    misspelling   correct        word  syllables##    <chr>         <chr>          <chr>     <dbl>##  1 Sanhedrim     Sanhedrin      <NA>         NA##  2 consicousness consciousness  <NA>         NA##  3 repubicans    republicans    <NA>         NA##  4 emmisions     emissions      <NA>         NA##  5 commonweath   commonwealth   <NA>         NA##  6 supressed     suppressed     <NA>         NA##  7 aproximately  approximately  <NA>         NA##  8 Missisippi    Mississippi    <NA>         NA##  9 lazyness      laziness       <NA>         NA## 10 constituional constitutional <NA>         NA## # ℹ 276 more rows

Most of the missing words here simply aren’t in our dictionary.

You can try other distance thresholds, other dictionaries, and otherdistance metrics (seestringdist-metricsfor more). This function is especially useful on a domain-specificdataset, such as free-form survey input that is likely to be close toone of a handful of responses.


[8]ページ先頭

©2009-2025 Movatter.jp