stringdist_inner_join: Correcting misspellings against adictionaryOften you find yourself with a set of words that you want to combinewith a “dictionary”- it could be a literal dictionary (as in this case)or a domain-specific category system. But you want to allow for smalldifferences in spelling or punctuation.
The fuzzyjoin package comes with a set of common misspellings (fromWikipedia):
## # A tibble: 4,505 × 2## misspelling correct ## <chr> <chr> ## 1 abandonned abandoned ## 2 aberation aberration## 3 abilties abilities ## 4 abilty ability ## 5 abondon abandon ## 6 abbout about ## 7 abotu about ## 8 abouta about a ## 9 aboutit about it ## 10 aboutthe about the ## # ℹ 4,495 more rows# use the dictionary of words from the qdapDictionaries package,# which is based on the Nettalk corpus.library(qdapDictionaries)words<-tbl_df(DICTIONARY)## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.## ℹ Please use `tibble::as_tibble()` instead.## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was## generated.## # A tibble: 20,137 × 2## word syllables## <chr> <dbl>## 1 hm 1## 2 hmm 1## 3 hmmm 1## 4 hmph 1## 5 mmhmm 2## 6 mmhm 2## 7 mm 1## 8 mmm 1## 9 mmmm 1## 10 pff 1## # ℹ 20,127 more rowsAs an example, we’ll pick 1000 of these words (you could try it onall of them though), and usestringdist_inner_join to jointhem against our dictionary.
By default,stringdist_inner_join uses optimal stringalignment (Damerau–Levenshtein distance), and we’re setting a maximumdistance of 1 for a join. Notice that they’ve been joined in cases wheremisspelling is close to (but not equal to)word:
## # A tibble: 760 × 4## misspelling correct word syllables## <chr> <chr> <chr> <dbl>## 1 cyclinder cylinder cylinder 3## 2 beastiality bestiality bestiality 5## 3 affilate affiliate affiliate 4## 4 supress suppress suppress 2## 5 intevene intervene intervene 3## 6 resaurant restaurant restaurant 3## 7 univesity university university 5## 8 allegedely allegedly allegedly 4## 9 emiting emitting smiting 2## 10 probaly probably probably 3## # ℹ 750 more rowsNote that there are some redundancies; words that could be multipleitems in the dictionary. These end up with one row per “guess” in theoutput. How many words did we classify?
## # A tibble: 462 × 3## misspelling correct n## <chr> <chr> <int>## 1 abilty ability 1## 2 accademic academic 1## 3 accademy academy 1## 4 accension accession 2## 5 acceptence acceptance 1## 6 acedemic academic 1## 7 achive achieve 4## 8 acommodate accommodate 1## 9 acuracy accuracy 1## 10 addmission admission 1## # ℹ 452 more rowsSo we found a match in the dictionary for about half of themisspellings. In how many of the ones we classified did we get at leastone of our guesses right?
which_correct<- joined%>%group_by(misspelling, correct)%>%summarize(guesses =n(),one_correct =any(correct== word))which_correct## # A tibble: 462 × 4## # Groups: misspelling [453]## misspelling correct guesses one_correct## <chr> <chr> <int> <lgl> ## 1 abilty ability 1 TRUE ## 2 accademic academic 1 TRUE ## 3 accademy academy 1 TRUE ## 4 accension accession 2 TRUE ## 5 acceptence acceptance 1 TRUE ## 6 acedemic academic 1 TRUE ## 7 achive achieve 4 TRUE ## 8 acommodate accommodate 1 TRUE ## 9 acuracy accuracy 1 TRUE ## 10 addmission admission 1 TRUE ## # ℹ 452 more rows## [1] 0.8246753# number uniquely correct (out of the original 1000)sum(which_correct$guesses==1& which_correct$one_correct)## [1] 290Not bad.
Note thatstringdist_inner_join is not the only functionwe can use. If we’re interested in including the words that wecouldn’t classify, we could have usedstringdist_left_join:
left_joined<- sub_misspellings%>%stringdist_left_join(words,by =c(misspelling ="word"),max_dist =1)left_joined## # A tibble: 1,298 × 4## misspelling correct word syllables## <chr> <chr> <chr> <dbl>## 1 Sanhedrim Sanhedrin <NA> NA## 2 cyclinder cylinder cylinder 3## 3 beastiality bestiality bestiality 5## 4 consicousness consciousness <NA> NA## 5 affilate affiliate affiliate 4## 6 repubicans republicans <NA> NA## 7 comitted committed <NA> NA## 8 emmisions emissions <NA> NA## 9 acquited acquitted <NA> NA## 10 decompositing decomposing <NA> NA## # ℹ 1,288 more rows## # A tibble: 538 × 4## misspelling correct word syllables## <chr> <chr> <chr> <dbl>## 1 Sanhedrim Sanhedrin <NA> NA## 2 consicousness consciousness <NA> NA## 3 repubicans republicans <NA> NA## 4 comitted committed <NA> NA## 5 emmisions emissions <NA> NA## 6 acquited acquitted <NA> NA## 7 decompositing decomposing <NA> NA## 8 decieved deceived <NA> NA## 9 asociated associated <NA> NA## 10 commonweath commonwealth <NA> NA## # ℹ 528 more rows(To getjust the ones without matches immediately, we couldhave usedstringdist_anti_join). If we increase ourdistance threshold, we’ll increase the fraction with a correct guess,but also get more false positive guesses:
left_joined2<- sub_misspellings%>%stringdist_left_join(words,by =c(misspelling ="word"),max_dist =2)left_joined2## # A tibble: 8,721 × 4## misspelling correct word syllables## <chr> <chr> <chr> <dbl>## 1 Sanhedrim Sanhedrin <NA> NA## 2 cyclinder cylinder cylinder 3## 3 beastiality bestiality bestiality 5## 4 consicousness consciousness <NA> NA## 5 affilate affiliate affiliate 4## 6 repubicans republicans <NA> NA## 7 comitted committed committee 3## 8 emmisions emissions <NA> NA## 9 acquited acquitted acquire 2## 10 acquited acquitted acquit 2## # ℹ 8,711 more rows## # A tibble: 286 × 4## misspelling correct word syllables## <chr> <chr> <chr> <dbl>## 1 Sanhedrim Sanhedrin <NA> NA## 2 consicousness consciousness <NA> NA## 3 repubicans republicans <NA> NA## 4 emmisions emissions <NA> NA## 5 commonweath commonwealth <NA> NA## 6 supressed suppressed <NA> NA## 7 aproximately approximately <NA> NA## 8 Missisippi Mississippi <NA> NA## 9 lazyness laziness <NA> NA## 10 constituional constitutional <NA> NA## # ℹ 276 more rowsMost of the missing words here simply aren’t in our dictionary.
You can try other distance thresholds, other dictionaries, and otherdistance metrics (seestringdist-metricsfor more). This function is especially useful on a domain-specificdataset, such as free-form survey input that is likely to be close toone of a handful of responses.