- Notifications
You must be signed in to change notification settings - Fork36
markvanderloo/stringdist
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
- Approximate matching, fuzzy text search, and string distance calculations for R.
- All distance and matching operations are system- and encoding-independent.
- Built for speed, usingopenMP for parallel computing.
Please cite theR-Journal article
@article{RJ-2014-011, author = {Mark P.J. van der Loo}, title = {{The stringdist Package for Approximate String Matching}}, year = {2014}, journal = {{The R Journal}}, doi = {10.32614/RJ-2014-011}, url = {https://doi.org/10.32614/RJ-2014-011}, pages = {111--122}, volume = {6}, number = {1}}The package offers the following main functions:
stringdistcomputes pairwise distances between two input character vectors (shorter one is recycled)stringdistmatrixcomputes the distance matrix for one or two vectorsstringsimcomputes a string similarity between 0 and 1, based onstringdistamatchis a fuzzy matching equivalent of R's nativematchfunctionainis a fuzzy matching equivalent of R's native%in%operatorafindfinds the location of fuzzy matches of a short string in a long string.seq_dist,seq_distmatrix,seq_amatchandseq_ainfor distances between, and matching of integer sequences. (see also thehashr package).
These functions are built uponC-code that re-implements some common (weighted) stringdistance functions. Distance functions include:
- Hamming distance;
- Levenshtein distance (weighted);
- Restricted Damerau-Levenshtein distance (weighted, a.k.a. Optimal String Alignment);
- Full Damerau-Levenshtein distance (weighted);
- Longest Common Substring distance;
- Q-gram distance
- cosine distance for q-gram count vectors (= 1-cosine similarity)
- Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)
- Jaro, and Jaro-Winkler distance
- Soundex-based string distance.
Also, there are some utility functions:
qgrams()tabulates the qgrams in one or morecharactervectors.seq_qrams()tabulates the qgrams (somtimes called ngrams) in one or moreintegervectors.phonetic()computes phonetic codes of strings (currently only soundex)printable_ascii()is a utility function that detects non-printable ascii or non-ascii characters.
As of version0.9.5.0 you can call a number ofstringdist functions directlyfrom theC code of your R package. The description of the API can be found
- By typing
?stringdist_apiin the R console - By browsing the package's help index to
User guides, package vignettes and other documentationand clicking ondoc/stringdist_api.pdf. - Or you can find the file's location as follows
system.file("doc/stringdist_api.pdf", package="stringdist")Examples of packages that link tostringdist can be foundhere andhere.
To install the latest release from CRAN, open an R terminal and type
install.packages('stringdist')
To obtain the package from the very latest source code open abash terminal (orgit bash if you work under Windowswithmsysgit) and type
git clone https://github.com/markvanderloo/stringdist.gitcd stringdistbash ./build.bashR CMD INSTALL output/stringdist_*.tar.gzWarning: the github version can change any time and may not even build properly. As mostof the code is written inC, the development version may crash yourR-session.
The following arguments have been obsolete since 2015 and have been removed in the 0.9.5.0 release (spring 2018)
- Argument
clusterfor functionstringdistmatrix. - Argument
maxDistfor functionsstringdistandstringdistmatrix(notamatch). - Argument
ncoresfor functionstringdistmatrix
Parallelization used to be based on R'sparallel package, that works by spawning several R sessions in the background. As of version 0.9.0,stringdist uses the more efficientopenMP protocol to parallelize everything under the hood.
The following arguments have become obsolete and will be removed somewhere in 2016:
- Argument
clusterfor functionstringdistmatrix. - Argument
maxDistfor functionsstringdistandstringdistmatrix(notamatch). - Argument
ncoresfor functionstringdistmatrix
About
String distance functions for R
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors6
Uh oh!
There was an error while loading.Please reload this page.