Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

String distance functions for R

NotificationsYou must be signed in to change notification settings

markvanderloo/stringdist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CRANstatusDownloadsResearch software impactMentioned in Awesome Official Statistics

stringdist

  • Approximate matching, fuzzy text search, and string distance calculations for R.
  • All distance and matching operations are system- and encoding-independent.
  • Built for speed, usingopenMP for parallel computing.

Citing

Please cite theR-Journal article

@article{RJ-2014-011,  author = {Mark P.J. van der Loo},  title = {{The stringdist Package for Approximate String Matching}},  year = {2014},  journal = {{The R Journal}},  doi = {10.32614/RJ-2014-011},  url = {https://doi.org/10.32614/RJ-2014-011},  pages = {111--122},  volume = {6},  number = {1}}

Functionality

The package offers the following main functions:

  • stringdist computes pairwise distances between two input character vectors (shorter one is recycled)
  • stringdistmatrix computes the distance matrix for one or two vectors
  • stringsim computes a string similarity between 0 and 1, based onstringdist
  • amatch is a fuzzy matching equivalent of R's nativematch function
  • ain is a fuzzy matching equivalent of R's native%in% operator
  • afind finds the location of fuzzy matches of a short string in a long string.
  • seq_dist,seq_distmatrix,seq_amatch andseq_ain for distances between, and matching of integer sequences. (see also thehashr package).

These functions are built uponC-code that re-implements some common (weighted) stringdistance functions. Distance functions include:

  • Hamming distance;
  • Levenshtein distance (weighted);
  • Restricted Damerau-Levenshtein distance (weighted, a.k.a. Optimal String Alignment);
  • Full Damerau-Levenshtein distance (weighted);
  • Longest Common Substring distance;
  • Q-gram distance
  • cosine distance for q-gram count vectors (= 1-cosine similarity)
  • Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)
  • Jaro, and Jaro-Winkler distance
  • Soundex-based string distance.

Also, there are some utility functions:

  • qgrams() tabulates the qgrams in one or morecharacter vectors.
  • seq_qrams() tabulates the qgrams (somtimes called ngrams) in one or moreinteger vectors.
  • phonetic() computes phonetic codes of strings (currently only soundex)
  • printable_ascii() is a utility function that detects non-printable ascii or non-ascii characters.

C API

As of version0.9.5.0 you can call a number ofstringdist functions directlyfrom theC code of your R package. The description of the API can be found

  • By typing?stringdist_api in the R console
  • By browsing the package's help index toUser guides, package vignettes and other documentation and clicking ondoc/stringdist_api.pdf.
  • Or you can find the file's location as follows
system.file("doc/stringdist_api.pdf", package="stringdist")

Examples of packages that link tostringdist can be foundhere andhere.

Installation

To install the latest release from CRAN, open an R terminal and type

install.packages('stringdist')

To obtain the package from the very latest source code open abash terminal (orgit bash if you work under Windowswithmsysgit) and type

git clone https://github.com/markvanderloo/stringdist.gitcd stringdistbash ./build.bashR CMD INSTALL output/stringdist_*.tar.gz

Warning: the github version can change any time and may not even build properly. As mostof the code is written inC, the development version may crash yourR-session.

Resources

  • Apaper on stringdist has been published in the R-journal
  • Slides of teuseR!2014 conference.

Note to users: deprecated arguments removed as of version 0.9.5.0

The following arguments have been obsolete since 2015 and have been removed in the 0.9.5.0 release (spring 2018)

  • Argumentcluster for functionstringdistmatrix.
  • ArgumentmaxDist for functionsstringdist andstringdistmatrix (notamatch).
  • Argumentncores for functionstringdistmatrix

Note to users: deprecated arguments as of >= 0.9.0, >= 0.9.2

Parallelization used to be based on R'sparallel package, that works by spawning several R sessions in the background. As of version 0.9.0,stringdist uses the more efficientopenMP protocol to parallelize everything under the hood.

The following arguments have become obsolete and will be removed somewhere in 2016:

  • Argumentcluster for functionstringdistmatrix.
  • ArgumentmaxDist for functionsstringdist andstringdistmatrix (notamatch).
  • Argumentncores for functionstringdistmatrix

About

String distance functions for R

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors6

Languages


[8]ページ先頭

©2009-2025 Movatter.jp