forked fromlife4/textdistance
- Notifications
You must be signed in to change notification settings - Fork0
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface.
License
NotificationsYou must be signed in to change notification settings
awesome-archive/textdistance
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
TextDistance -- python library for compare distance between two or more sequences by many algorithms.
Features:
- 30+ algorithms
- Pure python implementation
- Simple usage
- More than two sequences comparing
- Some algorithms have more than one implementation in one class.
- Optional numpy usage for maximum speed.
Algorithm | Class | Functions |
---|---|---|
Hamming | Hamming | hamming |
MLIPNS | Mlipns | mlipns |
Levenshtein | Levenshtein | levenshtein |
Damerau-Levenshtein | DamerauLevenshtein | damerau_levenshtein |
Jaro-Winkler | JaroWinkler | jaro_winkler ,jaro |
Strcmp95 | StrCmp95 | strcmp95 |
Needleman-Wunsch | NeedlemanWunsch | needleman_wunsch |
Gotoh | Gotoh | gotoh |
Smith-Waterman | SmithWaterman | smith_waterman |
Algorithm | Class | Functions |
---|---|---|
Jaccard index | Jaccard | jaccard |
Sørensen–Dice coefficient | Sorensen | sorensen ,sorensen_dice ,dice |
Tversky index | Tversky | tversky |
Overlap coefficient | Overlap | overlap |
Tanimoto distance | Tanimoto | tanimoto |
Cosine similarity | Cosine | cosine |
Monge-Elkan | MongeElkan | monge_elkan |
Bag distance | Bag | bag |
Algorithm | Class | Functions |
---|---|---|
longest common subsequence similarity | LCSSeq | lcsseq |
longest common substring similarity | LCSStr | lcsstr |
Ratcliff-Obershelp similarity | RatcliffObershelp | ratcliff_obershelp |
Work in progress. Now all algorithms compare two strings as array of bits, not by chars.
NCD
- normalized compression distance.
Functions:
bz2_ncd
lzma_ncd
arith_ncd
rle_ncd
bwtrle_ncd
zlib_ncd
Algorithm | Class | Functions |
---|---|---|
MRA | MRA | mra |
Editex | Editex | editex |
Algorithm | Class | Functions |
---|---|---|
Prefix similarity | Prefix | prefix |
Postfix similarity | Postfix | postfix |
Length distance | Length | length |
Identity similarity | Identity | identity |
Matrix similarity | Matrix | matrix |
Stable:
pip install textdistance
Dev:
pip install -e git+https://github.com/orsinium/textdistance.git#egg=textdistance
All algorithms have 2 interfaces:
- Class with algorithm-specific params for customizing.
- Class instance with default params for quick and simple usage.
All algorithms have some common methods:
.distance(*sequences)
-- calculate distance between sequences..similarity(*sequences)
-- calculate similarity for sequences..maximum(*sequences)
-- maximum possible value for distance and similarity. For any sequence:distance + similarity == maximum
..normalized_distance(*sequences)
-- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different..normalized_similarity(*sequences)
-- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.
Most common init arguments:
qval
-- q-value for split sequences into q-grams. Possible values:- 1 (default) -- compare sequences by chars.
- 2 or more -- transform sequences to q-grams.
- None -- split sequences by words.
as_set
-- for token-based algorithms:- True --
t
andttt
is equal. - False (default) --
t
andttt
is different.
- True --
For example,Hamming distance:
importtextdistancetextdistance.hamming('test','text')# 1textdistance.hamming.distance('test','text')# 1textdistance.hamming.similarity('test','text')# 3textdistance.hamming.normalized_distance('test','text')# 0.25textdistance.hamming.normalized_similarity('test','text')# 0.75textdistance.Hamming(qval=2).distance('test','text')# 2
Any other algorithms have same interface.
About
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface.
Resources
License
Stars
Watchers
Forks
Packages0
No packages published
Languages
- Python99.9%
- Shell0.1%