Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 302 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
src/fuzzysearch		src/fuzzysearch
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.coveragerc		.coveragerc
.gitignore		.gitignore
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
build.cmd		build.cmd
requirements_dev.txt		requirements_dev.txt
setup.py		setup.py
tox.ini		tox.ini

Repository files navigation

fuzzysearch

Fuzzy search: Find parts of long text or data, allowing for somechanges/typos.

Highly optimized, simple to use, does one thing well.

>>>find_near_matches('PATTERN','---PATERN---',max_l_dist=1)[Match(start=3,end=9,dist=1,matched="PATERN")]

Two simple functions to use: one for in-memory data and one for files
- Fastest search algorithm is chosen automatically
Levenshtein Distance metric with configurable parameters
- Separately configure the max. allowed distance, substitutions, deletionsand/or insertions
Advanced algorithms with optional C and Cython optimizations
Properly handles Unicode; special optimizations for binary data
Simple installation:
- pip install fuzzysearch just works
- pure-Python fallbacks for compiled modules
- only one dependency (attrs)
Extensively tested
Free software:MIT license

For more info, see thedocumentation.

How is this different than FuzzyWuzzy or RapidFuzz?

The main difference is that fuzzysearch searches for fuzzy matches throughlong texts or data. FuzzyWuzzy and RapidFuzz, on the other hand, are intendedfor fuzzy comparison of pairs of strings, identifying how closely they matchaccording to some metric such as the Levenshtein distance.

These are very different use-cases, and the solutions are very different aswell.

How is this different than ElasticSearch and Lucene?

The main difference is that fuzzysearch does no indexing or otherpreparations; it directly searches through the given text or data for a givensub-string. Therefore, it is much simpler to use compared to systems based ontext indexing.

Installation

fuzzysearch supports Python versions 3.8+, as well as PyPy 3.9 and 3.10.

$ pip install fuzzysearch

This will work even if installing the C and Cython extensions fails, usingpure-Python fallbacks.

Usage

Just callfind_near_matches() with the sub-sequence you're looking for,the sequence to search, and the matching parameters:

>>>fromfuzzysearchimportfind_near_matches# search for 'PATTERN' with a maximum Levenshtein Distance of 1>>>find_near_matches('PATTERN','---PATERN---',max_l_dist=1)[Match(start=3,end=9,dist=1,matched="PATERN")]

To search in a file, usefind_near_matches_in_file():

>>>fromfuzzysearchimportfind_near_matches_in_file>>>withopen('data_file','rb')asf:...find_near_matches_in_file(b'PATTERN',f,max_l_dist=1)[Match(start=3,end=9,dist=1,matched="PATERN")]

Examples

fuzzysearch is great for ad-hoc searches of genetic data, such as DNA orprotein sequences, before reaching for more complex tools:

>>>sequence='''\GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACATTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACACAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAGGGGATAGG'''>>>subsequence='TGCACTGTAGGGATAACAAT'# distance = 1>>>find_near_matches(subsequence,sequence,max_l_dist=2)[Match(start=3,end=24,dist=1,matched="TAGCACTGTAGGGATAACAAT")]

BioPython sequences are also supported:

>>>fromBio.SeqimportSeq>>>fromBio.AlphabetimportIUPAC>>>sequence=Seq('''\GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACATTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACACAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAGGGGATAGG''',IUPAC.unambiguous_dna)>>>subsequence=Seq('TGCACTGTAGGGATAACAAT',IUPAC.unambiguous_dna)>>>find_near_matches(subsequence,sequence,max_l_dist=2)[Match(start=3,end=24,dist=1,matched="TAGCACTGTAGGGATAACAAT")]

Matching Criteria

The search function supports four possible match criteria, which may besupplied in any combination:

maximum Levenshtein distance (max_l_dist)
maximum # of subsitutions
maximum # of deletions ("delete" = skip a character in the sub-sequence)
maximum # of insertions ("insert" = skip a character in the sequence)

Not supplying a criterion means that there is no limit for it. For this reason,one must always supplymax_l_dist and/or all other criteria.

>>>find_near_matches('PATTERN','---PATERN---',max_l_dist=1)[Match(start=3,end=9,dist=1,matched="PATERN")]# this will not match since max-deletions is set to zero>>>find_near_matches('PATTERN','---PATERN---',max_l_dist=1,max_deletions=0)[]# note that a deletion + insertion may be combined to match a substution>>>find_near_matches('PATTERN','---PAT-ERN---',max_deletions=1,max_insertions=1,max_substitutions=0)[Match(start=3,end=10,dist=1,matched="PAT-ERN")]# the Levenshtein distance is still 1# ... but deletion + insertion may also match other, non-substitution differences>>>find_near_matches('PATTERN','---PATERRN---',max_deletions=1,max_insertions=1,max_substitutions=0)[Match(start=3,end=10,dist=2,matched="PATERRN")]