- Notifications
You must be signed in to change notification settings - Fork136
Python module (C extension and plain python) implementing Aho-Corasick algorithm
License
WojciechMula/pyahocorasick
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
pyahocorasick is a fast and memory efficient library for exact or approximatemulti-pattern string search meaning that you can find multiple key stringsoccurrences at once in some input text. The strings "index" can be built aheadof time and saved (as a pickle) to disk to reload and reuse later. The libraryprovides an ahocorasick Python module that you can use as a plain dict-likeTrie or convert a Trie to an automaton for efficient Aho-Corasick search.
pyahocorasick is implemented in C and tested on Python 3.9 and up.It works on 64 bits Linux, macOS and Windows.
Thelicense is BSD-3-Clause. Some utilities, such as tests and the pure Pythonautomaton are dedicated to the Public Domain.
Many thanks for this package. Wasn't sure where to leave a thank you note butthis package is absolutely fantastic in our application where we have a libraryof 100k+ CRISPR guides that we have to count in a stream of millions of DNAsequencing reads. This package does it faster than the previous C program weused for the purpose and helps us stick to just Python code in our pipeline.
Miika (AstraZeneca Functional Genomics Centre)#145
You can fetchpyahocorasick from:
- GitHubhttps://github.com/WojciechMula/pyahocorasick/
- Pypihttps://pypi.python.org/pypi/pyahocorasick/
- Conda-Forgehttps://github.com/conda-forge/pyahocorasick-feedstock/
Thedocumentation is published athttps://pyahocorasick.readthedocs.io/
This module is written in C. You need a C compiler installed to compile nativeCPython extensions. To install:
pip install pyahocorasick
Then create an Automaton:
>>> import ahocorasick>>> automaton = ahocorasick.Automaton()
You can use the Automaton class as a trie. Add some string keys and their associatedvalue to this trie. Here we associate a tuple of (insertion index, original string)as a value to each key string we add to the trie:
>>> for idx, key in enumerate('he her hers she'.split()):... automaton.add_word(key, (idx, key))Then check if some string exists in the trie:
>>> 'he' in automatonTrue>>> 'HER' in automatonFalse
And play with theget() dict-like method:
>>> automaton.get('he')(0, 'he')>>> automaton.get('she')(3, 'she')>>> automaton.get('cat', 'not exists')'not exists'>>> automaton.get('dog')Traceback (most recent call last): File "<stdin>", line 1, in <module>KeyErrorNow convert the trie to an Aho-Corasick automaton to enable Aho-Corasick search:
>>> automaton.make_automaton()
Then search all occurrences of the keys (the needles) in an input string (our haystack).
Here we print the results and just check that they are correct. TheAutomaton.iter() method return the results as two-tuples of the end index where atrie key was found in the input string and the associated value for this key. Herewe had stored as values a tuple with the original string and its trie insertionorder:
>>> for end_index, (insert_order, original_value) in automaton.iter(haystack):... start_index = end_index - len(original_value) + 1... print((start_index, end_index, (insert_order, original_value)))... assert haystack[start_index:start_index + len(original_value)] == original_value...(1, 2, (0, 'he'))(1, 3, (1, 'her'))(1, 4, (2, 'hers'))(4, 6, (3, 'she'))(5, 6, (0, 'he'))
You can also create an eventually large automaton ahead of time and pickle it tore-load later. Here we just pickle to a string. You would typically pickle to afile instead:
>>> import pickle>>> pickled = pickle.dumps(automaton)>>> B = pickle.loads(pickled)>>> B.get('he')(0, 'he')See also:
- FAQ and Who is using pyahocorasick?https://github.com/WojciechMula/pyahocorasick/wiki/FAQ#who-is-using-pyahocorasick
The full documentation including the API overview and reference is published onreadthedocs.
Overview
With anAho-Corasick automatonyou can efficiently search all occurrences of multiple strings (the needles) in aninput string (the haystack) making a single pass over the input string. Withpyahocorasick you can eventually build large automatons and pickle them to reusethem over and over as an indexed structure for fast multi pattern string matching.
One of the advantages of an Aho-Corasick automaton is that the typical worst-caseand best-caseruntimes are about the same and depends primarily on the sizeof the input string and secondarily on the number of matches returned. Whilethis may not be the fastest string search algorithm in all cases, it can searchfor multiple strings at once and its runtime guarantees make it rather unique.Because pyahocorasick is based on a Trie, it stores redundant keys prefixes onlyonce using memory efficiently.
A drawback is that it needs to be constructed and "finalized" ahead of timebefore you can search strings. In several applications where you search forseveral pre-defined "needles" in a variable "haystacks" this is actually anadvantage.
Aho-Corasick automatons are commonly used for fast multi-pattern matchingin intrusion detection systems (such as snort), anti-viruses and many otherapplications that need fast matching against a pre-defined set of string keys.
Internally an Aho-Corasick automaton is typically based on a Trie with extradata for failure links and an implementation of the Aho-Corasick searchprocedure.
Behind the scenes thepyahocorasick Python library implements these two datastructures: aTrie and an Aho-Corasickstring matching automaton. Both are exposed through the Automaton class.
In addition to Trie-like and Aho-Corasick methods and data structures,pyahocorasick also implements dict-like methods: The pyahocorasickAutomaton is aTrie a dict-like structure indexed by string keys eachassociated with a value object. You can use this to retrieve an associated valuein a time proportional to a string key length.
pyahocorasick is available in two flavors:
- a CPythonC-based extension, compatible with Python 3 only. Use olderversion 1.4.x for Python 2.7.x and 32 bits support.
- a simpler pure Python module, compatible with Python 2 and 3. This is onlyavailable in the source repository (not on Pypi) under the etc/py/ directoryand has a slightly different API.
The type of strings accepted and returned byAutomaton methods are eitherunicode orbytes, depending on a compile time settings (preprocessordefinition ofAHOCORASICK_UNICODE as set in setup.py).
TheAutomaton.unicode attributes can tell you how the library was built.On Python 3, unicode is the default.
Warning
When the library is built with unicode support, an Automaton will store 2 or4 bytes per letter, depending on your Python installation. When builtfor bytes, only one byte per letter is needed.
To install for common operating systems, use pip. Pre-built wheels should beavailable on Pypi at some point in the future:
pip install pyahocorasick
To build from sources you need to have a C compiler installed and configured whichshould be standard on Linux and easy to get on MacOSX.
To build from sources, clone the git repository or download and extract the sourcearchive.
Install pip (and its setuptools companion) and then run (in a virtualenv ofcourse!):
pip install .
If compilation succeeds, the module is ready to use.
Support is available through theGitHub issue tracker to report bugs or askquestions.
You can submit contributions throughGitHub pull requests.
- There is a Makefile with a default target that builds and runs tests.
- The tests can run with a pip installe -e .[testing] && pytest -vvs
- See also the .github directory for CI tests and workflow
The initial author and maintainer is Wojciech Muła.Philippe Ombredanne is Wojciech's sidekick and helps maintaining,and rewrote documentation, setup CI servers and did a some work to make thismodule more accessible to end users.
Alphabetic list of authors and contributors:
- Andrew Grigorev
- Ayan Mahapatra
- Bogdan
- David Woakes
- Edward Betts
- Frankie Robertson
- Frederik Petersen
- gladtosee
- INADA Naoki
- Jan Fan
- Pastafarianist
- Philippe Ombredanne
- Renat Nasyrov
- Sylvain Zimmer
- Xiaopeng Xu
and many others!
This library would not be possible without help of many people, who contributed invarious ways.They createdpull requests,reported bugs asGitHub issuesor via direct messages, proposed fixes, or spent their valuable time on testing.
Thank you.
This library is licensed under very liberalBSD-3-Clause license. Someportions of the code are dedicated to the public domain such as the pure Pythonautomaton and test code.
Full text of license is available in LICENSE file.
Whilepyahocorasick tries to be the finest and fastest Aho Corasick libraryfor Python you may consider these other libraries:
- py_aho_corasick by Jan
- Written in pure Python.
- Poor performance.
- ahocorapy by abusix
- Written in pure Python.
- Better performance than py-aho-corasick.
- Using pypy, ahocorapy's search performance is only slightly worse than pyahocorasick's.
- Performs additional suffix shortcutting (more setup overhead, less search overhead for suffix lookups).
- Includes visualization tool for resulting automaton (using pygraphviz).
- MIT-licensed, 100% test coverage, tested on all major python versions (+ pypy)
- noaho by Jeff Donner
- Written in C. Does not return overlapping matches.
- Does not compile on Windows (July 2016).
- No support for the pickle protocol.
- acora by Stefan Behnel
- Written in Cython.
- Large automaton may take a long time to build (July 2016)
- No support for a dict-like protocol to associate a value to a string key.
- ahocorasick by Danny Yoo
- Written in C.
- seems unmaintained (last update in 2005).
- GPL-licensed.
About
Python module (C extension and plain python) implementing Aho-Corasick algorithm
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.