pyahocorasick

Appveyor Windows Master branch tests statusGitHub Action build on test -  Master branch statusDocumentation Status

pyahocorasick is a fast and memory efficient library for exact or approximatemulti-pattern string search meaning that you can find multiple key stringsoccurrences at once in some input text. The strings “index” can be built aheadof time and saved (as a pickle) to disk to reload and reuse later. The libraryprovides anahocorasick Python module that you can use as a plain dict-likeTrie or convert a Trie to an automaton for efficient Aho-Corasick search.

pyahocorasick is implemented in C and tested on Python 3.6 and up.It works on 64 bits Linux, macOS and Windows.

Thelicense is BSD-3-Clause. Some utilities, such as tests and the pure Pythonautomaton are dedicated to the Public Domain.

Testimonials

Many thanks for this package. Wasn’t sure where to leave a thank you note butthis package is absolutely fantastic in our application where we have a libraryof 100k+ CRISPR guides that we have to count in a stream of millions of DNAsequencing reads. This package does it faster than the previous C program weused for the purpose and helps us stick to just Python code in our pipeline.

Miika (AstraZeneca Functional Genomics Centre)https://github.com/WojciechMula/pyahocorasick/issues/145

Quick start

This module is written in C. You need a C compiler installed to compile nativeCPython extensions. To install:

pipinstallpyahocorasick

Then create an Automaton:

>>>importahocorasick>>>automaton=ahocorasick.Automaton()

You can use the Automaton class as a trie. Add some string keys and their associatedvalue to this trie. Here we associate a tuple of (insertion index, original string)as a value to each key string we add to the trie:

>>>foridx,keyinenumerate('he her hers she'.split()):...automaton.add_word(key,(idx,key))

Then check if some string exists in the trie:

>>>'he'inautomatonTrue>>>'HER'inautomatonFalse

And play with theget() dict-like method:

>>>automaton.get('he')(0, 'he')>>>automaton.get('she')(3, 'she')>>>automaton.get('cat','not exists')'not exists'>>>automaton.get('dog')Traceback (most recent call last):  File"<stdin>", line1, in<module>KeyError

Now convert the trie to an Aho-Corasick automaton to enable Aho-Corasick search:

>>>automaton.make_automaton()

Then search all occurrences of the keys (the needles) in an input string (our haystack).

Here we print the results and just check that they are correct. TheAutomaton.iter() method return the results as two-tuples of theend index where atrie key was found in the input string and the associatedvalue for this key. Herewe had stored as values a tuple with the original string and its trie insertionorder:

>>>forend_index,(insert_order,original_value)inautomaton.iter(haystack):...start_index=end_index-len(original_value)+1...print((start_index,end_index,(insert_order,original_value)))...asserthaystack[start_index:start_index+len(original_value)]==original_value...(1, 2, (0, 'he'))(1, 3, (1, 'her'))(1, 4, (2, 'hers'))(4, 6, (3, 'she'))(5, 6, (0, 'he'))

You can also create an eventually large automaton ahead of time andpickle it tore-load later. Here we just pickle to a string. You would typically pickle to afile instead:

>>>importpickle>>>pickled=pickle.dumps(automaton)>>>B=pickle.loads(pickled)>>>B.get('he')(0, 'he')

See also:

Documentation

The full documentation including the API overview and reference is published onreadthedocs.

Overview

With anAho-Corasick automatonyou can efficiently search all occurrences of multiple strings (the needles) in aninput string (the haystack) making a single pass over the input string. Withpyahocorasick you can eventually build large automatons and pickle them to reusethem over and over as an indexed structure for fast multi pattern string matching.

One of the advantages of an Aho-Corasick automaton is that the typical worst-caseand best-caseruntimes are about the same and depends primarily on the sizeof the input string and secondarily on the number of matches returned. Whilethis may not be the fastest string search algorithm in all cases, it can searchfor multiple strings at once and its runtime guarantees make it rather unique.Because pyahocorasick is based on a Trie, it stores redundant keys prefixes onlyonce using memory efficiently.

A drawback is that it needs to be constructed and “finalized” ahead of timebefore you can search strings. In several applications where you search forseveral pre-defined “needles” in a variable “haystacks” this is actually anadvantage.

Aho-Corasick automatons are commonly used for fast multi-pattern matchingin intrusion detection systems (such as snort), anti-viruses and many otherapplications that need fast matching against a pre-defined set of string keys.

Internally an Aho-Corasick automaton is typically based on a Trie with extradata for failure links and an implementation of the Aho-Corasick searchprocedure.

Behind the scenes thepyahocorasick Python library implements these two datastructures: aTrie and an Aho-Corasickstring matching automaton. Both are exposed through theAutomaton class.

In addition to Trie-like and Aho-Corasick methods and data structures,pyahocorasick also implements dict-like methods: The pyahocorasickAutomaton is aTrie a dict-like structure indexed by string keys eachassociated with a value object. You can use this to retrieve an associated valuein a time proportional to a string key length.

pyahocorasick is available in two flavors:

  • a CPythonC-based extension, compatible with Python 3 only. Use olderversion 1.4.x for Python 2.7.x and 32 bits support.
  • a simpler pure Python module, compatible with Python 2 and 3. This is onlyavailable in the source repository (not on Pypi) under the etc/py/ directoryand has a slightly different API.

Unicode and bytes

The type of strings accepted and returned byAutomaton methods are eitherunicode orbytes, depending on a compile time settings (preprocessordefinition ofAHOCORASICK_UNICODE as set insetup.py).

TheAutomaton.unicode attributes can tell you how the library was built.On Python 3, unicode is the default.

Warning

When the library is built with unicode support, an Automaton will store 2 or4 bytes per letter, depending on your Python installation. When builtfor bytes, only one byte per letter is needed.

Build and install from PyPi

To install for common operating systems, use pip. Pre-built wheels should beavailable on Pypi at some point in the future:

pipinstallpyahocorasick

To build from sources you need to have a C compiler installed and configured whichshould be standard on Linux and easy to get on MacOSX.

To build from sources, clone the git repository or download and extract the sourcearchive.

Installpip (and itssetuptools companion) and then run (in avirtualenv ofcourse!):

pipinstall.

If compilation succeeds, the module is ready to use.

Support

Support is available through theGitHub issue tracker to report bugs or askquestions.

Contributing

You can submit contributions throughGitHub pull requests.

  • There is a Makefile with a default target that builds and runs tests.
  • The tests can run with apip installe -e .[testing] && pytest -vvs
  • See also the .github directory for CI tests and workflow

Authors

The initial author and maintainer is Wojciech Muła.Philippe Ombredanne is Wojciech’s sidekick and helps maintaining,and rewrote documentation, setup CI servers and did a some work to make thismodule more accessible to end users.

Alphabetic list of authors:

  • Andrew Grigorev
  • Bogdan
  • David Woakes
  • Edward Betts
  • Frankie Robertson
  • Frederik Petersen
  • gladtosee
  • INADA Naoki
  • Jan Fan
  • Pastafarianist
  • Philippe Ombredanne
  • Renat Nasyrov
  • Sylvain Zimmer
  • Xiaopeng Xu

and many others!

This library would not be possible without help of many people, who contributed invarious ways.They createdpull requests,reported bugs asGitHub issuesor via direct messages, proposed fixes, or spent their valuable time on testing.

Thank you.

License

This library is licensed under very liberalBSD-3-Clause license. Someportions of the code are dedicated to the public domain such as the pure Pythonautomaton and test code.

Full text of license is available in LICENSE file.

Other Aho-Corasick implementations for Python you can consider

Whilepyahocorasick tries to be the finest and fastest Aho Corasick libraryfor Python you may consider these other libraries:

  • Written in pure Python.
  • Poor performance.
  • Written in pure Python.
  • Better performance than py-aho-corasick.
  • Using pypy, ahocorapy’s search performance is only slightly worse than pyahocorasick’s.
  • Performs additional suffix shortcutting (more setup overhead, less search overhead for suffix lookups).
  • Includes visualization tool for resulting automaton (using pygraphviz).
  • MIT-licensed, 100% test coverage, tested on all major python versions (+ pypy)
  • Written in C. Does not return overlapping matches.
  • Does not compile on Windows (July 2016).
  • No support for the pickle protocol.
  • Written in Cython.
  • Large automaton may take a long time to build (July 2016)
  • No support for a dict-like protocol to associate a value to a string key.
  • Written in C.
  • seems unmaintained (last update in 2005).
  • GPL-licensed.

API Overview

This is a quick tour of the API for the Cahocorasick module.See the full API doc for more details. The pure Python module has a slightly different interface.

The moduleahocorasick contains a few constants and the mainAutomaton class.

Module constants

  • ahocorasick.unicode — seeUnicode and bytes
  • ahocorasick.STORE_ANY,ahocorasick.STORE_INTS,ahocorasick.STORE_LENGTH — seeAutomaton class
  • ahocorasick.KEY_STRINGahocorasick.KEY_SEQUENCE— seeAutomaton class
  • ahocorasick.EMPTY,ahocorasick.TRIE,ahocorasick.AHOCORASICK— seeAutomaton Attributes
  • ahocorasick.MATCH_EXACT_LENGTH,ahocorasick.MATCH_AT_MOST_PREFIX,ahocorasick.MATCH_AT_LEAST_PREFIX — see description of the keys method

Automaton class

Note:Automaton instances arepickle-ablemeaning that you can create ahead of time an eventually large automaton then save it to diskand re-load it later to reuse it over and over as a persistent multi-string search index.Internally, Automaton implements the__reduce__()magicmethod.

Automaton([value_type],[key_type])

Create a new empty Automaton optionally passing avalue_type to indicatewhat is the type of associated values (default to any Python object type).It can be one ofahocorasick.STORE_ANY,ahocorasick.STORE_INTS orahocorasick.STORE_LENGTH. In the last case the length of the key willbe stored in the automaton. The optional argumentkey_type can beahocorasick.KEY_STRING orahocorasick.KEY_SEQUENCE. In the lattercase keys will be tuples of integers. The size of integer depends on theversion and platform Python is running on, but for versions of Python >=3.3, it is guaranteed to be 32-bits.

Automaton Trie methods

The Automaton class has the following main trie-like methods:

add_word(key,[value])=>bool
Add akey string to the dict-like trie and associate this key with avalue.
remove_word(key)=>bool
Remove akey string from the dict-like trie.
pop(key)=>value
Remove akey string from the dict-like trie and return the associatedvalue.
exists(key)=>bool orkeyin...
Return True if the key is present in the trie.
match(key)=>bool
Return True if there is a prefix (or key) equal tokey present in the trie.

Automaton Dictionary-like methods

A pyahocorasick Automaton trie behaves more or less like a Python dictionary andimplements a subset of dict-like methods. Some of them are:

get(key[,default])
Return the value associated with thekey string. Similar todict.get().
keys([prefix,[wildcard,[how]]])=>yieldstrings
Return an iterator on keys.
values([prefix,[wildcard,[how]]])=>yieldobject
Return an iterator on values associated with each keys.
items([prefix,[wildcard,[how]]])=>yieldtuple(string,object)
Return an iterator on tuples of (key, value).

Wildcard search

The methodskeys,values anditems can be called with an optionalwildcard. A wildcard character is equivalent to a question mark used in globpatterns (?) or a dot (.) in regular expressions. You can use any character youlike as a wildcard.

Note that it is not possible to escape a wildcard to match it exactly.You need instead to select another wildcard character not present in theprovided prefix. For example:

automaton.keys("hi?","?")# would match "him", "his"automaton.keys("XX?","X")# would match "me?", "he?" or "it?"

Aho-Corasick methods

The Automaton class has the following main Aho-Corasick methods:

make_automaton()
Finalize and create the Aho-Corasick automaton.
iter(string,[start,[end]])
Perform the Aho-Corasick search procedure using the provided inputstring.Return an iterator of tuples (end_index, value) for keys found in string.
iter_long(string,[start,[end]])
Returns iterator (object of class AutomatonSearchIterLong) thatsearches for longest, non-overlapping matches.

AutomatonSearchIter class

Instances of this class are returned by theiter method of anAutomaton.This iterator can be manipulated through itsset() method.

set(string,[reset])=>None

Set a new string to search eventually keeping the current Automaton state tocontinue searching for the next chunk of a string.

For example:

>>>it=A.iter(b"")>>>whileTrue:...buffer=receive(server_address,4096)...ifnotbuffer:...break...it.set(buffer)...forindex,valueinit:...print(index,'=>',value)

Whenreset isTrue then processing is restarted. For example this code:

>>>forstringinstring_set:...forindex,valueinA.iter(string)...print(index,'=>',value)

does the same job as:

>>>it=A.iter(b"")>>>forstringinstring_set:...it.set(it,True)...forindex,valueinit:...print(index,'=>',value)

Automaton Attributes

The Automaton class has the following attributes:

kind [readonly]
Return the state of theAutomaton instance.
store [readonly]
Return the type of values stored in the Automaton as specified at creation.

Saving and loading automaton

There is support for two method of saving and loading an automaton:

  • the standardpickle protocol,
  • customsave andload methods.

While pickling is more convenient to use, it has quite high memoryrequirements. Thesave/load method try to overcome thisproblem.

Warning

Neither format of pickle nor save are safe. Although there area few sanity checks, they are not sufficient to detect allpossible input errors.

Pickle

importahocorasickimportpickle# build automatonA=ahocorasick.Automaton()# ... A.add_data, A.make_automaton# save current statewithopen(path,'wb')asf:pickle.dump(A,f)# load saved statewithopen(path,'rb')asf:B=pickle.load(f)

Save/load methods

importahocorasickimportpickle# build automatonA=ahocorasick.Automaton()# ... A.add_data, A.make_automaton# save current stateA.save(path,pickle.dumps)# load saved stateB=ahocorasick.load(path,pickle.loads)

Automaton methodsave requirespath to the file which will store data.If the automaton type isSTORE_ANY, i.e. values associated with words areany python objects, thensave requires also another argument, a callable.The callable serializes python object into bytes; in the example above weuse standard pickledumps function.

Module methodload also requirespath to file that has data previouslysaved. Because at the moment of loading data we don’t know what is the storeattribute of automaton, the second argument - a callable - is required. Thecallable must convert back given bytes object into python value, that will bestored in automaton. Similarly, standardpickle.loads function can be passed.

Other Automaton methods

The Automaton class has a few other interesting methods:

dump()=>(listofnodes,listofedges,listoffaillinks)
Return a three-tuple of lists describing the Automaton as a graph of(nodes, edges, failure links).The source repository and source package also contains theetc/dump2dot.pyscript that convertsdump() results to agraphviz dotformat for convenient visualization of the trie and Automaton data structure.
get_stats()=>dict
Return a dictionary containing Automaton statistics.Note that the real size occupied by the data structure could be larger becauseofinternal memory fragmentationthat can occur in a memory manager.
__sizeof__()=>int
Return the approximate size in bytes occupied by the Automaton instance.Also available by calling sys.getsizeof(automaton instance).

Examples

>>>importahocorasick>>>A=ahocorasick.Automaton()>>># add some key words to trie>>>forindex,wordinenumerate('he her hers she'.split()):...A.add_word(word,(index,word))>>># test that these key words exists in the trie all right>>>'he'inATrue>>>'HER'inAFalse>>>A.get('he')(0, 'he')>>>A.get('she')(3, 'she')>>>A.get('cat','<not exists>')'<not exists>'>>>A.get('dog')Traceback (most recent call last):  File"<stdin>", line1, in<module>KeyError>>>A.remove_word('he')True>>>A.remove_word('he')False>>>A.pop('she')(3, 'she')>>>'she'inAFalse>>># convert the trie in an Aho-Corasick automaton>>>A=ahocorasick.Automaton()>>>forindex,wordinenumerate('he her hers she'.split()):...A.add_word(word,(index,word))>>>A.make_automaton()>>># then find all occurrences of the stored keys in a string>>>foriteminA.iter('_hershe_'):...print(item)...(2, (0, 'he'))(3, (1, 'her'))(4, (2, 'hers'))(6, (3, 'she'))(6, (0, 'he'))

Example of the keys method behavior

>>>importahocorasick>>>A=ahocorasick.Automaton()>>># add some key words to trie>>>forindex,wordinenumerate('cat catastropha rat rate bat'.split()):...A.add_word(word,(index,word))>>># Search some prefix>>>list(A.keys('cat'))['cat', 'catastropha']>>># Search with a wildcard: here '?' is used as a wildcard. You can use any character you like.>>>list(A.keys('?at','?',ahocorasick.MATCH_EXACT_LENGTH))['bat', 'cat', 'rat']>>>list(A.keys('?at?','?',ahocorasick.MATCH_AT_MOST_PREFIX))['bat', 'cat', 'rat', 'rate']>>>list(A.keys('?at?','?',ahocorasick.MATCH_AT_LEAST_PREFIX))['rate']

API Reference

Automaton(value_type=ahocorasick.STORE_ANY, [key_type])

Create a new empty Automaton. Bothvalue_type andkey_type are optional.

value_type is one of these constants:

  • ahocorasick.STORE_ANY [default] : The associated value can be any Python object.
  • ahocorasick.STORE_LENGTH : The length of an added string key is automaticallyused as the associated value stored in the trie for that key.
  • ahocorasick.STORE_INTS : The associated value must be a 32-bit integer.

key_type defines the type of data that can be stored in an automaton; it is one ofthese constants and defines type of data might be stored:

  • ahocorasick.KEY_STRING [default] : string
  • ahocorasick.KEY_SEQUENCE : sequences of integers; The size of integer dependsthe version and platform Python, but for versions of Python >= 3.3, it isguaranteed to be 32-bits.

Examples

>>>importahocorasick>>>A=ahocorasick.Automaton()>>>A<ahocorasick.Automaton object at 0x7f1da1bdf7f0>>>>B=ahocorasick.Automaton(ahocorasick.STORE_ANY)>>>B<ahocorasick.Automaton object at 0x7f1da1bdf6c0>>>>C=ahocorasick.Automaton(ahocorasick.STORE_INTS,ahocorasick.KEY_STRING)>>>C<ahocorasick.Automaton object at 0x7f1da1527f10>

add_word(key, [value]) -> boolean

Add a key string to the dict-like trie and associate this key with a value.value is optional or mandatory depending how theAutomaton instance wascreated. Return True if the word key is inserted and did not exists in thetrie or False otherwise. The value associated with an existing word is replaced.

The value is either mandatory or optional:

  • If the Automaton was created without argument (the default) asAutomaton()or withAutomaton(ahocorasick.STORE_ANY) then the value is required and canbe any Python object.
  • If the Automaton was created withAutomaton(ahocorasick.STORE_INTS) then thevalue is optional. If provided it must be an integer, otherwise it defaults tolen(automaton) which is therefore the order index in which keys are addedto the trie.
  • If the Automaton was created withAutomaton(ahocorasick.STORE_LENGTH) thenassociating a value is not allowed -len(word) is saved automatically asa value instead.

Calling add_word() invalidates all iterators only if the new key did not existin the trie so far (i.e. the method returned True).

Examples

>>>importahocorasick>>>A=ahocorasick.Automaton()>>>A.add_word("pyahocorasick")Traceback (most recent call last):  File"<stdin>", line1, in<module>ValueError:A value object is required as second argument.>>>A.add_word("pyahocorasick",(42,'text'))True>>>A.get("pyhocorasick")(42, 'text')>>>A.add_word("pyahocorasick",12)False>>>A.get("pyhocorasick")12
>>>importahocorasick>>>B=ahocorasick.Automaton(ahocorasick.STORE_INTS)>>>B.add_word("cat")True>>>B.get()Traceback (most recent call last):  File"<stdin>", line1, in<module>IndexError:tuple index out of range>>>B.get("cat")1>>>B.add_word("dog")True>>>B.get("dog")2>>>B.add_word("tree",42)True>>>B.get("tree")42>>>B.add_word("cat",43)False>>>B.get("cat")43

exists(key) -> boolean

Return True if thekey is present in the trie. Same as using the ‘in’ keyword.

Examples

>>>importahocorasick>>>A=ahocorasick.Automaton()>>>A.add_word("cat",1)True>>>A.exists("cat")True>>>A.exists("dog")False>>>'elephant'inAFalse>>>'cat'inATrue

get(key[, default])

Return the value associated with the key string.

Raise aKeyError exception if the key is not in the trie and no default is provided.

Return the optional default value if provided and the key is not in the trie.

Example

>>>importahocorasick>>>A=ahocorasick.Automaton()>>>A.add_word("cat",42)True>>>A.get("cat")42>>>A.get("dog")Traceback (most recent call last):  File"<stdin>", line1, in<module>KeyError>>>A.get("dog","good dog")'good dog'

longest_prefix(string) => integer

Return the length of the longest prefix of string that exists in the trie.

Examples

>>>importahocorasick>>>A=ahocorasick.Automaton()>>>A.add_word("he",True)True>>>A.add_word("her",True)True>>>A.add_word("hers",True)True>>>A.longest_prefix("she")0>>>A.longest_prefix("herself")4

match(key) -> bool

Return True if there is a prefix (or key) equal to key present in the trie.

For example if the key ‘example’ has been added to the trie, then calls tomatch(‘e’), match(‘ex’), …, match(‘exampl’) or match(‘example’) all returnTrue. But exists() is True only when calling exists(‘example’).

Examples

>>>importahocorasick>>>A=ahocorasick.Automaton()>>>A.add_word("example",True)True>>>A.match("e")True>>>A.match("ex")True>>>A.match("exa")True>>>A.match("exam")True>>>A.match("examp")True>>>A.match("exampl")True>>>A.match("example")True>>>A.match("examples")False>>>A.match("python")False

len() -> integer

Return the number of distinct keys added to the trie.

Examples

>>>importahocorasick>>>A=ahocorasick.Automaton()>>>len(A)0>>>A.add_word("python",1)True>>>len(A)1>>>A.add_word("elephant",True)True>>>len(A)2

remove_word(word) -> bool

Remove given word from a trie. Return True if words was found, False otherwise.

Examples

>>>importahocorasick>>>A=ahocorasick.Automaton()>>>A.add_word("cat",1)True>>>A.add_word("dog",2)True>>>A.remove_word("cat")True>>>A.remove_word("cat")False>>>A.remove_word("dog")True>>>A.remove_word("dog")False>>>

pop(word)

Remove given word from a trie and return associated values. Raise aKeyErrorif the word was not found.

Examples

>>>importahocorasick>>>A=ahocorasick.Automaton()>>>A.add_word("cat",1)True>>>A.add_word("dog",2)True>>>A.pop("elephant")Traceback (most recent call last):  File"<stdin>", line1, in<module>KeyError>>>A.pop("cat")1>>>A.pop("dog")2>>>A.pop("cat")Traceback (most recent call last):  File"<stdin>", line1, in<module>KeyError

clear()

Remove all keys from the trie. This method invalidates all iterators.

Examples

>>>importahocorasick>>>A=ahocorasick.Automaton()>>>A.add_word("cat",1)True>>>A.add_word("dog",2)True>>>A.add_word("elephant",3)True>>>len(A)3>>>A.clear()>>>len(A)0

keys([prefix, [wildcard, [how]]])

Return an iterator on keys.If the optionalprefix string is provided, only yield keys startingwith this prefix.

If the optionalwildcard is provided as a single character string,then the prefix is treated as a simple pattern using this characteras a wildcard.

The optionalhow argument is used to control how strings are matchedusing one of these possible values:

  • ahocorasick.MATCH_EXACT_LENGTH (default)Yield matches that have the same exact length as the prefix length.
  • ahocorasick.MATCH_AT_LEAST_PREFIXYield matches that have a length greater or equal to the prefix length.
  • ahocorasick.MATCH_AT_MOST_PREFIXYield matches that have a length lesser or equal to the prefix length.

items([prefix, [wildcard, [how]]])

Return an iterator on tuples of (key, value).Keys are matched optionally to the prefix using the same logicand arguments as in the keys() method.

values([prefix, [wildcard, [how]]])

Return an iterator on values associated with each keys.Keys are matched optionally to the prefix using the same logicand arguments as in the keys() method.

make_automaton()

Finalize and create the Aho-Corasick automaton based on the keys already addedto the trie. This does not require additional memory. After successful creationtheAutomaton.kind attribute is set toahocorasick.AHOCORASICK.

iter(string, [start, [end]], ignore_white_space=False)

Perform the Aho-Corasick search procedure using the provided input string.

Return an iterator of tuples (end_index,value) for keys found instring where:

  • end_index is the end index in the input string where a trie keystring was found.
  • value is the value associated with the found key string.

Thestart andend optional arguments can be used to limit the searchto an input string slice as instring[start:end].

Theignore_white_space optional arguments can be used to ignore whitespaces from input string.

iter_long(string, [start, [end]])

Perform the modified Aho-Corasick search procedure which matchesthe longest words from set.

Return an iterator of tuples (end_index,value) for keys found instring where:

  • end_index is the end index in the input string where a trie keystring was found.
  • value is the value associated with the found key string.

Thestart andend optional arguments can be used to limit the searchto an input string slice as instring[start:end].

Example

The default Aho-Corasick algorithm returns all occurrences of words storedin the automaton, including substring of other words from string. Methoditer_long reports only the longest match.

For set of words {“he”, “her”, “here”} and a needle “he here her” thedefault algorithm finds following words: “he”, “he”, “her”, “here”, “he”,“her”, while the modified one yields only: “he”, “here”, “her”.

>>>importahocorasick>>>A=ahocorasick.Automaton()>>>A.add_word("he","he")True>>>A.add_word("her","her")True>>>A.add_word("here","here")True>>>A.make_automaton()>>>needle="he here her">>>list(A.iter_long(needle))[(1, 'he'), (6, 'here'), (10, 'her')]>>>list(A.iter(needle))[(1, 'he'), (4, 'he'), (5, 'her'), (6, 'here'), (9, 'he'), (10, 'her')]

find_all(string, callback, [start, [end]])

Perform the Aho-Corasick search procedure using the provided inputstringand iterate over the matching tuples (end_index,value) for keys foundin string. Invoke thecallback callable for each matching tuple.

The callback callable must accept two positional arguments:- end_index is the end index in the input string where a trie key string was found.- value is the value associated with the found key string.

The start and end optional arguments can be used to limit the search to aninput string slice as in string[start:end].

Equivalent to a loop on iter() calling a callable at each iteration.

__reduce__()

Return pickle-able data for this automaton instance.

save(path, serializer)

Save content of automaton in an on-disc file.

Serializer is a callable object that is used when automaton storetype isSTORE_ANY. This method converts a python object intobytes; it can bepickle.dumps.

load(path, deserializer) => Automaton

Load automaton previously stored on disc usingsave method.

Deserializer is a callable object which converts bytes back intopython object; it can bepickle.loads.

Return the approximate size in bytes occupied by the Automaton instance inmemory excluding the size of associated objects when the Automaton is createdwith Automaton() or Automaton(ahocorasick.STORE_ANY).

get_stats() -> dict

Return a dictionary containing Automaton statistics.

  • nodes_count - total number of nodes
  • words_count - number of distinct words (same aslen(automaton))
  • longest_word - length of the longest word
  • links_count - number of edges
  • sizeof_node - size of single node in bytes
  • total_size - total size of trie in bytes (aboutnodes_count * size_of node + links_count * size of pointer).

Examples

>>>importahocorasick>>>A=ahocorasick.Automaton()>>>A.add_word("he",None)True>>>A.add_word("her",None)True>>>A.add_word("hers",None)True>>>A.get_stats(){'nodes_count': 5, 'words_count': 3, 'longest_word': 4, 'links_count': 4, 'sizeof_node': 40, 'total_size': 232}

dump()

Return a three-tuple of lists describing the Automaton as a graphofnodes,edges,failure links.

  • nodes: each item is a pair (node id, end of word marker)
  • edges: each item is a triple (node id, label char, child node id)
  • failure links: each item is a pair (source node id, node if connectedby fail node)

For each of these, the node id is a unique number and a label isa number.

set(string, reset=False)

Set a new string to search. When the reset argument is False (default)then the Aho-Corasick procedure is continued and the internal state of theAutomaton and end index of the string being searched are not reset. This allowto search for large strings in multiple smaller chunks.