Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Python module (C extension and plain python) implementing Aho-Corasick algorithm

License

NotificationsYou must be signed in to change notification settings

WojciechMula/pyahocorasick

Repository files navigation

GitHub Action build on test -  Master branch statusDocumentation Status

pyahocorasick is a fast and memory efficient library for exact or approximatemulti-pattern string search meaning that you can find multiple key stringsoccurrences at once in some input text. The strings "index" can be built aheadof time and saved (as a pickle) to disk to reload and reuse later. The libraryprovides an ahocorasick Python module that you can use as a plain dict-likeTrie or convert a Trie to an automaton for efficient Aho-Corasick search.

pyahocorasick is implemented in C and tested on Python 3.9 and up.It works on 64 bits Linux, macOS and Windows.

Thelicense is BSD-3-Clause. Some utilities, such as tests and the pure Pythonautomaton are dedicated to the Public Domain.

Testimonials

Many thanks for this package. Wasn't sure where to leave a thank you note butthis package is absolutely fantastic in our application where we have a libraryof 100k+ CRISPR guides that we have to count in a stream of millions of DNAsequencing reads. This package does it faster than the previous C program weused for the purpose and helps us stick to just Python code in our pipeline.

Miika (AstraZeneca Functional Genomics Centre)#145

Download and source code

You can fetchpyahocorasick from:

Thedocumentation is published athttps://pyahocorasick.readthedocs.io/

Quick start

This module is written in C. You need a C compiler installed to compile nativeCPython extensions. To install:

pip install pyahocorasick

Then create an Automaton:

>>> import ahocorasick>>> automaton = ahocorasick.Automaton()

You can use the Automaton class as a trie. Add some string keys and their associatedvalue to this trie. Here we associate a tuple of (insertion index, original string)as a value to each key string we add to the trie:

>>> for idx, key in enumerate('he her hers she'.split()):...   automaton.add_word(key, (idx, key))

Then check if some string exists in the trie:

>>> 'he' in automatonTrue>>> 'HER' in automatonFalse

And play with theget() dict-like method:

>>> automaton.get('he')(0, 'he')>>> automaton.get('she')(3, 'she')>>> automaton.get('cat', 'not exists')'not exists'>>> automaton.get('dog')Traceback (most recent call last):  File "<stdin>", line 1, in <module>KeyError

Now convert the trie to an Aho-Corasick automaton to enable Aho-Corasick search:

>>> automaton.make_automaton()

Then search all occurrences of the keys (the needles) in an input string (our haystack).

Here we print the results and just check that they are correct. TheAutomaton.iter() method return the results as two-tuples of the end index where atrie key was found in the input string and the associated value for this key. Herewe had stored as values a tuple with the original string and its trie insertionorder:

>>> for end_index, (insert_order, original_value) in automaton.iter(haystack):...     start_index = end_index - len(original_value) + 1...     print((start_index, end_index, (insert_order, original_value)))...     assert haystack[start_index:start_index + len(original_value)] == original_value...(1, 2, (0, 'he'))(1, 3, (1, 'her'))(1, 4, (2, 'hers'))(4, 6, (3, 'she'))(5, 6, (0, 'he'))

You can also create an eventually large automaton ahead of time and pickle it tore-load later. Here we just pickle to a string. You would typically pickle to afile instead:

>>> import pickle>>> pickled = pickle.dumps(automaton)>>> B = pickle.loads(pickled)>>> B.get('he')(0, 'he')

See also:

Documentation

The full documentation including the API overview and reference is published onreadthedocs.

Overview

With anAho-Corasick automatonyou can efficiently search all occurrences of multiple strings (the needles) in aninput string (the haystack) making a single pass over the input string. Withpyahocorasick you can eventually build large automatons and pickle them to reusethem over and over as an indexed structure for fast multi pattern string matching.

One of the advantages of an Aho-Corasick automaton is that the typical worst-caseand best-caseruntimes are about the same and depends primarily on the sizeof the input string and secondarily on the number of matches returned. Whilethis may not be the fastest string search algorithm in all cases, it can searchfor multiple strings at once and its runtime guarantees make it rather unique.Because pyahocorasick is based on a Trie, it stores redundant keys prefixes onlyonce using memory efficiently.

A drawback is that it needs to be constructed and "finalized" ahead of timebefore you can search strings. In several applications where you search forseveral pre-defined "needles" in a variable "haystacks" this is actually anadvantage.

Aho-Corasick automatons are commonly used for fast multi-pattern matchingin intrusion detection systems (such as snort), anti-viruses and many otherapplications that need fast matching against a pre-defined set of string keys.

Internally an Aho-Corasick automaton is typically based on a Trie with extradata for failure links and an implementation of the Aho-Corasick searchprocedure.

Behind the scenes thepyahocorasick Python library implements these two datastructures: aTrie and an Aho-Corasickstring matching automaton. Both are exposed through the Automaton class.

In addition to Trie-like and Aho-Corasick methods and data structures,pyahocorasick also implements dict-like methods: The pyahocorasickAutomaton is aTrie a dict-like structure indexed by string keys eachassociated with a value object. You can use this to retrieve an associated valuein a time proportional to a string key length.

pyahocorasick is available in two flavors:

  • a CPythonC-based extension, compatible with Python 3 only. Use olderversion 1.4.x for Python 2.7.x and 32 bits support.
  • a simpler pure Python module, compatible with Python 2 and 3. This is onlyavailable in the source repository (not on Pypi) under the etc/py/ directoryand has a slightly different API.

Unicode and bytes

The type of strings accepted and returned byAutomaton methods are eitherunicode orbytes, depending on a compile time settings (preprocessordefinition ofAHOCORASICK_UNICODE as set in setup.py).

TheAutomaton.unicode attributes can tell you how the library was built.On Python 3, unicode is the default.

Warning

When the library is built with unicode support, an Automaton will store 2 or4 bytes per letter, depending on your Python installation. When builtfor bytes, only one byte per letter is needed.

Build and install from PyPi

To install for common operating systems, use pip. Pre-built wheels should beavailable on Pypi at some point in the future:

pip install pyahocorasick

To build from sources you need to have a C compiler installed and configured whichshould be standard on Linux and easy to get on MacOSX.

To build from sources, clone the git repository or download and extract the sourcearchive.

Install pip (and its setuptools companion) and then run (in a virtualenv ofcourse!):

pip install .

If compilation succeeds, the module is ready to use.

Support

Support is available through theGitHub issue tracker to report bugs or askquestions.

Contributing

You can submit contributions throughGitHub pull requests.

  • There is a Makefile with a default target that builds and runs tests.
  • The tests can run with a pip installe -e .[testing] && pytest -vvs
  • See also the .github directory for CI tests and workflow

Authors

The initial author and maintainer is Wojciech Muła.Philippe Ombredanne is Wojciech's sidekick and helps maintaining,and rewrote documentation, setup CI servers and did a some work to make thismodule more accessible to end users.

Alphabetic list of authors and contributors:

  • Andrew Grigorev
  • Ayan Mahapatra
  • Bogdan
  • David Woakes
  • Edward Betts
  • Frankie Robertson
  • Frederik Petersen
  • gladtosee
  • INADA Naoki
  • Jan Fan
  • Pastafarianist
  • Philippe Ombredanne
  • Renat Nasyrov
  • Sylvain Zimmer
  • Xiaopeng Xu

and many others!

This library would not be possible without help of many people, who contributed invarious ways.They createdpull requests,reported bugs asGitHub issuesor via direct messages, proposed fixes, or spent their valuable time on testing.

Thank you.

License

This library is licensed under very liberalBSD-3-Clause license. Someportions of the code are dedicated to the public domain such as the pure Pythonautomaton and test code.

Full text of license is available in LICENSE file.

Other Aho-Corasick implementations for Python you can consider

Whilepyahocorasick tries to be the finest and fastest Aho Corasick libraryfor Python you may consider these other libraries:

  • Written in pure Python.
  • Poor performance.
  • Written in pure Python.
  • Better performance than py-aho-corasick.
  • Using pypy, ahocorapy's search performance is only slightly worse than pyahocorasick's.
  • Performs additional suffix shortcutting (more setup overhead, less search overhead for suffix lookups).
  • Includes visualization tool for resulting automaton (using pygraphviz).
  • MIT-licensed, 100% test coverage, tested on all major python versions (+ pypy)
  • Written in C. Does not return overlapping matches.
  • Does not compile on Windows (July 2016).
  • No support for the pickle protocol.
  • Written in Cython.
  • Large automaton may take a long time to build (July 2016)
  • No support for a dict-like protocol to associate a value to a string key.
  • Written in C.
  • seems unmaintained (last update in 2005).
  • GPL-licensed.

About

Python module (C extension and plain python) implementing Aho-Corasick algorithm

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors27


[8]ページ先頭

©2009-2025 Movatter.jp