- Notifications
You must be signed in to change notification settings - Fork1
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. (Python wrapper for daachorse)
License
Apache-2.0, MIT licenses found
Licenses found
daac-tools/python-daachorse
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
daachorse is a fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.This is a Python wrapper.
Run the following command:
$ pip install daachorseYou need to install the Rust compiler followingthe documentation beforehand.daachorse usespyproject.toml, so you also need to upgrade pip to version 19 or later.
$ pip install --upgrade pipAfter setting up the environment, you can install daachorse as follows:
$ pip install git+https://github.com/daac-tools/python-daachorseDaachorse contains some search options,ranging from basic matching with the Aho-Corasick algorithm to trickier matching.All of them will run very fast based on the double-array data structure andcan be easily plugged into your application as shown below.
To search for all occurrences of registered patternsthat allow for positional overlap in the input text,usefind_overlapping(). When you instantiate a new automaton,unique identifiers are assigned to each pattern in the input order.The match result has the character positions of the occurrence and its identifier.
>>importdaachorse>>patterns= ['bcd','ab','a']>>pma=daachorse.Automaton(patterns)>>pma.find_overlapping('abcd')[(0,1,2), (0,2,1), (1,4,0)]
If you do not want to allow positional overlap, usefind() instead.It performs the search on the Aho-Corasick automatonand reports patterns first found in each iteration.
>>importdaachorse>>patterns= ['bcd','ab','a']>>pma=daachorse.Automaton(patterns)>>pma.find('abcd')[(0,1,2), (1,4,0)]
If you want to search for the longest pattern without positional overlap in each iteration,useMATCH_KIND_LEFTMOST_LONGEST in the construction.
>>importdaachorse>>patterns= ['ab','a','abcd']>>pma=daachorse.Automaton(patterns,daachorse.MATCH_KIND_LEFTMOST_LONGEST)>>pma.find('abcd')[(0,4,2)]
If you want to find the the earliest registered patternamong ones starting from the search position,useMATCH_KIND_LEFTMOST_FIRST.
This is so-calledthe leftmost first match, a bit tricky search option.For example, in the following code,ab is reported because it is the earliest registered one.
>>importdaachorse>>patterns= ['ab','a','abcd']>>pma=daachorse.Automaton(patterns,daachorse.MATCH_KIND_LEFTMOST_FIRST)>>pma.find('abcd')[(0,2,0)]
Licensed under either of
- Apache License, Version 2.0(LICENSE-APACHE orhttp://www.apache.org/licenses/LICENSE-2.0)
- MIT license(LICENSE-MIT orhttp://opensource.org/licenses/MIT)
at your option.
About
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. (Python wrapper for daachorse)
Topics
Resources
License
Apache-2.0, MIT licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.