- Notifications
You must be signed in to change notification settings - Fork18
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.
License
Apache-2.0, MIT licenses found
Licenses found
daac-tools/daachorse
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.
The main technical ideas behind this library appear in the following paper:
Shunsuke Kanda, Koichi Akabe, and Yusuke Oda.Engineering faster double-array Aho-Corasick automata.Software: Practice and Experience (SPE),53(6): 1332–1361, 2023(arXiv)
A Python wrapper is also availablehere.
Daachorse is a crate for fast multiple pattern matching using theAho-Corasick algorithm, running in linear time overthe length of the input text. This crate uses thecompact double-array data structure for implementingthe pattern match automaton for time and memory efficiency. The data structure not only supportsconstant-time state-to-state traversal but also represents each state in the space of only 12bytes.
For example, compared to the NFA of theaho-corasickcrate, which is the most popular Aho-Corasick implementation in Rust, Daachorse can perform patternmatching3.0–5.2 times faster while consuming56–60% smaller memory when using a worddictionary of 675K patterns. Other experimental results are available onWiki.
Rust 1.61 or higher is required to build this crate.
Daachorse contains some search options, ranging from standard matching with the Aho-Corasickalgorithm to trickier matching. They will run very fast based on the double-array data structureand can be easily plugged into your application, as shown below.
To search for all occurrences of registered patterns that allow for positional overlap in the inputtext, usefind_overlapping_iter()
. When you usenew()
for construction, the library assigns aunique identifier to each pattern in the input order. The match result has the byte positions ofthe occurrence and its identifier.
use daachorse::DoubleArrayAhoCorasick;let patterns =vec!["bcd","ab","a"];let pma =DoubleArrayAhoCorasick::new(patterns).unwrap();letmut it = pma.find_overlapping_iter("abcd");let m = it.next().unwrap();assert_eq!((0,1,2),(m.start(), m.end(), m.value()));let m = it.next().unwrap();assert_eq!((0,2,1),(m.start(), m.end(), m.value()));let m = it.next().unwrap();assert_eq!((1,4,0),(m.start(), m.end(), m.value()));assert_eq!(None, it.next());
If you do not want to allow positional overlap, usefind_iter()
instead.It performs the search on the Aho-Corasick automatonand reports patterns first found in each iteration.
use daachorse::DoubleArrayAhoCorasick;let patterns =vec!["bcd","ab","a"];let pma =DoubleArrayAhoCorasick::new(patterns).unwrap();letmut it = pma.find_iter("abcd");let m = it.next().unwrap();assert_eq!((0,1,2),(m.start(), m.end(), m.value()));let m = it.next().unwrap();assert_eq!((1,4,0),(m.start(), m.end(), m.value()));assert_eq!(None, it.next());
If you want to search for the longest pattern without positional overlap in each iteration, useleftmost_find_iter()
with specifyingMatchKind::LeftmostLongest
in the construction.
use daachorse::{DoubleArrayAhoCorasickBuilder,MatchKind};let patterns =vec!["ab","a","abcd"];let pma =DoubleArrayAhoCorasickBuilder::new().match_kind(MatchKind::LeftmostLongest).build(&patterns).unwrap();letmut it = pma.leftmost_find_iter("abcd");let m = it.next().unwrap();assert_eq!((0,4,2),(m.start(), m.end(), m.value()));assert_eq!(None, it.next());
If you want to find the earliest registered pattern among ones starting from the search position,useleftmost_find_iter()
with specifyingMatchKind::LeftmostFirst
.
This is the so-calledleftmost first match, a tricky search option supported in theaho-corasick crate. For example, in the followingcode,ab
is reported because it is the earliest registered one.
use daachorse::{DoubleArrayAhoCorasickBuilder,MatchKind};let patterns =vec!["ab","a","abcd"];let pma =DoubleArrayAhoCorasickBuilder::new().match_kind(MatchKind::LeftmostFirst).build(&patterns).unwrap();letmut it = pma.leftmost_find_iter("abcd");let m = it.next().unwrap();assert_eq!((0,2,0),(m.start(), m.end(), m.value()));assert_eq!(None, it.next());
To build the automaton from pairs of a pattern and user-defined value, instead of assigning identifiersautomatically, usewith_values()
.
use daachorse::DoubleArrayAhoCorasick;let patvals =vec![("bcd",0),("ab",10),("a",20)];let pma =DoubleArrayAhoCorasick::with_values(patvals).unwrap();letmut it = pma.find_overlapping_iter("abcd");let m = it.next().unwrap();assert_eq!((0,1,20),(m.start(), m.end(), m.value()));let m = it.next().unwrap();assert_eq!((0,2,10),(m.start(), m.end(), m.value()));let m = it.next().unwrap();assert_eq!((1,4,0),(m.start(), m.end(), m.value()));assert_eq!(None, it.next());
To build a faster automaton on multibyte characters, useCharwiseDoubleArrayAhoCorasick
instead.
The standard versionDoubleArrayAhoCorasick
handles strings as UTF-8 sequences and definestransition labels using byte values. On the other hand,CharwiseDoubleArrayAhoCorasick
usesUnicode code point values, reducing the number of transitions and faster matching.
use daachorse::CharwiseDoubleArrayAhoCorasick;let patterns =vec!["全世界","世界","に"];let pma =CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();letmut it = pma.find_iter("全世界中に");let m = it.next().unwrap();assert_eq!((0,9,0),(m.start(), m.end(), m.value()));let m = it.next().unwrap();assert_eq!((12,15,2),(m.start(), m.end(), m.value()));assert_eq!(None, it.next());
Daachorse has no dependency onstd
(but requires a global allocator with thealloc
crate).
This repository contains a command-line interface nameddaacfind
for searching patterns in textfiles.
% cat ./pat.txtfnconst fnpub fnunsafe fn% find . -name "*.rs" | xargs cargo run --release -p daacfind -- --color=auto -nf ./pat.txt......./src/errors.rs:67: fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {./src/errors.rs:81: fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {./src/lib.rs:115: fn default() -> Self {./src/lib.rs:126: pub fn base(&self) -> Option<u32> {./src/lib.rs:131: pub const fn check(&self) -> u8 {./src/lib.rs:136: pub const fn fail(&self) -> u32 {......
Does this library support data types other than
str
and[u8]
?(e.g., structures implementingEq
.)Not supported. This library uses Aho-Corasick automata built with adata structure calleddouble-array trie. The algorithm on this datastructure works with XOR operations on the input haystack. Therefore,the haystack must be a sequence of integers. This library is speciallyoptimized for
str
and[u8]
among integer sequences.Does this library provide bindings to programming languages otherthan Rust?
We are providinga Python binding.Other programming languages are not currently planned to be supported.If you are interested in writing bindings, you are welcome to do so.daachorse is free software.
We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.
- https://daac-tools.slack.com/
- Please get an invitation fromhere.
Licensed under either of
- Apache License, Version 2.0(LICENSE-APACHE orhttp://www.apache.org/licenses/LICENSE-2.0)
- MIT license(LICENSE-MIT orhttp://opensource.org/licenses/MIT)
at your option.
If you use this library in academic settings,please cite the following paper.
@article{10.1002/spe.3190, author = {Kanda, Shunsuke and Akabe, Koichi and Oda, Yusuke}, title = {Engineering faster double-array {Aho--Corasick} automata}, journal = {Software: Practice and Experience}, volume={53}, number={6}, pages={1332--1361}, year={2023}, keywords = {Aho–Corasick automata, code optimization, double-array, multiple pattern matching}, doi = {https://doi.org/10.1002/spe.3190}, url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.3190}, eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.3190}}
Seethe guidelines.
About
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.
Topics
Resources
License
Apache-2.0, MIT licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.
Contributors3
Uh oh!
There was an error while loading.Please reload this page.