Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
NotificationsYou must be signed in to change notification settings

daac-tools/daachorse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.

Crates.ioDocumentationRustBuild StatusSlack

The main technical ideas behind this library appear in the following paper:

Shunsuke Kanda, Koichi Akabe, and Yusuke Oda.Engineering faster double-array Aho-Corasick automata.Software: Practice and Experience (SPE),53(6): 1332–1361, 2023(arXiv)

A Python wrapper is also availablehere.

Overview

Daachorse is a crate for fast multiple pattern matching using theAho-Corasick algorithm, running in linear time overthe length of the input text. This crate uses thecompact double-array data structure for implementingthe pattern match automaton for time and memory efficiency. The data structure not only supportsconstant-time state-to-state traversal but also represents each state in the space of only 12bytes.

For example, compared to the NFA of theaho-corasickcrate, which is the most popular Aho-Corasick implementation in Rust, Daachorse can perform patternmatching3.0–5.2 times faster while consuming56–60% smaller memory when using a worddictionary of 675K patterns. Other experimental results are available onWiki.

Requirements

Rust 1.61 or higher is required to build this crate.

Example usage

Daachorse contains some search options, ranging from standard matching with the Aho-Corasickalgorithm to trickier matching. They will run very fast based on the double-array data structureand can be easily plugged into your application, as shown below.

Finding overlapped occurrences

To search for all occurrences of registered patterns that allow for positional overlap in the inputtext, usefind_overlapping_iter(). When you usenew() for construction, the library assigns aunique identifier to each pattern in the input order. The match result has the byte positions ofthe occurrence and its identifier.

use daachorse::DoubleArrayAhoCorasick;let patterns =vec!["bcd","ab","a"];let pma =DoubleArrayAhoCorasick::new(patterns).unwrap();letmut it = pma.find_overlapping_iter("abcd");let m = it.next().unwrap();assert_eq!((0,1,2),(m.start(), m.end(), m.value()));let m = it.next().unwrap();assert_eq!((0,2,1),(m.start(), m.end(), m.value()));let m = it.next().unwrap();assert_eq!((1,4,0),(m.start(), m.end(), m.value()));assert_eq!(None, it.next());

Finding non-overlapped occurrences with the standard matching

If you do not want to allow positional overlap, usefind_iter() instead.It performs the search on the Aho-Corasick automatonand reports patterns first found in each iteration.

use daachorse::DoubleArrayAhoCorasick;let patterns =vec!["bcd","ab","a"];let pma =DoubleArrayAhoCorasick::new(patterns).unwrap();letmut it = pma.find_iter("abcd");let m = it.next().unwrap();assert_eq!((0,1,2),(m.start(), m.end(), m.value()));let m = it.next().unwrap();assert_eq!((1,4,0),(m.start(), m.end(), m.value()));assert_eq!(None, it.next());

Finding non-overlapped occurrences with the longest matching

If you want to search for the longest pattern without positional overlap in each iteration, useleftmost_find_iter() with specifyingMatchKind::LeftmostLongest in the construction.

use daachorse::{DoubleArrayAhoCorasickBuilder,MatchKind};let patterns =vec!["ab","a","abcd"];let pma =DoubleArrayAhoCorasickBuilder::new().match_kind(MatchKind::LeftmostLongest).build(&patterns).unwrap();letmut it = pma.leftmost_find_iter("abcd");let m = it.next().unwrap();assert_eq!((0,4,2),(m.start(), m.end(), m.value()));assert_eq!(None, it.next());

Finding non-overlapped occurrences with the leftmost-first matching

If you want to find the earliest registered pattern among ones starting from the search position,useleftmost_find_iter() with specifyingMatchKind::LeftmostFirst.

This is the so-calledleftmost first match, a tricky search option supported in theaho-corasick crate. For example, in the followingcode,ab is reported because it is the earliest registered one.

use daachorse::{DoubleArrayAhoCorasickBuilder,MatchKind};let patterns =vec!["ab","a","abcd"];let pma =DoubleArrayAhoCorasickBuilder::new().match_kind(MatchKind::LeftmostFirst).build(&patterns).unwrap();letmut it = pma.leftmost_find_iter("abcd");let m = it.next().unwrap();assert_eq!((0,2,0),(m.start(), m.end(), m.value()));assert_eq!(None, it.next());

Associating arbitrary values with patterns

To build the automaton from pairs of a pattern and user-defined value, instead of assigning identifiersautomatically, usewith_values().

use daachorse::DoubleArrayAhoCorasick;let patvals =vec![("bcd",0),("ab",10),("a",20)];let pma =DoubleArrayAhoCorasick::with_values(patvals).unwrap();letmut it = pma.find_overlapping_iter("abcd");let m = it.next().unwrap();assert_eq!((0,1,20),(m.start(), m.end(), m.value()));let m = it.next().unwrap();assert_eq!((0,2,10),(m.start(), m.end(), m.value()));let m = it.next().unwrap();assert_eq!((1,4,0),(m.start(), m.end(), m.value()));assert_eq!(None, it.next());

Building faster automata on multibyte characters

To build a faster automaton on multibyte characters, useCharwiseDoubleArrayAhoCorasick instead.

The standard versionDoubleArrayAhoCorasick handles strings as UTF-8 sequences and definestransition labels using byte values. On the other hand,CharwiseDoubleArrayAhoCorasick usesUnicode code point values, reducing the number of transitions and faster matching.

use daachorse::CharwiseDoubleArrayAhoCorasick;let patterns =vec!["全世界","世界","に"];let pma =CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();letmut it = pma.find_iter("全世界中に");let m = it.next().unwrap();assert_eq!((0,9,0),(m.start(), m.end(), m.value()));let m = it.next().unwrap();assert_eq!((12,15,2),(m.start(), m.end(), m.value()));assert_eq!(None, it.next());

no_std

Daachorse has no dependency onstd (but requires a global allocator with thealloc crate).

CLI

This repository contains a command-line interface nameddaacfind for searching patterns in textfiles.

% cat ./pat.txtfnconst fnpub fnunsafe fn% find . -name "*.rs" | xargs cargo run --release -p daacfind -- --color=auto -nf ./pat.txt......./src/errors.rs:67:    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {./src/errors.rs:81:    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {./src/lib.rs:115:    fn default() -> Self {./src/lib.rs:126:    pub fn base(&self) -> Option<u32> {./src/lib.rs:131:    pub const fn check(&self) -> u8 {./src/lib.rs:136:    pub const fn fail(&self) -> u32 {......

FAQ

  • Does this library support data types other thanstr and[u8]?(e.g., structures implementingEq.)

    Not supported. This library uses Aho-Corasick automata built with adata structure calleddouble-array trie. The algorithm on this datastructure works with XOR operations on the input haystack. Therefore,the haystack must be a sequence of integers. This library is speciallyoptimized forstr and[u8] among integer sequences.

  • Does this library provide bindings to programming languages otherthan Rust?

    We are providinga Python binding.Other programming languages are not currently planned to be supported.If you are interested in writing bindings, you are welcome to do so.daachorse is free software.

Slack

We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.

License

Licensed under either of

at your option.

If you use this library in academic settings,please cite the following paper.

@article{10.1002/spe.3190,    author = {Kanda, Shunsuke and Akabe, Koichi and Oda, Yusuke},    title = {Engineering faster double-array {Aho--Corasick} automata},    journal = {Software: Practice and Experience},    volume={53},    number={6},    pages={1332--1361},    year={2023},    keywords = {Aho–Corasick automata, code optimization, double-array, multiple pattern matching},    doi = {https://doi.org/10.1002/spe.3190},    url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.3190},    eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.3190}}

Contribution

Seethe guidelines.

About

🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Contributors3

  •  
  •  
  •  

Languages


[8]ページ先頭

©2009-2025 Movatter.jp