Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

C Library of Aho-Corasick Algorithm based on Coordinate Hash Trie

License

NotificationsYou must be signed in to change notification settings

dongyx/libaca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LibACA is a C library of the Aho-Corasick algorithm/automaton.

Given a set of strings, each called a pattern, and an input string called the text, the Aho-Corasick algorithm finds every occurrence of every pattern in the text.

LibACA aims for:

  • Time-space balance

    LibACA uses an original variant of trie, calledcoordinate hash trie.This trie variant provides a good balance between time and space,without compromising simplicity.See theInternals section for detail.

    The space complexity is proportional to the number of states.Because no reallocation, resizing, or rehashing will be performed,the space consumption is stable.

    When transition from one state to another,the time complexity is constant for the average case,andproportional to the size of the alphabet for the worst case.

  • Simplicity

    The implementation contains no more than 175 LOCs of C, including header files.

  • Robustness

    LibACA carefully deals with errors and arithmetic overflows.

LibACA is designed for binary streams, including infinite ones.

Table of Contents

System Requirements

LibACA requires a C89 compiler, POSIXerrno definitions, and the following features of the target architecture:

  • The binary representation of -1 is111...11.

Most mainstream architectures, systems, and compilers meet the requirements.

To run the test and benchmark, the system should additionally implement thegetrusage() routine and support theru_maxrss field in therusage structure.Most Unix-compatible systems meet the requirements.

Getting Started

The following exampleeg.c reads symbols from the standard input and matches in a fixed pattern set.For every match, it prints the matched pattern.

#include <aca.h>...char *pattab[] = {"he", "she", "his", "hers"};int main(void){int hit(int pat); /* callback for matching */aca fgrep;/* the automaton */aca_iter it;/* iterator of the automaton */int i, ch;/* initialize an automaton with the maximum number of states to be 16 */aca_init(&fgrep, 16);/* add all patterns in pattab to the automaton */for (i = 0; i < sizeof pattab / sizeof pattab[0]; i++)aca_add(&fgrep, pattab[i], strlen(pattab[i]));/* build the automaton */aca_build(&fgrep);/* get the initial iterator */it = aca_root(&fgrep);/* read symbols from the stdin stream and feed to the automaton * each match is passed to hit() */while ((ch = getchar()) != EOF)it = aca_next(it, ch, hit);/* destroy the automaton */aca_destroy(&fgrep);return 0;}int hit(int pat){/* print the matched pattern */puts(pattab[pat]);return 0;}

The above example should behave like the following.

$ cc -l aca eg.c$ echo shers | ./a.outshehehers

Installation

To install LibACA as header files and a static library into the system:

makesudo make install

By default, LibACA is installed to/usr/local.To link LibACA, use the-l aca option on your compiler.

It's also simple to embed LibACA into your project.Just includeaca.h.Compile and linkaca.c.

Types

  • aca: Aho-Corasick Automaton

  • aca_iter: iterator of anaca instance

Functions

  • int aca_init(aca *aca, int n)

    Initialize the automaton with the specific maximum number of states.

    The safest estimation ofn is the sum of lengths of all patterns adding 1. An less than 1 is equivalent to 1.

    Upon sucess, return 0.

    Otherwise, return -1 and seterrno.

  • void aca_destroy(aca *aca)

    Destroy the automaton.

  • int aca_add(aca *aca, char *pat, int n)

    Add a pattern of lengthn to the automaton.

    A negativen is equivalent to 0.

    Upon success, return the pattern index.The first pattern has index 0;the second has index 1;and so on.Duplicated patterns would have the same index.

    Otherwise, return -1 and seterrno.The internal state of the automaton will be indeterminate.The only allowed next action isaca_destroy().

  • int aca_build(aca *aca)

    Build the automaton.This routine should not be called before all patterns are added.

    Upon success, return 0.

    Otherwise, return -1 and seterrno.

  • aca_iter aca_root(aca *aca)

    Return the initial iterator of the automaton.

    This routine should not be called before the automaton is built.

  • aca_iter aca_next(aca_iter it, char sym, int (*hit)(int))

    Feed a symbol to the automaton and return the next iterator.

    If there are matched occurrences after feeding the symbol,hit() will be called with the matched pattern index.

    Ifhit() returns 0, the matching continues, untilhit() is called on all occurrences.

    Otherwise, the matching stops.

    Thehit() function is called in the descending order of the length of the matched pattern.

Benchmarks

There are three test cases.

  • test/alice: the text ofAlice's Adventures in Wonderland and the pattern set of 10000 random English words

  • test/rand: the text of 300000 random characters and the pattern set of 10000 random strings

  • test/lfail: the text of 399999as followed by ab, and a single pattern containing 400000as

test/aliceProg        Build    Match    RSS-------     -------  -------  -------naive       86.4ms   4.9ms    95.9MBlibaca      247.3ms  8.5ms    5.3MB-------     -------  -------  -------Improv      -65%     -42%     94%test/randProg       Build    Match    RSS-------    -------  -------  -------naive      63.4ms   16.2ms   65.9MBlibaca     207.6ms  27.4ms   4.7MB-------    -------  -------  -------Improv     -69%     -41%     93%test/lfailProg        Build    Match    RSS-------     -------  -------  -------naive       404.3ms  29.3ms   398.8MBlibaca      827.2ms  20.8ms   24.5MB-------     -------  -------  -------Improv      -51%     40%      94%

You could runmake benchmark to perform the benchmark in your machine.The unit of RSS may be incorrect in some systems, e.g. Linux.

Internals

LibACA uses the coordinate hash trie to reach time-space balance and simplicity.This trie variant is documented in the following paper.

Storing a Trie with Compact and Predictable Space

Credits

LibACA was created byDONG Yuxuan in 2023.

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp