Genivia/ugrep-indexerPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star73

A monotonic indexer to speed up grepping by >10x (ugrep-indexer is now part of ugrep 6.0 and greater)

License

BSD-3-Clause license

73 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
bin		bin
lzma		lzma
m4		m4
man		man
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
Makefile.am		Makefile.am
Makefile.in		Makefile.in
README.md		README.md
aclocal.m4		aclocal.m4
ar-lib		ar-lib
build.sh		build.sh
compile		compile
config.guess		config.guess
config.h.in		config.h.in
config.sub		config.sub
configure		configure
configure.ac		configure.ac
depcomp		depcomp
install-sh		install-sh
man.sh		man.sh
missing		missing

Repository files navigation

A monotonic indexer to speed up grepping

Theugrep-indexer utility recursively indexes files to speed up recursivegrepping.

Also the contents of archives and compressed files are indexed when specifiedwith a command-line option. This eliminates searching them when none of theircontents match the specified patterns.

ugrep is a grep-compatible fast filesearcher that supports index-based searching. Index-based search can besignificantly faster on slow file systems and when file system caching isineffective: if the file system on a drive searched is not cached in RAM, i.e.it is "cold", then indexing will speed up search. It only searches those filesthat may match a specified regex pattern by using an index of the file. Thisindex allows for a quick check if there is a potential match, thus we avoidsearching all files.

Indexed-based search with ugrep is safe and never skips updated files that maynow match. If any files and directories are added or changed after indexing,then searching will always search these additions and changes made to the filesystem by comparing file and directory time stamps to the indexing time stamp.

When many files are added or changed after indexing, then we might want tore-index to bring the indexes up to date. Re-indexing is incremental, so itwill not take as much time as the initial indexing process.

A typical but small example of an index-based search, for example on the ugrepv3.12.6 repository placed on a separate drive:

$ cd drive/ugrep$ ugrep-indexer -I12247077 bytes scanned and indexed with 19% noise on average    1317 files indexed in 28 directories      28 new directories indexed    1317 new files indexed       0 modified files indexed       0 deleted files removed from indexes     128 binary files ignored with --ignore-binary       0 symbolic links skipped       0 devices skipped 5605227 bytes indexing storage increase at 4256 bytes/file

Normal searching on a cold file system without indexing takes 1.02 secondsafter unmounting thedrive and mounting again to clear FS cache to record theeffect of indexing:

$ ugrep -I -l 'std::chrono' --statssrc/ugrep.cppSearched 1317 files in 28 directories in 1.02 seconds with 8 threads: 1 matching (0.07593%)

Ripgrep 13.0.0 takes longer with 1.18 seconds for the same cold search (ripgrepskips binary files by default, so option-I is not specified):

$ time rg -l 'std::chrono'src/ugrep.cpp    1.18 real         0.01 user         0.06 sys

By contrast, with indexing, searching a cold file system only takes 0.0487seconds with ugrep, which is 21 times faster, after unmountingdrive andmounting again to clear FS cache to record the effect of indexing:

$ ugrep --index -I -l 'std::chrono' --statssrc/ugrep.cppSearched 1317 files in 28 directories in 0.0487 seconds with 8 threads: 1 matching (0.07593%)Skipped 1316 of 1317 files with non-matching indexes

There is always some variance in the elapsed time with 0.0487 seconds the besttime of four search runs that produced a search time range of 0.0487 (21x speedup) to 0.0983 seconds (10x speed up).

The speed increase may be significantly higher in general compared to thissmall demo, depending on several factors, the size of the files indexed, theread speed of the file system and assuming most files are cold.

The indexing algorithm that I designed isprovably monotonic: a higheraccuracy guarantees an increased search performance by reducing the falsepositive rate, but also increases index storage overhead. Likewise, a loweraccuracy decreases search performance, but also reduces the index storageoverhead. Therefore, I named my indexer amonotonic indexer.

If file storage space is at a premium, then we can dial down the index storageoverhead by specifying a lower indexing accuracy.

Indexing the example from above with level 0 (option-0) reduces the indexingstorage overhead by 8.6 times, from 4256 bytes per file to a measly 490 bytesper file:

12247077 bytes scanned and indexed with 42% noise on average    1317 files indexed in 28 directories       0 new directories indexed    1317 new files indexed       0 modified files indexed       0 deleted files removed from indexes     128 binary files ignored with --ignore-binary       0 symbolic links skipped       0 devices skipped  646123 bytes indexing storage increase at 490 bytes/file

Indexed search is still a lot faster by 12x than non-indexed for this example,with 16 files actually searched (15 false positives):

Searched 1317 files in 28 directories in 0.0722 seconds with 8 threads: 1 matching (0.07593%)Skipped 1301 of 1317 files with non-matching indexes

Regex patterns that are more complex than this example may have a higher falsepositive rate naturally, which is the rate of files that are consideredpossibly matching when they are not. A higher false positive rate may reducesearch speeds when the rate is large enough to be impactful.

The following table shows how indexing accuracy affects indexing storageand the average noise per file indexed. The rightmost columns show the searchspeed and false positive rate forugrep --index -I -l 'std::chrono':

acc.	index storage (KB)	average noise	false positives	search time (s)
`-0`	631	42%	15	0.0722
`-1`	1276	39%	1	0.0506
`-2`	1576	36%	0	0.0487
`-3`	2692	31%	0	unch
`-4`	2966	28%	0	unch
`-5`	4953	23%	0	unch
`-6`	5474	19%	0	unch
`-7`	9513	15%	0	unch
`-8`	10889	11%	0	unch
`-9`	13388	7%	0	unch

If the specified regex matches many more possible patterns, for example withthe searchugrep --index -I -l '(todo|TODO)[: ]', then we may observe ahigher rate of false positives among the 1317 files searched, resulting inslightly longer search times:

acc.	false positives	search time (s)
`-0`	189	0.292
`-1`	69	0.122
`-2`	43	0.103
`-3`	19	0.101
`-4`	16	0.097
`-5`	2	0.096
`-6`	1	unch
`-7`	0	unch
`-8`	0	unch
`-9`	0	unch

Accuracy-4 is the default (from-5 previously in older releases), whichtends to work very well to search with regex patterns of modest complexity.

One word of caution. There is always a tiny bit of overhead to check theindexes. This means that if all files are already cached in RAM, because fileswere searched or read recently, then indexing will not necesarily speed upsearch, obviously. In that case a non-indexed search might be faster.Furthermore, an index-based search has a longer start-up time. This start-uptime increases when Unicode character classes and wildcards are used that mustbe converted to hash tables.

To summarize, index-based search is most effective when searching a lot ofcold files and when regex patterns aren't matching too much, i.e. we want tolimit the use of unlimited repeats* and+ and limit the use of Unicodecharacter classes when possible. This reduces the ugrep start-up time andlimits the rate of false positive pattern matches (see also Q&A below).

Quick examples

Recursively and incrementally index all non-binary files showing progress:

ugrep-indexer -I -v

Recursively and incrementally index all non-binary files, including non-binaryfiles stored in archives and in compressed files, showing progress:

ugrep-indexer -z -I -v

Incrementally index all non-binary files, including archives and compressedfiles, show progress, follow symbolic links to files (but not to directories),but do not index files and directories matching the globs in .gitignore:

ugrep-indexer -z -I -v -S -X

Force re-indexing of all non-binary files, including archives and compressedfiles, follow symbolic links to files (but not to directories), but do notindex files and directories matching the globs in .gitignore:

ugrep-indexer -f -z -I -v -S -X

Same, but decrease index file storage to a minimum by decreasing indexingaccuracy from 5 (default) to 0:

ugrep-indexer -f -0 -z -I -v -S -X

Increase search performance by increasing the indexing accuracy from 5(default) to 7 at a cost of larger index files:

ugrep-indexer -f7zIvSX

Recursively delete all hidden._UG#_Store index files to restore thedirectory tree to non-indexed:

ugrep-indexer -d

Build steps

Configure and compile with:

./build.sh

If desired but not required, install with:

sudo make install

Future enhancements

Add an option to create one index file, e.g. specified explicitly to ugrep.This could further improve indexed search speed if the index file is locatedon a fast file system. Otherwise, do not expect much improvement or evenpossible slow down, since a single index file cannot be searched concurrentlyand more index entries will be checked when in fact directories are skipped(skipping their indexes too). Experiments will tell.A critical caveat ofthis approach is that index-based search withugrep --index is no longersafe: new and modified files that are not indexed yet will not be searched.
Each N-gram Bloom filter has its own "bit tier" in the hash table to avoidhash conflicts. For example 2-grams do not share any bits with 3-grams.This ensures that we never have any false positives with characters beingfalsely matched that are actually not part of the pattern. However, the1-gram (single character) bit space is small (at most 256 bits). Therefore,we waste some bits when hash tables are larger. A possible approach toreduce waste is to combine 1-grams with 2-grams to share the same bit space.This is easy to do if we consider a 1-gram being equal to a 2-gram with thesecond character set to\0 (NUL). We can lower the false positive ratewith a second 2-gram hash based on a different hash method. Or we can expandthe "bit tiers" from 8 to 9 to store 9-grams. That will increase theindexing accuracy for longer patterns (9 or longer) at no additional cost.On the other hand, that change may cause more false positives when charactersare falsely matched that are not part of the pattern; we lose the advantageof a perfect 1-gram accuracy.

Q&A

Q: How does it work?

Indexing adds a hidden index file._UG#_Store to each directory indexed.Files indexed are scanned (never changed!) by ugrep-indexer to generate indexfiles.

The size of the index files depends on the specified accuracy, with-0 thelowest (small index files) and-9 the highest (large index files). Thedefault accuracy is-4. See the next Q for details on the impact of accuracyon indexing size versus search speed.

Indexingnever follows symbolic links to directories, because symbolicallylinked directories may be located anywhere in a file system, or in another filesystem, where we do not want to add index files. You can still index symboliclinks to files with ugrep-indexer option-S.

Option-v (--verbose) displays the indexing progress and "noise" of eachfile indexed. Noise is a measure ofentropy orrandomness in the input. Ahigher level of noise means that indexing was less accurate in representing thecontents of a file. For example, a large file with random data is hard toindex accurately and will have a high level of noise.

The complexity of indexing is linear in the size of a given file to index.In practice it is not a fast process, not as fast a searching, and may takesome time to complete a full indexing pass over a large directory tree. Whenindexing completes, ugrep-indexer displays the results of indexing. The totalsize of the indexes added and average indexing noise is also reported.

Scanning a file to index results in a 64KB indexing hashes table. Then, theugrep-indexer halves the table with bit compression using bitwise-and as longas the target accuracy is not exceeded. Halving is made possible by the factthat the table encodes hashes for 8 windows at offsets from the start of thepattern, corresponding to the 8 bits per index hashing table cell. Combiningthe two halves of the table may flip some bits to zero from one, which maycause a false positive match. This proves the monotonicity of the indexer. Azero bit hash value indicates a possible match.

The ugrep-indexer detects "binary files", which can be ignored and not indexedwith ugrep-indexer option-I (--ignore-binary). This is useful whensearching with ugrep option-I (--ignore-binary) to ignore binary files,which is a typical scenario.

The ugrep-indexer obeys .gitignore file exclusions when specified with option-X (--ignore-files). Ignored files and directories will not be indexed tosave file system space. This works well when searching for files with ugrepoption--ignore-files.

Indexing can be aborted, for example with CTRL-C, which will not result in aloss of search capability with ugrep, but will leave the directory structureonly partially indexed.

Option-c checks indexes for stale references and non-indexed files anddirectories.

Indexes are deleted with ugrep-indexer option-d.

The ugrep-indexer has been extensively tested by comparingugrep --indexsearch results to the "slow" non-indexedugrep search results on thousands offiles with thousands of random search patterns.

Indexed-based search works with all ugrep options except with option-v(--invert-match),--filter,-P (--perl-regexp) and-Z (--fuzzy).Option-c (--count) with--index automatically sets--min-count=1 toskip all files with zero matches.

If any files or directories were updated, added or deleted after indexing, thenugrep--index will always search these files and directories when they arepresent on the recursive search path. You can run ugrep-indexer again toincrementally update all indexes.

Regex patterns are converted internally by ugrep with option--index to aform of hash tables for up to the first 16 bytes of the regex patternsspecified, possibly shorter in order to reduce construction time when regexpatterns are complex. Therefore, the first 8 to 16 characters of a regexpattern to search are most critical and should not match too much to limitso-called false positive matches that may slow down searching.

In ugrep, a regex pattern is converted to a DFA. An indexing hash finiteautomaton (HFA) is constructed on top of the DFA to compactly represent hashtables as state transitions with labelled edges. This HFA consists of up toeight layers, each shifted by one byte to represent the next 8-byte window overthe pattern. Each HFA layer encodes index hashes for that part of the pattern.The index hash function chosen is "additive", meaning the next byte is addedwhen hashed with the previous hash. This is very important as it criticallyreduces the HFA construction overhead. We can now encode labelled HFAtransitions to states as multiple edges with 16-bit hash value ranges insteadof a set of single edges each with an individual hash value. To this end, Iuse my open-ended ranges libraryreflex::ORanges<T> derived fromstd::set<T>.

A very simple single stringmaybe_match() function with the prime 61 indexhash function is given below to demonstrate index-based searching of a singlestring:

// prime 61 hashinguint16_t indexhash(uint16_t h, uint8_t b, size_t size){  return ((h << 6) - h - h - h + b) & (size - 1);}// return possible match of string given array of hashes of size <= 64K (power of two)bool maybe_match(const char *string, uint8_t *hashes, size_t size){  size_t len = strlen(string); // practically we can and should limit len to e.g. 15 or 16  for (const char *window = string; len > 0; ++window, --len)  {    uint16_t h = window[0] & (size - 1);    if (hashes[h] & 0x01)      return false    size_t k, n = len < 8 ? len : 8;    for (k = 1; k < n; ++k)    {      h = indexhash(h, window[k], size);      if (hashes[h] & (1 << k))        return false;    }  }  return true;}

The prime 61 hash was chosen among many other possible hashing functions usinga realistic experimental setup. A candidate hashing function was tested byrepreatedly searching a randomly-drawn word from a 100MB Wikipedia file.The word was mutated with one, two or three random letters. This mutation ischecked to make sure it does not correspond to an actual valid word in theWikipedia file. Then the false positive rate was recorded whenever a mutatedword matches the file. A hash function with a minimal false positive rateshould be a good candidate overall.

By using a window of 8 (or shorter depending on the pattern length) the falsepositive rate is lower compared to standard Bloom filters. More specifically,N² hash functions are used instead ofN in a Bloom filter. For shorterpatterns,N is often too small to limit false positives. Therefore,N² ismore effective. It also rejects any pattern from a match that has a characteranywhere in the first 8 bytes of the pattern does not actually occur anywherein an indexed file, whereas a standard Bloom filter might have a false positivematch. Furthermore, the bit addressing used to index the hashes table enablesefficient table compression.

Q: What is indexing accuracy?

Indexing is a form of lossy compression. The higher the indexing accuracy, thefaster ugrep search performance should be by skipping more files that do notmatch. A higher accuracy reduces noise (less lossy). A high level of noisecauses ugrep to sometimes search indexed files that do not match. We callthese "false positive matches". Higher accuracy requires larger index files.Normally we expect 4K or less indexing storage per file on average. Theminimum is 128 bytes of index storage per file, excluding the file name anda 4-byte index header. The maximum is 64K bytes storage per file for verylarge noisy files.

When searching indexed files withugrep --index --stats, option--statsshows the search statistics after the indexing-based search completed. Whenmany files are not skipped from searching due to indexing noise (i.e. falsepositives), then a higher accuracy helps to increase the effectiveness ofindexing, which may speed up searching.

Q: What about UTF-16 and UTF-32 files?

UTF-16 and UTF-32 files are indexed too. The indexer treats them as UTF-8after internally converting them to UTF-8 to index.

Q: Why bother indexing archives and compressed files?

Disk space is saved by archiving (zip/tar/pax/cpio) and compressing files. Onthe other hand, searching archives and compressed files is much slower thansearching regular files. Indexing archives and compressed files withugrep-indexer -z -I and searching them withugrep -z -I --index PATTERNspeeds up searching, i.e. when archives and compressed files are skipped. Onthe other hand, disk store requirements will increase with the addition ofindex file entries for archives and compressed files. Note that when archivesand compressed files contain binaries, option-I ignores these binaries.

Q: Why is the start-up time of ugrep higher with option --index?

The start-up overhead ofugrep --index to construct indexing hash tablesdepends on the regex patterns. If a regex pattern is very "permissive", i.e.matches a lot of possible patterns, then the start-up time ofugrep --indexsignificantly increases to compute hash tables. This may happen when largeUnicode character classes and wildcards are used, especially with the unlimited* and+ repeats. To find out how the start-up time increases, use optionugrep --index -r PATTERN /dev/null --stats=vm to search /dev/null with yourPATTERN.

Q: Why are index files not compressed?

Index files should be very dense in information content and that is the casewith this new indexing algorithm for ugrep that I designed and implemented.The denser an index file is, the more compact it accurately represents theoriginal file data. That makes it hard or impossible to compress index files.This is also a good indicator of how effective an index file will be inpractice.

About

A monotonic indexer to speed up grepping by >10x (ugrep-indexer is now part of ugrep 6.0 and greater)

ugrep.com

Releases9

ugrep-indexer v1.0.0 Latest

Apr 18, 2024

+ 8 releases

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

A monotonic indexer to speed up grepping

Quick examples

Build steps

Future enhancements

Q&A

Q: How does it work?

Q: What is indexing accuracy?

Q: What about UTF-16 and UTF-32 files?

Q: Why bother indexing archives and compressed files?

Q: Why is the start-up time of ugrep higher with option --index?

Q: Why are index files not compressed?

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases9

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

Genivia/ugrep-indexer

Folders and files

Latest commit

History

Repository files navigation

A monotonic indexer to speed up grepping

Quick examples

Build steps

Future enhancements

Q&A

Q: How does it work?

Q: What is indexing accuracy?

Q: What about UTF-16 and UTF-32 files?

Q: Why bother indexing archives and compressed files?

Q: Why is the start-up time of ugrep higher with option --index?

Q: Why are index files not compressed?

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases9

Uh oh!

Contributors2

Uh oh!

Languages