Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Boyer–Moore string-search algorithm

From Wikipedia, the free encyclopedia
String searching algorithm
For the majority vote algorithm, seeBoyer–Moore majority vote algorithm. For the Boyer–Moore theorem prover, seeNqthm.
Boyer–Moore string search
ClassString search
Data structureString
Worst-caseperformanceΘ(m) preprocessing + O(mn) matching[note 1]
Best-caseperformanceΘ(m) preprocessing + Ω(n/m) matching
Worst-casespace complexityΘ(k+m)[note 2]

Incomputer science, theBoyer–Moore string-search algorithm is an efficientstring-searching algorithm that is the standard benchmark for practical string-search literature.[1] It was developed byRobert S. Boyer andJ Strother Moore in 1977.[2] The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected byWojciech Rytter in 1980.[3][4]

The algorithmpreprocesses thestring being searched for (the pattern), but not the string being searched in (the text). It is thus well-suited for applications in which the pattern is much shorter than the text or where it persists across multiple searches. The Boyer–Moore algorithm uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string search algorithms. In general, the algorithm runs faster as the pattern length increases. The key features of the algorithm are to match on the tail of the pattern rather than the head, and to skip along the text in jumps of multiple characters rather than searching every single character in the text.

Definitions

[edit]
ANPANMAN-
PAN------
-PAN-----
--PAN----
---PAN---
----PAN--
-----PAN-
Alignments of patternPAN to textANPANMAN,
fromk=3 tok=8. A match occurs atk=5.
  • T denotes the input text to be searched. Its length isn.
  • P denotes the string to be searched for, called thepattern. Its length ism.
  • S[i] denotes the character at indexi of stringS, counting from 1.
  • S[i..j] denotes thesubstring of stringS starting at indexi and ending atj, inclusive.
  • Aprefix ofS is a substringS[1..i] for somei in range [1,l], wherel is the length ofS.
  • Asuffix ofS is a substringS[i..l] for somei in range [1,l], wherel is the length ofS.
  • Analignment ofP toT is an indexk inT such that the last character ofP is aligned with indexk ofT.
  • Amatch oroccurrence ofP occurs at an alignmentk ifP is equivalent toT[(k-m+1)..k].

Description

[edit]

The Boyer–Moore algorithm searches for occurrences ofP inT by performing explicit character comparisons at different alignments. Instead of abrute-force search of all alignments (of which there arenm+1{\displaystyle n-m+1}), Boyer–Moore uses information gained by preprocessingP to skip as many alignments as possible.

Previous to the introduction of this algorithm, the usual way to search within text was to examine each character of the text for the first character of the pattern. Once that was found the subsequent characters of the text would be compared to the characters of the pattern. If no match occurred then the text would again be checked character by character in an effort to find a match. Thus almost every character in the text needs to be examined.

The key insight in this algorithm is that if the end of the pattern is compared to the text, then jumps along the text can be made rather than checking every character of the text. The reason that this works is that in lining up the pattern against the text, the last character of the pattern is compared to the character in the text. If the characters do not match, there is no need to continue searching backwards along the text. If the character in the text does not match any of the characters in the pattern, then the next character in the text to check is locatedm characters farther along the text, wherem is the length of the pattern. If the character in the textis in the pattern, then a partial shift of the pattern along the text is done to line up along the matching character and the process is repeated. Jumping along the text to make comparisons rather than checking every character in the text decreases the number of comparisons that have to be made, which is the key to the efficiency of the algorithm.

More formally, the algorithm begins at alignmentk=m{\displaystyle k=m}, so the start ofP is aligned with the start ofT. Characters inP andT are then compared starting at indexm inP andk inT, moving backward. The strings are matched from the end ofP to the start ofP. The comparisons continue until either the beginning ofP is reached (which means there is a match) or a mismatch occurs upon which the alignment is shifted forward (to the right) according to the maximum value permitted by a number of rules. The comparisons are performed again at the new alignment, and the process repeats until the alignment is shifted past the end ofT, which means no further matches will be found.

The shift rules are implemented as constant-time table lookups, using tables generated during the preprocessing ofP.

Shift rules

[edit]

A shift is calculated by applying two rules: the bad-character rule and the good-suffix rule. The actual shifting offset is the maximum of the shifts calculated by these rules.

The bad-character rule

[edit]

Description

[edit]
----X--K---
ANPANMANAM-
-NNAAMAN---
---NNAAMAN-
Demonstration of bad-character rule with patternP =NNAAMAN. There is a mismatch betweenN (in the input text) andA (in the pattern) in the column marked with anX. The pattern is shifted right (in this case by 2) so that the next occurrence of the characterN (in the patternP) to the left of the current character (which is the middleA) is found.

The bad-character rule considers the character inT at which the comparison process failed (assuming such a failure occurred). The next occurrence of that character to the left inP is found, and a shift which brings that occurrence in line with the mismatched occurrence inT is proposed. If the mismatched character does not occur to the left inP, a shift is proposed that moves the entirety ofP past the point of mismatch.

Preprocessing

[edit]

Methods vary on the exact form the table for the bad-character rule should take, but a simple constant-time lookup solution is as follows: create a 2D table which is indexed first by the index of the characterc in the alphabet and second by the indexi in the pattern. This lookup will return the occurrence ofc inP with the next-highest indexj<i{\displaystyle j<i} or -1 if there is no such occurrence. The proposed shift will then beij{\displaystyle i-j}, withO(1){\displaystyle O(1)} lookup time andO(km){\displaystyle O(km)} space, assuming a finite alphabet of lengthk.

The C and Java implementations below have aO(k){\displaystyle O(k)} space complexity (make_delta1, makeCharTable). This is the same as the original delta1 and theBMH bad-character table. This table maps a character at positioni{\displaystyle i} to shift bylen(p)1i{\displaystyle \operatorname {len} (p)-1-i}, with the last instance—the least shift amount—taking precedence. All unused characters are set aslen(p){\displaystyle \operatorname {len} (p)} as a sentinel value.

The good-suffix rule

[edit]

Description

[edit]
----X--K-----
MANPANAMANAP-
ANAMPNAM-----
----ANAMPNAM-
Demonstration of good-suffix rule with patternP =ANAMPNAM. Here,t isT[6..8] andt isP[2..4].

The good-suffix rule is markedly more complex in both concept and implementation than the bad-character rule. Like the bad-character rule, it also exploits the algorithm's feature of comparisons beginning at the end of the pattern and proceeding towards the pattern's start. It can be described as follows:[5]

Suppose for a given alignment ofP andT, a substringt ofT matches a suffix ofP and supposet is the largest such substring for the given alignment.

  1. Then find, if it exists, the right-most copyt oft inP such thatt is not a suffix ofP and the character to the left oft inP differs from the character to the left oft inP. ShiftP to the right so that substringt inP aligns with substringt inT.
  2. Ift does not exist, then shift the left end ofP to the right by the least amount (past the left end oft inT) so that a prefix of the shifted pattern matches a suffix oft inT. This includes cases wheret is an exact match ofP.
  3. If no such shift is possible, then shiftP bym (length of P) places to the right.

Preprocessing

[edit]

The good-suffix rule requires two tables: one for use in the general case (where a copyt is found), and another for use when the general case returns no meaningful result. These tables will be designatedL andH respectively. Their definitions are as follows:[5]

For eachi,L[i]{\displaystyle L[i]} is the largest position less thanm such that stringP[i..m]{\displaystyle P[i..m]} matches a suffix ofP[1..L[i]]{\displaystyle P[1..L[i]]} and such that the character preceding that suffix is not equal toP[i1]{\displaystyle P[i-1]}.L[i]{\displaystyle L[i]} is defined to be zero if there is no position satisfying the condition.

LetH[i]{\displaystyle H[i]} denote the length of the largest suffix ofP[i..m]{\displaystyle P[i..m]} that is also a prefix ofP, if one exists. If none exists, letH[i]{\displaystyle H[i]} be zero.

Both of these tables are constructible inO(m){\displaystyle O(m)} time and useO(m){\displaystyle O(m)} space. The alignment shift for indexi inP is given bymL[i]{\displaystyle m-L[i]} ormH[i]{\displaystyle m-H[i]}.H should only be used ifL[i]{\displaystyle L[i]} is zero or a match has been found.


Shift Example using pattern ANPANMAN

[edit]
Index| Mismatch | Shift    0   |         N|   1     1   |        AN|   8     2   |       MAN|   3     3   |      NMAN|   6    4   |     ANMAN|   6    5   |    PANMAN|   6   6   |   NPANMAN|   6   7   |  ANPANMAN|   6

Explanation:

Index 0, no characters matched, the character read was not an N. The good-suffix length is zero. Since there are plenty of letters in the pattern that are also not N, we have minimal information here - shifting by 1 is the least interesting result.

Index 1, we matched the N, and it was preceded by something other than A. Now look at the pattern starting from the end, where do we have N preceded by something other than A? There are two other N's, but both are preceded by A. That means no part of the good suffix can be useful to us—shift by the full pattern length 8.

Index 2: We matched the AN, and it was preceded by not M. In the middle of the pattern there is a AN preceded by P, so it becomes the shift candidate. Shifting that AN to the right to line up with our match is a shift of 3.

Index 3 & up: the matched suffixes do not match anything else in the pattern, but the trailing suffix AN matches the start of the pattern, so the shifts here are all 6.[6]

The Galil rule

[edit]

A simple but important optimization of Boyer–Moore was put forth byZvi Galil in 1979.[7]As opposed to shifting, the Galil rule deals with speeding up the actual comparisons done at each alignment by skipping sections that are known to match. Suppose that at an alignmentk1,P is compared withT down to characterc ofT. Then ifP is shifted tok2 such that its left end is betweenc andk1, in the next comparison phase a prefix ofP must match the substringT[(k2 -n)..k1]. Thus if the comparisons get down to positionk1 ofT, an occurrence ofP can be recorded without explicitly comparing pastk1. In addition to increasing the efficiency of Boyer–Moore, the Galil rule is required for proving linear-time execution in the worst case.

The Galil rule, in its original version, is only effective for versions that output multiple matches. It updates the substring range only onc = 0, i.e. a full match. A generalized version for dealing with submatches was reported in 1985 as theApostolico–Giancarlo algorithm.[8]

Performance

[edit]

The Boyer–Moore algorithm as presented in the original paper has worst-case running time ofO(n+m){\displaystyle O(n+m)} only if the pattern doesnot appear in the text. This was first proved byKnuth,Morris, andPratt in 1977,[3] followed byGuibas andOdlyzko in 1980[9] with an upper bound of5n comparisons in the worst case.Richard Cole gave a proof with an upper bound of3n comparisons in the worst case in 1991.[10]There is a simple modification of the BM algorithm which improves the bound to2n.[11]

When the patterndoes occur in the text, running time of the original algorithm isO(nm){\displaystyle O(nm)} in the worst case. This is easy to see when both pattern and text consist solely of the same repeated character. However, inclusion of theGalil rule results in linear runtime across all cases.[7][10]

Knuth, Morris and Pratt also showed that for arandom text, the average number of character comparisons is bounded byO(nlogkmm){\displaystyle O(n{\frac {\log _{k}m}{m}})}, wherek is the alphabet size.

Implementations

[edit]

Various implementations exist in different programming languages. InC++ it is part of the Standard Library since C++17 andBoost provides thegeneric Boyer–Moore search implementation under theAlgorithm library. InGo (programming language) there is an implementation insearch.go.D (programming language) uses aBoyerMooreFinder for predicate based matching within ranges as a part of the Phobos Runtime Library.

The Boyer–Moore algorithm is also used inGNU'sgrep.[12]

Variants

[edit]

TheBoyer–Moore–Horspool algorithm is a simplification of the Boyer–Moore algorithm using only the bad-character rule.

TheApostolico–Giancarlo algorithm speeds up the process of checking whether a match has occurred at the given alignment by skipping explicit character comparisons. This uses information gleaned during the pre-processing of the pattern in conjunction with suffix match lengths recorded at each match attempt. Storing suffix match lengths requires an additional table equal in size to the text being searched.

TheRaita algorithm improves the performance of Boyer–Moore–Horspool algorithm. The searching pattern of particular sub-string in a given string is different from Boyer–Moore–Horspool algorithm.

Notes

[edit]
  1. ^m is the length of the pattern string, which we are searching for in the text, which is of lengthn. This runtime is for finding all occurrences of the pattern, without the Galil rule.
  2. ^k is the size of the alphabet. This space is for the original delta1 bad-character table in the C and Java implementations and the good-suffix table.

References

[edit]
  1. ^Hume, Andrew; Sunday, Daniel (November 1991). "Fast String Searching".Software: Practice and Experience.21 (11):1221–1248.doi:10.1002/spe.4380211105.S2CID 5902579.
  2. ^Boyer, Robert S.;Moore, J Strother (October 1977)."A Fast String Searching Algorithm".Comm. ACM.20 (10). New York: Association for Computing Machinery:762–772.doi:10.1145/359842.359859.ISSN 0001-0782.S2CID 15892987.
  3. ^abKnuth, Donald E.;Morris, James H. Jr.;Pratt, Vaughan R. (1977)."Fast pattern matching in strings".SIAM Journal on Computing.6 (2):323–350.CiteSeerX 10.1.1.93.8147.doi:10.1137/0206024.ISSN 0097-5397.
  4. ^Rytter, Wojciech (1980). "A Correct Preprocessing Algorithm for Boyer–Moore String-Searching".SIAM Journal on Computing.9 (3):509–512.doi:10.1137/0209037.ISSN 0097-5397.
  5. ^abGusfield, Dan (1999) [1997], "Chapter 2 - Exact Matching: Classical Comparison-Based Methods",Algorithms on Strings, Trees, and Sequences (1 ed.), Cambridge University Press, pp. 19–21,ISBN 0-521-58519-8
  6. ^"Constructing a Good Suffix Table - Understanding an example".Stack Overflow. 11 December 2014. Retrieved30 July 2024. This article incorporates text from this source, which is available under theCC BY-SA 3.0 license.
  7. ^abGalil, Z. (September 1979)."On improving the worst case running time of the Boyer–Moore string matching algorithm".Comm. ACM.22 (9). New York: Association for Computing Machinery:505–508.doi:10.1145/359146.359148.ISSN 0001-0782.S2CID 1333465.
  8. ^Apostolico, Alberto; Giancarlo, Raffaele (February 1986)."The Boyer–Moore–Galil String Searching Strategies Revisited".SIAM Journal on Computing.15:98–105.doi:10.1137/0215007.
  9. ^Guibas, Leonidas;Odlyzko, Andrew (1977)."A new proof of the linearity of the Boyer-Moore string searching algorithm".18th Annual Symposium on Foundations of Computer Science (SFCS 1977). IEEE Computer Society. pp. 189–195.doi:10.1109/SFCS.1977.3.S2CID 6470193.
  10. ^abCole, Richard (September 1991).Tight bounds on the complexity of the Boyer-Moore string matching algorithm. Society for Industrial and Applied Mathematics. pp. 224–233.ISBN 0-89791-376-0.
  11. ^Crochemore, Maxime; et al. (1994)."Speeding Up Two String-Matching Algorithms".Algorithmica.12 (24):247–267.doi:10.1007/BF01185427.
  12. ^Haertel, Mike (21 August 2010)."why GNU grep is fast".FreeBSD-current mailing list archive.

External links

[edit]
Wikimedia Commons has media related toBoyer–Moore string search algorithm.
String metric
String-searching algorithm
Multiple string searching
Regular expression
Sequence alignment
Data structure
Other
Retrieved from "https://en.wikipedia.org/w/index.php?title=Boyer–Moore_string-search_algorithm&oldid=1312259676"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp