Movatterモバイル変換

[0]ホーム

Jump to content

Suffix array

Edit links

From Wikipedia, the free encyclopedia

Data structure for a string

Suffix array

Type

Array

Invented by

Manber & Myers (1990)

Time complexity
inbig O notation

	Average	Worst case
Space	${\mathcal {O}}(n)$	${\mathcal {O}}(n)$
Construction	${\mathcal {O}}(n)$	${\mathcal {O}}(n)$

Incomputer science, asuffix array is a sortedarray of allsuffixes of astring. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field ofbibliometrics.

Suffix arrays were introduced byManber & Myers (1990) as a simple, space efficient alternative tosuffix trees. They had independently been discovered byGaston Gonnet in 1987 under the namePAT array (Gonnet, Baeza-Yates & Snider 1992).

Li, Li & Huo (2016) gave the first in-place ${\mathcal {O}}(n)$ time suffix array construction algorithm that is optimal both in time and space, wherein-place means that the algorithm only needs ${\mathcal {O}}(1)$ additional space beyond the input string and the output suffix array.

Enhanced suffix arrays (ESAs) are suffix arrays with additional tables that reproduce the full functionality of suffix trees preserving the same time and memory complexity.^[1]A sorted array of only some (rather than all) suffixes of a string is called asparse suffix array.^[2]

Definition

[edit]

Let $S=S[1]S[2]...S[n]$ be an ${\textstyle n}$ -string and let $S[i,j]$ denote the substring of $S {\displaystyle S}$ ranging from $i {\displaystyle i}$ to $j {\displaystyle j}$ inclusive.

The suffix array $A {\displaystyle A}$ of $S {\displaystyle S}$ is now defined to be an array of integers providing the starting positions ofsuffixes of $S {\displaystyle S}$ inlexicographical order. This means, an entry $A[i]$ contains the starting position of the $i {\displaystyle i}$ -th smallest suffix in $S {\displaystyle S}$ and thus for all $1\leq i\leq n$ : $S[A[i-1],n]<S[A[i],n]$ .

Eachsuffix of $S {\displaystyle S}$ shows up in $A {\displaystyle A}$ exactly once. Suffixes are simple strings. These strings are sorted (as in a paper dictionary), before their starting positions (integer indices) are saved in $A {\displaystyle A}$ .

Example

[edit]

Consider the text $S {\displaystyle S}$ =banana$ to be indexed:

i	1	2	3	4	5	6	7
$S[i]$	b	a	n	a	n	a	$

The text ends with the special sentinel letter$ that is unique and lexicographically smaller than any other character. The text has the following suffixes:

Suffix	i
banana$	1
anana$	2
nana$	3
ana$	4
na$	5
a$	6
$	7

These suffixes can be sorted in ascending order:

Suffix	i
$	7
a$	6
ana$	4
anana$	2
banana$	1
na$	5
nana$	3

The suffix array $A {\displaystyle A}$ contains the starting positions of these sorted suffixes:

i =	1	2	3	4	5	6	7
$A[i]$ =	7	6	4	2	1	5	3

The suffix array with the suffixes written out vertically underneath for clarity:

i =	1	2	3	4	5	6	7
$A[i]$ =	7	6	4	2	1	5	3
1	$	a	a	a	b	n	n
2		$	n	n	a	a	a
3			a	a	n	$	n
4			$	n	a		a
5				a	n		$
6				$	a
7					$

So for example, $A[3]$ contains the value 4, and therefore refers to the suffix starting at position 4 within $S {\displaystyle S}$ , which is the suffixana$.

Correspondence to suffix trees

[edit]

Suffix arrays are closely related tosuffix trees:

Suffix arrays can be constructed by performing adepth-first traversal of a suffix tree. The suffix array corresponds to the leaf-labels given in the order in which these are visited during the traversal, if edges are visited in the lexicographical order of their first character.
A suffix tree can be constructed in linear time by using a combination of suffix array andLCP array. For a description of the algorithm, see thecorresponding section in theLCP array article.

It has been shown that every suffix tree algorithm can be systematically replaced with an algorithm that uses a suffix array enhanced with additional information (such as theLCP array) and solves the same problem in the same time complexity.^[1]Advantages of suffix arrays over suffix trees include improved space requirements, simpler linear time construction algorithms (e.g., compared toUkkonen's algorithm) and improved cache locality.^[3]

Space efficiency

[edit]

Suffix arrays were introduced byManber & Myers (1990) in order to improve over the space requirements ofsuffix trees: Suffix arrays store $n {\displaystyle n}$ integers. Assuming an integer requires $4 {\displaystyle 4}$ bytes, a suffix array requires $4n$ bytes in total. This is significantly less than the $20n$ bytes which are required by a careful suffix tree implementation.^[4]

However, in certain applications, the space requirements of suffix arrays may still be prohibitive. Analyzed in bits, a suffix array requires ${\mathcal {O}}(n\log n)$ space, whereas the original text over an alphabet of size $\sigma$ only requires ${\mathcal {O}}(n\log \sigma )$ bits.For a human genome with $\sigma =4$ and $n=3.4\times 10^{9}$ the suffix array would therefore occupy about 16 times more memory than the genome itself.

Such discrepancies motivated a trend towardscompressed suffix arrays andBWT-based compressed full-text indices such as theFM-index. These data structures require only space within the size of the text or even less.

Construction algorithms

[edit]

A suffix tree can be built in ${\mathcal {O}}(n)$ and can be converted into a suffix array by traversing the tree depth-first also in ${\mathcal {O}}(n)$ , so there exist algorithms that can build a suffix array in ${\mathcal {O}}(n)$ .

A naive approach to construct a suffix array is to use acomparison-based sorting algorithm. These algorithms require ${\mathcal {O}}(n\log n)$ suffix comparisons, but a suffix comparison runs in ${\mathcal {O}}(n)$ time, so the overall runtime of this approach is ${\mathcal {O}}(n^{2}\log n)$ .

More advanced algorithms take advantage of the fact that the suffixes to be sorted are not arbitrary strings but related to each other. These algorithms strive to achieve the following goals:^[5]

minimal asymptotic complexity $\Theta (n)$
lightweight in space, meaning little or no working memory beside the text and the suffix array itself is needed
fast in practice

One of the first algorithms to achieve all goals is the SA-IS algorithm ofNong, Zhang & Chan (2009). The algorithm is also rather simple (< 100LOC) and can be enhanced to simultaneously construct theLCP array.^[6] The SA-IS algorithm is one of the fastest known suffix array construction algorithms. A careful implementation by Yuta Mori^[7] outperforms most other linear or super-linear construction approaches.

Beside time and space requirements, suffix array construction algorithms are also differentiated by their supportedalphabet:constant alphabets where the alphabet size is bound by a constant,integer alphabets where characters are integers in a range depending on $n {\displaystyle n}$ andgeneral alphabets where only character comparisons are allowed.^[8]

Most suffix array construction algorithms are based on one of the following approaches:^[5]

Prefix doubling algorithms are based on a strategy ofKarp, Miller & Rosenberg (1972). The idea is to find prefixes that honor the lexicographic ordering of suffixes. The assessed prefix length doubles in each iteration of the algorithm until a prefix is unique and provides the rank of the associated suffix.
Recursive algorithms follow the approach of the suffix tree construction algorithm byFarach (1997) to recursively sort a subset of suffixes. This subset is then used to infer a suffix array of the remaining suffixes. Both of these suffix arrays are then merged to compute the final suffix array.
Induced copying algorithms are similar to recursive algorithms in the sense that they use an already sorted subset to induce a fast sort of the remaining suffixes. The difference is that these algorithms favor iteration over recursion to sort the selected suffix subset. A survey of this diverse group of algorithms has been put together byPuglisi, Smyth & Turpin (2007).

A well-known recursive algorithm for integer alphabets is theDC3 / skew algorithm ofKärkkäinen & Sanders (2003). It runs in linear time and has successfully been used as the basis for parallel^[9] andexternal memory^[10] suffix array construction algorithms.

Recent work bySalson et al. (2010) proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is ${\mathcal {O}}(n\log n)$ , it appears to perform well in practice: experimental results from the authors showed that their implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion of a reasonable number of letters in the original text.

In practicalopen source work, a commonly used routine for suffix array construction was qsufsort, based on the 1999 Larsson-Sadakane algorithm.^[11] This routine has been superseded by Yuta Mori's DivSufSort, "the fastest known suffix sorting algorithm in main memory" as of 2017. It too can be modified to compute an LCP array. It uses a induced copying combined with Itoh-Tanaka.^[12] In 2021 a faster implementation of the algorithm was presented by Ilya Grebnov^[13] which in average showed 65% performance improvement over DivSufSort implementation on theSilesia corpus.^[14]

Generalized suffix array

[edit]

The concept of a suffix array can be extended to more than one string. This is called a generalized suffix array (or GSA), a suffix array that contains all suffixes for a set of strings (for example, $S=S_{1},S_{2},S_{3},...,S_{k}$ and is lexicographically sorted with all suffixes of each string.^[15]

Applications

[edit]

The suffix array of a string can be used as anindex to quickly locate every occurrence of a substring pattern $P {\displaystyle P}$ within the string $S {\displaystyle S}$ . Finding every occurrence of the pattern is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array and can be found efficiently with twobinary searches. The first search locates the starting position of the interval, and the second one determines the end position:^{[citation needed]}

n=len(S)defsearch(P:str)->tuple[int,int]:"""    Return indices (s, r) such that the interval A[s:r] (including the end    index) represents all suffixes of S that start with the pattern P.    """# Find starting position of intervall=0# in Python, arrays are indexed starting at 0r=nwhilel<r:mid=(l+r)//2# division rounding down to nearest integer# suffixAt(A[i]) is the ith smallest suffixifP>suffixAt(A[mid]):l=mid+1else:r=mids=l# Find ending position of intervalr=nwhilel<r:mid=(l+r)//2ifsuffixAt(A[mid]).startswith(P):l=mid+1else:r=midreturn(s,r)

Finding the substring pattern $P {\displaystyle P}$ of length $m {\displaystyle m}$ in the string $S {\displaystyle S}$ of length $n {\displaystyle n}$ takes ${\mathcal {O}}(m\log n)$ time, given that a single suffix comparison needs to compare $m {\displaystyle m}$ characters.Manber & Myers (1990) describe how this bound can be improved to ${\mathcal {O}}(m+\log n)$ time usingLCP information. The idea is that a pattern comparison does not need to re-compare certain characters, when it is already known that these are part of the longest common prefix of the pattern and the current search interval.Abouelhoda, Kurtz & Ohlebusch (2004) improve the bound even further and achieve a search time of ${\mathcal {O}}(m)$ for constant alphabet size, as known fromsuffix trees.

Suffix sorting algorithms can be used to compute theBurrows–Wheeler transform (BWT). TheBWT requires sorting of all cyclic permutations of a string. If this string ends in a special end-of-string character that is lexicographically smaller than all other character (i.e., $), then the order of the sorted rotatedBWT matrix corresponds to the order of suffixes in a suffix array. TheBWT can therefore be computed in linear time by first constructing a suffix array of the text and then deducing theBWT string: $BWT[i]=S[A[i]-1]$ .

Suffix arrays can also be used to look up substrings inexample-based machine translation, demanding much less storage than a fullphrase table as used inStatistical machine translation.

Many additional applications of the suffix array require theLCP array. Some of these are detailed in theapplication section of the latter.

Enhanced suffix arrays

[edit]

Suffix trees are powerful data structures that have wide application in areas of pattern and string matching, indexing and textual statistics. However, it occupies a significant amount of space and thus has a drawback in many real-time applications that require processing a considerably large amount of data like genome analysis. To overcome this drawback, Enhanced Suffix Arrays were developed that are data structures consisting of suffix arrays and an additional table called the child table that contains the information about the parent-child relationship between the nodes in the suffix tree. The node branching data structure for this tree is a linked list. Enhanced suffix arrays are superior in terms of both space efficiency and time complexity and are easy to implement. Moreover, they can be applied to any algorithm that uses a suffix tree by using an abstract concept lcp-interval trees. The time complexity for searching a pattern in an enhanced suffix array is O(m|Σ|).

The suffix array of the string is an array of n integers in the range of 0 to n that represents the n+1 suffixes of the string including the special character #.

The suffix array is composed of two arrays:

pos array pos[1,...n]: It represents a sorted list of all S suffixes. Only the initial positions of the suffixes are stored in the array to reduce the space complexity since the suffixes are too large.
lcp array lcp[1,...n]: It is an array of n integers that maintains the lengths of the longest common prefix of two consecutive suffixes stored in the pos array.

Constructing the lcp-interval

[edit]

For a suffix array of S, the lcp-interval associated with the corresponding node of suffix tree of S can be defined as:

Interval [i,..j], 0 ≤ i ≤ j ≤ n is an lcp-interval of lcp-value, if
lcptab[i] < l,
lcptab[k] ≥ l for all i + 1 ≤ k ≤ j,
lcptab[k] = l for some i + 1 ≤ k ≤ j if i ≠ j and l = n − i + 1 if i = j,
lcptab[j + 1] < l.

The length of the longest common prefix of pos[i − 1] and pos[i] is stored in lcp[i],where 2 ≤ i ≤ n. The lcp-interval portrays the same parent-child relationship as that among the associated nodes in the suffix tree of S.This shows that if the corresponding node of [i..j] is a child of the corresponding node of [k..l], a lcp-interval [i..j] is a child interval of another lcp-interval [k..l]. If [k..l] is a child interval of [i..j], a lcp-interval [i..j] is the parent interval of a lcp-interval [k..l].

Constructing a child table

[edit]

The child tablecldtab is composed of three n arrays,up,down andnextlIndex. The information about the edges of the corresponding suffix tree is stored and maintained by theup anddown arrays. ThenextlIndex array stores the links in the linked list used for node branching the suffix tree.

Theup,down andnextlIndex array are defined as follows:

The elementup[i] records the starting index of the longest lcp-second interval’s child interval, which ends at indexi-1.
The initial index of the second child interval of the longest lcp-interval, starting at indexi is stored in the elementdown[i].
If and only if the interval is neither the first child nor the final child of its parent, the elementnextlIndex[i] contains the first index of the next sibling interval of the longest lcp-interval, starting at indexi.

By performing a bottom-up traversal of the lcp-interval of the tree, the child table can be constructed in linear time. Theup/down values and thenextlIndex values can be computed separately by using two distinct algorithms.

Constructing a suffix link table

[edit]

The suffix links for an enhanced suffix array can be computed by generating the suffix link interval [1,..,r] for each [i,..j] interval during the preprocessing. The left and right elements l and r of the interval are maintained in the first index of [i,..,j]. The table for this interval ranges from 0 to n. The suffix link table is constructed by the left-to-right breadth-first traversal of the lcp-interval tree. Every time anl-interval is computed, it is added to the list of l-intervals, which is referred to as the l-list. When the lcp-value > 0, for everyl-interval[i,..,j] in the list, link[i] is calculated. The interval [l,..,r] is computed by a binary search in(l-1)-list, wherel is the largest left boundary amongst all thel-1 intervals. The suffix link interval of [i,..j] is represented by this interval[l,..,r]. The valuesl andr are ultimately stored in the first index of [i,..,j].

Notes

[edit]

^^a ^bAbouelhoda, Kurtz & Ohlebusch 2004.
^I, Kärkkäinen & Kempa 2014.
^Abouelhoda, Kurtz & Ohlebusch 2002.
^Kurtz 1999.
^^a ^bPuglisi, Smyth & Turpin 2007.
^Fischer 2011.
^Mori, Yuta."sais". Archived fromthe original on 9 Mar 2023. Retrieved31 Aug 2023.
^Burkhardt & Kärkkäinen 2003.
^Kulla & Sanders 2007.
^Dementiev et al. 2008.
^Larsson, N. Jesper; Sadakane, Kunihiko (22 November 2007)."Faster suffix sorting".Theoretical Computer Science.387 (3):258–272.doi:10.1016/j.tcs.2007.07.017.ISSN 0304-3975.
^Fischer, Johannes; Kurpicz, Florian (5 October 2017). "Dismantling DivSufSort".Proceedings of the Prague Stringology Conference 2017.arXiv:1710.01896.
^"New saca and bwt library (libsais)".encode.su. Retrieved2021-10-03.
^Grebnov, Ilya (2021-09-22),libsais, retrieved2021-10-02
^Shi 1996.

References

[edit]

Manber, Udi;Myers, Gene (1990).Suffix arrays: a new method for on-line string searches. First Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 319–327.
Manber, Udi;Myers, Gene (1993)."Suffix arrays: a new method for on-line string searches".SIAM Journal on Computing.22 (5):935–948.doi:10.1137/0222058.S2CID 5074629.
Li, Zhize; Li, Jian; Huo, Hongwei (2016).Optimal In-Place Suffix Sorting. Proceedings of the 25th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science. Vol. 11147. Springer. pp. 268–284.arXiv:1610.08305.doi:10.1007/978-3-030-00479-8_22.ISBN 978-3-030-00478-1.
Shi, Fei (1996). "Suffix arrays for multiple strings: A method for on-line multiple string searches".Concurrency and Parallelism, Programming, Networking, and Security. Lecture Notes in Computer Science. Vol. 1179. Springer Berlin Heidelberg. pp. 11–22.doi:10.1007/BFb0027775.ISBN 978-3-540-62031-0.
Abouelhoda, Mohamed Ibrahim; Kurtz, Stefan; Ohlebusch, Enno (2002).The Enhanced Suffix Array and Its Applications to Genome Analysis. Algorithms in Bioinformatics.Lecture Notes in Computer Science. Vol. 2452.doi:10.1007/3-540-45784-4_35.ISBN 978-3-540-44211-0.
Abouelhoda, Mohamed Ibrahim; Kurtz, Stefan; Ohlebusch, Enno (March 2004)."Replacing suffix trees with enhanced suffix arrays".Journal of Discrete Algorithms.2 (1):53–86.doi:10.1016/S1570-8667(03)00065-0.ISSN 1570-8667.
Gonnet, G.H.; Baeza-Yates, R.A.; Snider, T. (1992)."New indices for text: PAT trees and PAT arrays".Information Retrieval: Data Structures and Algorithms.
Kurtz, S (1999). "Reducing the space requirement of suffix trees".Software: Practice and Experience.29 (13):1149–1171.doi:10.1002/(SICI)1097-024X(199911)29:13<1149::AID-SPE274>3.0.CO;2-O.hdl:10338.dmlcz/135448.
Puglisi, Simon J.; Smyth, W. F.; Turpin, Andrew H. (2007)."A taxonomy of suffix array construction algorithms".ACM Computing Surveys.39 (2): 4.doi:10.1145/1242471.1242472.S2CID 2653529.
Nong, Ge; Zhang, Sen; Chan, Wai Hong (2009).Linear Suffix Array Construction by Almost Pure Induced-Sorting. 2009 Data Compression Conference. p. 193.doi:10.1109/DCC.2009.42.ISBN 978-0-7695-3592-0.
Fischer, Johannes (2011).Inducing the LCP-Array. Algorithms and Data Structures. Lecture Notes in Computer Science. Vol. 6844. pp. 374–385.arXiv:1101.3448.doi:10.1007/978-3-642-22300-6_32.ISBN 978-3-642-22299-3.
Salson, M.; Lecroq, T.; Léonard, M.; Mouchard, L. (2010)."Dynamic extended suffix arrays".Journal of Discrete Algorithms.8 (2): 241.doi:10.1016/j.jda.2009.02.007.
Burkhardt, Stefan; Kärkkäinen, Juha (2003).Fast Lightweight Suffix Array Construction and Checking. Combinatorial Pattern Matching. Lecture Notes in Computer Science. Vol. 2676. pp. 55–69.doi:10.1007/3-540-44888-8_5.ISBN 978-3-540-40311-1.
Karp, Richard M.; Miller, Raymond E.; Rosenberg, Arnold L. (1972).Rapid identification of repeated patterns in strings, trees and arrays. Proceedings of the fourth annual ACM symposium on Theory of computing - STOC '72. pp. 125–136.doi:10.1145/800152.804905.
Farach, M. (1997).Optimal suffix tree construction with large alphabets. Proceedings 38th Annual Symposium on Foundations of Computer Science.doi:10.1109/SFCS.1997.646102.ISBN 0-8186-8197-7.
I, Tomohiro; Kärkkäinen, Juha; Kempa, Dominik (2014).Faster Sparse Suffix Sorting. Leibniz International Proceedings in Informatics (LIPIcs). Vol. 25. Schloss Dagstuhl – Leibniz-Zentrum fuer Informatik. pp. 386–396.doi:10.4230/LIPIcs.STACS.2014.386.ISBN 978-3-939897-65-1.
Kärkkäinen, Juha;Sanders, Peter (2003).Simple Linear Work Suffix Array Construction. Automata, Languages and Programming. Lecture Notes in Computer Science. Vol. 2719.doi:10.1007/3-540-45061-0_73.ISBN 978-3-540-40493-4.
Dementiev, Roman; Kärkkäinen, Juha; Mehnert, Jens;Sanders, Peter (2008)."Better external memory suffix array construction".Journal of Experimental Algorithmics.12:1–24.doi:10.1145/1227161.1402296.S2CID 12296500.
Kulla, Fabian;Sanders, Peter (2007). "Scalable parallel suffix array construction".Parallel Computing.33 (9):605–612.doi:10.1016/j.parco.2007.06.004.
Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. "Replacing suffix trees with enhanced suffix arrays."Journal of Discrete Algorithms, 2(1):53–86, 2004.
Dong Kyue Kim, Jeong Eun Jeon, and Heejin Park. "An efficient index data structure with the capabilities of suffix trees and suffix arrays for alphabets of non-negligible size."String Processing and Information Retrieval Lecture Notes in Computer Science, page138–149, 2004.

External links

[edit]

Wikimedia Commons has media related toSuffix array.

v t e Strings
String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Gestalt pattern matching Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm
String-searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string-search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp algorithm Raita algorithm Trigram search Two-way string-matching algorithm Zhu–Takaoka string matching algorithm
Multiple string searching	Aho–Corasick Commentz-Walter algorithm
Regular expression	Comparison of regular-expression engines Regular grammar Thompson's construction Nondeterministic finite automaton
Sequence alignment	BLAST Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm
Data structure	DAFSA Substring index Suffix array Suffix automaton Suffix tree Compressed suffix array LCP array FM-index Generalized suffix tree Rope Ternary search tree Trie
Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting String rewriting systems String operations

Retrieved from "https://en.wikipedia.org/w/index.php?title=Suffix_array&oldid=1287002172"

Categories:

Hidden categories:

[8]ページ先頭