Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Suffix tree

From Wikipedia, the free encyclopedia
Tree containing all suffixes of a given text
Suffix tree for the textBANANA. Each substring is terminated with special character$. The six paths from the root to the leaves (shown as boxes) correspond to the six suffixesA$,NA$,ANA$,NANA$,ANANA$ andBANANA$. The numbers in the leaves give the start position of the corresponding suffix. Suffix links, drawn dashed, are used during construction.

Incomputer science, asuffix tree (also calledPAT tree or, in an earlier form,position tree) is a compressedtrie containing all thesuffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

The construction of such a tree for the stringS{\displaystyle S} takes time and space linear in the length ofS{\displaystyle S}. Once constructed, several operations can be performed quickly, such as locating asubstring inS{\displaystyle S}, locating a substring if a certain number of mistakes are allowed, and locating matches for aregular expression pattern. Suffix trees also provided one of the first linear-time solutions for thelongest common substring problem.[2] These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself.

History

[edit]

The concept was first introduced byWeiner (1973).Rather than the suffixS[i..n]{\displaystyle S[i..n]}, Weiner stored in his trie[3] theprefix identifier for each position, that is, the shortest string starting ati{\displaystyle i} and occurring only once inS{\displaystyle S}. HisAlgorithm D takes an uncompressed[4] trie forS[k+1..n]{\displaystyle S[k+1..n]} and extends it into a trie forS[k..n]{\displaystyle S[k..n]}. This way, starting from the trivial trie forS[n..n]{\displaystyle S[n..n]}, a trie forS[1..n]{\displaystyle S[1..n]} can be built byn1{\displaystyle n-1} successive calls to Algorithm D; however, the overall run time isO(n2){\displaystyle O(n^{2})}. Weiner'sAlgorithm B maintains several auxiliary data structures, to achieve an overall run time linear in the size of the constructed trie. The latter can still beO(n2){\displaystyle O(n^{2})} nodes, e.g. forS=anbnanbn$.{\displaystyle S=a^{n}b^{n}a^{n}b^{n}\$.} Weiner'sAlgorithm C finally uses compressed tries to achieve linear overall storage size and run time.[5]Donald Knuth subsequently characterized the latter as "Algorithm of the Year 1973" according to his studentVaughan Pratt.[original research?][6]The text bookAho, Hopcroft & Ullman (1974, Sect.9.5) reproduced Weiner's results in a simplified and more elegant form, introducing the termposition tree.

McCreight (1976) was the first to build a (compressed) trie of all suffixes ofS{\displaystyle S}. Although the suffix starting ati{\displaystyle i} is usually longer than the prefix identifier, their path representations in a compressed trie do not differ in size. On the other hand, McCreight could dispense with most of Weiner's auxiliary data structures; only suffix links remained.

Ukkonen (1995) further simplified the construction.[6] He provided the first online-construction of suffix trees, now known asUkkonen's algorithm, with running time that matched the then fastest algorithms.These algorithms are all linear-time for a constant-size alphabet, and have worst-case running time ofO(nlogn){\displaystyle O(n\log n)} in general.

Farach (1997) gave the first suffix tree construction algorithm that is optimal for all alphabets. In particular, this is the first linear-time algorithm for strings drawn from an alphabet of integers in a polynomial range. Farach's algorithm has become the basis for new algorithms for constructing both suffix trees andsuffix arrays, for example, in external memory, compressed, succinct, etc.

Definition

[edit]

The suffix tree for the stringS{\displaystyle S} of lengthn{\displaystyle n} is defined as a tree such that:[7]

  1. The tree has exactly n leaves numbered from1{\displaystyle 1} ton{\displaystyle n}.
  2. Except for the root, everyinternal node has at least two children.
  3. Each edge is labelled with a non-empty substring ofS{\displaystyle S}.
  4. No two edges starting out of a node can have string-labels beginning with the same character.
  5. The string obtained by concatenating all the string-labels found on the path from the root to leafi{\displaystyle i} spells out suffixS[i..n]{\displaystyle S[i..n]}, fori{\displaystyle i} from1{\displaystyle 1} ton{\displaystyle n}.

If a suffix ofS{\displaystyle S} is also the prefix of another suffix, such a tree does not exist for the string. For example, in the stringabcbc, the suffixbc is also a prefix of the suffixbcbc. In such a case, the path spelling outbc will not end in a leaf, violating the fifth rule. To fix this problem,S{\displaystyle S} is padded with a terminal symbol not seen in the string (usually denoted$). This ensures that no suffix is a prefix of another, and that there will ben{\displaystyle n} leaf nodes, one for each of then{\displaystyle n} suffixes ofS{\displaystyle S}.[8] Since all internal non-root nodes are branching, there can be at mostn1{\displaystyle n-1} such nodes, andn+(n1)+1=2n{\displaystyle n+(n-1)+1=2n} nodes in total (n{\displaystyle n} leaves,n1{\displaystyle n-1} internal non-root nodes, 1 root).

Suffix links are a key feature for older linear-time construction algorithms, although most newer algorithms, which are based onFarach's algorithm, dispense with suffix links. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the stringχα{\displaystyle \chi \alpha }, whereχ{\displaystyle \chi } is a single character andα{\displaystyle \alpha } is a string (possibly empty), it has a suffix link to the internal node representingα{\displaystyle \alpha }. See for example the suffix link from the node forANA to the node forNA in the figure above. Suffix links are also used in some algorithms running on the tree.

Ageneralized suffix tree is a suffix tree made for a set of strings instead of a single string. It represents all suffixes from this set of strings. Each string must be terminated by a different termination symbol.

Functionality

[edit]

A suffix tree for a stringS{\displaystyle S} of lengthn{\displaystyle n} can be built inΘ(n){\displaystyle \Theta (n)} time, if the letters come from an alphabet of integers in a polynomial range (in particular, this is true for constant-sized alphabets).[9]For larger alphabets, the running time is dominated by firstsorting the letters to bring them into a range of sizeO(n){\displaystyle O(n)}; in general, this takesO(nlogn){\displaystyle O(n\log n)} time.The costs below are given under the assumption that the alphabet is constant.

Assume that a suffix tree has been built for the stringS{\displaystyle S} of lengthn{\displaystyle n}, or that ageneralised suffix tree has been built for the set of stringsD={S1,S2,,SK}{\displaystyle D=\{S_{1},S_{2},\dots ,S_{K}\}} of total lengthn=n1+n2++nK{\displaystyle n=n_{1}+n_{2}+\cdots +n_{K}}.You can:

The suffix tree can be prepared for constant timelowest common ancestor retrieval between nodes inΘ(n){\displaystyle \Theta (n)} time.[18] One can then also:

Applications

[edit]

Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search,computational biology and other application areas.[26] Primary applications include:[26]

  • String search, inO(m) complexity, wherem is the length of the sub-string (but with initialO(n) time required to build the suffix tree for the string)
  • Finding the longest repeated substring
  • Finding the longest common substring
  • Finding the longestpalindrome in a string

Suffix trees are often used inbioinformatics applications, searching for patterns inDNA orprotein sequences (which can be viewed as long strings of characters). The ability to search efficiently with mismatches might be considered their greatest strength. Suffix trees are also used indata compression; they can be used to find repeated data, and can be used for the sorting stage of theBurrows–Wheeler transform. Variants of theLZW compression schemes use suffix trees (LZSS). A suffix tree is also used insuffix tree clustering, adata clustering algorithm used in some search engines.[27]

Implementation

[edit]

If each node and edge can be represented inΘ(1){\displaystyle \Theta (1)} space, the entire tree can be represented inΘ(n){\displaystyle \Theta (n)} space. The total length of all the strings on all of the edges in the tree isO(n2){\displaystyle O(n^{2})}, but each edge can be stored as the position and length of a substring ofS, giving a total space usage ofΘ(n){\displaystyle \Theta (n)} computer words. The worst-case space usage of a suffix tree is seen with afibonacci word, giving the full2n{\displaystyle 2n} nodes.

An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is usinglinked lists calledsibling lists. Each node has a pointer to its first child, and to the next node in the child list it is a part of. Other implementations with efficient running time properties usehash maps, sorted or unsortedarrays (witharray doubling), orbalanced search trees. We are interested in:

  • The cost of finding the child on a given character.
  • The cost of inserting a child.
  • The cost of enlisting all children of a node (divided by the number of children in the table below).

Letσ be the size of the alphabet. Then you have the following costs:[citation needed]

LookupInsertionTraversal
Sibling lists / unsorted arraysO(σ)Θ(1)Θ(1)
Bitwise sibling treesO(logσ)Θ(1)Θ(1)
Hash mapsΘ(1)Θ(1)O(σ)
Balanced search treeO(logσ)O(logσ)O(1)
Sorted arraysO(logσ)O(σ)O(1)
Hash maps + sibling listsO(1)O(1)O(1)

The insertion cost is amortised, and that the costs for hashing are given for perfect hashing.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about 10 to 20 times the memory size of the source text in good implementations. Thesuffix array reduces this requirement to a factor of 8 (for array includingLCP values built within 32-bit address space and 8-bit characters.) This factor depends on the properties and may reach 2 with usage of 4-byte wide characters (needed to contain any symbol in someUNIX-like systems, seewchar_t) on 32-bit systems.[citation needed] Researchers have continued to find smaller indexing structures.

Parallel construction

[edit]

Various parallel algorithms to speed up suffix tree construction have been proposed.[28][29][30][31][32]Recently, a practical parallel algorithm for suffix tree construction withO(n){\displaystyle O(n)}work (sequential time) andO(log2n){\displaystyle O(\log ^{2}n)}span has been developed. The algorithm achieves good parallel scalability on shared-memory multicore machines and can index thehuman genome – approximately 3GB – in under 3 minutes using a 40-core machine.[33]

External construction

[edit]

Though linear, the memory usage of a suffix tree is significantly higherthan the actual size of the sequence collection. For a large text,construction may require external memory approaches.

There are theoretical results for constructing suffix trees in externalmemory.The algorithm byFarach-Colton, Ferragina & Muthukrishnan (2000)is theoretically optimal, with an I/O complexity equal to that of sorting.However the overall intricacy of this algorithm has prevented, so far, itspractical implementation.[34]

On the other hand, there have been practical works for constructingdisk-based suffix treeswhich scale to (few) GB/hours.The state of the art methods are TDD,[35]TRELLIS,[36]DiGeST,[37]andB2ST.[38]

TDD and TRELLIS scale up to the entire human genome resulting in a disk-based suffix tree of a size in the tens of gigabytes.[35][36] However, these methods cannot handle efficiently collections of sequences exceeding 3 GB.[37] DiGeST performs significantly better and is able to handle collections of sequences in the order of 6 GB in about 6 hours.[37]

All these methods can efficiently build suffix trees for the case when thetree does not fit in main memory,but the input does.The most recent method, B2ST,[38] scales to handleinputs that do not fit in main memory. ERA is a recent parallel suffix tree construction method that is significantly faster. ERA can index the entire human genome in 19 minutes on an 8-core desktop computer with 16 GB RAM. On a simple Linux cluster with 16 nodes (4 GB RAM per node), ERA can index the entire human genome in less than 9 minutes.[39]

See also

[edit]

Notes

[edit]
  1. ^Donald E. Knuth; James H. Morris; Vaughan R. Pratt (Jun 1977)."Fast Pattern Matching in Strings"(PDF).SIAM Journal on Computing.6 (2):323–350.doi:10.1137/0206024. Here: p.339 bottom.
  2. ^Knuth conjectured in 1970 that the problem could not be solved in linear time.[1] In 1973, this was refuted by Weiner's suffix-tree algorithmWeiner (1973).
  3. ^This term is used here to distinguish Weiner's precursor data structures from proper suffix trees as definedabove and unconsidered beforeMcCreight (1976).
  4. ^i.e., with each branch labelled by a single character
  5. ^SeeFile:WeinerB aaaabbbbaaaabbbb.gif andFile:WeinerC aaaabbbbaaaabbbb.gif for an uncompressed example tree and its compressed correspondant.
  6. ^abGiegerich & Kurtz (1997).
  7. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.90.
  8. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.90-91.
  9. ^Farach (1997).
  10. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.92.
  11. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.123.
  12. ^Baeza-Yates & Gonnet (1996).
  13. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.132.
  14. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.125.
  15. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.144.
  16. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.166.
  17. ^Such substrings corresponds to internal nodes of maximal depth within the suffix tree, hence can be found in linear time using depth-first search.
  18. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), Chapter 8.
  19. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.196.
  20. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.200.
  21. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.198.
  22. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.201.
  23. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.204.
  24. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), p.205.
  25. ^Gusfield (1999) harvtxt error: no target: CITEREFGusfield1999 (help), pp.197–199.
  26. ^abAllison, L."Suffix Trees".Archived from the original on 2008-10-13. Retrieved2008-10-14.
  27. ^First introduced byZamir & Etzioni (1998).
  28. ^Apostolico et al. (1988).
  29. ^Hariharan (1994).
  30. ^Sahinalp & Vishkin (1994).
  31. ^Farach & Muthukrishnan (1996).
  32. ^Iliopoulos & Rytter (2004).
  33. ^Shun & Blelloch (2014).
  34. ^Smyth (2003).
  35. ^abTata, Hankins & Patel (2003).
  36. ^abPhoophakdee & Zaki (2007).
  37. ^abcBarsky et al. (2008).
  38. ^abBarsky et al. (2009).
  39. ^Mansour et al. (2011).

References

[edit]

External links

[edit]
Search trees
(dynamic sets,
associative arrays)
Heaps
Tries
Spatial data
partitioning trees
Other trees
String metric
String-searching algorithm
Multiple string searching
Regular expression
Sequence alignment
Data structure
Other
Retrieved from "https://en.wikipedia.org/w/index.php?title=Suffix_tree&oldid=1308257852"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp