Movatterモバイル変換

[0]ホーム

Jump to content

Compressed suffix array

Српски / srpski

Edit links

From Wikipedia, the free encyclopedia

Compressed data structure for pattern matching

Incomputer science, acompressed suffix array^[1]^[2]^[3] is acompressed data structure forpattern matching. Compressed suffix arrays are a general class ofdata structure that improve on thesuffix array.^[1]^[2] These data structures enable quick search for an arbitrarystring with a comparatively small index.

Given a textT ofn characters from an alphabet Σ, a compressed suffix array supports searching for arbitrary patterns inT. For an input patternP ofm characters, the search time is typically O(m) or O(m + log(n)). The space used is typically $O(nH_{k}(T))+o(n)$ , where $H_{k}(T)$ is thek-th order empirical entropy of the textT. The time and space to construct a compressed suffix array are normally⁠ $O(n)$ ⁠.

The original presentation of a compressed suffix array^[1] solved a long-standing open problem by showing that fast pattern matching was possible using only a linear-space data structure, namely, one proportional to the size of the textT, which takes $O(n\,{\log |\Sigma |})$ bits. The conventional suffix array and suffix tree use $\Omega (n\,{\log n})$ bits, which is substantially larger. The basis for the data structure is a recursive decomposition using the "neighbor function," which allows a suffix array to be represented by one of half its length. The construction is repeated multiple times until the resulting suffix array uses a linear number of bits. Following work showed that the actual storage space was related to the $0^{th}$ -order entropy and that the index supports self-indexing.^[4] The space bound was further improved achieving the ultimate goal of higher-order entropy; the compression is obtained by partitioning the neighbor function by high-order contexts, and compressing each partition with awavelet tree.^[3] The space usage is extremely competitive in practice with other state-of-the-art compressors,^[5] and it also supports fastin-situ pattern matching.

The memory accesses made by compressed suffix arrays and other compressed data structures for pattern matching are typically not localized, and thus these data structures have been notoriously hard to design efficiently for use inexternal memory. Recent progress using geometric duality takes advantage of the block access provided by disks to speed up the I/O time significantly^[6] In addition, potentially practical search performance for a compressed suffix array in external-memory has been demonstrated.^[7]

Open source implementations

[edit]

There are several open source implementations of compressed suffix arrays available (seeExternal Links below). Bowtie and Bowtie2 are open-source compressed suffix array implementations ofread alignment for use inbioinformatics. The Succinct Data Structure Library (SDSL) is a library containing a variety of compressed data structures including compressed suffix arrays. FEMTO is an implementation of compressed suffix arrays for external memory. In addition, a variety of implementations, including the originalFM-index implementations, are available from the Pizza & Chili Website (see external links).

References

[edit]

^^a ^b ^cR. Grossi and J. S. Vitter,Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching,SIAM Journal on Computing, 35(2), 2005, 378–407. An earlier version appeared inProceedings of the 32nd ACM Symposium on Theory of Computing, May 2000, 397–406.
^^a ^bPaolo Ferragina and Giovanni Manzini (2000)."Opportunistic Data Structures with Applications". Proceedings of the 41st Annual Symposium on Foundations of Computer Science. p.390.
^^a ^bR. Grossi, A. Gupta, and J. S. Vitter,High-Order Entropy-Compressed Text Indexes,Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms, January 2003, 841–850.
^K. Sadakane,Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Arrays,Proceedings of the International Symposium on Algorithms and Computation, Lecture Notes in Computer Science, vol. 1969, Springer, December 2000, 410–421.
^L. Foschini, R. Grossi, A. Gupta, and J. S. Vitter, Indexing Equals Compression: Experiments on Suffix Arrays and Trees, ACM Transactions on Algorithms, 2(4), 2006, 611–639.
^W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter,On Entropy-Compressed Text Indexing in External Memory,Proceedings of the Conference on String Processing and Information Retrieval, August 2009.
^M. P. Ferguson,FEMTO: fast search of large sequence collections,Proceedings of the 23rd Annual Conference on Combinatorial Pattern Matching, July 2012

External links

[edit]

Implementations:

Data compression methods

Lossless
type

Entropy	Adaptive coding Arithmetic Asymmetric numeral systems Golomb Huffman Adaptive Canonical Modified Range Shannon Shannon–Fano Shannon–Fano–Elias Tunstall Unary Universal Exp-Golomb Fibonacci Gamma Levenshtein
Dictionary	Byte-pair encoding Lempel–Ziv 842 LZ4 LZJB LZO LZRW LZSS LZW LZWL Snappy
Other	BWT CTW CM Delta Incremental DMC DPCM Grammar Re-Pair Sequitur LDCT MTF PAQ PPM RLE
Hybrid	LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFSE LZ77 + Huffman + ANS Zstandard LZ77 + Huffman + context Brotli LZSS + Huffman LHA/LZH LZ77 + Range LZMA LZHAM RLE + BWT + MTF + Huffman bzip2

Lossy
type

Transform	Discrete cosine transform DCT MDCT DST FFT Wavelet Daubechies DWT SPIHT
Predictive	DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion Compensation Estimation Vector Psychoacoustic

Audio

Concepts	Bit rate ABR CBR VBR Companding Convolution Dynamic range Latency Nyquist–Shannon theorem Sampling Silence compression Sound quality Speech coding Sub-band coding
Codec parts	A-law μ-law DPCM ADPCM DM FT FFT LPC ACELP CELP LAR LSP WLPC MDCT Psychoacoustic model

Image

Concepts	Chroma subsampling Coding tree unit Color space Compression artifact Image resolution Macroblock Pixel PSNR Quantization Standard test image Texture compression
Methods	Chain code DCT Deflate Fractal KLT LP RLE Wavelet Daubechies DWT EZW SPIHT

Video

Concepts	Bit rate ABR CBR VBR Display resolution Frame Frame rate Frame types Interlace Video characteristics Video quality
Codec parts	DCT DPCM Deblocking filter Lapped transform Motion Compensation Estimation Vector Wavelet Daubechies DWT