Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

FM-index

From Wikipedia, the free encyclopedia
Compressed full-text substring index

Incomputer science, anFM-index is a compressed full-textsubstring index based on theBurrows–Wheeler transform, with some similarities to thesuffix array. It was created by Paolo Ferragina and Giovanni Manzini,[1] who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for Full-text index in Minute space.[2]

It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence. The query time, as well as the requiredstorage space, has asublinear complexity with respect to the size of the input data.

The original authors have devised improvements to their original approach and dubbed it "FM-Index version 2".[3] A further improvement, the alphabet-friendly FM-index, combines the use of compression boosting andwavelet trees[4] to significantly reduce the space usage for large alphabets.

The FM-index has found use in, among other places,bioinformatics.[5]

Background

[edit]

Using an index is a common strategy to efficiently search a large body of text. When the text is larger than what reasonably fits within a computer's main memory, there is a need to compress not only the text but also the index. When the FM-index was introduced, there were several suggested solutions that were based on traditional compression methods and tried to solve the compressed matching problem. In contrast, the FM-index is a compressed self-index, which means that it compresses the data and indexes it at the same time.

FM-index data structure

[edit]

An FM-index is created by first taking theBurrows–Wheeler transform (BWT) of the input text. For example, the BWT of the stringT ="abracadabra$" is "ard$rcaaaabb", and here it is represented by the matrixM where each row is a rotation of the text, and the rows have been sorted lexicographically. The transform corresponds to the concatenation of the characters from the last column (labeledL).

IFL
1$abracadabra
2a$abracadabr
3abra$abracad
4abracadabra$
5acadabra$abr
6adabra$abrac
7bra$abracada
8bracadabra$a
9cadabra$abra
10dabra$abraca
11ra$abracadab
12racadabra$ab

The BWT in itself allows for some compression with, for instance,move to front andHuffman encoding, but the transform has even more uses. The rows in the matrix are essentially the sorted suffixes of the text and the first column F of the matrix shares similarities withsuffix arrays. How the suffix array relates to the BWT lies at the heart of the FM-index.

C[c] of "ard$rcaaaabb"
c$abcdr
C[c]0168910
Occ(c, k) of "ard$rcaaaabb"
ard$rcaaaabb
123456789101112
$000111111111
a111111234555
b000000000012
c000001111111
d001111111111
r011122222222

It is possible to make a last-to-first column mappingLF(i) from an indexi to an indexj, such thatF[j] =L[i], with the help of a tableC[c] and a functionOcc(c, k).

  • C[c] is a table that, for each characterc in the alphabet, contains the number of occurrences of lexically (strictly) smaller characters in the text.
  • The functionOcc(c, k) is the number of occurrences of characterc in the prefixL[1..k]. Ferragina and Manzini showed[1] that it is possible to computeOcc(c, k) in constant time.

The last-to-first mapping can now be defined asLF(i) = C[L[i]] + Occ(L[i], i). For instance, on row 9,L isa and the samea can be found on row 5 in the first columnF, soLF(9) should be 5 andLF(9) = C[a] + Occ(a, 9) = 5. For any rowi of the matrix, the character in the last columnL[i] precedes the character in the first columnF[i] also in T. Finally, ifL[i] = T[k], thenL[LF(i)] = T[k - 1], and using the equality it is possible to extract a string ofT fromL.

The FM-index itself is a compression of the stringL together withC andOcc in some form, as well as information that maps a selection of indices inL to positions in the original stringT.

Count

[edit]

The operationcount takes a patternP[1..p] and returns the number of occurrences of that pattern in the original textT. Since the rows of matrixM are sorted, and it contains every suffix ofT, the occurrences of patternP will be next to each other in a single continuous range. The operation iterates backwards over the pattern. For every character in the pattern, the range that has the character as a suffix is found. For example, the count of the pattern "bra" in "abracadabra" follows these steps:

  1. The first character we look for isa, the last character in the pattern. The initial range is set to[C[a] + 1 .. C[a+1]] = [2..6]. This range overL represents every character ofT that has a suffix beginning witha.
  2. The next character to look for isr. The new range is[C[r] + Occ(r, start-1) + 1 .. C[r] + Occ(r, end)] =[10 + 0 + 1 .. 10 + 2] =[11..12], ifstart is the index of the beginning of the range andend is the end. This range overL is all the characters ofT that have suffixes beginning withra.
  3. The last character to look at isb. The new range is[C[b] + Occ(b, start-1) + 1 .. C[b] + Occ(b, end)] =[6 + 0 + 1 .. 6 + 2] =[7..8]. This range overL is all the characters that have a suffix that begins withbra. Now that the whole pattern has been processed, the count is the same as the size of the range:8 - 7 + 1 = 2.

If the range becomes empty or the range boundaries cross each other before the whole pattern has been looked up, the pattern does not occur inT. BecauseOcc(c, k) can be performed in constant time, count can complete in linear time in the length of the pattern:O(p) time.

Locate

[edit]

The operationlocate takes as input an index of a character inL and returns its positioni inT. For instancelocate(7) = 8. To locate every occurrence of a pattern, first the range of character is found whose suffix is the pattern in the same way thecount operation found the range. Then the position of every character in the range can be located.

To map an index inL to one inT, a subset of the indices inL are associated with a position inT. IfL[j] has a position associated with it,locate(j) is trivial. If it's not associated, the string is followed withLF(i) until an associated index is found. By associating a suitable number of indices, an upper bound can be found.Locate can be implemented to findocc occurrences of a patternP[1..p] in a textT[1..u] inO(p +occ logεu) time withO(Hk(T)+loglogulogϵu){\displaystyle O\left(H_{k}(T)+{{\log \log u} \over {\log ^{\epsilon }u}}\right)} bits per input symbol for anyk ≥ 0.[1]

Applications

[edit]

DNA read mapping

[edit]

FM index with backtracking has been successfully (>2000 citations) applied to approximate string matching/sequence alignment, See Bowtiehttps://bowtie-bio.sourceforge.net/index.shtml

Implementations

[edit]

The FM-Index is available in multiple languages, includingC++,[6]Java[7] andRust.[8] Despite all of them being an implementation of the FM-Index, some feature different data structures for compressing the input text, e.g. with the Rust version being available as both a standard FM-Index and a run-length encoded[9] FM-Index or the Java version using fixed block boosting compression[10] wavelet trees.

See also

[edit]

References

[edit]
  1. ^abcPaolo Ferragina and Giovanni Manzini (2000)."Opportunistic Data Structures with Applications".[permanent dead link] Proceedings of the 41st Annual Symposium on Foundations of Computer Science. p.390.
  2. ^Paolo Ferragina and Giovanni Manzini (2005)."Indexing Compressed Text". Journal of the ACM, 52, 4 (Jul. 2005). p. 553
  3. ^Ferragina, Paolo; Venturini, Rossano (September 2005)."Fm-index version 2".www.di.unipi.it. Dipartimento di Informatica, University of Pisa, Italy. Retrieved2018-08-15.
  4. ^P. Ferragina, G. Manzini, V. Mäkinen and G. Navarro. An Alphabet-Friendly FM-index.In Proc. SPIRE'04, pages 150-160. LNCS 3246.
  5. ^Simpson, Jared T.; Durbin, Richard (2010-06-15)."Efficient construction of an assembly string graph using the FM-index".Bioinformatics.26 (12):i367–i373.doi:10.1093/bioinformatics/btq217.ISSN 1367-4803.PMC 2881401.PMID 20529929.
  6. ^"FM-Index in C++". Retrieved25 September 2025.
  7. ^"FM-Index in Java". Retrieved25 September 2025.
  8. ^"FM-Index in Rust". Retrieved25 September 2025.
  9. ^Mäkinen V, Navarro G (2004)."Run-length FM-index"(PDF).Proc. DIMACS Workshop: "The Burrows-Wheeler Transform: Ten Years Later":17–19. Retrieved2025-09-25.
  10. ^Kärkkäinen J, Puglisi SJ (2011)."Fixed block compression boosting in FM-indexes".International Symposium on String Processing and Information Retrieval:174–184.doi:10.1007/978-3-642-24583-1_18. Retrieved2025-09-25.
Lossless
type
Entropy
Dictionary
Other
Hybrid
Lossy
type
Transform
Predictive
Audio
Concepts
Codec
parts
Image
Concepts
Methods
Video
Concepts
Codec
parts
Theory
Community
People
Retrieved from "https://en.wikipedia.org/w/index.php?title=FM-index&oldid=1335096884"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp