Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Longest common subsequence

From Wikipedia, the free encyclopedia
Algorithmic problem on pairs of sequences
Not to be confused withLongest common substring.
Comparison of two revisions of an example file, based on their longest common subsequence (black)

Alongest common subsequence (LCS) is the longestsubsequence common to all sequences in a set of sequences (often just two sequences). It differs from thelongest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classiccomputer science problem. Because it ispolynomial and has an efficient algorithm to solve it, it is employed tocompare data andmerge changes to files in programs such as thediff utility andrevision control systems such asGit. It has similar applications incomputational linguistics andbioinformatics.

For example, consider the sequences (ABCD) and (ACBAD). They have five length-2 common subsequences: (AB), (AC), (AD), (BD), and (CD); two length-3 common subsequences: (ABD) and (ACD); and no longer common subsequences. So (ABD) and (ACD) are their longest common subsequences.

Complexity

[edit]

For the general case of an arbitrary number of input sequences, the problem isNP-hard.[1] When the number of sequences is constant, the problem is solvable in polynomial time bydynamic programming.

GivenN{\displaystyle N} sequences of lengthsn1,...,nN{\displaystyle n_{1},...,n_{N}}, a naive search would test each of the2n1{\displaystyle 2^{n_{1}}} subsequences of the first sequence to determine whether they are also subsequences of the remaining sequences; each subsequence may be tested in time linear in the lengths of the remaining sequences, so the time for this algorithm would be

O(2n1i>1ni).{\displaystyle O\left(2^{n_{1}}\sum _{i>1}n_{i}\right).}

For the case of two sequences ofn andm elements, the running time of the dynamic programming approach isO(n ×m).[2] For an arbitrary number of input sequences, the dynamic programming approach gives a solution in

O(Ni=1Nni).{\displaystyle O\left(N\prod _{i=1}^{N}n_{i}\right).}

There exist methods with lower complexity,[3]which often depend on the length of the LCS, the size of the alphabet, or both.

The LCS is not necessarily unique; in the worst case, the number of common subsequences is exponential in the lengths of the inputs, so the algorithmic complexity of listingall common subsequences must be at least exponential.[4]

Solution for two sequences

[edit]

The LCS problem has anoptimal substructure: the problem can be broken down into smaller, simpler subproblems, which can, in turn, be broken down into simpler subproblems, and so on, until, finally, the solution becomes trivial. LCS in particular hasoverlapping subproblems: the solutions to high-level subproblems often reuse solutions to lower level subproblems. Problems with these two properties are amenable todynamic programming approaches, in which subproblem solutions arememoized, that is, the solutions of subproblems are saved for reuse.

Prefixes

[edit]

The prefixSn ofS is defined as the firstn characters ofS.[5] For example, the prefixes ofS = (AGCA) are

S0 = ()
S1 = (A)
S2 = (AG)
S3 = (AGC)
S4 = (AGCA).

LetLCS(X,Y) be a function that computes a longest subsequence common toX andY. Such a function has two interesting properties.

First property

[edit]

LCS(X^A,Y^A) =LCS(X,Y)^A, for all stringsX,Y and all symbolsA, where ^ denotes string concatenation. This allows one to simplify theLCS computation for two sequences ending in the same symbol.For example,LCS("BANANA","ATANA") =LCS("BANAN","ATAN")^"A", Continuing for the remaining common symbols,LCS("BANANA","ATANA") =LCS("BAN","AT")^"ANA".

Second property

[edit]

IfA andB are distinct symbols (AB), thenLCS(X^A,Y^B) is one of the maximal-length strings in the set {LCS(X^A,Y),LCS(X,Y^B) }, for all stringsX,Y.

For example,LCS("ABCDEFG","BCDGK") is the longest string amongLCS("ABCDEFG","BCDG") andLCS("ABCDEF","BCDGK"); if both happened to be of equal length, one of them could be chosen arbitrarily.

To realize the property, distinguish two cases:

  • IfLCS("ABCDEFG","BCDGK") ends with a "G", then the final "K" cannot be in the LCS, henceLCS("ABCDEFG","BCDGK") =LCS("ABCDEFG","BCDG").
  • IfLCS("ABCDEFG","BCDGK") does not end with a "G", then the final "G" cannot be in the LCS, henceLCS("ABCDEFG","BCDGK") =LCS("ABCDEF","BCDGK").

LCS function defined

[edit]

Let two sequences be defined as follows:X=(x1x2xm){\displaystyle X=(x_{1}x_{2}\cdots x_{m})} andY=(y1y2yn){\displaystyle Y=(y_{1}y_{2}\cdots y_{n})}. The prefixes ofX{\displaystyle X} areX0,X1,X2,,Xm{\displaystyle X_{0},X_{1},X_{2},\dots ,X_{m}}; the prefixes ofY{\displaystyle Y} areY0,Y1,Y2,,Yn{\displaystyle Y_{0},Y_{1},Y_{2},\dots ,Y_{n}}. LetLCS(Xi,Yj){\displaystyle {\mathit {LCS}}(X_{i},Y_{j})} represent the set of longest common subsequence of prefixesXi{\displaystyle X_{i}} andYj{\displaystyle Y_{j}}. This set of sequences is given by the following.

LCS(Xi,Yj)={ϵif i=0 or j=0LCS(Xi1,Yj1)^xiif i,j>0 and xi=yjmax{LCS(Xi,Yj1),LCS(Xi1,Yj)}if i,j>0 and xiyj.{\displaystyle {\mathit {LCS}}(X_{i},Y_{j})={\begin{cases}\epsilon &{\mbox{if }}i=0{\mbox{ or }}j=0\\{\mathit {LCS}}(X_{i-1},Y_{j-1}){\hat {}}x_{i}&{\mbox{if }}i,j>0{\mbox{ and }}x_{i}=y_{j}\\\operatorname {\max } \{{\mathit {LCS}}(X_{i},Y_{j-1}),{\mathit {LCS}}(X_{i-1},Y_{j})\}&{\mbox{if }}i,j>0{\mbox{ and }}x_{i}\neq y_{j}.\end{cases}}}

To find the LCS ofXi{\displaystyle X_{i}} andYj{\displaystyle Y_{j}}, comparexi{\displaystyle x_{i}} andyj{\displaystyle y_{j}}. If they are equal, then the sequenceLCS(Xi1,Yj1){\displaystyle {\mathit {LCS}}(X_{i-1},Y_{j-1})} is extended by that element,xi{\displaystyle x_{i}}. If they are not equal, then the longest among the two sequences,LCS(Xi,Yj1){\displaystyle {\mathit {LCS}}(X_{i},Y_{j-1})}, andLCS(Xi1,Yj){\displaystyle {\mathit {LCS}}(X_{i-1},Y_{j})}, is retained. (If they are the same length, but not identical, then both are retained.) The base case, when eitherXi{\displaystyle X_{i}} orYi{\displaystyle Y_{i}} is empty, is theempty string,ϵ{\displaystyle \epsilon }.

Worked example

[edit]

The longest subsequence common toR = (GAC), andC = (AGCAT) will be found. Because theLCS function uses a "zeroth" element, it is convenient to define zero prefixes that are empty for these sequences:R0 = ε; andC0 = ε. All the prefixes are placed in a table withC in the first row (making it acolumn header) andR in the first column (making it arow header).

LCS Strings
εAGCAT
εεεεεεε
Gε
Aε
Cε

This table is used to store the LCS sequence for each step of the calculation. The second column and second row have been filled in with ε, because when an empty sequence is compared with a non-empty sequence, the longest common subsequence is always an empty sequence.

LCS(R1,C1) is determined by comparing the first elements in each sequence. G and A are not the same, so this LCS gets (using the "second property") the longest of the two sequences,LCS(R1,C0) andLCS(R0,C1). According to the table, both of these are empty, soLCS(R1,C1) is also empty, as shown in the table below. The arrows indicate that the sequence comes from both the cell above,LCS(R0,C1) and the cell on the left,LCS(R1,C0).

LCS(R1,C2) is determined by comparing G and G. They match, so G is appended to the upper left sequence,LCS(R0,C1), which is (ε), giving (εG), which is (G).

ForLCS(R1,C3), G and C do not match. The sequence above is empty; the one to the left contains one element, G. Selecting the longest of these,LCS(R1,C3) is (G). The arrow points to the left, since that is the longest of the two sequences.

LCS(R1,C4), likewise, is (G).

LCS(R1,C5), likewise, is (G).

"G" Row Completed
εAGCAT
εεεεεεε
Gε  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}ε {\displaystyle {\overset {\nwarrow }{\ }}}(G) {\displaystyle {\overset {\ }{\leftarrow }}}(G) {\displaystyle {\overset {\ }{\leftarrow }}}(G) {\displaystyle {\overset {\ }{\leftarrow }}}(G)
Aε
Cε

ForLCS(R2,C1), A is compared with A. The two elements match, so A is appended to ε, giving (A).

ForLCS(R2,C2), A and G do not match, so the longest ofLCS(R1,C2), which is (G), andLCS(R2,C1), which is (A), is used. In this case, they each contain one element, so this LCS is given two subsequences: (A) and (G).

ForLCS(R2,C3), A does not match C.LCS(R2,C2) contains sequences (A) and (G); LCS(R1,C3) is (G), which is already contained inLCS(R2,C2). The result is thatLCS(R2,C3) also contains the two subsequences, (A) and (G).

ForLCS(R2,C4), A matches A, which is appended to the upper left cell, giving (GA).

ForLCS(R2,C5), A does not match T. Comparing the two sequences, (GA) and (G), the longest is (GA), soLCS(R2,C5) is (GA).

"G" & "A" Rows Completed
εAGCAT
εεεεεεε
Gε  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}ε {\displaystyle {\overset {\nwarrow }{\ }}}(G) {\displaystyle {\overset {\ }{\leftarrow }}}(G) {\displaystyle {\overset {\ }{\leftarrow }}}(G) {\displaystyle {\overset {\ }{\leftarrow }}}(G)
Aε {\displaystyle {\overset {\nwarrow }{\ }}}(A)  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(A) & (G)  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(A) & (G) {\displaystyle {\overset {\nwarrow }{\ }}}(GA) {\displaystyle {\overset {\ }{\leftarrow }}}(GA)
Cε

ForLCS(R3,C1), C and A do not match, soLCS(R3,C1) gets the longest of the two sequences, (A).

ForLCS(R3,C2), C and G do not match. BothLCS(R3,C1) andLCS(R2,C2) have one element. The result is thatLCS(R3,C2) contains the two subsequences, (A) and (G).

ForLCS(R3,C3), C and C match, so C is appended toLCS(R2,C2), which contains the two subsequences, (A) and (G), giving (AC) and (GC).

ForLCS(R3,C4), C and A do not match. CombiningLCS(R3,C3), which contains (AC) and (GC), andLCS(R2,C4), which contains (GA), gives a total of three sequences: (AC), (GC), and (GA).

Finally, forLCS(R3,C5), C and T do not match. The result is thatLCS(R3,C5) also contains the three sequences, (AC), (GC), and (GA).

Completed LCS Table
εAGCAT
εεεεεεε
Gε  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}ε {\displaystyle {\overset {\nwarrow }{\ }}}(G) {\displaystyle {\overset {\ }{\leftarrow }}}(G) {\displaystyle {\overset {\ }{\leftarrow }}}(G) {\displaystyle {\overset {\ }{\leftarrow }}}(G)
Aε {\displaystyle {\overset {\nwarrow }{\ }}}(A)  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(A) & (G)  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(A) & (G) {\displaystyle {\overset {\nwarrow }{\ }}}(GA) {\displaystyle {\overset {\ }{\leftarrow }}}(GA)
Cε  {\displaystyle {\overset {\ \uparrow }{\ }}}(A)  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(A) & (G) {\displaystyle {\overset {\nwarrow }{\ }}}(AC) & (GC)  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(AC) & (GC) & (GA)  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}(AC) & (GC) & (GA)

The final result is that the last cell contains all the longest subsequences common to (AGCAT) and (GAC); these are (AC), (GC), and (GA). The table also shows the longest common subsequences for every possible pair of prefixes. For example, for (AGC) and (GA), the longest common subsequence are (A) and (G).

Traceback approach

[edit]

Calculating the LCS of a row of the LCS table requires only the solutions to the current row and the previous row. Still, for long sequences, these sequences can get numerous and long, requiring a lot of storage space. Storage space can be saved by saving not the actual subsequences, but the length of the subsequence and the direction of the arrows, as in the table below.

Storing length, rather than sequences
εAGCAT
ε000000
G0  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}0 {\displaystyle {\overset {\nwarrow }{\ }}}1 {\displaystyle {\overset {\ }{\leftarrow }}}1 {\displaystyle {\overset {\ }{\leftarrow }}}1 {\displaystyle {\overset {\ }{\leftarrow }}}1
A0 {\displaystyle {\overset {\nwarrow }{\ }}}1  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}1  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}1 {\displaystyle {\overset {\nwarrow }{\ }}}2 {\displaystyle {\overset {\ }{\leftarrow }}}2
C0  {\displaystyle {\overset {\ \uparrow }{\ }}}1  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}1 {\displaystyle {\overset {\nwarrow }{\ }}}2  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}2  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}2

The actual subsequences are deduced in a "traceback" procedure that follows the arrows backwards, starting from the last cell in the table. When the length decreases, the sequences must have had a common element. Several paths are possible when two arrows are shown in a cell. Below is the table for such an analysis, with numbers colored in cells where the length is about to decrease. The bold numbers trace out the sequence, (GA).[6]

Traceback example
εAGCAT
ε000000
G0  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}0 {\displaystyle {\overset {\nwarrow }{\ }}}1 {\displaystyle {\overset {\ }{\leftarrow }}}1 {\displaystyle {\overset {\ }{\leftarrow }}}1 {\displaystyle {\overset {\ }{\leftarrow }}}1
A0 {\displaystyle {\overset {\nwarrow }{\ }}}1  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}1  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}1 {\displaystyle {\overset {\nwarrow }{\ }}}2 {\displaystyle {\overset {\ }{\leftarrow }}}2
C0  {\displaystyle {\overset {\ \uparrow }{\ }}}1  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}1 {\displaystyle {\overset {\nwarrow }{\ }}}2  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}2  {\displaystyle {\overset {\ \ \uparrow }{\leftarrow }}}2

Relation to other problems

[edit]

For two stringsX1m{\displaystyle X_{1\dots m}} andY1n{\displaystyle Y_{1\dots n}}, the length of theshortest common supersequence is related to the length of the LCS by[3]

|SCS(X,Y)|=n+m|LCS(X,Y)|.{\displaystyle \left|SCS(X,Y)\right|=n+m-\left|LCS(X,Y)\right|.}

Theedit distance when only insertion and deletion is allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion, is:

d(X,Y)=n+m2|LCS(X,Y)|.{\displaystyle d'(X,Y)=n+m-2\cdot \left|LCS(X,Y)\right|.}

Code for the dynamic programming solution

[edit]
icon
This sectiondoes notcite anysources. Please helpimprove this section byadding citations to reliable sources. Unsourced material may be challenged andremoved.(March 2013) (Learn how and when to remove this message)

Computing the length of the LCS

[edit]

The function below takes as input sequencesX[1..m] andY[1..n], computes the LCS betweenX[1..i] andY[1..j] for all1 ≤ i ≤ m and1 ≤ j ≤ n, and stores it inC[i,j].C[m,n] will contain the length of the LCS ofX andY.[7]

function LCSLength(X[1..m], Y[1..n])    C = array(0..m, 0..n)for i := 0..m        C[i,0] = 0for j := 0..n        C[0,j] = 0for i := 1..mfor j := 1..nif X[i] = Y[j]                C[i,j] := C[i-1,j-1] + 1else                C[i,j] := max(C[i,j-1], C[i-1,j])return C[m,n]

Alternatively,memoization could be used.

Reading out a LCS

[edit]

The following functionbacktracks the choices taken when computing theC table. If the last characters in the prefixes are equal, they must be in an LCS. If not, check what gave the largest LCS of keepingxi{\displaystyle x_{i}} andyj{\displaystyle y_{j}}, and make the same choice. Just choose one if they were equally long. Call the function withi=m andj=n.

function backtrack(C[0..m,0..n], X[1..m], Y[1..n], i, j)if i = 0or j = 0return ""if X[i] = Y[j]return backtrack(C, X, Y, i-1, j-1) + X[i]if C[i,j-1] > C[i-1,j]return backtrack(C, X, Y, i, j-1)return backtrack(C, X, Y, i-1, j)

Reading out all LCSs

[edit]

If choosingxi{\displaystyle x_{i}} andyj{\displaystyle y_{j}} would give an equally long result, read out both resulting subsequences. This is returned as a set by this function. Notice that this function is not polynomial, as it might branch in almost every step if the strings are similar.

function backtrackAll(C[0..m,0..n], X[1..m], Y[1..n], i, j)if i = 0or j = 0return {""}if X[i] = Y[j]return {Z + X[i]for all Zin backtrackAll(C, X, Y, i-1, j-1)}    R := {}if C[i,j-1] ≥ C[i-1,j]        R := backtrackAll(C, X, Y, i, j-1)if C[i-1,j] ≥ C[i,j-1]        R := R ∪ backtrackAll(C, X, Y, i-1, j)return R

Print the diff

[edit]

This function will backtrack through the C matrix, and print thediff between the two sequences. Notice that you will get a different answer if you exchange and<, with> and below.

function printDiff(C[0..m,0..n], X[1..m], Y[1..n], i, j)if i >= 0and j >= 0and X[i] = Y[j]        printDiff(C, X, Y, i-1, j-1)        print "  " + X[i]else if j > 0and (i = 0or C[i,j-1] ≥ C[i-1,j])        printDiff(C, X, Y, i, j-1)        print "+ " + Y[j]else if i > 0and (j = 0or C[i,j-1] < C[i-1,j])        printDiff(C, X, Y, i-1, j)        print "- " + X[i]else        print ""

Example

[edit]

LetX{\displaystyle X} be "XMJYAUZ" andY{\displaystyle Y} be "MZJAWXU". The longest common subsequence betweenX{\displaystyle X} andY{\displaystyle Y} is "MJAU". The tableC shown below, which is generated by the functionLCSLength, shows the lengths of the longest common subsequences between prefixes ofX{\displaystyle X} andY{\displaystyle Y}. Thei{\displaystyle i}th row andj{\displaystyle j}th column shows the length of the LCS betweenX1..i{\displaystyle X_{1..i}} andY1..j{\displaystyle Y_{1..j}}.

01234567
εMZJAWXU
0ε00000000
1X00000011
2M01111111
3J01122222
4Y01122222
5A01123333
6U01123334
7Z01223334

Thehighlighted numbers show the path the functionbacktrack would follow from the bottom right to the top left corner, when reading out an LCS. If the current symbols inX{\displaystyle X} andY{\displaystyle Y} are equal, they are part of the LCS, and we go both up and left (shown inbold). If not, we go up or left, depending on which cell has a higher number. This corresponds to either taking the LCS betweenX1..i1{\displaystyle X_{1..i-1}} andY1..j{\displaystyle Y_{1..j}}, orX1..i{\displaystyle X_{1..i}} andY1..j1{\displaystyle Y_{1..j-1}}.

Code optimization

[edit]

Several optimizations can be made to the algorithm above to speed it up for real-world cases.

Reduce the problem set

[edit]

The C matrix in the naive algorithmgrows quadratically with the lengths of the sequences. For two 100-item sequences, a 10,000-item matrix would be needed, and 10,000 comparisons would need to be done. In most real-world cases, especially source code diffs and patches, the beginnings and ends of files rarely change, and almost certainly not both at the same time. If only a few items have changed in the middle of the sequence, the beginning and end can be eliminated. This reduces not only the memory requirements for the matrix, but also the number of comparisons that must be done.

function LCS(X[1..m], Y[1..n])    start := 1    m_end := m    n_end := ntrim off the matching items at the beginningwhile start ≤ m_endand start ≤ n_endand X[start] = Y[start]        start := start + 1trim off the matching items at the endwhile start ≤ m_endand start ≤ n_endand X[m_end] = Y[n_end]        m_end := m_end - 1        n_end := n_end - 1    C = array(start-1..m_end, start-1..n_end)only loop over the items that have changedfor i := start..m_endfor j := start..n_endthe algorithm continues as before ...

In the best-case scenario, a sequence with no changes, this optimization would eliminate the need for the C matrix. In the worst-case scenario, a change to the very first and last items in the sequence, only two additional comparisons are performed.

Reduce the comparison time

[edit]

Most of the time taken by the naive algorithm is spent performing comparisons between items in the sequences. For textual sequences such as source code, you want to view lines as the sequence elements instead of single characters. This can mean comparisons of relatively long strings for each step in the algorithm. Two optimizations can be made that can help to reduce the time these comparisons consume.

Reduce strings to hashes

[edit]

Ahash function orchecksum can be used to reduce the size of the strings in the sequences. That is, for source code where the average line is 60 or more characters long, the hash or checksum for that line might be only 8 to 40 characters long. Additionally, the randomized nature of hashes and checksums would guarantee that comparisons would short-circuit faster, as lines of source code will rarely be changed at the beginning.

There are three primary drawbacks to this optimization. First, an amount of time needs to be spent beforehand to precompute the hashes for the two sequences. Second, additional memory needs to be allocated for the new hashed sequences. However, in comparison to the naive algorithm used here, both of these drawbacks are relatively minimal.

The third drawback is that ofcollisions. Since the checksum or hash is not guaranteed to be unique, there is a small chance that two different items could be reduced to the same hash. This is unlikely in source code, but it is possible. A cryptographic hash would therefore be far better suited for this optimization, as its entropy is going to be significantly greater than that of a simple checksum. However, the benefits may not be worth the setup and computational requirements of a cryptographic hash for small sequence lengths.

Reduce the required space

[edit]

If only the length of the LCS is required, the matrix can be reduced to a2×min(n,m){\displaystyle 2\times \min(n,m)} matrix, or to amin(m,n)+1{\displaystyle \min(m,n)+1} vector as the dynamic programming approach requires only the current and previous columns of the matrix.Hirschberg's algorithm allows the construction of the optimal sequence itself in the same quadratic time and linear space bounds.[8]

Reduce cache misses

[edit]

Chowdhury and Ramachandran devised a quadratic-time linear-space algorithm[9][10] for finding the LCS length along with an optimal sequence which runs faster than Hirschberg's algorithm in practice due to its superior cache performance.[9] The algorithm has an asymptotically optimal cache complexity under theIdeal cache model.[11] Interestingly, the algorithm itself iscache-oblivious[11] meaning that it does not make any choices based on the cache parameters (e.g., cache size and cache line size) of the machine.

Further optimized algorithms

[edit]

Several algorithms exist that run faster than the presented dynamic programming approach. One of them isHunt–Szymanski algorithm, which typically runs inO((n+r)log(n)){\displaystyle O((n+r)\log(n))} time (forn>m{\displaystyle n>m}), wherer{\displaystyle r} is the number of matches between the two sequences.[12] For problems with a bounded alphabet size, theMethod of Four Russians can be used to reduce the running time of the dynamic programming algorithm by a logarithmic factor.[13]

Behavior on random strings

[edit]
Main article:Chvátal–Sankoff constants

Beginning withChvátal & Sankoff (1975),[14] a number of researchers have investigated the behavior of the longest common subsequence length when the two given strings are drawn randomly from the same alphabet. When the alphabet size is constant, the expected length of the LCS is proportional to the length of the two strings, and the constants of proportionality (depending on alphabet size) are known as theChvátal–Sankoff constants. Their exact values are not known, but upper and lower bounds on their values have been proven,[15] and it is known that they grow inversely proportionally to the square root of the alphabet size.[16] Simplified mathematical models of the longest common subsequence problem have been shown to be controlled by theTracy–Widom distribution.[17]

Computing the longest palindromic subsequence of a string

[edit]

For decades, it had been considered folklore that the longest palindromic subsequence of a string could be computed by finding the longest common subsequence between the string and its reversal, using the classical dynamic programming approach introduced by Wagner and Fischer. However, a formal proof of the correctness of this method was only established in 2024 by Brodal, Fagerberg, and Moldrup Rysgaard.[18]

See also

[edit]

References

[edit]
  1. ^David Maier (1978)."The Complexity of Some Problems on Subsequences and Supersequences".J. ACM.25 (2). ACM Press:322–336.doi:10.1145/322063.322075.S2CID 16120634.
  2. ^Wagner, Robert; Fischer, Michael (January 1974). "The string-to-string correction problem".Journal of the ACM.21 (1):168–173.CiteSeerX 10.1.1.367.5281.doi:10.1145/321796.321811.S2CID 13381535.
  3. ^abL. Bergroth and H. Hakonen and T. Raita (7–29 September 2000).A survey of longest common subsequence algorithms. Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000. A Curuna, Spain: IEEE Computer Society. pp. 39–48.doi:10.1109/SPIRE.2000.878178.ISBN 0-7695-0746-8.S2CID 10375334.
  4. ^Ronald I. Greenberg (2003-08-06). "Bounds on the Number of Longest Common Subsequences".arXiv:cs.DM/0301030.
  5. ^Xia, Xuhua (2007).Bioinformatics and the Cell: Modern Computational Approaches in Genomics, Proteomics and Transcriptomics. New York: Springer. p. 24.ISBN 978-0-387-71336-6.
  6. ^Thomas H. Cormen,Charles E. Leiserson,Ronald L. Rivest andClifford Stein (2001). "15.4".Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. pp. 350–355.ISBN 0-262-53196-8.{{cite book}}: CS1 maint: multiple names: authors list (link)
  7. ^Cormen, Thomas H.;Leiserson, Charles E.;Rivest, Ronald L.;Stein, Clifford (2009) [1990]. "Dynamic Programming".Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. p. 394.ISBN 0-262-03384-4.
  8. ^Hirschberg, D. S. (1975)."A linear space algorithm for computing maximal common subsequences".Communications of the ACM.18 (6):341–343.doi:10.1145/360825.360861.S2CID 207694727.
  9. ^abChowdhury, Rezaul; Ramachandran, Vijaya (January 2006)."Cache-oblivious dynamic programming".Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm - SODA '06. pp. 591–600.doi:10.1145/1109557.1109622.ISBN 0-89871-605-5.S2CID 9650418.
  10. ^Chowdhury, Rezaul; Le, Hai-Son; Ramachandran, Vijaya (July 2010). "Cache-oblivious dynamic programming for bioinformatics".IEEE/ACM Transactions on Computational Biology and Bioinformatics.7 (3):495–510.Bibcode:2010ITCBB...7..495C.doi:10.1109/TCBB.2008.94.PMID 20671320.S2CID 2532039.
  11. ^abFrigo, Matteo; Leiserson, Charles E.; Prokop, Harald; Ramachandran, Sridhar (January 2012)."Cache-oblivious algorithms".ACM Transactions on Algorithms.8 (1):1–22.doi:10.1145/2071379.2071383.
  12. ^Apostolico, Alberto; Galil, Zvi (1997-05-29).Pattern Matching Algorithms. Oxford University Press.ISBN 978-0-19-535434-8.
  13. ^Masek, William J.;Paterson, Michael S. (1980), "A faster algorithm computing string edit distances",Journal of Computer and System Sciences,20 (1):18–31,doi:10.1016/0022-0000(80)90002-1,hdl:1721.1/148933,MR 0566639.
  14. ^Chvátal, Václáv;Sankoff, David (1975), "Longest common subsequences of two random sequences",Journal of Applied Probability,12 (2):306–315,doi:10.2307/3212444,JSTOR 3212444,MR 0405531,S2CID 250345191.
  15. ^Lueker, George S. (2009), "Improved bounds on the average length of longest common subsequences",Journal of the ACM,56 (3), A17,doi:10.1145/1516512.1516519,MR 2536132,S2CID 7232681.
  16. ^Kiwi, Marcos; Loebl, Martin;Matoušek, Jiří (2005), "Expected length of the longest common subsequence for large alphabets",Advances in Mathematics,197 (2):480–498,arXiv:math/0308234,doi:10.1016/j.aim.2004.10.012,MR 2173842.
  17. ^Majumdar, Satya N.; Nechaev, Sergei (2005), "Exact asymptotic results for the Bernoulli matching model of sequence alignment",Physical Review E,72 (2): 020901, 4,arXiv:q-bio/0410012,Bibcode:2005PhRvE..72b0901M,doi:10.1103/PhysRevE.72.020901,MR 2177365,PMID 16196539,S2CID 11390762.
  18. ^Brodal, G. S.; Fagerberg, R.; Rysgaard, C. M. (2024).On Finding Longest Palindromic Subsequences Using Longest Common Subsequences. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. pp. 35:1–35:16.doi:10.4230/lipics.esa.2024.35.

External links

[edit]
The WikibookAlgorithm implementation has a page on the topic of:Longest common subsequence


String metric
String-searching algorithm
Multiple string searching
Regular expression
Sequence alignment
Data structure
Other
Retrieved from "https://en.wikipedia.org/w/index.php?title=Longest_common_subsequence&oldid=1338335273"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp