Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Hirschberg's algorithm

From Wikipedia, the free encyclopedia
Algorithm for aligning two sequences

Incomputer science,Hirschberg's algorithm, named after its inventor,Dan Hirschberg, is adynamic programmingalgorithm that finds the optimalsequence alignment between twostrings. Optimality is measured with theLevenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space-efficient version of theNeedleman–Wunsch algorithm that usesdynamic programming.[1] Hirschberg's algorithm is commonly used incomputational biology to find maximal global alignments ofDNA andprotein sequences.

Algorithm information

[edit]

Hirschberg's algorithm is a generally applicable algorithm for optimal sequence alignment.BLAST andFASTA are suboptimalheuristics. IfX{\displaystyle X} andY{\displaystyle Y} are strings, wherelength(X)=n{\displaystyle \operatorname {length} (X)=n} andlength(Y)=m{\displaystyle \operatorname {length} (Y)=m}, theNeedleman–Wunsch algorithm finds an optimal alignment inO(nm){\displaystyle O(nm)} time, usingO(nm){\displaystyle O(nm)} space. Hirschberg's algorithm is a clever modification of the Needleman–Wunsch Algorithm, which still takesO(nm){\displaystyle O(nm)} time, but needs onlyO(min{n,m}){\displaystyle O(\min\{n,m\})} space and is much faster in practice.[2]One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate thelongest common subsequence between two sets of data such as with the commondiff tool.

The Hirschberg algorithm can be derived from the Needleman–Wunsch algorithm by observing that:[3]

  1. one can compute the optimal alignment score by only storing the current and previous row of the Needleman–Wunsch score matrix;
  2. if(Z,W)=NW(X,Y){\displaystyle (Z,W)=\operatorname {NW} (X,Y)} is the optimal alignment of(X,Y){\displaystyle (X,Y)}, andX=Xl+Xr{\displaystyle X=X^{l}+X^{r}} is an arbitrary partition ofX{\displaystyle X}, there exists a partitionYl+Yr{\displaystyle Y^{l}+Y^{r}} ofY{\displaystyle Y} such thatNW(X,Y)=NW(Xl,Yl)+NW(Xr,Yr){\displaystyle \operatorname {NW} (X,Y)=\operatorname {NW} (X^{l},Y^{l})+\operatorname {NW} (X^{r},Y^{r})}.

Algorithm description

[edit]

Xi{\displaystyle X_{i}} denotes thei-th character ofX{\displaystyle X}, where1ilength(X){\displaystyle 1\leqslant i\leqslant \operatorname {length} (X)}.Xi:j{\displaystyle X_{i:j}} denotes a substring of sizeji+1{\displaystyle j-i+1}, ranging from thei-th to thej-th character ofX{\displaystyle X}.rev(X){\displaystyle \operatorname {rev} (X)} is the reversed version ofX{\displaystyle X}.

X{\displaystyle X} andY{\displaystyle Y} are sequences to be aligned. Letx{\displaystyle x} be a character fromX{\displaystyle X}, andy{\displaystyle y} be a character fromY{\displaystyle Y}. We assume thatDel(x){\displaystyle \operatorname {Del} (x)},Ins(y){\displaystyle \operatorname {Ins} (y)} andSub(x,y){\displaystyle \operatorname {Sub} (x,y)} are well defined integer-valued functions. These functions represent the cost of deletingx{\displaystyle x}, insertingy{\displaystyle y}, and replacingx{\displaystyle x} withy{\displaystyle y}, respectively.

We defineNWScore(X,Y){\displaystyle \operatorname {NWScore} (X,Y)}, which returns the last line of the Needleman–Wunsch score matrixScore(i,j){\displaystyle \mathrm {Score} (i,j)}:

function NWScore(X, Y)    Score(0, 0) = 0 // 2 * (length(Y) + 1) arrayfor j = 1to length(Y)        Score(0, j) = Score(0, j - 1) + Ins(Yj)for i = 1to length(X) // Init array        Score(1, 0) = Score(0, 0) + Del(Xi)for j = 1to length(Y)            scoreSub = Score(0, j - 1) + Sub(Xi, Yj)            scoreDel = Score(0, j) + Del(Xi)            scoreIns = Score(1, j - 1) + Ins(Yj)            Score(1, j) = max(scoreSub, scoreDel, scoreIns)end        // Copy Score[1] to Score[0]        Score(0, :) = Score(1, :)endfor j = 0to length(Y)        LastLine(j) = Score(1, j)return LastLine

Note that at any point,NWScore{\displaystyle \operatorname {NWScore} } only requires the two most recent rows of the score matrix. Thus,NWScore{\displaystyle \operatorname {NWScore} } is implemented inO(min{length(X),length(Y)}){\displaystyle O(\min\{\operatorname {length} (X),\operatorname {length} (Y)\})} space.

The Hirschberg algorithm follows:

function Hirschberg(X, Y)    Z = ""    W = ""if length(X) == 0for i = 1to length(Y)            Z = Z + '-'            W = W + Yiendelse if length(Y) == 0for i = 1to length(X)            Z = Z + Xi            W = W + '-'endelse if length(X) == 1or length(Y) == 1        (Z, W) = NeedlemanWunsch(X, Y)else        xlen = length(X)        xmid = length(X) / 2        ylen = length(Y)        ScoreL = NWScore(X1:xmid, Y)        ScoreR = NWScore(rev(Xxmid+1:xlen), rev(Y))        ymid =arg max ScoreL + rev(ScoreR)        (Z,W) = Hirschberg(X1:xmid, y1:ymid) + Hirschberg(Xxmid+1:xlen, Yymid+1:ylen)endreturn (Z, W)

In the context of observation (2), assume thatXl+Xr{\displaystyle X^{l}+X^{r}} is a partition ofX{\displaystyle X}. Indexymid{\displaystyle \mathrm {ymid} } is computed such thatYl=Y1:ymid{\displaystyle Y^{l}=Y_{1:\mathrm {ymid} }} andYr=Yymid+1:length(Y){\displaystyle Y^{r}=Y_{\mathrm {ymid} +1:\operatorname {length} (Y)}}.

Example

[edit]

Let

X=AGTACGCA,Y=TATGC,Del(x)=2,Ins(y)=2,Sub(x,y)={+2,if x=y1,if xy.{\displaystyle {\begin{aligned}X&={\text{AGTACGCA}},\\Y&={\text{TATGC}},\\\operatorname {Del} (x)&=-2,\\\operatorname {Ins} (y)&=-2,\\\operatorname {Sub} (x,y)&={\begin{cases}+2,&{\text{if }}x=y\\-1,&{\text{if }}x\neq y.\end{cases}}\end{aligned}}}

The optimal alignment is given by

 W = AGTACGCA Z = --TATGC-

Indeed, this can be verified by backtracking its corresponding Needleman–Wunsch matrix:

T   A   T   G   C0  -2  -4  -6  -8 -10A-2  -1   0  -2  -4  -6G-4  -3  -2  -1   0  -2T  -6-2  -4   0  -2  -1A  -8  -40  -2  -1  -3C -10  -6  -2-1  -3   1G -12  -8  -4  -31  -1C -14 -10  -6  -5  -13A -16 -12  -8  -7  -31

One starts with the top level call toHirschberg(AGTACGCA,TATGC){\displaystyle \operatorname {Hirschberg} ({\text{AGTACGCA}},{\text{TATGC}})}, which splits the first argument in half:X=AGTA+CGCA{\displaystyle X={\text{AGTA}}+{\text{CGCA}}}. The call toNWScore(AGTA,Y){\displaystyle \operatorname {NWScore} ({\text{AGTA}},Y)} produces the following matrix:

T   A   T   G   C    0  -2  -4  -6  -8 -10A -2  -1   0  -2  -4  -6G -4  -3  -2  -1   0  -2T -6  -2  -4   0  -2  -1A -8  -4   0  -2  -1  -3

Likewise,NWScore(rev(CGCA),rev(Y)){\displaystyle \operatorname {NWScore} (\operatorname {rev} ({\text{CGCA}}),\operatorname {rev} (Y))} generates the following matrix:

C   G   T   A   T    0 -2  -4  -6  -8 -10A -2 -1  -3  -5  -4  -6C -4  0  -2  -4  -6  -5G -6 -2   2   0  -2  -4C -8 -4   0   1  -1  -3

Their last lines (after reversing the latter) and sum of those are respectively

 ScoreL      = [ -8 -4  0 -2 -1 -3 ] rev(ScoreR) = [ -3 -1  1  0 -4 -8 ] Sum         = [-11 -51 -2 -5 -11]

The maximum (shown in bold) appears atymid = 2, producing the partitionY=TA+TGC{\displaystyle Y={\text{TA}}+{\text{TGC}}}.

The entire Hirschberg recursion (which we omit for brevity) produces the following tree:

               (AGTACGCA,TATGC)               /               \        (AGTA,TA)             (CGCA,TGC)         /     \              /        \     (AG, )   (TA,TA)      (CG,TG)     (CA,C)              /   \        /   \                  (T,T) (A,A)  (C,T) (G,G)

The leaves of the tree contain the optimal alignment.

See also

[edit]

References

[edit]
  1. ^Hirschberg's algorithm.
  2. ^"The Algorithm".
  3. ^Hirschberg, D. S. (1975). "A linear space algorithm for computing maximal common subsequences".Communications of the ACM.18 (6):341–343.CiteSeerX 10.1.1.348.4774.doi:10.1145/360825.360861.MR 0375829.S2CID 207694727.
String metric
String-searching algorithm
Multiple string searching
Regular expression
Sequence alignment
Data structure
Other
Retrieved from "https://en.wikipedia.org/w/index.php?title=Hirschberg%27s_algorithm&oldid=1286379838"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp