Movatterモバイル変換


[0]ホーム

URL:


CRAN statusCRAN downloads

edlibR

edlibR: R integration foredlib

The R packageedlibR provides bindings toedlib, a performant C/C++library for calculating the exact pairwise sequence alignment using theedit distance(Levenshteindistance). The package is modeled after the API of thePython package edlib.

The original Python bindings to the C/C++ library using Cython can befoundhere.Much of the background information below is directly from theoriginalREADME for edlib.

Main Features

Usage

For details and examples of the edlibR functionality, see thevignettes: *HTMLversion *Markdownversion

There are three functions within edlibR:

align()

align(query, target, [mode], [task], [k], [cigarFormat], [additionalEqualities])

getNiceAlignment()

getNiceAlignment(alignResult, query, target, [gapSymbol])

nice_print()

nice_print(niceAlignment)

Background

Alignment Methods

Edlib supports three alignment methods: *global(NW) - This is the standard method, when we say “edit distance”this is the method that is assumed. (Note that ‘NW’ stands for‘Needleman-Wunsch’.) It tells us the smallest number of operationsneeded to transform the first sequence into the second sequence.This method is appropriate when you want to find out how similar thefirst sequence is to the second sequence. *prefix(SHW) - Similar to the global method, but with a small twist -the gap at the query end is not penalized. What this means is thatdeleting elements from the end of the second sequence is “free”! (Notethat ‘HW’ stands for ‘Hybrid Wunsch’.) For example, if we had query asAACT and target asAACTGGC, the edit distancewould be 0, because removingGGC from the end of the secondsequence is “free” and does not count into the total edit distance.This method is appropriate when you want to find out how well thefirst sequence fits at the beginning of the second sequence. *infix (HW): Similar as prefix method, but with one moretwist - gaps at query endand start are not penalized.What this means is that deleting elements from both the start and theend of the second sequence is “free”! (Note that ‘SHW’ stands for‘Semi-Hybrid Wunsch’.) For example, if we had query asACTand target asCGACTGAC, the edit distance would be 0,because removingCG from the start andGACfrom the end of the second sequence is “free” and does not count intothe total edit distance.This method is appropriate when you want tofind out how well the first sequence fits at any part of the secondsequence. For example, if your second sequence was a long text andyour first sequence was a sentence from that text, but slightlyscrambled, you could use this method to discover how scrambled it is andwhere it fits in that text.In bioinformatics, this method isappropriate for aligning a read to a sequence.

Algorithmic Design

Edlib is based on Myers’s bit-vector algorithm[1]and extends from it.

It calculates a dynamic programming matrix of dimensionsQ x T, whereQ is the length of the firstsequence (query), andT is the length of the secondsequence (target). It uses Ukkonen’s banded algorithm[2] to reduce the space of search, and there is alsoparallelization from Myers’s algorithm, however time complexity is stillquadratic.

Edlib uses Hirschberg’s algorithm[3] to find thealignment path, therefore space complexity is linear.

More details can be found within the edlib publication[4].

Installation

To install the stable version fromCRAN, use:

install.packages('edlibR')

To install the latest version, use:

install.packages('devtools')devtools::install_github('evanbiederstedt/edlibR')

References

R package citation

The R package can be cited as follows:

Martin Šošić and Evan Biederstedt (2022). edlibR: R wrapper to edlib,the C/C++ library for fast exact sequence alignment using edit(Levenshtein) distance. R package version 1.0.2.https://github.com/evanbiederstedt/edlibR

edlib publication

Please also cite the original edlib publication, which can be foundhereand should be cited as follows:

Martin Šošić, Mile Šikić (2017). Edlib: a C/C++ library for fast, exact sequence alignment using edit distance. Bioinformatics, Volume 33, Issue 9, 1 May 2017, Pages 1394–1395, https://doi.org/10.1093/bioinformatics/btw753

README references

[1] Myers G. (1999) A fast bit-vector algorithm forapproximate string matching based on dynamic programming. J. ACM, 46,395–415. https://doi.org/10.1145/316542.316550

[2] Ukkonen E. (1985) Algorithms for approximate stringmatching. Inform. Control, 64, 100–118.https://doi.org/10.1016/S0019-9958(85)80046-2

[3] Hirschberg D.S. (1975) A linear space algorithm forcomputing maximal common subsequences. Commun. ACM, 18, 341–343.https://doi.org/10.1145/360825.360861

[4] Martin Šošić, Mile Šikić. Edlib: a C/C++ libraryfor fast, exact sequence alignment using edit distance, Bioinformatics,Volume 33, Issue 9, 1 May 2017, Pages 1394–1395,https://doi.org/10.1093/bioinformatics/btw753


[8]ページ先頭

©2009-2025 Movatter.jp