Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
NotificationsYou must be signed in to change notification settings

traversc/seqtrie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Basic usage

results<- dist_search(x,y,max_distance=2,nthreads=1)

The above code will find all similar sequences/strings betweenx andy. This will generally be significantly faster than calculatingpairwise distance or pairwise alignment.

Background

seqtrie is a collection of Radix Tree algorithms. These include someclassic algorithms like prefix lookup and more bioinformatics focusedalgorithms such as alignment algorithms and sequence distancecalculations (Hamming and Levenshtein distances).

A trie (aka Prefix Tree) is a data structure used in many applications.It is a space efficient data structure where a collection of sequencesare stored in a tree structure, where each leaf represents one sequence,and each node holds one character representing a shared prefix of allsequences descending from it. A Radix Tree (aka Compact Prefix Tree) isan improvement on a trie, using less memory and being generally faster.In a Radix Tree, each node is able to represent multiple charactersinstead of just one.

Tries and Radix Trees have similar complexity to a hashmap. Storing asequence within the tree or looking it up are O(k) where k is the lengthof the sequence. The main advantage of a trie is that it can be used toquickly find similar sequences by various algorithms and metrics,whereas a hashmap does not contain any sequence similarity information.

See also:https://en.wikipedia.org/wiki/Radix_tree

Inseqtrie, there are twoR6 classes:

RadixTree is the primary class in this package. There are three mainmethods. The$insert() method is used to store sequences on the tree,$erase() for erasing sequences from the tree and$search() forfinding similar sequences stored on the tree.

The secondR6 class isRadixForest, a derivative data structurewhere separate trees are constructed for each sequence length. This datastructure has advantages and disadvantages, discussed later.

Finally, there’s a simple convenience functiondist_search to findsimilar sequences using aRadixTree orRadixForest object. This is awrapper around the$new,$insert and$search() methods.

Install

install.packages("seqtrie")

More examples

Below is a simple example where we insert some sequences (strings),erase one and then plot out the tree.

tree<-RadixTree$new()tree$insert(c("cargo","cart","carburetor","carbuncle","bar","zebra"))tree$erase("zebra")# tree$graph requires igraph packageset.seed(1);tree$graph()

Levenshtein “edit distance” search

Below is an example using COVID19 T-cell data from AdaptiveBiotechnologies. (doi: 10.21203/rs.3.rs-51964/v1. This data is licensedunder CC 4.0.)

Here, we find highly similar sequences within a fixed edit distance. Forthe purpose of this vignette, we sample a small selection of sequences.

On the full dataset, if you tried to calculate an edit distance matrix,it would take a pretty long time, not to mention requiring a lot ofmemory. A trie-based method could be used to find similar sequences in afraction of the time. Approximate running times on the full datasetusing 8 threads are listed in the comments.

# 130,000 "CDR3" sequencesset.seed(1)data(covid_cdr3)covid_cdr3<- sample(covid_cdr3,1000)tree<-RadixTree$new()tree$insert(covid_cdr3)# Full data: 1 minresults<-tree$search(covid_cdr3,max_distance=2,mode="levenshtein",nthreads=2)# Alternatively, instead of using the RadixTree object directly, you can use the# dist_search function, which is a wrapper around the RadixTree object.results<- dist_search(covid_cdr3,covid_cdr3,max_distance=2)# The output is a data.frame mapping query (search sequences)# and target (sequences inserted into the tree).dplyr::filter(results,query!=target)
##                                           query## 1    TGTGCCAGCAGTCACGGAACTAGCACAGATACGCAGTATTTT## 2          TGCAGCGTTGATCTGCCGGGAGAGACCCAGTACTTC## 3 TGTGCCAGTACTATGGGACAGGGGATGAACACTGAAGCTTTCTTT## 4 TGTGCCAGTAGTATGGGACAGGGAATGAACACTGAAGCTTTCTTT## 5    TGTGCCAGCAGTGACAGAACTAGCACAGATACGCAGTATTTT## 6          TGCAGCGTTGATCTGACGGGAGAGACCCAGTACTTC##                                          target distance## 1    TGTGCCAGCAGTGACAGAACTAGCACAGATACGCAGTATTTT        2## 2          TGCAGCGTTGATCTGACGGGAGAGACCCAGTACTTC        1## 3 TGTGCCAGTAGTATGGGACAGGGAATGAACACTGAAGCTTTCTTT        2## 4 TGTGCCAGTACTATGGGACAGGGGATGAACACTGAAGCTTTCTTT        2## 5    TGTGCCAGCAGTCACGGAACTAGCACAGATACGCAGTATTTT        2## 6          TGCAGCGTTGATCTGCCGGGAGAGACCCAGTACTTC        1

Search parameters

The$search() function contains two mutually exclusive parameters:max_distance andmax_fraction.The former parameter sets an absolutethreshold that limits the distance between pairs of sequences in theoutput, and the latter parameter sets a threshold relative to the querysequence length.

The search time is monotonically increasing with the distance threshold,logarithmically increasing with tree size and linearly increasing withthe number of query sequences and the length of each sequence. Overall,the algorithm is significantly faster than a pairwise/matrix editdistance calculation for finding similar sequences. However,care stillneeds to be taken when setting parameters for searching a large numberof sequences (~100,000+).

Some additional examples using the max_fraction parameter.

# Full data: several secondsresults<-tree$search(covid_cdr3,max_fraction=0.035,mode="levenshtein",nthreads=2)# Full data: 1 minuteresults<-tree$search(covid_cdr3,max_fraction=0.06,mode="levenshtein",nthreads=2)# Full data: 15-20 minutesresults<-tree$search(covid_cdr3,max_fraction=0.15,mode="levenshtein",nthreads=2)

Hamming distance search

Hamming distance is similar to Levenshtein distance, but does not allowinsertions or deletions. Sequences must be the same length. Because ofthis restriction, Hamming distance is generally a lot faster.

# Full data: 1 secondresults<-tree$search(covid_cdr3,max_fraction=0.035,mode="hamming",nthreads=2)# Full data: several secondsresults<-tree$search(covid_cdr3,max_fraction=0.06,mode="hamming",nthreads=2)# Full data: 1.5 minutesresults<-tree$search(covid_cdr3,max_fraction=0.15,mode="hamming",nthreads=2)

Anchored alignment searches

An anchored alignment is a form of semi-global alignment, where thequery sequence is “anchored” (global) to the beginning of both the queryand target sequences, but is semi-global in that the end of the eitherthe query sequence or target sequence (but not both) can be unaligned.This type of alignment is sometimes called an “extension” alignment inliterature.

tree<-RadixTree$new()tree$insert("CARTON")tree$insert("CAR")tree$insert("CARBON")tree$search("CART",max_distance=0,mode="anchored")
##   query target distance query_size target_size## 1  CART    CAR        0          3           3## 2  CART CARTON        0          4           4

Because the alignment is semi-global at the end of the alignment, thequery of “CART” finds “CAR” and “CARTON” but not “CARBON” given a maxdistance of 0. Additionally, the output of an anchored search alsoreturns the position of the query and target at the ends. Either thequery or the target must fully align, so at least one of the endpositions will be the full length of the sequence. This type ofalignment is frequently useful in biology e.g. if you are trying toalign multiple reads that are variable in length but start at the samegenomic position or primer site.

Custom distance searches and affine gap alignment

seqtrie supports custom substitution costs and affine gap penalties.Note: we are calculating distance (higher is worse) and not alignmentscore (higher is better).

tree<-RadixTree$new()tree$insert(covid_cdr3)# Define a custom substitution matrix. Use generate_cost_matrix for convenience.cost_mat<- generate_cost_matrix("ACGT",match=0,mismatch=5)print(cost_mat)
##   A C G T## A 0 5 5 5## C 5 0 5 5## G 5 5 0 5## T 5 5 5 0
# Set gap penalties via parameters (not in the matrix):# - Linear gaps: set gap_cost only# - Affine gaps: set both gap_cost and gap_open_cost# Linear exampleresults_linear<-tree$search(covid_cdr3,max_distance=8,mode="global",cost_matrix=cost_mat,gap_cost=2,nthreads=2)# Affine exampleresults_affine<-tree$search(covid_cdr3,max_distance=8,mode="global",cost_matrix=cost_mat,gap_cost=2,gap_open_cost=5,nthreads=2)dplyr::filter(results_linear,query!=target)
##                                            query## 1        TGTGCCAGCAGCTTAGGACAGTCCTACGAGCAGTACTTC## 2        TGTGCCAGCAGCTTAGGACAGTCCTACGAGCAGTACTTC## 3     TGTGCCAGCAGTCACGGAACTAGCACAGATACGCAGTATTTT## 4           TGCAGCGTTGATCTGCCGGGAGAGACCCAGTACTTC## 5     TGTGCCAGCAGCTCGGCGGGGTCCTACAATGAGCAGTTCTTC## 6        TGTGCCAGCAGTTACGGGCAGTCCTACGAGCAGTACTTC## 7        TGTGCCAGCAGTTACGGGCAGTCCTACGAGCAGTACTTC## 8  TGTGCCAGTACTATGGGACAGGGGATGAACACTGAAGCTTTCTTT## 9  TGTGCCAGTAGTATGGGACAGGGAATGAACACTGAAGCTTTCTTT## 10    TGTGCCAGCAGCTCAGGGGGCTCCTACAATGAGCAGTTCTTC## 11          TGTGCCAGCAGTTGGGGGGGCTACGAGCAGTACTTC## 12    TGTGCCAGCAGTTCCCCTAATAGCAATCAGCCCCAGCATTTT## 13       TGTGCCAGCAGTTTATCGGGGTCCTACGAGCAGTACTTC## 14       TGTGCCAGCAGTTTATCGGGGTCCTACGAGCAGTACTTC## 15 TGTGCCAGCAGCCTTAGCGGGGTGAGCACAGATACGCAGTATTTT## 16          TGTGCCAGCAGTCCCTCAGGGGAGACCCAGTACTTC## 17       TGTGCCAGCAGCCTAGCAGGGGCCGGGGAGCTGTTTTTT## 18       TGTGCCAGCAGGCCAGGACAGTCCTACGAGCAGTACTTC## 19       TGTGCCAGCAGCTTAGAGGCGGCCGGGGAGCTGTTTTTT## 20    TGTGCCAGCAGTCCCATAGATAGCAATCAGCCCCAGCATTTT## 21    TGTGCCAGCAGTGACAGAACTAGCACAGATACGCAGTATTTT## 22       TGTGCCAGCAGTTTAGGGGGTGGCTACGAGCAGTACTTC## 23          TGTGCCAGCAGTTTCGGGGCCTACGAGCAGTACTTC## 24    TGTGCCAGCAGCCTTAGCGGTAGCACAGATACGCAGTATTTT## 25          TGTGCCAGCAGTCCTAGCGGCGAGACCCAGTACTTC## 26          TGCAGCGTTGATCTGACGGGAGAGACCCAGTACTTC##                                           target distance## 1        TGTGCCAGCAGTTACGGGCAGTCCTACGAGCAGTACTTC        8## 2        TGTGCCAGCAGGCCAGGACAGTCCTACGAGCAGTACTTC        8## 3     TGTGCCAGCAGTGACAGAACTAGCACAGATACGCAGTATTTT        8## 4           TGCAGCGTTGATCTGACGGGAGAGACCCAGTACTTC        4## 5     TGTGCCAGCAGCTCAGGGGGCTCCTACAATGAGCAGTTCTTC        8## 6        TGTGCCAGCAGTTTATCGGGGTCCTACGAGCAGTACTTC        8## 7        TGTGCCAGCAGCTTAGGACAGTCCTACGAGCAGTACTTC        8## 8  TGTGCCAGTAGTATGGGACAGGGAATGAACACTGAAGCTTTCTTT        8## 9  TGTGCCAGTACTATGGGACAGGGGATGAACACTGAAGCTTTCTTT        8## 10    TGTGCCAGCAGCTCGGCGGGGTCCTACAATGAGCAGTTCTTC        8## 11       TGTGCCAGCAGTTTAGGGGGTGGCTACGAGCAGTACTTC        6## 12    TGTGCCAGCAGTCCCATAGATAGCAATCAGCCCCAGCATTTT        8## 13          TGTGCCAGCAGTTTCGGGGCCTACGAGCAGTACTTC        6## 14       TGTGCCAGCAGTTACGGGCAGTCCTACGAGCAGTACTTC        8## 15    TGTGCCAGCAGCCTTAGCGGTAGCACAGATACGCAGTATTTT        6## 16          TGTGCCAGCAGTCCTAGCGGCGAGACCCAGTACTTC        8## 17       TGTGCCAGCAGCTTAGAGGCGGCCGGGGAGCTGTTTTTT        8## 18       TGTGCCAGCAGCTTAGGACAGTCCTACGAGCAGTACTTC        8## 19       TGTGCCAGCAGCCTAGCAGGGGCCGGGGAGCTGTTTTTT        8## 20    TGTGCCAGCAGTTCCCCTAATAGCAATCAGCCCCAGCATTTT        8## 21    TGTGCCAGCAGTCACGGAACTAGCACAGATACGCAGTATTTT        8## 22          TGTGCCAGCAGTTGGGGGGGCTACGAGCAGTACTTC        6## 23       TGTGCCAGCAGTTTATCGGGGTCCTACGAGCAGTACTTC        6## 24 TGTGCCAGCAGCCTTAGCGGGGTGAGCACAGATACGCAGTATTTT        6## 25          TGTGCCAGCAGTCCCTCAGGGGAGACCCAGTACTTC        8## 26          TGCAGCGTTGATCTGCCGGGAGAGACCCAGTACTTC        4

Radix Forest for faster Levenshtein searches

TheRadixForest class is a data structure holding a collection ofRadix Trees, where a separate tree is constructed for each sequencelength. The primary advantage ofRadixForest is significantly fasterLevenshtein searches, because you can know sequence length up front. Thedisadvantages are higher memory usage, and no support for customdistance searches.

Below is a brief comparison:

# RadixTree, full data: 45 secondstree<-RadixTree$new()tree$insert(covid_cdr3)results_tree<-tree$search(covid_cdr3,max_distance=2,mode="levenshtein",nthreads=2)# RadixForest, full data: 19 secondsfrst<-RadixForest$new()frst$insert(covid_cdr3)results_frst<-frst$search(covid_cdr3,max_distance=2,mode="levenshtein",nthreads=2)# The results are the same, but order is not guaranteedidentical(dplyr::arrange(results_tree,query,target),dplyr::arrange(results_frst,query,target) )
## [1] TRUE

Finding strings that start with a pattern

The$prefix_search() function can be used to find similar sequencesthat start with a pattern. This is one of the classic use cases of triedata structures, for use as a database lookup and predictive text.

tree<-RadixTree$new()tree$insert(c("cargo","cart","carburetor","carbuncle","bar"))tree$prefix_search("car")
##   query     target## 1   car      cargo## 2   car       cart## 3   car carburetor## 4   car  carbuncle

References and literature

  • “Fast string correction with Levenshtein automata” (2002)<doi:10.1007/s10032-002-0082-8>
  • “A survey of sequence alignment algorithms for next-generationsequencing” (2010) <doi:10.1093/bib/bbq015>
  • “Fast and Easy Levenshtein distance using a Trie” (2011)<https://stevehanov.ca/blog/index.php?id=114>
  • “Spell Checker Application Based on Levenshtein Automaton” (2021)<doi:10.1007/978-3-030-91608-4_5>

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp