- Notifications
You must be signed in to change notification settings - Fork12
ksahlin/strobemers
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A strobemer is a seed extracted from text and used for text similarity searches describedhere. Strobemers arefuzzy seeds and can match over mutations in text. They were initially designed for biological sequence analysis where point mutations and small shifts (insertions/deletions) are common, but they are agnostic to the alphabet of text.
A small illustration below of two biological sequencesT
andT'
with substitutions and insertions/deletions between them, and four strobemer seeds (s_1
-s_4
ands'_1
-s'_4
) extracted from each sequence. Mutations destroy some seeds (s_1
/s'_1
), but some remain intact, allowing retrieval of similar regions. The seeds have total length of 24 letters (two 12-letter strings separated by a pseudorandom spacing), but no consecutive 24 letter string (k-mer) is found betweenT
andT'
.
The repository consists of
- functions to generate strobemers in C++
- functions to generate strobemers in Python
- a tool
StrobeMap
implemented in both C++ and Python - scripts used for the evaluations in thepaper
Other strobemer implementations are found here
The construction time is dependent on the implementation. The times reported in the preprint are for the C++/Python implementations in this repository.
The C++ librarystrobemers_cpp/index.[cpp/hpp]
contains functions for creating randstobes (order 2 and 3), hybridstrobes (order 2 and 3) and minstrobes (order 2).
You can copy theindex.cpp
andindex.hpp
files in thestrobemers_cpp
folder in this repository if you want to use either randstrobes (order 2 and 3), hybridstrobes (order 2), or minstrobes (order 2) in your project. The implementation of these functions uses bitpacking and some other clever tricks (inspired bythis repo) to be fast. Because of bitpacking, the limitation is that any single strobe cannot be lager than 32, which means that themaximum strobemer length allowed in this implementation is3*32 = 96
for strobemers of order 3, and2*32 = 64
for order 2. This should be large enough for most applications.
The functions in theindex.cpp
file can be used as follows:
#include "index.hpp"typedef std::vector< std::tuple<uint64_t, unsigned int, unsigned int, unsigned int, unsigned int>> strobes_vector;strobes_vector randstrobes3; // (kmer hash value, seq_id, strobe1_pos, strobe2_pos, strobe3_pos)seq = "ACGCGTACGAATCACGCCGGGTGTGTGTGATCGGGGCTATCAGCTACGTACTATGCTAGCTACGGACGGCGATTTTTTTTCATATCGTACGCTAGCTAGCTAGCTGCGATCGATTCG";n=3;k=15;w_min=16;w_max=30;seq_id = 0; // using integers for compactness, you can store a vector with accessions v = [acc_chr1, acc_chr2,...] then seq_id = 0 means v[0].randstrobes3 = seq_to_randstrobes3(n, k, w_min, w_max, seq, seq_id);for (auto &t : randstrobes3) // iterate over the strobemer tuples{strobemer_hash = std::get<0>(t);strobe1_pos = std::get<2>(t);strobe2_pos = std::get<3>(t);strobe3_pos = std::get<4>(t);// if you want the actual strobemer sequences:randstrobe = seq.substr(strobe1_pos, k) + seq.substr(strobe2_pos, k)+ seq.substr(strobe3_pos, k);}
If you are using some ofseq_to_randstrobes2
,seq_to_hybridstrobes2
, orseq_to_minstrobes3
they return the same vector tuples but position of strobe 2 copied twice, i.e.,(kmer hash value, seq_id, strobe1_pos, strobe2_pos, strobe2_pos)
. There is no reason for this and for any high performance application the function could be modified to return the minimal needed information.
My benchmarking in Table S3 in thesupplemental methods found that randstrobes is roughly as fast as hybridstrobes and minstrobes. Furthermore, randstrobes is unexpectedly fast in this implementation in general, about 1.5-3 times slower than generating k-mers for randstrobes of (n=2, s=15, w_min=16,w_max=70). What takes time is pushing the tuples to a vector, not computing the strobemers. Bare construction time could be further compared if preallocating an array of fixed size to remove the resizing when pushing to vectors. Nevertheless, the takehome message is that the generation of strobemers could be implemented so that it is fast, and will likely not be a bottleneck in most algorithms using them.
The preprint describes shrinking the sampling windows[w_min, w_max]
at ends of sequences to assure that a similar number of strobemers and k-mers created. However, in, e.g., read mapping, there is little to no gain in shrinking windows. This is because if we shrink windows at the ends of reads, the strobemer extracted from the read in those windows cannot be guaranteed to (but may) be the same as in the reference, as first described inthis issue. The more the window(s) are shrunk, the less likely the strobers are to match between the sequences, and the probability of matching a strobemer after the last window (original size) is completely outside the read is 0 (if disregarding false matches). After noting this, my implementation only shrink the last strobemer window regardless of the number of strobes (i.e., there is a positive probablility of a match even if window is shrunk). This means that there will ben - (k + w_min) +1
strobemers of order 2 generated form a sequence of lengthn
, andn - (k + w_max + w_min) +1
strobemers of order 3. In other words, we will only slide last strobe's window outside the sequence. Once it is fully outside the sequence we stop (illustrated in approach B for order 2 inhere.
The python libraryindexing.py
contains functions and generators for creating all strobemers of order 2 and 3.
Theindexing.py
module located in themodules
folder contains functions for generating k-mers, spaced k-mers, minimizers, and strobemers (minstrobes, hybridstrobes, and randstrobes) of order 2 and 3. For randstrobes, there are two ways to create them. The first way is with the functionrandstrobes
, which takes a string, k-mer size, and upper and lower window limits and returns a dictionary with positions of the strobes as keys and the hash value of the randstrobe sequence (strings) as values. For example
from modules import indexingall_mers = defaultdict(list)for (p1,p2,p3), h in indexing.randstrobes(seq, k_size, w_min, w_max, order = 3).items(): # all_mers is a dictionary with hash values as keys and # a list with position-tuples of where the strobemer is sampled from all_mers[h].append( (p1,p2,p3) )
Functionsminstrobes
andhybridstrobes
have the same interface.
The second way is to callrandstrobes_iter
which is a generator. Similarly torandstrobes
,randstrobes_iter
takes a string, k-mer size, and upper and lower window size, but instead yields randstrobes from the sequence and is not as memmory requiring as therandstrobes
function which store and returns all the strobes in a dictionary.randstrobes_iter
generating randpers of order 2 can be used as follows
from modules import indexingall_mers = defaultdict(list)for (p1,p2), s in indexing.randstrobes_iter(seq, k_size, w_min, w_max, order = 2, buffer_size = 1000000): all_mers[s].append( (p1,p2) )
Functionsminstrobes_iter
andhybridstrobes_iter
have the same interface.
The toolStrobeMap
is a program which roughly has the same interface asMUMmer
.StrobeMap
takes a reference and query file in multi-fasta or fastq format. It produces NAMs (Non-overlapping Approximate Matches) between the queries and references and outputs them in a format simular to nucmer/MUMmer. Seesupplementary material Section A in the paper for definition of NAMs.
StrobeMap
is available throughbioconda. You can also acquire precompiled binaries for Linux and Mac OSx fromhere. For example, for linux, simply do
wget https://github.com/ksahlin/strobemers/raw/main/strobemers_cpp/binaries/Linux/StrobeMap-0.0.2chmod +x StrobeMap-0.0.2./StrobeMap-0.0.2 # test program
If you want to compile from source, you need to have a newerg++
andzlib
installed. Then do the following:
git clone https://github.com/ksahlin/strobemerscd strobemers/strobemers_cpp/# Needs a newer g++ version. Tested with version 8 and upwards.g++ -std=c++14 main.cpp index.cpp -lz -fopenmp -o StrobeMap -O3 -mavx2
If zlib is not already installed on your system, it can be installed through, e.g., conda by
conda install -c anaconda zlib
If you dont have conda, download and installhere.
$ ./StrobeMap StrobeMap VERSION 0.0.2StrobeMap [options] <references.fasta> <queries.fast[a/q]>options: -n INT number of strobes [2] -k INT strobe length, limited to 32 [20] -v INT strobe w_min offset [k+1] -w INT strobe w_max offset [70] -t INT number of threads [3] -o name of output tsv-file [output.tsv] -c Choice of protocol to use; kmers, minstrobes, hybridstrobes, randstrobes [randstrobes]. -s Split output into one file per thread and forward/reverse complement mappings. This option is used to generate format compatible with uLTRA long-read RNA aligner and requires option -o to be specified as a folder path to uLTRA output directory, e.g., -o /my/path/to/uLTRA_output/
# randstrobes (3,30,31,60)StrobeMap -k 30 -n 3 -v 31 -w 60 -c randstrobes -o mapped.tsv ref.fa query.fa
If you havezlib
installed, and thezlib.h
file is in folder/path/to/zlib/include
and thelibz.so
file in/path/to/zlib/lib
but you get
main.cpp:12:10: fatal error: zlib.h: No such file or directory #include <zlib.h> ^~~~~~~~compilation terminated.
add-I/path/to/zlib/include -L/path/to/zlib/lib
to the compilation, that is
g++ -std=c++14 -I/path/to/zlib/include -L/path/to/zlib/lib main.cpp index.cpp -lz -fopenmp -o StrobeMap -O3 -mavx2
StrobeMap
implements order 2 and 3 hybridstrobes (default), randstrobes, minstrobes, as well as kmers. The tool produces NAMs (Non-overlapping Approximate Matches; see explanation in preprint) for both strobemers and kmers. Test data is found in the folderdata
in this repository.Here are some example uses:
# Generate hybridstrobe matches (hybridstrobe parametrization (2,15,20,70)) # between ONT SIRV reads and the true reference sequences./StrobeMap --queries data/sirv_transcripts.fasta \ --references data/ONT_sirv_cDNA_seqs.fasta \ --outfolder strobemer_output/ --k 15 --strobe_w_min_offset 20 --strobe_w_max_offset 70# Generate kmer matches (k=30) # between ONT SIRV reads and the true reference sequences./StrobeMap --queries data/sirv_transcripts.fasta \ --references data/ONT_sirv_cDNA_seqs.fasta \ --outfolder kmer_output/ --k 30 --kmer_index# Reads vs reads matching using randstrobes./StrobeMap --queries data/ONT_sirv_cDNA_seqs.fasta \ --references data/ONT_sirv_cDNA_seqs.fasta \ --outfolder strobemer_output/ --k 15 \ --strobe_w_min_offset 20 --strobe_w_max_offset 70 \ --randstrobe_index
Minstrobes has the same parameters as hybridstrobes and randstrobes but are invoked with parameter--minstrobe_index
The output is a filematches.tsv
in the output folder. You can se a custom outfile name with the parameter--prefix
.Output format is a tab separated file on the same format as MUMmer, with identical fields except the last one which is approximate reference sequence match length instead of what MUMmer produce:
>query_accessionref_id ref_pos query_pos match_length_on_reference
Small example output from aligning sirv reads to transcripts (from the commands above) which also highlights the stobemers strength compared to kmers. While kmers can give a more nuanced differentiation (compare read hits toSIRV606
andSIRV616
) both the sequences are good candidates for downstream processing. In this small example, the strobemers produce fewer hits/less output needed for post clustering of matches, e.g., for downstream clustering/alignment/mapping. Notice that randstobe hit positions are currently not deterministic due to hash seed is set at each new pyhon instantiation. I will fix the hash seed in future implementations.
Randstrobes (2,15,20,70)
>41:650|d00e6247-9de6-485c-9b44-806023c51f13SIRV606 35 92 487SIRV616 35 92 473>56:954|a23755a1-d138-489e-8efb-f119e679daf4SIRV509 3 3 515SIRV509 520 529 214SIRV509 762 767 121>106:777|0f79c12f-efed-4548-8fcc-49657f97a126SIRV404 53 131 535
kmers (k=30)
>41:650|d00e6247-9de6-485c-9b44-806023c51f13SIRV606 33 90 46SIRV606 92 150 125SIRV606 219 275 81SIRV606 349 408 70SIRV606 420 479 47SIRV606 481 540 42SIRV616 33 90 46SIRV616 92 150 125SIRV616 219 275 81SIRV616 349 408 60SIRV616 409 482 44SIRV616 467 540 42>56:954|a23755a1-d138-489e-8efb-f119e679daf4SIRV509 68 72 141SIRV509 230 233 100SIRV509 331 335 105SIRV509 435 442 40SIRV509 475 483 36SIRV509 579 585 41SIRV509 621 627 46SIRV509 695 701 44SIRV509 812 815 53>106:777|0f79c12f-efed-4548-8fcc-49657f97a126SIRV404 53 131 58SIRV404 128 208 127SIRV404 283 364 30SIRV404 422 494 142
Kristoffer Sahlin, Effective sequence similarity detection with strobemers, Genome Res. November 2021 31: 2080-2094; doi:https://doi.org/10.1101/gr.275648.121
About
A repository for generating strobemers and evalaution