- Notifications
You must be signed in to change notification settings - Fork3
Fast approximate string searching
License
RagnarGrootKoerkamp/sassy
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
WARNING: Versions up to0.1.9 had a shameful bug in thesassy grep CLI where afterevery 1MB of input it would skip one record. Please update.
Sassy is a library and tool for searching short strings in texts,a problem that goes by many names:
- approximate string matching,
- pattern matching,
- fuzzy searching.
The motivating application is searching short (length 20 to 100) DNA sequencesin a human genome or e.g. in a set of reads.Sassy generally works well for patterns/queries up to length 1000,and supports both ASCII, DNA, and IUPAC.
It has agrep-like mode for quick human inspection, as well assearch toreport locations of matches, andfilter to only output (non)-matching records.
Highlights:
- Sassy uses bitpacking and SIMD (both AVX2 and NEON supported).Its main novelty is tiling these in the text direction.
- Support foroverhang alignments where the pattern extends beyond the text.
- Support for (case-insensitive) ASCII, DNA (
ACGT), andIUPAC (=ACGT+NYR...) alphabets. - Rust library (
cargo add sassy), binary (cargo install sassy, see details below), Pythonbindings (pip install sassy-rs), and C bindings (see below).
Seethe paper, and corresponding evals inevals/:
Rick Beeloo and Ragnar Groot Koerkamp.
Sassy: Searching Short DNA Strings in the 2020s.
bioRxiv, July 2025.https://doi.org/10.1101/2025.07.22.666207.
See the latestrelease.
You can also get these via
cargo binstall sassy
or via conda/mamba/pixi:
conda install -c bioconda sassy
RUSTFLAGS="-C target-cpu=native" cargo install sassySassy uses AVX2 or NEON instructions performance reasons, which requires eithertarget-cpu=native ortarget-cpu=x86-64-v3 on x64 machines.Seethis README for details andthisblog for background.The same restrictions apply when using the sassy library in a larger project.
Sassy requires Rust 1.91 or newer. Get it viarustup update. (Switch torustup when your system installation is too old).
Sassy can be used via the CLI, or as Rust, Python, or C library.
The library can be used to search for ASCII or DNA strings.A larger example can be found insrc/lib.rs.
// cargo add sassyuse sassy::{Searcher,Match, profiles::Iupac,Strand};let pattern =b"ATCG";let text =b"AAAATTGAAA";let k =1;// The Iupac profile supports N and YR... characters.// If you are sure you only have ACGT input, then `profiles::Dna` is slightly faster.letmut searcher =Searcher::<Iupac>::new_fwd();let matches = searcher.search(pattern,&text, k);assert_eq!(matches.len(),1);assert_eq!(matches[0].text_start,3);assert_eq!(matches[0].text_end,7);assert_eq!(matches[0].cost,1);assert_eq!(matches[0].strand,Strand::Fwd);assert_eq!(matches[0].cigar.to_string(),"2=1X1=");
The CLI can be used via:
sassy grep: to show nicely coloured output.sassy search: to write a.tsvof matching locations.sassy filter: to write a.fasta/.fastqof (non)-matching records.sassy crispr: to search for CRISPR guides.
grep,search, andfilter all take the same arguments, and are implementedby forwarding togrep. Thus, they can all be combined via e.g.
sassy grep -p ACGTCAAACCTA -k 3 --matches matches.tsv --output filtered.fastq reads.fastq.gz
Search a patternATGAGCA intext.fasta with ≤1 edit:
sassy search --pattern ATGAGCA -k 1 text.fasta
or search all records of a fasta file with--pattern-fasta <fasta-file> instead of--pattern.
Thegrep output is coloured:
- green shows matching characters,
- orange shows mismatches,
- red shows deleted characters (in pattern but not in text),
- blue shows inserted characters (in text but not in pattern).

sassy search -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa> matches.tsv# orsassy search -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa --matches matches.tsv# orsassy grep -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa --matches matches.tsv
gives.tsv output like this:
pat_idtext_idcoststrandstartendmatch_regioncigarpatternAC_000001.1__1_10+648GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_350+897939GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_491+866908GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCGCGCG37=1X4=patternAC_000001.1__1_640-12671309GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_670+600642GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_680-18261868GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_783-43814425GTACAGAAACGAGCGGATGGAAAATAGTAGTGAGCGGCCTCGCG23=1X1I10=1I8=patternAC_000001.1__1_920-65546596GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_940-64136455GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_1152+20912131GTACAGAAACGAGCATGGAAAGAGTAGTGAGCGCCTCGCG14=2D26=patternAC_000001.1__1_1180-30623104GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_1230+14161458GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_1270+2769GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=
sassy filter -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq> filtered.fq# orsassy filter -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq -o filtered.fq# orsassy grep -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq -o filtered.fq
Writes a file containing only matching records. Use--invert to onlywrite non-matching records.
Search for one or more guides inguides.txt:
sassy crispr --threads 8 --guide guides.txt --k 5 --max-n-frac 0.1 --output hits.tsv hg38.fasta
Allows<= k edits in the sgRNA, and the PAM (the last 3 characters of each guide) has to match exactly, unless--allow-pam-edits is given.
Output of thecrispr command is a tab-delimited file with one row per hit, e.g.:
guide text_id cost strand start end match_region cigarGAGTCCGAGCAGAAGAAGAANGG chr21 5 + 5024135 5024154 GAGGCCACAGAGAAGAGGG 3=1X2=1D1=1D3=1D5=1D4=GAGTCCGAGCAGAAGAAGAANGG chr21 3 + 21087337 21087359 gagaccgaggagaagaaaaagg 3=1X5=1X7=1D5=GAGTCCGAGCAGAAGAAGAANGG chr21 3 - 9701297 9701320 GACTCGAGCATGAAGAAGAAAGG 2=1X1=1D6=1I12=GAGTCCGAGCAGAAGAAGAANGG chr21 5 - 46396975 46396998 CAGTCCCAGCAGACGACGGACGG 1X5=1X6=1X2=1X1=1X4=Thestart andend are 0-based open-ended (i.e. 0-based inclusive of thestart, but exclusive of the end), andstart is always less thanend(regardless of the strand). Thematch_region reported will be the sequence from the target file whenstrand is+, or the reverse complementof the sequence from the target file whenstrand is-, so that it matches theguide sequence.Thecigar is always oriented to read left-to-right with the provided guide andmatch_region sequences.
Note that this searches for approximate occurrences of the guidesequence itself, andnot for reverse-complementbinding sites.If binding sites are to be found, please reverse-complement the input or output manually.
PyPI wheels can be installed with:
pip install sassy-rs
importsassypattern=b"ACTG"text=b"ACGGCTACGCAGCATCATCAGCAT"searcher=sassy.Searcher("dna")# ascii / dna / iupacmatches=searcher.search(pattern,text,k=1)forminmatches:print(m)
Seepython/README.md for more details.
Seec/README.md for details. Quick example:
#include"sassy.h"intmain() {constchar*pattern="ACTG";constchar*text="ACGGCTACGCAGCATCATCAGCAT";// DNA alphabet, with reverse complement, without overhang.sassy_SearcherType*searcher=sassy_searcher("dna", true,NAN);sassy_Match*out_matches=NULL;size_tn_matches=search(searcher,pattern,strlen(pattern),text,strlen(text),1,// k=1&out_matches);sassy_matches_free(out_matches,n_matches);sassy_searcher_free(searcher);}
About
Fast approximate string searching
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.
Contributors4
Uh oh!
There was an error while loading.Please reload this page.
