Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Fast approximate string searching

License

NotificationsYou must be signed in to change notification settings

RagnarGrootKoerkamp/sassy

Repository files navigation

crates.ioConda versionPyPIdocs.rsbiorXiv preprint

WARNING: Versions up to0.1.9 had a shameful bug in thesassy grep CLI where afterevery 1MB of input it would skip one record. Please update.

Sassy: SIMD-accelerated Approximate String Matching

Sassy is a library and tool for searching short strings in texts,a problem that goes by many names:

  • approximate string matching,
  • pattern matching,
  • fuzzy searching.

The motivating application is searching short (length 20 to 100) DNA sequencesin a human genome or e.g. in a set of reads.Sassy generally works well for patterns/queries up to length 1000,and supports both ASCII, DNA, and IUPAC.

It has agrep-like mode for quick human inspection, as well assearch toreport locations of matches, andfilter to only output (non)-matching records.

gif of sassy grep

Highlights:

  • Sassy uses bitpacking and SIMD (both AVX2 and NEON supported).Its main novelty is tiling these in the text direction.
  • Support foroverhang alignments where the pattern extends beyond the text.
  • Support for (case-insensitive) ASCII, DNA (ACGT), andIUPAC (=ACGT+NYR...) alphabets.
  • Rust library (cargo add sassy), binary (cargo install sassy, see details below), Pythonbindings (pip install sassy-rs), and C bindings (see below).

Seethe paper, and corresponding evals inevals/:

Rick Beeloo and Ragnar Groot Koerkamp.
Sassy: Searching Short DNA Strings in the 2020s.
bioRxiv, July 2025.https://doi.org/10.1101/2025.07.22.666207.

Installation

Prebuilt binaries

See the latestrelease.

You can also get these via

cargo binstall sassy

or via conda/mamba/pixi:

conda install -c bioconda sassy

Build from source

RUSTFLAGS="-C target-cpu=native" cargo install sassy

Sassy uses AVX2 or NEON instructions performance reasons, which requires eithertarget-cpu=native ortarget-cpu=x86-64-v3 on x64 machines.Seethis README for details andthisblog for background.The same restrictions apply when using the sassy library in a larger project.

Sassy requires Rust 1.91 or newer. Get it viarustup update. (Switch torustup when your system installation is too old).

Usage

Sassy can be used via the CLI, or as Rust, Python, or C library.

0. Rust library

The library can be used to search for ASCII or DNA strings.A larger example can be found insrc/lib.rs.

// cargo add sassyuse sassy::{Searcher,Match, profiles::Iupac,Strand};let pattern =b"ATCG";let text =b"AAAATTGAAA";let k =1;// The Iupac profile supports N and YR... characters.// If you are sure you only have ACGT input, then `profiles::Dna` is slightly faster.letmut searcher =Searcher::<Iupac>::new_fwd();let matches = searcher.search(pattern,&text, k);assert_eq!(matches.len(),1);assert_eq!(matches[0].text_start,3);assert_eq!(matches[0].text_end,7);assert_eq!(matches[0].cost,1);assert_eq!(matches[0].strand,Strand::Fwd);assert_eq!(matches[0].cigar.to_string(),"2=1X1=");

1. Command-line interface (CLI)

The CLI can be used via:

  1. sassy grep: to show nicely coloured output.
  2. sassy search: to write a.tsv of matching locations.
  3. sassy filter: to write a.fasta/.fastq of (non)-matching records.
  4. sassy crispr: to search for CRISPR guides.

grep,search, andfilter all take the same arguments, and are implementedby forwarding togrep. Thus, they can all be combined via e.g.

sassy grep -p ACGTCAAACCTA -k 3 --matches matches.tsv --output filtered.fastq reads.fastq.gz

1.1: Grep for a pattern

Search a patternATGAGCA intext.fasta with ≤1 edit:

sassy search --pattern ATGAGCA -k 1 text.fasta

or search all records of a fasta file with--pattern-fasta <fasta-file> instead of--pattern.

Thegrep output is coloured:

  • green shows matching characters,
  • orange shows mismatches,
  • red shows deleted characters (in pattern but not in text),
  • blue shows inserted characters (in text but not in pattern).screenshot of sassy grep output

1.2: TSV output for matches

sassy search -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa> matches.tsv# orsassy search -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa --matches matches.tsv# orsassy grep   -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa --matches matches.tsv

gives.tsv output like this:

pat_idtext_idcoststrandstartendmatch_regioncigarpatternAC_000001.1__1_10+648GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_350+897939GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_491+866908GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCGCGCG37=1X4=patternAC_000001.1__1_640-12671309GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_670+600642GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_680-18261868GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_783-43814425GTACAGAAACGAGCGGATGGAAAATAGTAGTGAGCGGCCTCGCG23=1X1I10=1I8=patternAC_000001.1__1_920-65546596GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_940-64136455GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_1152+20912131GTACAGAAACGAGCATGGAAAGAGTAGTGAGCGCCTCGCG14=2D26=patternAC_000001.1__1_1180-30623104GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_1230+14161458GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=patternAC_000001.1__1_1270+2769GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG42=

1.3: Filter matching records

sassy filter -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq> filtered.fq# orsassy filter -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq -o filtered.fq# orsassy grep   -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq -o filtered.fq

Writes a file containing only matching records. Use--invert to onlywrite non-matching records.

1.4: CRISPR off-target search

Search for one or more guides inguides.txt:

sassy crispr --threads 8 --guide guides.txt --k 5 --max-n-frac 0.1 --output hits.tsv hg38.fasta

Allows<= k edits in the sgRNA, and the PAM (the last 3 characters of each guide) has to match exactly, unless--allow-pam-edits is given.

Output of thecrispr command is a tab-delimited file with one row per hit, e.g.:

guide                    text_id  cost  strand  start     end       match_region             cigarGAGTCCGAGCAGAAGAAGAANGG  chr21    5     +       5024135   5024154   GAGGCCACAGAGAAGAGGG      3=1X2=1D1=1D3=1D5=1D4=GAGTCCGAGCAGAAGAAGAANGG  chr21    3     +       21087337  21087359  gagaccgaggagaagaaaaagg   3=1X5=1X7=1D5=GAGTCCGAGCAGAAGAAGAANGG  chr21    3     -       9701297   9701320   GACTCGAGCATGAAGAAGAAAGG  2=1X1=1D6=1I12=GAGTCCGAGCAGAAGAAGAANGG  chr21    5     -       46396975  46396998  CAGTCCCAGCAGACGACGGACGG  1X5=1X6=1X2=1X1=1X4=

Thestart andend are 0-based open-ended (i.e. 0-based inclusive of thestart, but exclusive of the end), andstart is always less thanend(regardless of the strand). Thematch_region reported will be the sequence from the target file whenstrand is+, or the reverse complementof the sequence from the target file whenstrand is-, so that it matches theguide sequence.Thecigar is always oriented to read left-to-right with the provided guide andmatch_region sequences.

Note that this searches for approximate occurrences of the guidesequence itself, andnot for reverse-complementbinding sites.If binding sites are to be found, please reverse-complement the input or output manually.

2. Python bindings

PyPI wheels can be installed with:

pip install sassy-rs
importsassypattern=b"ACTG"text=b"ACGGCTACGCAGCATCATCAGCAT"searcher=sassy.Searcher("dna")# ascii / dna / iupacmatches=searcher.search(pattern,text,k=1)forminmatches:print(m)

Seepython/README.md for more details.

3. C library

Seec/README.md for details. Quick example:

#include"sassy.h"intmain() {constchar*pattern="ACTG";constchar*text="ACGGCTACGCAGCATCATCAGCAT";// DNA alphabet, with reverse complement, without overhang.sassy_SearcherType*searcher=sassy_searcher("dna", true,NAN);sassy_Match*out_matches=NULL;size_tn_matches=search(searcher,pattern,strlen(pattern),text,strlen(text),1,// k=1&out_matches);sassy_matches_free(out_matches,n_matches);sassy_searcher_free(searcher);}

About

Fast approximate string searching

Topics

Resources

License

Stars

Watchers

Forks

Contributors4

  •  
  •  
  •  
  •  

Languages


[8]ページ先頭

©2009-2025 Movatter.jp