Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Rust bindings to minimap2 library

License

NotificationsYou must be signed in to change notification settings

jguhlin/minimap2-rs

Repository files navigation

A rust FFI library forminimap2. In development! Feedback appreciated!

https://crates.io/crates/minimap2https://docs.rs/minimap2/latest/minimap2/CircleCIcodecov

Structure

minimap2-sys is the raw FFI bindings to minimap2. minimap2 is the more opinionated, rusty version.

How to use

Requirements

minimap2 ="0.1.23+minimap2.2.28"

Also seeFeatures

Tested with rustc 1.82.0 and nightly. So probably a good idea to upgrade before running. But let me know if you run into pain points with older versions and will try to fix.

Minimap2 Version Table

minimap2-rsminimap2
0.1.232.28
0.1.222.28
0.1.212.28
0.1.202.28
0.1.192.28
0.1.182.28
0.1.172.27
0.1.162.26

Usage

Create an Aligner

letmut aligner =Aligner::builder().map_ont().with_threads(8).with_cigar().with_index("ReferenceFile.fasta",None).expect("Unable to build index");

Align a sequence:

let seq:Vec<u8> =b"ACTGACTCACATCGACTACGACTACTAGACACTAGACTATCGACTACTGACATCGA";let alignment = aligner.map(&seq,false,false,None,None,Some(b"My Sequence Name")).expect("Unable to align");

Presets

All minimap2 presets should be available (seefunctions section):

let aligner =map_ont();let aligner =asm20();

Note Each preset overwrites different arguments. Using multiple at a time is not technically supported, but will work. Results unknown. So be careful!It's equivalent to running minimap2 -x map_ont -x short ...

Customization

MapOpts andIdxOpts can be customized with Rust's struct pattern, as well as applying mapping settings. Inspired bybevy.

letmut aligner:Aligner<PresetSet> =Aligner::builder().map_ont();aligner.mapopt.seed =42;aligner.mapopt.best_n =1;aligner.idxopt.k =21;self.mapopt.flag |=MM_F_COPY_COMMENTasi64;// Setting a flag. If you do this frequently, open an [issue](https://github.com/jguhlin/minimap2-rs/issues/new) asking for an ergonomic function!self.idxopt.flag |=MM_I_HPCasi32;

Seefull list of options below.

Working Example

Examples Directory

There are two working examples directly in this repo. In both instances below, 64 is the number of threads to allocate.

Channel-based multi-threading

cargo run --example channels -- reference_genome.fasta reads.fasta 64

Rayon-based multi-threading

cargo run --example rayon -- reference_genome.fasta reads.fasta 64

Depending on your needs you can probably do just fine with Rayon. But for very large implementations, interactivity, or limited memory, using channels may be the way to go.

Fakeminimap2

There is a binary called "fakeminimap2" which demonstrates basic usage and multithreading using channels or rayon. You can find itin this repo for an example. It it much more fully featured example, with an output interface, some mouse support, and interaction.

Code Examples

Alignment functions return aMapping struct. TheAlignment struct is only returned when theAligner is created using.with_cigar().

A very simple example would be:

letmut file = std::fs::File::open(query_file);letmut reader =BufReader::new(reader);letmut fasta =Fasta::from_buffer(&mut reader)for seqin reader{let seq = seq.unwrap();let alignment:Vec<Mapping> = aligner.map(&seq.sequence.unwrap(),false,false,None,None,None).expect("Unable to align");println!("{:?}", alignment);}

There is a map_file function that works on an entire file, but it is not-lazy and thus not suitable for large files. It may be removed in the future or moved to a separate lib.

let mappings:Result<Vec<Mapping>> = aligner.map_file("query.fa",false,false);

Multithreading

Multithreading is supported, for implementation example seefakeminimap2. Minimap2 also supports threading itself, and will use a minimum of 3 cores for building the index. Multithreading for mapping is left to the end-user.

Adjust the number of threads used to build the index:

letmut aligner =Aligner::builder().map_ont().with_index_threads(8);

Experimental Rayon support

Thisappears to work. Seefakeminimap2 for full implementation.

use rayon::prelude::*;let results = sequences.par_iter().map(|seq|{    aligner.map(seq.as_bytes(),false,false,None,None,None).unwrap()}).collect::<Vec<_>>();

Arc cloning the Aligner

Also works. Otherwise directly cloning the aligner will Arc clone the internal index.

Features

The following crate features are available:

  • map-file - Enables the ability to map a file directly to a reference. Enabled by deafult
  • htslib - Provides an interface to minimap2 that returns rust_htslib::Records
  • simde - Enables SIMD Everywhere library in minimap2
  • zlib-ng - Enables the use of zlib-ng for faster compression
  • curl - Enables curl for htslib
  • static - Builds minimap2 as a static library
  • sse2only - Builds minimap2 with only SSE2 support

Map-file is adefault feature and enabled unless otherwise specified.

Missing Features

Create anissue if you need any of the following:

Potentially others. Please create an issue!

Building for MUSL

Follow theseinstructions.

In brief, using bash shell:

docker pull messense/rust-musl-cross:x86_64-muslalias rust-musl-builder='docker run --rm -it -v "$(pwd)":/home/rust/src messense/rust-musl-cross:x86_64-musl'rust-musl-builder cargo build --release

Minimap2 is tested on x86_64 and aarch64 (arm64). Other platforms may work, please open an issue if minimap2 compiles but minimap2-rs does not.

Features tested with MUSL

  • htslib -Success
  • simde -Success

Tools using this binding

  • Chopper - Long read trimming and filtering
  • mappy-rs - Drop-in multi-threaded replacement for python's mappy
  • HiFiHLA - HLA star-calling tool for PacBio HiFi data
  • STRdust - Tandem repeat genotyper for long reads
  • oarfish - transcript quantification from long-read RNA-seq data
  • lrge - Long Read-based Genome size Estimation from overlaps

Next things todo

  • Iterator interface for map_file
  • -sys Possible to decouple from pthread?

Citation

Please cite the appropriate minimap2 papers if you use this in your work, as well as this library.

DOI for this library

... coming soon ...

Minimap2 Papers

Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences.Bioinformatics,34:3094-3100. [doi:10.1093/bioinformatics/bty191][doi]

and/or:

Li, H. (2021). New strategies to improve minimap2 alignment accuracy.Bioinformatics,37:4572-4574. [doi:10.1093/bioinformatics/btab705][doi2]

Minimap2 Mapping and Indexing Options

Seecustomization for how to use these.

Mapping Options (MapOpt in rust, alias formm_mapopt_t)

Field NameTypeDescription
flagi64Flags to control mapping behavior (bitwise flags).
seedc_intRandom seed for mapping.
sdust_thresc_intThreshold for masking low-complexity regions using SDUST.
max_qlenc_intMaximum query length.
bwc_intBandwidth for alignment of short reads.
bw_longc_intBandwidth for alignment of long reads.
max_gapc_intMaximum gap allowed in mapping.
max_gap_refc_intMaximum gap allowed on the reference.
max_frag_lenc_intMaximum fragment length for paired-end reads.
max_chain_skipc_intMaximum number of seeds to skip in chaining.
max_chain_iterc_intMaximum number of chaining iterations.
min_cntc_intMinimum number of seeds required for a chain.
min_chain_scorec_intMinimum score for a chain to be considered.
chain_gap_scalef32Scaling factor for chain gap penalty.
chain_skip_scalef32Scaling factor for chain skipping.
rmq_size_capc_intSize cap for RMQ (Range Minimum Query).
rmq_inner_distc_intInner distance for RMQ rescue.
rmq_rescue_sizec_intSize threshold for RMQ rescue.
rmq_rescue_ratiof32Rescue ratio for RMQ.
mask_levelf32Level at which to mask repetitive seeds.
mask_lenc_intLength of sequences to mask.
pri_ratiof32Ratio threshold for primary alignment selection.
best_nc_intMaximum number of best alignments to retain.
alt_dropf32Score drop ratio for alternative mappings.
ac_intMatch score.
bc_intMismatch penalty.
qc_intGap open penalty.
ec_intGap extension penalty.
q2c_intGap open penalty for long gaps.
e2c_intGap extension penalty for long gaps.
transitionc_intPenalty for transitions in spliced alignment.
sc_ambic_intScore for ambiguous bases.
noncanc_intAllow non-canonical splicing (boolean flag).
junc_bonusc_intBonus score for junctions.
zdropc_intZ-drop score for alignment extension stopping.
zdrop_invc_intInverse Z-drop score.
end_bonusc_intBonus score for aligning to the ends of sequences.
min_dp_maxc_intMinimum score to consider a DP alignment valid.
min_ksw_lenc_intMinimum length for performing Smith-Waterman alignment.
anchor_ext_lenc_intLength for anchor extension.
anchor_ext_shiftc_intShift for anchor extension.
max_clip_ratiof32Maximum allowed clipping ratio.
rank_min_lenc_intMinimum length for rank filtering.
rank_fracf32Fraction for rank filtering.
pe_oric_intExpected orientation of paired-end reads.
pe_bonusc_intBonus score for proper paired-end alignment.
mid_occ_fracf32Fraction for mid-occurrence filtering.
q_occ_fracf32Fraction for query occurrence filtering.
min_mid_occi32Minimum mid-occurrence threshold.
max_mid_occi32Maximum mid-occurrence threshold.
mid_occi32Mid-occurrence cutoff value.
max_occi32Maximum occurrence cutoff value.
max_max_occi32Maximum allowed occurrence value.
occ_disti32Distribution of occurrences for filtering.
mini_batch_sizei64Size of mini-batches for processing.
max_sw_mati64Maximum size of Smith-Waterman matrices.
cap_kalloci64Memory allocation cap for kalloc.
split_prefix*const c_charPrefix for splitting output files.

Mapping Flags (MM_F_*)

These can be set with helper functions. Please see the docs for IdxOpt and MapOpt.

Flag ConstantValueDescription
MM_F_NO_DIAG1Skip seed pairs on the same diagonal.
MM_F_NO_DUAL2Do not compute reverse complement of seeds.
MM_F_CIGAR4Compute CIGAR string.
MM_F_OUT_SAM8Output alignments in SAM format.
MM_F_NO_QUAL16Do not output base quality in SAM.
MM_F_OUT_CG32Output CIGAR in CG format (Compact CIGAR).
MM_F_OUT_CS64Output cs tag (difference string) in SAM/PAF.
MM_F_SPLICE128Enable spliced alignment (for RNA-seq).
MM_F_SPLICE_FOR256Only consider the forward strand for spliced alignment.
MM_F_SPLICE_REV512Only consider the reverse strand for spliced alignment.
MM_F_NO_LJOIN1024Disable long join for gapped alignment.
MM_F_OUT_CS_LONG2048Output cs tag in long format.
MM_F_SR4096Perform split read alignment (for short reads).
MM_F_FRAG_MODE8192Fragment mode for paired-end reads.
MM_F_NO_PRINT_2ND16384Do not output secondary alignments.
MM_F_2_IO_THREADS32768Use two I/O threads during mapping.
MM_F_LONG_CIGAR65536Use long CIGAR (>65535 operations).
MM_F_INDEPEND_SEG131072Map segments independently in multiple mapping.
MM_F_SPLICE_FLANK262144Add flanking bases for spliced alignment.
MM_F_SOFTCLIP524288Perform soft clipping at ends.
MM_F_FOR_ONLY1048576Only map the forward strand of the query.
MM_F_REV_ONLY2097152Only map the reverse complement of the query.
MM_F_HEAP_SORT4194304Use heap sort for mapping.
MM_F_ALL_CHAINS8388608Output all chains (may include suboptimal chains).
MM_F_OUT_MD16777216Output MD tag in SAM.
MM_F_COPY_COMMENT33554432Copy comment from FASTA/Q to SAM output.
MM_F_EQX67108864Use =/X instead of M in CIGAR.
MM_F_PAF_NO_HIT134217728Output unmapped reads in PAF format.
MM_F_NO_END_FLT268435456Disable end flanking region filtering.
MM_F_HARD_MLEVEL536870912Hard mask low-complexity regions.
MM_F_SAM_HIT_ONLY1073741824Output only alignments in SAM (no headers).
MM_F_RMQ2147483648Use RMQ for read mapping quality estimation.
MM_F_QSTRAND4294967296Consider query strand in mapping.
MM_F_NO_INV8589934592Disable inversion in alignment.
MM_F_NO_HASH_NAME17179869184Do not hash read names (for reproducibility).
MM_F_SPLICE_OLD34359738368Use old splice alignment model.
MM_F_SECONDARY_SEQ68719476736Output sequence of secondary alignments.
MM_F_OUT_DS137438953472Output detailed alignment score (ds tag).

Index Options (IdxOpt in rust, alias formm_idxopt_t)

Field NameTypeDescription
kc_shortK-mer size (mer length).
wc_shortMinimizer window size.
flagc_shortFlags to control indexing behavior (bitwise flags).
bucket_bitsc_shortNumber of bits for the size of hash table buckets.
mini_batch_sizei64Size of mini-batches for indexing (number of bases).
batch_sizeu64Total batch size for indexing (number of bases).

Indexing Flags (MM_I_*)

These can be set with helper functions. Please see the docs for IdxOpt and MapOpt.

Flag ConstantValueDescription
MM_I_HPC1Use homopolymer-compressed k-mers for indexing.
MM_I_NO_SEQ2Do not store sequences in the index.
MM_I_NO_NAME4Do not store sequence names in the index.

Changelog

SeeCHANGELOG.md

Funding

Genomics Aotearoa


[8]ページ先頭

©2009-2025 Movatter.jp