Movatterモバイル変換

mr-c/minimap2Public

forked fromlh3/minimap2

NotificationsYou must be signed in to change notification settings
Fork0
Star0

A versatile pairwise aligner for genomic and spliced nucleotide sequences

lh3.github.io/minimap2

License

View license

0 stars 439 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,107 Commits
.github/workflows		.github/workflows
lib		lib
misc		misc
python		python
sse2neon		sse2neon
test		test
tex		tex
.gitignore		.gitignore
.gitmodules		.gitmodules
FAQ.md		FAQ.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
Makefile		Makefile
Makefile.simde		Makefile.simde
NEWS.md		NEWS.md
README.md		README.md
align.c		align.c
bseq.c		bseq.c
bseq.h		bseq.h
code_of_conduct.md		code_of_conduct.md
cookbook.md		cookbook.md
esterr.c		esterr.c
example.c		example.c
format.c		format.c
hit.c		hit.c
index.c		index.c
kalloc.c		kalloc.c
kalloc.h		kalloc.h
kdq.h		kdq.h
ketopt.h		ketopt.h
khash.h		khash.h
krmq.h		krmq.h
kseq.h		kseq.h
ksort.h		ksort.h
ksw2.h		ksw2.h
ksw2_dispatch.c		ksw2_dispatch.c
ksw2_extd2_sse.c		ksw2_extd2_sse.c
ksw2_exts2_sse.c		ksw2_exts2_sse.c
ksw2_extz2_sse.c		ksw2_extz2_sse.c
ksw2_ll_sse.c		ksw2_ll_sse.c
kthread.c		kthread.c
kthread.h		kthread.h
kvec.h		kvec.h
lchain.c		lchain.c
main.c		main.c
map.c		map.c
minimap.h		minimap.h
minimap2.1		minimap2.1
misc.c		misc.c
mmpriv.h		mmpriv.h
options.c		options.c
pe.c		pe.c
pyproject.toml		pyproject.toml
sdust.c		sdust.c
sdust.h		sdust.h
seed.c		seed.c
setup.py		setup.py
sketch.c		sketch.c
splitidx.c		splitidx.c

Repository files navigation

Getting Started

git clone https://github.com/lh3/minimap2cd minimap2&& make# long sequences against a reference genome./minimap2 -a test/MT-human.fa test/MT-orang.fa> test.sam# create an index first and then map./minimap2 -x map-ont -d MT-human-ont.mmi test/MT-human.fa./minimap2 -a MT-human-ont.mmi test/MT-orang.fa> test.sam# use presets (no test data)./minimap2 -ax map-pb ref.fa pacbio.fq.gz> aln.sam# PacBio CLR genomic reads./minimap2 -ax map-ont ref.fa ont.fq.gz> aln.sam# Oxford Nanopore genomic reads./minimap2 -ax map-hifi ref.fa pacbio-ccs.fq.gz> aln.sam# PacBio HiFi/CCS genomic reads (v2.19 or later)./minimap2 -ax asm20 ref.fa pacbio-ccs.fq.gz> aln.sam# PacBio HiFi/CCS genomic reads (v2.18 or earlier)./minimap2 -ax sr ref.fa read1.fa read2.fa> aln.sam# short genomic paired-end reads./minimap2 -ax splice ref.fa rna-reads.fa> aln.sam# spliced long reads (strand unknown)./minimap2 -ax splice -uf -k14 ref.fa reads.fa> aln.sam# noisy Nanopore Direct RNA-seq./minimap2 -ax splice:hq -uf ref.fa query.fa> aln.sam# Final PacBio Iso-seq or traditional cDNA./minimap2 -ax splice --junc-bed anno.bed12 ref.fa query.fa> aln.sam# prioritize on annotated junctions./minimap2 -cx asm5 asm1.fa asm2.fa> aln.paf# intra-species asm-to-asm alignment./minimap2 -x ava-pb reads.fa reads.fa> overlaps.paf# PacBio read overlap./minimap2 -x ava-ont reads.fa reads.fa> overlaps.paf# Nanopore read overlap# man page for detailed command line optionsman ./minimap2.1

Users' Guide

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNAsequences against a large reference database. Typical use cases include: (1)mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2)finding overlaps between long reads with error rate up to ~15%; (3)splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA readsagainst a reference genome; (4) aligning Illumina single- or paired-end reads;(5) assembly-to-assembly alignment; (6) full-genome alignment between twoclosely related species with divergence below ~15%.

For ~10kb noisy reads sequences, minimap2 is tens of times faster thanmainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is moreaccurate on simulated long reads and produces biologically meaningful alignmentready for downstream analyses. For >100bp Illumina short reads, minimap2 isthree times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data.Detailed evaluations are available from theminimap2 paper or thepreprint.

Installation

Minimap2 is optimized for x86-64 CPUs. You can acquire precompiled binaries fromtherelease page with:

curl -L https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2| tar -jxvf -./minimap2-2.26_x64-linux/minimap2

If you want to compile from the source, you need to have a C compiler, GNU makeand zlib development files installed. Then typemake in the source codedirectory to compile. If you see compilation errors, trymake sse2only=1to disable SSE4 code, which will make minimap2 slightly slower.

Minimap2 also works with ARM CPUs supporting the NEON instruction sets. Tocompile for 32 bit ARM architectures (such as ARMv7), usemake arm_neon=1. Tocompile for for 64 bit ARM architectures (such as ARMv8), usemake arm_neon=1 aarch64=1.

Minimap2 can useSIMD Everywhere (SIMDe) library for portingimplementation to the different SIMD instruction sets. To compile using SIMDe,usemake -f Makefile.simde. To compile for ARM CPUs, useMakefile.simdewith the ARM related command lines given above.

General usage

Without any options, minimap2 takes a reference database and a query sequencefile as input and produce approximate mapping, without base-level alignment(i.e. coordinates are only approximate and no CIGAR in output), in thePAF format:

minimap2 ref.fa query.fq> approx-mapping.paf

You can ask minimap2 to generate CIGAR at thecg tag of PAF with:

minimap2 -c ref.fa query.fq> alignment.paf

or to output alignments in theSAM format:

minimap2 -a ref.fa query.fq> alignment.sam

Minimap2 seamlessly works with gzip'd FASTA and FASTQ formats as input. Youdon't need to convert between FASTA and FASTQ or decompress gzip'd files first.

For the human reference genome, minimap2 takes a few minutes to generate aminimizer index for the reference before mapping. To reduce indexing time, youcan optionally save the index with option-d and replace the referencesequence file with the index file on the minimap2 command line:

minimap2 -d ref.mmi ref.fa# indexingminimap2 -a ref.mmi reads.fq> alignment.sam# alignment

Importantly, it should be noted that once you build the index, indexingparameters such as-k,-w,-H and-I can't be changed duringmapping. If you are running minimap2 for different data types, you willprobably need to keep multiple indexes generated with different parameters.This makes minimap2 different from BWA which always uses the same indexregardless of query data types.

Use cases

Minimap2 uses the same base algorithm for all applications. However, due to thedifferent data types it supports (e.g. short vs long reads; DNA vs mRNA reads),minimap2 needs to be tuned for optimal performance and accuracy. It is usuallyrecommended to choose a preset with option-x, which sets multipleparameters at the same time. The default setting is the same asmap-ont.

Map long noisy genomic reads

minimap2 -ax map-pb  ref.fa pacbio-reads.fq> aln.sam# for PacBio CLR readsminimap2 -ax map-ont ref.fa ont-reads.fq> aln.sam# for Oxford Nanopore readsminimap2 -ax map-iclr ref.fa iclr-reads.fq> aln.sam# for Illumina Complete Long Reads

The difference betweenmap-pb andmap-ont is thatmap-pb useshomopolymer-compressed (HPC) minimizers as seeds, whilemap-ont uses ordinaryminimizers as seeds. Empirical evaluation suggests HPC minimizers improveperformance and sensitivity when aligning PacBio CLR reads, but hurt when aligningNanopore reads.map-iclr uses an adjusted alignment scoring matrix thataccounts for the low overall error rate in the reads, with transversion errorsbeing less frequent than transitions.

Map long mRNA/cDNA reads

minimap2 -ax splice:hq -uf ref.fa iso-seq.fq> aln.sam# PacBio Iso-seq/traditional cDNAminimap2 -ax splice ref.fa nanopore-cdna.fa> aln.sam# Nanopore 2D cDNA-seqminimap2 -ax splice -uf -k14 ref.fa direct-rna.fq> aln.sam# Nanopore Direct RNA-seqminimap2 -ax splice --splice-flank=no SIRV.fa SIRV-seq.fa# mapping against SIRV control

There are different long-read RNA-seq technologies, including tranditionalfull-length cDNA, EST, PacBio Iso-seq, Nanopore 2D cDNA-seq and Direct RNA-seq.They produce data of varying quality and properties. By default,-x spliceassumes the read orientation relative to the transcript strand is unknown. Ittries two rounds of alignment to infer the orientation and write the strand tothets SAM/PAF tag if possible. For Iso-seq, Direct RNA-seq and tranditionalfull-length cDNAs, it would be desired to apply-u f to force minimap2 toconsider the forward transcript strand only. This speeds up alignment withslight improvement to accuracy. For noisy Nanopore Direct RNA-seq reads, it isrecommended to use a smaller k-mer size for increased sensitivity to the firstor the last exons.

Minimap2 rates an alignment by the score of the max-scoring sub-segment,excluding introns, and marks the best alignment as primary in SAM. When aspliced gene also has unspliced pseudogenes, minimap2 does not intentionallyprefer spliced alignment, though in practice it more often marks the splicedalignment as the primary. By default, minimap2 outputs up to five secondaryalignments (i.e. likely pseudogenes in the context of RNA-seq mapping). Thiscan be tuned with option-N.

For long RNA-seq reads, minimap2 may produce chimeric alignments potentiallycaused by gene fusions/structural variations or by an intron longer than themax intron length-G (200k by default). For now, it is not recommended toapply an excessively large-G as this slows down minimap2 and sometimesleads to false alignments.

It is worth noting that by default-x splice prefers GT[A/G]..[C/T]AGover GT[C/T]..[A/G]AG, and then over other splicing signals. Consideringone additional base improves the junction accuracy for noisy reads, butreduces the accuracy when aligning against the widely used SIRV control data.This is because SIRV does not honor the evolutionarily conservative splicingsignal. If you are studying SIRV, you may apply--splice-flank=no to letminimap2 only model GT..AG, ignoring the additional base.

Since v2.17, minimap2 can optionally take annotated genes as input andprioritize on annotated splice junctions. To use this feature, you can

paftools.js gff2bed anno.gff> anno.bedminimap2 -ax splice --junc-bed anno.bed ref.fa query.fa> aln.sam

Here,anno.gff is the gene annotation in the GTF or GFF3 format (gff2bedautomatically tests the format). The output ofgff2bed is in the 12-columnBED format, or the BED12 format. With the--junc-bed option, minimap2 adds abonus score (tuned by--junc-bonus) if an aligned junction matches a junctionin the annotation. Option--junc-bed also takes 5-column BED, including thestrand field. In this case, each line indicates an oriented junction.

Find overlaps between long reads

minimap2 -x ava-pb  reads.fq reads.fq> ovlp.paf# PacBio CLR read overlapminimap2 -x ava-ont reads.fq reads.fq> ovlp.paf# Oxford Nanopore read overlap

Similarly,ava-pb uses HPC minimizers whileava-ont uses ordinaryminimizers. It is usually not recommended to perform base-level alignment inthe overlapping mode because it is slow and may produce false positiveoverlaps. However, if performance is not a concern, you may try to add-a or-c anyway.

Map short accurate genomic reads

minimap2 -ax sr ref.fa reads-se.fq> aln.sam# single-end alignmentminimap2 -ax sr ref.fa read1.fq read2.fq> aln.sam# paired-end alignmentminimap2 -ax sr ref.fa reads-interleaved.fq> aln.sam# paired-end alignment

When two read files are specified, minimap2 reads from each file in turn andmerge them into an interleaved stream internally. Two reads are considered tobe paired if they are adjacent in the input stream and have the same name (withthe/[0-9] suffix trimmed if present). Single- and paired-end reads can bemixed.

Minimap2 does not work well with short spliced reads. There are many capableRNA-seq mappers for short reads.

Full genome/assembly alignment

minimap2 -ax asm5 ref.fa asm.fa> aln.sam# assembly to assembly/ref alignment

For cross-species full-genome alignment, the scoring system needs to be tunedaccording to the sequence divergence.

Advanced features

Working with >65535 CIGAR operations

Due to a design flaw, BAM does not work with CIGAR strings with >65535operations (SAM and CRAM work). However, for ultra-long nanopore reads minimap2may align ~1% of read bases with long CIGARs beyond the capability of BAM. Ifyou convert such SAM/CRAM to BAM, Picard and recent samtools will throw anerror and abort. Older samtools and other tools may create corrupted BAM.

To avoid this issue, you can add option-L at the minimap2 command line.This option moves a long CIGAR to theCG tag and leaves a fully clipped CIGARat the SAM CIGAR column. Current tools that don't read CIGAR (e.g. merging andsorting) still work with such BAM records; tools that read CIGAR willeffectively ignore these records. It has been decided that future toolswill seamlessly recognize long-cigar records generated by option-L.

TL;DR: if you work with ultra-long reads and use tools that only processBAM files, please add option-L.

The cs optional tag

Thecs SAM/PAF tag encodes bases at mismatches and INDELs. It matches regularexpression/(:[0-9]+|\*[a-z][a-z]|[=\+\-][A-Za-z]+)+/. Like CIGAR,csconsists of series of operations. Each leading character specifies theoperation; the following sequence is the one involved in the operation.

Thecs tag is enabled by command line option--cs. The following alignment,for example:

CGATCGATAAATAGAGTAG---GAATAGCA||||||   ||||||||||   |||| |||CGATCG---AATAGAGTAGGTCGAATtGCA

is represented as:6-ata:10+gtc:4*at:3, where:[0-9]+ represents anidentical block,-ata represents a deletion,+gtc an insertion and*atindicates reference basea is substituted with a query baset. It issimilar to theMD SAM tag but is standalone and easier to parse.

If--cs=long is used, thecs string also contains identical sequences inthe alignment. The above example will become=CGATCG-ata=AATAGAGTAG+gtc=GAAT*at=GCA. The long form ofcs encodes bothreference and query sequences in one string. Thecs tag also encodes intronpositions and splicing signals (see theminimap2 manpage fordetails).

Working with the PAF format

Minimap2 also comes with a (java)scriptpaftools.js thatprocesses alignments in the PAF format. It calls variants fromassembly-to-reference alignment, lifts over BED files based on alignment,converts between formats and provides utilities for various evaluations. Fordetails, please seemisc/README.md.

Algorithm overview

In the following, minimap2 command line options have a dash ahead and arehighlighted in bold. The description may help to tune minimap2 parameters.

Read-I [=4G] reference bases, extract (-k,-w)-minimizers andindex them in a hash table.
Read-K [=200M] query bases. For each query sequence, do step 3through 7:
For each (-k,-w)-minimizer on the query, check against the referenceindex. If a reference minimizer is not among the top-f [=2e-4] mostfrequent, collect its the occurrences in the reference, which are calledseeds.
Sort seeds by position in the reference. Chain them with dynamicprogramming. Each chain represents a potential mapping. For readoverlapping, report all chains and then go to step 8. For reference mapping,do step 5 through 7:
LetP be the set of primary mappings, which is an empty set initially. Foreach chain from the best to the worst according to their chaining scores: ifon the query, the chain overlaps with a chain inP by--mask-level[=0.5] or higher fraction of the shorter chain, mark the chain assecondary to the chain inP; otherwise, add the chain toP.
Retain all primary mappings. Also retain up to-N [=5] top secondarymappings if their chaining scores are higher than-p [=0.8] of theircorresponding primary mappings.
If alignment is requested, filter out an internal seed if it potentiallyleads to both a long insertion and a long deletion. Extend from theleft-most seed. Perform global alignments between internal seeds. Split thechain if the accumulative score along the global alignment drops by-z[=400], disregarding long gaps. Extend from the right-most seed. Outputchains and their alignments.
If there are more query sequences in the input, go to step 2 until no morequeries are left.
If there are more reference sequences, reopen the query file from the startand go to step 1; otherwise stop.

Getting help

Manpageminimap2.1 provides detailed description of minimap2command line options and optional tags. TheFAQ page answers severalfrequently asked questions. If you encounter bugs or have further questions orrequests, you can raise an issue at theissue page. There is not aspecific mailing list for the time being.

Citing minimap2

If you use minimap2 in your work, please cite:

Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences.Bioinformatics,34:3094-3100.doi:10.1093/bioinformatics/bty191

and/or:

Li, H. (2021). New strategies to improve minimap2 alignment accuracy.Bioinformatics,37:4572-4574.doi:10.1093/bioinformatics/btab705

Developers' Guide

Minimap2 is not only a command line tool, but also a programming library.It provides C APIs to build/load index and to align sequences against theindex. Fileexample.c demonstrates typical uses of C APIs. Headerfileminimap.h gives more detailed API documentation. Minimap2aims to keep APIs in this header stable. Filemmpriv.h containsadditional private APIs which may be subjected to changes frequently.

This repository also provides Python bindings to a subset of C APIs. Filepython/README.rst gives the full documentation;python/minimap2.py shows an example. This Pythonextension, mappy, is alsoavailable from PyPI viapip install mappy orfrom BioConda viaconda install -c bioconda mappy.

Limitations

Minimap2 may produce suboptimal alignments through long low-complexityregions where seed positions may be suboptimal. This should not be a bigconcern because even the optimal alignment may be wrong in such regions.
Minimap2 requires SSE2 instructions on x86 CPUs or NEON on ARM CPUs. It ispossible to add non-SIMD support, but it would make minimap2 slower byseveral times.
Minimap2 does not work with a single query or database sequence ~2billion bases or longer (2,147,483,647 to be exact). The total length of allsequences can well exceed this threshold.
Minimap2 often misses small exons.

About

A versatile pairwise aligner for genomic and spliced nucleotide sequences

lh3.github.io/minimap2

Releases

30tags

Packages

No packages published

Languages

C60.0%
JavaScript17.3%
TeX16.8%
Roff2.6%
Cython1.6%
Makefile0.7%
Other1.0%

Movatterモバイル変換

License

mr-c/minimap2

Folders and files

Latest commit

History

Repository files navigation

Getting Started

Table of Contents

Users' Guide

Installation

General usage

Use cases

Map long noisy genomic reads

Map long mRNA/cDNA reads

Find overlaps between long reads

Map short accurate genomic reads

Full genome/assembly alignment

Advanced features

Working with >65535 CIGAR operations

The cs optional tag

Working with the PAF format

Algorithm overview

Getting help

Citing minimap2

Developers' Guide

Limitations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages