Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
forked fromlh3/miniprot

Align proteins to genomes with splicing and frameshift

License

NotificationsYou must be signed in to change notification settings

kdm9/miniprot

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReleaseBioConda InstallBuild Status

Getting Started

# download and compilegit clone https://github.com/lh3/miniprotcd miniprot&& make# test file./miniprot test/DPP3-hs.gen.fa.gz test/DPP3-mm.pep.fa.gz> aln.paf# PAF output./miniprot --gff test/DPP3-hs.gen.fa.gz test/DPP3-mm.pep.fa.gz> aln.gff# GFF3+PAF output# general command line: index and align in one go (-I sets max intron size based on genome size)./miniprot -Iut16 --gff genome.fna protein.faa> aln.gff# general command line: index first and then align (recommended)./miniprot -t16 -d genome.mpi genome.fna./miniprot -Iut16 --gff genome.mpi protein.faa> aln.gff# output formatman ./miniprot.1

Table of Contents

Introduction

Miniprot aligns a protein sequence against a genome with affine gap penalty,splicing and frameshift. It is primarily intended for annotating protein-codinggenes in a new species using known genes from other species. Miniprot issimilar toGeneWise andExonerate in functionality butit can map proteins to whole genomes and is much faster at the residuealignment step.

Miniprot is not optimized for mapping distant homologs because distant homologsare less informative to gene annotations. Nonetheless, it is still possible totune seeding parameters to achieve higher sensitivity at the cost ofperformance.

Users' Guide

Installation

Miniprot requires SSE2 or NEON instructions and only works on x86_64 or ARMCPUs. It depends onzlib for parsing gzip'd input files. To compileminiprot, typemake in the source code directory. This will produce astandalone executableminiprot. This executable is all you need to invokeminiprot.

For some unknown reason, the default gcc-4.8.5 on CentOS 7 may compile a binarythat is very slow on certain sequences but gcc-10.3.0 has more stableperformance. If possible, use a more recent gcc to compile miniprot.

Usage

To run miniprot, use

miniprot -t8 ref-file protein.faa> output.paf

whereref-file can either be a genome in the FASTA format or a pre-builtindex generated by

miniprot -t8 -d ref.mpi ref.fna

Because miniprot indexing is slow and memory intensive, it is recommended topre-build the index. FASTA input files can be optionally compressed with gzip.

Miniprot outputs alignment in the protein PAF format. Different from the morecommon nucleotide PAF format, miniprot uses more CIGAR operators to encodeintrons and frameshifts. Please refer to themanpage for detailed explanation.

For convenience, miniprot can also output GFF3 with option--gff:

miniprot -t8 --gff -d ref.mpi ref.fna> out.gff

The detailed alignment is embedded in##PAF lines in the GFF3 output. You canalso get detailed residue alignment with--aln.

If you are aligning proteins to a whole genome, it is recommended to add option-I to let miniprot automatically set the maximum intron size. You can alsouse-G to explicitly specify the max intron size.

Algorithm overview

  1. Translate the reference genome to amino acids in six phases and filter outORFs shorter than 45bp. Reduce 20 amino acids to 13 distinct integers andextract random open syncmers of 6aa in length. By default, miniprot selects20% of 6-mers in average. For a reduced 6-mer at reference positionx,keep the 6-mer andfloor(x/256) in a dense hash table. This concludes theindexing step.

  2. Given a protein sequence as query, extract 6-mer syncmers on the protein,look up the index for seed matches and apply minimap2-like chaining. Thisfirst round of chaining is approximate as the reference positions have beenbinned during indexing.

  3. For each chain in step 2, redo seeding and chaining with sliding 5-mers fromboth the reference and the protein in the original chain. Miniprot uses allreduced 5-mers for this second round of chaining.

  4. Choose top 100 (see-N) chains. Filter out anchors around potentialintrons or long gaps. Perform striped dynamic programming between remaininganchors and also extend from the first or last anchors. This gives the finalalignment.

Citing miniprot

If you use miniprot, please cite:

Li, H. (2023) Protein-to-genome alignment with miniprot.Bioinformatics,39, btad014[PMID: 36648328].

The preprint is available atarXiv:2210.08052, whichadditionally shows metrics on MetaEuk. Please note that the published paperevaluated miniprot-0.7. The latest version may report different numbers.

Limitations

  • The initial conditions of dynamic programming are not technically correct,which may result in suboptimal residue alignment in rare cases.

  • Support for non-splicing alignment needs to be improved.

  • More manual inspection required for improved accuracy. For example, tandemcopies in segmental duplications could be handled more carefully.

About

Align proteins to genomes with splicing and frameshift

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C59.4%
  • TeX36.9%
  • Roff2.7%
  • Other1.0%

[8]ページ先頭

©2009-2025 Movatter.jp