- Notifications
You must be signed in to change notification settings - Fork7
A versatile toolkit for k-mers with taxonomic information
License
shenwei356/unikmer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Documents:https://bioinf.shenwei.me/unikmer/
unikmer
is a toolkit for nucleic acidk-mer analysis,providing functionsincluding set operation k-mers (sketch) optional withTaxIds but without count information.
K-mers are either encoded (k<=32) or hashed (k<=64, using ntHash v1) intouint64
,and serialized in binary file with extension.unik
.
TaxIds can be assigned when counting k-mers from genome sequences,and LCA (Lowest Common Ancestor) is computed during set opertionsincluding computing union, intersecton, set difference, unique andrepeated k-mers.
Related projects:
- kmers provides bit-packed k-mers methods for this tool.
- unik provides k-mer serialization methods for this tool.
- sketches provides generators/iterators for k-mer sketches(Minimizer,Scaled MinHash,Closed Syncmers).
- taxdump provides querying manipulations from NCBI Taxonomy taxdump files.
- Finding conserved regions in all genomes of a species.
- Finding species/strain-specific sequences for designing probes/primers.
Downloadingexecutable binary files.
conda install -c bioconda unikmer
Counting
count Generate k-mers (sketch) from FASTA/Q sequences
Information
info Information of binary files num Quickly inspect the number of k-mers in binary files
Format conversion
view Read and output binary format to plain text dump Convert plain k-mer text to binary format encode Encode plain k-mer texts to integers decode Decode encoded integers to k-mer texts
Set operations
concat Concatenate multiple binary files without removing duplicates inter Intersection of k-mers in multiple binary files common Find k-mers shared by most of the binary files union Union of k-mers in multiple binary files diff Set difference of k-mers in multiple binary files
Split and merge
sort Sort k-mers to reduce the file size and accelerate downstream analysis split Split k-mers into sorted chunk files tsplit Split k-mers according to TaxId merge Merge k-mers from sorted chunk files
Subset
head Extract the first N k-mers sample Sample k-mers from binary files grep Search k-mers from binary files filter Filter out low-complexity k-mers rfilter Filter k-mers by taxonomic rank
Searching on genomes
locate Locate k-mers in genome map Mapping k-mers back to the genome and extract successive regions/subsequences
Misc
autocompletion Generate shell autocompletion script version Print version information and check for update
K-mers (represented inuint64
in RAM ) are serialized in 8-Byte(or less Bytes for shorter k-mers in compact format,or much less Bytes for sorted k-mers) arrays andoptionally compressed in gzip format with extension of.unik
.TaxIds are optionally stored next to k-mers with 4 or less bytes.
No TaxIds stored in this test.
label | encoded-kmera | gzip-compressedb | compact-formatc | sortedd | comment |
---|---|---|---|---|---|
plain | plain text | ||||
gzip | ✔ | gzipped plain text | |||
unik.default | ✔ | ✔ | gzipped encoded k-mers in fixed-length byte array | ||
unik.compat | ✔ | ✔ | ✔ | gzipped encoded k-mers in shorter fixed-length byte array | |
unik.sorted | ✔ | ✔ | ✔ | gzipped sorted encoded k-mers |
- a One k-mer is encoded as
uint64
and serialized in 8 Bytes. - b K-mers file is compressed in gzip format by default,users can switch on global option
-C/--no-compress
to output non-compressed file. - c One k-mer is encoded as
uint64
and serialized in 8 Bytes by default.However few Bytes are needed for short k-mers, e.g., 4 Bytes are enough for15-mers (30 bits). This makes the file more compact with smaller file size,controled by global option-c/--compact
. - d One k-mer is encoded as
uint64
, all k-mers are sorted and compressedusing varint-GB algorithm. - In all test, flag
--canonical
is ON when runningunikmer count
.
# memusg is for compute time and RAM usage: https://github.com/shenwei356/memusg# counting (only keep the canonical k-mers and compact output)# memusg -t unikmer count -k 23 Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23 --canonical --compact$ memusg -t unikmer count -k 23 Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23 --canonical --compactelapsed time: 0.897speak rss: 192.41 MB# counting (only keep the canonical k-mers and sort k-mers)# memusg -t unikmer count -k 23 Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23.sorted --canonical --sort$ memusg -t unikmer count -k 23 Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23.sorted --canonical --sortelapsed time: 1.136speak rss: 227.28 MB# counting and assigning global TaxIds$ unikmer count -k 23 -K -s Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23.sorted -t 585057$ unikmer count -k 23 -K -s Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23.sorted -t 511145$ unikmer count -k 23 -K -s A.muciniphila-ATCC_BAA-835.fasta.gz -o A.muciniphila-ATCC_BAA-835.fasta.gz.sorted -t 349741# counting minimizer and ouputting in linear order$ unikmer count -k 23 -W 5 -H -K -l A.muciniphila-ATCC_BAA-835.fasta.gz -o A.muciniphila-ATCC_BAA-835.fasta.gz.m# view$ unikmer view Ecoli-MG1655.fasta.gz.k23.sorted.unik --show-taxid | head -n 3AAAAAAAAACCATCCAAATCTGG 511145AAAAAAAAACCGCTAGTATATTC 511145AAAAAAAAACCTGAAAAAAACGG 511145# view (hashed k-mers needs original FASTA/Q file)$ unikmer view --show-code --genome A.muciniphila-ATCC_BAA-835.fasta.gz A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik | head -n 3CATCCGCCATCTTTGGGGTGTCG 1210726578792AGCGCAAAATCCCCAAACATGTA 2286899379883AACTGATTTTTGATGATGACTCC 3542156397282# find the positions of k-mers$ unikmer locate -g A.muciniphila-ATCC_BAA-835.fasta.gz A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik | head -n 5NC_010655.1 2 25 ATCTTATAAAATAACCACATAAC 0 .NC_010655.1 5 28 TTATAAAATAACCACATAACTTA 0 .NC_010655.1 6 29 TATAAAATAACCACATAACTTAA 0 .NC_010655.1 9 32 AAAATAACCACATAACTTAAAAA 0 .NC_010655.1 13 36 TAACCACATAACTTAAAAAGAAT 0 .# info$ unikmer info *.unik -a -j 10file k canonical hashed scaled include-taxid global-taxid sorted compact gzipped version number descriptionA.muciniphila-ATCC_BAA-835.fasta.gz.m.unik 23 ✓ ✓ ✕ ✕ ✕ ✕ ✓ v5.0 860,900 A.muciniphila-ATCC_BAA-835.fasta.gz.sorted.unik 23 ✓ ✕ ✕ ✕ 349741 ✓ ✕ ✓ v5.0 2,630,905 Ecoli-IAI39.fasta.gz.k23.sorted.unik 23 ✓ ✕ ✕ ✕ 585057 ✓ ✕ ✓ v5.0 4,902,266 Ecoli-IAI39.fasta.gz.k23.unik 23 ✓ ✕ ✕ ✕ ✕ ✓ ✓ v5.0 4,902,266 Ecoli-MG1655.fasta.gz.k23.sorted.unik 23 ✓ ✕ ✕ ✕ 511145 ✓ ✕ ✓ v5.0 4,546,632 Ecoli-MG1655.fasta.gz.k23.unik 23 ✓ ✕ ✕ ✕ ✕ ✓ ✓ v5.0 4,546,632 # concat$ memusg -t unikmer concat *.k23.sorted.unik -o concat.k23 -celapsed time: 1.020speak rss: 25.86 MB# union$ memusg -t unikmer union *.k23.sorted.unik -o union.k23 -selapsed time: 3.991speak rss: 590.92 MB# or sorting with limited memory.# note that taxonomy database need some memory.$ memusg -t unikmer sort *.k23.sorted.unik -o union2.k23 -u -m 1Melapsed time: 3.538speak rss: 324.2 MB$ unikmer view -t union.k23.unik | md5sum 4c038832209278840d4d75944b29219c -$ unikmer view -t union2.k23.unik | md5sum 4c038832209278840d4d75944b29219c -# duplicate k-mers# memusg -t unikmer sort *.k23.sorted.unik -o dup.k23 -d -m 1M # limit memory usage$ memusg -t unikmer sort *.k23.sorted.unik -o dup.k23 -delapsed time: 1.143speak rss: 240.18 MB# intersection$ memusg -t unikmer inter *.k23.sorted.unik -o inter.k23elapsed time: 1.481speak rss: 399.94 MB# difference$ memusg -t unikmer diff -j 10 *.k23.sorted.unik -o diff.k23 -selapsed time: 0.793speak rss: 338.06 MB$ ls -lh *.unik-rw-r--r-- 1 shenwei shenwei 6.6M Sep 9 17:24 A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik-rw-r--r-- 1 shenwei shenwei 9.5M Sep 9 17:24 A.muciniphila-ATCC_BAA-835.fasta.gz.sorted.unik-rw-r--r-- 1 shenwei shenwei 46M Sep 9 17:25 concat.k23.unik-rw-r--r-- 1 shenwei shenwei 9.2M Sep 9 17:27 diff.k23.unik-rw-r--r-- 1 shenwei shenwei 11M Sep 9 17:26 dup.k23.unik-rw-r--r-- 1 shenwei shenwei 18M Sep 9 17:23 Ecoli-IAI39.fasta.gz.k23.sorted.unik-rw-r--r-- 1 shenwei shenwei 29M Sep 9 17:24 Ecoli-IAI39.fasta.gz.k23.unik-rw-r--r-- 1 shenwei shenwei 17M Sep 9 17:23 Ecoli-MG1655.fasta.gz.k23.sorted.unik-rw-r--r-- 1 shenwei shenwei 27M Sep 9 17:25 Ecoli-MG1655.fasta.gz.k23.unik-rw-r--r-- 1 shenwei shenwei 11M Sep 9 17:27 inter.k23.unik-rw-r--r-- 1 shenwei shenwei 26M Sep 9 17:26 union2.k23.unik-rw-r--r-- 1 shenwei shenwei 26M Sep 9 17:25 union.k23.unik$ unikmer stats *.unik -a -j 10file k canonical hashed scaled include-taxid global-taxid sorted compact gzipped version number descriptionA.muciniphila-ATCC_BAA-835.fasta.gz.m.unik 23 ✓ ✓ ✕ ✕ ✕ ✕ ✓ v5.0 860,900 A.muciniphila-ATCC_BAA-835.fasta.gz.sorted.unik 23 ✓ ✕ ✕ ✕ 349741 ✓ ✕ ✓ v5.0 2,630,905 concat.k23.unik 23 ✓ ✕ ✕ ✓ ✕ ✓ ✓ v5.0 -1 diff.k23.unik 23 ✓ ✕ ✕ ✓ ✓ ✕ ✓ v5.0 2,326,096 dup.k23.unik 23 ✓ ✕ ✕ ✓ ✓ ✕ ✓ v5.0 2,576,170 Ecoli-IAI39.fasta.gz.k23.sorted.unik 23 ✓ ✕ ✕ ✕ 585057 ✓ ✕ ✓ v5.0 4,902,266 Ecoli-IAI39.fasta.gz.k23.unik 23 ✓ ✕ ✕ ✕ ✕ ✓ ✓ v5.0 4,902,266 Ecoli-MG1655.fasta.gz.k23.sorted.unik 23 ✓ ✕ ✕ ✕ 511145 ✓ ✕ ✓ v5.0 4,546,632 Ecoli-MG1655.fasta.gz.k23.unik 23 ✓ ✕ ✕ ✕ ✕ ✓ ✓ v5.0 4,546,632 inter.k23.unik 23 ✓ ✕ ✕ ✓ ✓ ✕ ✓ v5.0 2,576,170 union2.k23.unik 23 ✓ ✕ ✕ ✓ ✓ ✕ ✓ v5.0 6,872,728 union.k23.unik 23 ✓ ✕ ✕ ✓ ✓ ✕ ✓ v5.0 6,872,728# -----------------------------------------------------------------------------------------# mapping k-mers to genomeseqkit seq Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fastag=Ecoli-IAI39.fastaf=inter.k23.unik# mapping k-mers back to the genome and extract successive regions/subsequencesunikmer map -g $g $f -a | more# using bwa# to fastaunikmer view $f -a -o $f.fa.gz# make indexbwa index $g; samtools faidx $gncpu=12ls $f.fa.gz \ | rush -j 1 -v ref=$g -v j=$ncpu \ 'bwa aln -o 0 -l 17 -k 0 -t {j} {ref} {} \ | bwa samse {ref} - {} \ | samtools view -bS > {}.bam; \ samtools sort -T {}.tmp -@ {j} {}.bam -o {}.sorted.bam; \ samtools index {}.sorted.bam; \ samtools flagstat {}.sorted.bam > {}.sorted.bam.flagstat; \ /bin/rm {}.bam '
Pleaseopen an issue to report bugs,propose new functions or ask for help.
About
A versatile toolkit for k-mers with taxonomic information
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.