- Notifications
You must be signed in to change notification settings - Fork7
A versatile toolkit for k-mers with taxonomic information
License
shenwei356/unikmer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Documents:https://bioinf.shenwei.me/unikmer/
unikmer
is a toolkit for nucleic acidk-mer analysis,providing functionsincluding set operation k-mers (sketch) optional withTaxIds but without count information.
K-mers are either encoded (k<=32) or hashed (k<=64, using ntHash v1) intouint64
,and serialized in binary file with extension.unik
.
TaxIds can be assigned when counting k-mers from genome sequences,and LCA (Lowest Common Ancestor) is computed during set opertionsincluding computing union, intersecton, set difference, unique andrepeated k-mers.
Related projects:
- kmers provides bit-packed k-mers methods for this tool.
- unik provides k-mer serialization methods for this tool.
- sketches provides generators/iterators for k-mer sketches(Minimizer,Scaled MinHash,Closed Syncmers).
- taxdump provides querying manipulations from NCBI Taxonomy taxdump files.
- Finding conserved regions in all genomes of a species.
- Finding species/strain-specific sequences for designing probes/primers.
Downloadingexecutable binary files.
conda install -c bioconda unikmer
Counting
count Generate k-mers (sketch) from FASTA/Q sequences
Information
info Information of binary files num Quickly inspect the number of k-mers in binary files
Format conversion
view Read and output binary format to plain text dump Convert plain k-mer text to binary format encode Encode plain k-mer texts to integers decode Decode encoded integers to k-mer texts
Set operations
concat Concatenate multiple binary files without removing duplicates inter Intersection of k-mers in multiple binary files common Find k-mers shared by most of the binary files union Union of k-mers in multiple binary files diff Set difference of k-mers in multiple binary files
Split and merge
sort Sort k-mers to reduce the file size and accelerate downstream analysis split Split k-mers into sorted chunk files tsplit Split k-mers according to TaxId merge Merge k-mers from sorted chunk files
Subset
head Extract the first N k-mers sample Sample k-mers from binary files grep Search k-mers from binary files filter Filter out low-complexity k-mers rfilter Filter k-mers by taxonomic rank
Searching on genomes
locate Locate k-mers in genome map Mapping k-mers back to the genome and extract successive regions/subsequences
Misc
autocompletion Generate shell autocompletion script version Print version information and check for update
K-mers (represented inuint64
in RAM ) are serialized in 8-Byte(or less Bytes for shorter k-mers in compact format,or much less Bytes for sorted k-mers) arrays andoptionally compressed in gzip format with extension of.unik
.TaxIds are optionally stored next to k-mers with 4 or less bytes.
No TaxIds stored in this test.
label | encoded-kmera | gzip-compressedb | compact-formatc | sortedd | comment |
---|---|---|---|---|---|
plain | plain text | ||||
gzip | ✔ | gzipped plain text | |||
unik.default | ✔ | ✔ | gzipped encoded k-mers in fixed-length byte array | ||
unik.compat | ✔ | ✔ | ✔ | gzipped encoded k-mers in shorter fixed-length byte array | |
unik.sorted | ✔ | ✔ | ✔ | gzipped sorted encoded k-mers |
- a One k-mer is encoded as
uint64
and serialized in 8 Bytes. - b K-mers file is compressed in gzip format by default,users can switch on global option
-C/--no-compress
to output non-compressed file. - c One k-mer is encoded as
uint64
and serialized in 8 Bytes by default.However few Bytes are needed for short k-mers, e.g., 4 Bytes are enough for15-mers (30 bits). This makes the file more compact with smaller file size,controled by global option-c/--compact
. - d One k-mer is encoded as
uint64
, all k-mers are sorted and compressedusing varint-GB algorithm. - In all test, flag
--canonical
is ON when runningunikmer count
.
# memusg is for compute time and RAM usage: https://github.com/shenwei356/memusg# counting (only keep the canonical k-mers and compact output)# memusg -t unikmer count -k 23 Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23 --canonical --compact$ memusg -t unikmer count -k 23 Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23 --canonical --compactelapsed time: 0.897speak rss: 192.41 MB# counting (only keep the canonical k-mers and sort k-mers)# memusg -t unikmer count -k 23 Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23.sorted --canonical --sort$ memusg -t unikmer count -k 23 Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23.sorted --canonical --sortelapsed time: 1.136speak rss: 227.28 MB# counting and assigning global TaxIds$ unikmer count -k 23 -K -s Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fasta.gz.k23.sorted -t 585057$ unikmer count -k 23 -K -s Ecoli-MG1655.fasta.gz -o Ecoli-MG1655.fasta.gz.k23.sorted -t 511145$ unikmer count -k 23 -K -s A.muciniphila-ATCC_BAA-835.fasta.gz -o A.muciniphila-ATCC_BAA-835.fasta.gz.sorted -t 349741# counting minimizer and ouputting in linear order$ unikmer count -k 23 -W 5 -H -K -l A.muciniphila-ATCC_BAA-835.fasta.gz -o A.muciniphila-ATCC_BAA-835.fasta.gz.m# view$ unikmer view Ecoli-MG1655.fasta.gz.k23.sorted.unik --show-taxid | head -n 3AAAAAAAAACCATCCAAATCTGG 511145AAAAAAAAACCGCTAGTATATTC 511145AAAAAAAAACCTGAAAAAAACGG 511145# view (hashed k-mers needs original FASTA/Q file)$ unikmer view --show-code --genome A.muciniphila-ATCC_BAA-835.fasta.gz A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik | head -n 3CATCCGCCATCTTTGGGGTGTCG 1210726578792AGCGCAAAATCCCCAAACATGTA 2286899379883AACTGATTTTTGATGATGACTCC 3542156397282# find the positions of k-mers$ unikmer locate -g A.muciniphila-ATCC_BAA-835.fasta.gz A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik | head -n 5NC_010655.1 2 25 ATCTTATAAAATAACCACATAAC 0 .NC_010655.1 5 28 TTATAAAATAACCACATAACTTA 0 .NC_010655.1 6 29 TATAAAATAACCACATAACTTAA 0 .NC_010655.1 9 32 AAAATAACCACATAACTTAAAAA 0 .NC_010655.1 13 36 TAACCACATAACTTAAAAAGAAT 0 .# info$ unikmer info *.unik -a -j 10file k canonical hashed scaled include-taxid global-taxid sorted compact gzipped version number descriptionA.muciniphila-ATCC_BAA-835.fasta.gz.m.unik 23 ✓ ✓ ✕ ✕ ✕ ✕ ✓ v5.0 860,900 A.muciniphila-ATCC_BAA-835.fasta.gz.sorted.unik 23 ✓ ✕ ✕ ✕ 349741 ✓ ✕ ✓ v5.0 2,630,905 Ecoli-IAI39.fasta.gz.k23.sorted.unik 23 ✓ ✕ ✕ ✕ 585057 ✓ ✕ ✓ v5.0 4,902,266 Ecoli-IAI39.fasta.gz.k23.unik 23 ✓ ✕ ✕ ✕ ✕ ✓ ✓ v5.0 4,902,266 Ecoli-MG1655.fasta.gz.k23.sorted.unik 23 ✓ ✕ ✕ ✕ 511145 ✓ ✕ ✓ v5.0 4,546,632 Ecoli-MG1655.fasta.gz.k23.unik 23 ✓ ✕ ✕ ✕ ✕ ✓ ✓ v5.0 4,546,632 # concat$ memusg -t unikmer concat *.k23.sorted.unik -o concat.k23 -celapsed time: 1.020speak rss: 25.86 MB# union$ memusg -t unikmer union *.k23.sorted.unik -o union.k23 -selapsed time: 3.991speak rss: 590.92 MB# or sorting with limited memory.# note that taxonomy database need some memory.$ memusg -t unikmer sort *.k23.sorted.unik -o union2.k23 -u -m 1Melapsed time: 3.538speak rss: 324.2 MB$ unikmer view -t union.k23.unik | md5sum 4c038832209278840d4d75944b29219c -$ unikmer view -t union2.k23.unik | md5sum 4c038832209278840d4d75944b29219c -# duplicate k-mers# memusg -t unikmer sort *.k23.sorted.unik -o dup.k23 -d -m 1M # limit memory usage$ memusg -t unikmer sort *.k23.sorted.unik -o dup.k23 -delapsed time: 1.143speak rss: 240.18 MB# intersection$ memusg -t unikmer inter *.k23.sorted.unik -o inter.k23elapsed time: 1.481speak rss: 399.94 MB# difference$ memusg -t unikmer diff -j 10 *.k23.sorted.unik -o diff.k23 -selapsed time: 0.793speak rss: 338.06 MB$ ls -lh *.unik-rw-r--r-- 1 shenwei shenwei 6.6M Sep 9 17:24 A.muciniphila-ATCC_BAA-835.fasta.gz.m.unik-rw-r--r-- 1 shenwei shenwei 9.5M Sep 9 17:24 A.muciniphila-ATCC_BAA-835.fasta.gz.sorted.unik-rw-r--r-- 1 shenwei shenwei 46M Sep 9 17:25 concat.k23.unik-rw-r--r-- 1 shenwei shenwei 9.2M Sep 9 17:27 diff.k23.unik-rw-r--r-- 1 shenwei shenwei 11M Sep 9 17:26 dup.k23.unik-rw-r--r-- 1 shenwei shenwei 18M Sep 9 17:23 Ecoli-IAI39.fasta.gz.k23.sorted.unik-rw-r--r-- 1 shenwei shenwei 29M Sep 9 17:24 Ecoli-IAI39.fasta.gz.k23.unik-rw-r--r-- 1 shenwei shenwei 17M Sep 9 17:23 Ecoli-MG1655.fasta.gz.k23.sorted.unik-rw-r--r-- 1 shenwei shenwei 27M Sep 9 17:25 Ecoli-MG1655.fasta.gz.k23.unik-rw-r--r-- 1 shenwei shenwei 11M Sep 9 17:27 inter.k23.unik-rw-r--r-- 1 shenwei shenwei 26M Sep 9 17:26 union2.k23.unik-rw-r--r-- 1 shenwei shenwei 26M Sep 9 17:25 union.k23.unik$ unikmer stats *.unik -a -j 10file k canonical hashed scaled include-taxid global-taxid sorted compact gzipped version number descriptionA.muciniphila-ATCC_BAA-835.fasta.gz.m.unik 23 ✓ ✓ ✕ ✕ ✕ ✕ ✓ v5.0 860,900 A.muciniphila-ATCC_BAA-835.fasta.gz.sorted.unik 23 ✓ ✕ ✕ ✕ 349741 ✓ ✕ ✓ v5.0 2,630,905 concat.k23.unik 23 ✓ ✕ ✕ ✓ ✕ ✓ ✓ v5.0 -1 diff.k23.unik 23 ✓ ✕ ✕ ✓ ✓ ✕ ✓ v5.0 2,326,096 dup.k23.unik 23 ✓ ✕ ✕ ✓ ✓ ✕ ✓ v5.0 2,576,170 Ecoli-IAI39.fasta.gz.k23.sorted.unik 23 ✓ ✕ ✕ ✕ 585057 ✓ ✕ ✓ v5.0 4,902,266 Ecoli-IAI39.fasta.gz.k23.unik 23 ✓ ✕ ✕ ✕ ✕ ✓ ✓ v5.0 4,902,266 Ecoli-MG1655.fasta.gz.k23.sorted.unik 23 ✓ ✕ ✕ ✕ 511145 ✓ ✕ ✓ v5.0 4,546,632 Ecoli-MG1655.fasta.gz.k23.unik 23 ✓ ✕ ✕ ✕ ✕ ✓ ✓ v5.0 4,546,632 inter.k23.unik 23 ✓ ✕ ✕ ✓ ✓ ✕ ✓ v5.0 2,576,170 union2.k23.unik 23 ✓ ✕ ✕ ✓ ✓ ✕ ✓ v5.0 6,872,728 union.k23.unik 23 ✓ ✕ ✕ ✓ ✓ ✕ ✓ v5.0 6,872,728# -----------------------------------------------------------------------------------------# mapping k-mers to genomeseqkit seq Ecoli-IAI39.fasta.gz -o Ecoli-IAI39.fastag=Ecoli-IAI39.fastaf=inter.k23.unik# mapping k-mers back to the genome and extract successive regions/subsequencesunikmer map -g $g $f -a | more# using bwa# to fastaunikmer view $f -a -o $f.fa.gz# make indexbwa index $g; samtools faidx $gncpu=12ls $f.fa.gz \ | rush -j 1 -v ref=$g -v j=$ncpu \ 'bwa aln -o 0 -l 17 -k 0 -t {j} {ref} {} \ | bwa samse {ref} - {} \ | samtools view -bS > {}.bam; \ samtools sort -T {}.tmp -@ {j} {}.bam -o {}.sorted.bam; \ samtools index {}.sorted.bam; \ samtools flagstat {}.sorted.bam > {}.sorted.bam.flagstat; \ /bin/rm {}.bam '
Pleaseopen an issue to report bugs,propose new functions or ask for help.
About
A versatile toolkit for k-mers with taxonomic information