- Notifications
You must be signed in to change notification settings - Fork176
A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
License
shenwei356/seqkit
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
- Try SeqKit in your browser (Tutorials and Exercises provided bysandbox.bio)
- Documents:http://bioinf.shenwei.me/seqkit(Usage,FAQs,Tutorial,andBenchmark)
- Source code:https://github.com/shenwei356/seqkit
- Latest version:
- Please cite:
- Others:
- Easy to install (download)
- Providing statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64)
- Light weight and out-of-the-box, no dependencies, no compilation, no configuration
conda install -c bioconda seqkit
- Easy to use
- Ultrafast (seetechnical-details andbenchmark)
- Seamlessly parsing both FASTA and FASTQ formats
- Supporting (
gzip/xz/zstd/bzip2/lz4compressed) STDIN/STDOUT and input/output file, easily integrated in pipe - Reproducible results (configurable rand seed in
sampleandshuffle) - Supporting custom sequence ID via regular expression
- SupportingBash/Zsh autocompletion
- Versatile commands (usages and examples)
- Practical functions supported by38 subcommands
Go toDownload Page, where you can find download links to various platforms.
pixi global install -c bioconda seqkitconda install -c bioconda seqkitbrew install seqkit| Category | Command | Function | Input | Strand-sensitivity | Multi-threads |
|---|---|---|---|---|---|
| Basic operation | seq | Transform sequences: extract ID/seq, filter by length/quality, remove gaps… | FASTA/Q | ||
| stats | Simple statistics: #seqs, min/max_len, N50, Q20%, Q30%… | FASTA/Q | ✓ | ||
| subseq | Get subsequences by region/gtf/bed, including flanking sequences | FASTA/Q | + or/and - | ||
| sliding | Extract subsequences in sliding windows | FASTA/Q | + only | ||
| faidx | Create the FASTA index file and extract subsequences (with more features than samtools faidx) | FASTA | + or/and - | ||
| translate | translate DNA/RNA to protein sequence | FASTA/Q | + or/and - | ||
| watch | Monitoring and online histograms of sequence features | FASTA/Q | |||
| scat | Real time concatenation and streaming of fastx files | FASTA/Q | ✓ | ||
| Format conversion | fq2fa | Convert FASTQ to FASTA format | FASTQ | ||
| fx2tab | Convert FASTA/Q to tabular format | FASTA/Q | |||
| fa2fq | Retrieve corresponding FASTQ records by a FASTA file | FASTA/Q | + only | ||
| tab2fx | Convert tabular format to FASTA/Q format | TSV | |||
| convert | Convert FASTQ quality encoding between Sanger, Solexa and Illumina | FASTA/Q | |||
| Searching | grep | Search sequences by ID/name/sequence/sequence motifs, mismatch allowed | FASTA/Q | + and - | partly, -m |
| locate | Locate subsequences/motifs, mismatch allowed | FASTA/Q | + and - | partly, -m | |
| amplicon | Extract amplicon (or specific region around it), mismatch allowed | FASTA/Q | + and - | partly, -m | |
| fish | Look for short sequences in larger sequences | FASTA/Q | + and - | ||
| Set operation | sample | Sample sequences by number or proportion | FASTA/Q | ||
| sample2 | Sample sequences by number or proportion (version 2) | FASTA/Q | |||
| rmdup | Remove duplicated sequences by ID/name/sequence | FASTA/Q | + and - | ||
| common | Find common sequences of multiple files by id/name/sequence | FASTA/Q | + and - | ||
| duplicate | Duplicate sequences N times | FASTA/Q | |||
| split | Split sequences into files by id/seq region/size/parts (mainly for FASTA) | FASTA preffered | |||
| split2 | Split sequences into files by size/parts (FASTA, PE/SE FASTQ) | FASTA/Q | |||
| head | print the first N FASTA/Q records, or leading records whose total length >= L | FASTA/Q | |||
| head-genome | Print sequences of the first genome with common prefixes in name | FASTA/Q | |||
| range | Print FASTA/Q records in a range (start:end) | FASTA/Q | |||
| pair | Patch up paired-end reads from two fastq files | FASTA/Q | |||
| Edit | replace | Replace name/sequence by regular expression | FASTA/Q | + only | |
| rename | Rename duplicated IDs | FASTA/Q | |||
| concat | Concatenate sequences with same ID from multiple files | FASTA/Q | + only | ||
| restart | Reset start position (rotate) for circular genomes | FASTA/Q | + only | ||
| mutate | Edit sequence (point mutation, insertion, deletion) | FASTA/Q | + only | ||
| sana | Sanitize broken single line FASTQ files | FASTQ | |||
| Ordering | sort | Sort sequences by id/name/sequence/length | FASTA preffered | ||
| shuffle | Shuffle sequences | FASTA preffered | |||
| BAM processing | bam | Monitoring and online histograms of BAM record features | BAM | ||
| Miscellaneous | sum | Compute message digest for all sequences in FASTA/Q files | FASTA/Q | ✓ | |
| merge-slides | Merge sliding windows generated from seqkit sliding | TSV |
Notes:
- Strand-sensitivity:
+ only: only processing on the positive/forward strand.+ and -: searching on both strands.+ or/and -: depends on users' flags/options/arguments.
- Multiple-threads: Using the default 4 threads is fast enough for most commands, some commands can benefit from extra threads.
Wei Shen*, Botond Sipos, and Liuyang Zhao. 2024. SeqKit2: A Swiss Army Knife for Sequence and Alignment Processing.iMeta e191.doi:10.1002/imt2.191.
- Wei Shen
- Botond Sipos:
bam,scat,fish,sana,watch. - others
We thank all users for their valuable feedback and suggestions. We thank all contributors for improving the code and documentation.
We appreciateKlaus Post for his fantastic packages (compress andpgzip) which accelerate gzip file reading and writing.
Create an issue to report bugs,propose new functions or ask for help.
About
A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
