- Notifications
You must be signed in to change notification settings - Fork2
🔬🐛 Rapid 16s rRNA identification from isolate FASTQ files
License
tseemann/sixess
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Rapid 16s rDNA from isolate FASTQ files
sixess
is a command-line software tool to identifybacterial species based on 16S rDNA sequence directlyfrom WGS FASTQ data. It includes databases fromNCBI (default), RDP and SILVA.
# just give it sequences!% sixess R1.fastq.gzStaphylococcus epidermidis# sometimes there is no match% sixess /dev/nullNo matches# give it as many sequence files as needed% sixess R1.fq R2.fqEnterococcus faecium# we provide different databases you can choose% sixess -d RDP contigs.faBacillus cereus# you can pipe to stdin too% bzcat chernobyl.fq.bz2 | sixess -Deinococcus radiodurans
cd $HOMEgit clone https://github.com/tseemann/sixessexport PATH=$HOME/sixess/bin:$PATH
brew install brewsci/bio/sixess # COMING SOON
conda install -c bioconda -c conda-forge sixess # COMING SOON
The input can be one or more sequence files, or-
denotingstdin
.The input data can be FASTQ or FASTA, and may be.gz
compressed.Any read length is accepted, even whole chromosomes.
The output is asingle line tostdout
.If a match was found, it will beGenus species
.If no prediction could be made, it will beNo matches
.
-q Quiet mode, no output -p DIR Database folder (/home/tseemann/git/sixess/db) -d FILE Database {NCBI RDP SILVA.gz} (NCBI) -t NUM CPU threads (1) -m FILE Save alignments to FILE in PAF format -V Print version and exit
-q
enables "quiet mode" which only prints to stderr for errors-p
is the location of the sequence databases-d
selects the database; they can be.gz
compressed (seeDatabases-t
increases threads; 3 is the suggested value forminimap2
-m
allows you to save the PAF output ofminimap2
-V
prints the version and exitse.g.sixess 1.0
TheNCBI 16S ribosomal RNA projectcontains curated 16S ribosomal RNA bacteria and archaea RefSeq entries.It has ~20,000 entries.
esearch -db nucleotide -query '33175[BioProject] OR 33317[BioProject]' \ | efetch -db nuccore -format fasta \ > $(which sixess)/../db/NCBI
Bacterial 16S rDNA sequences for "type strains"from theRDP databaseare included. These are denoted with(T)
in theFASTA headers. It contains ~10,000 entries.
wget --no-check-certificate https://rdp.cme.msu.edu/download/current_Bacteria_unaligned.fa.gzgunzip -c current_Bacteria_unaligned.fa.gz \ | bioawk -cfastx '/\(T\)/{print ">" $name " " $comment "\n" toupper($seq)}' \ > $(which sixess)/../db/RDP
SILVAis a comprehensive on-line resource for quality checked andaligned ribosomal RNA sequence data.The filtered version of the aligned 16S/18S/SSU databasecontains ~100,000 entries.
# replace "132" with latest version as neededwget https://www.arb-silva.de/fileadmin/silva_databases/release_132/Exports/SILVA_132_SSURef_Nr99_tax_silva.fasta.gzgunzip -v SILVA_132_SSURef_Nr99_tax_silva.fasta.gz \ | bioawk -cfastx \ '$comment ~ /^Bacteria;|^Archaea;/ \ && $comment !~ /(;unidentified|Mitochondria;|;Chloroplast|;uncultured| sp\.)/ \ { sub(/^.*;/,"",$comment); gsub("U","T",$seq); print ">" $name " " $comment "\n" $seq }' \ | seqtk seq -l 60 -U \ > SILVA.tmp1cd-hit-est -i SILVA.tmp1 -o SILVA.tmp2 -c 1.0 -T 0 -M 2000 -d 250cp SILVA.tmp2 $(which sixess)/../db/SILVArm -f SILVA.tmp1 SILVA.tmp2 SILVA.tmp2.clstr
Assuming you have a FASTA file of 16S DNA sequencescalled/home/alex/GG.fa
say, you can do this:
cp /home/alex/GG.fa $(which sixess)/../db/GGsixess -d GG R1.fastq.gz
sixess -p /home/alex/data -d GG.fa R1.fastq.gz
- Identify reads which look like 16S (
minimap2
) - Count up how many reads hit each 16S sequence (possibly weighted)
- Choose the top hit and report it
Report bugs and give suggesions on theIssues page