- Notifications
You must be signed in to change notification settings - Fork1
Package to extract Unannotated Regions from prokaryotic genomes report coding and pseudogenised genes delimited by stop codons - Named StORFs (Stop - Open Reading Frames)
License
NotificationsYou must be signed in to change notification settings
NickJD/StORF-Reporter
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
StORF-Reporter has now been published in NAR:https://doi.org/10.1093/nar/gkad814
StORF-Reporter, a toolkit that returns missed CDS genes from the Unannotated Regions (URs) of prokaryotic genomes.
This will also install the python-standard library numpy (>=1.22.0,<1.24.0) and Pyrodigal (https://github.com/althonos/pyrodigal).
Consider using '--no-cache-dir' with pip to ensure the download of the newest version of StORF-Reporter.
#############################################################
Supplement a current annotation from a tool such as Prokka or Bakta. A new GFF file will be created compatible with downstream pangenome analysis tools such as Roary and Panaroo.
For use on a single Prokka/Bakta output directory - Will also create a new fasta file with Prokka/Bakta genes and StORF sequences.
StORF-Reporter -anno Prokka Out_Dir -p .../Test_Datasets/Prokka_E-coli/
For use on multiple Prokka/Bakta output directies - Will also create a new fasta file with Prokka/Bakta genes and StORF sequences.
StORF-Reporter -anno Prokka Multiple_Out_Dirs -p ../Test_Datasets/Multi_Prokka_Outs
StORF-Reporter -anno Prokka Multiple_GFFs -p .../Test_Datasets/Prokka_Outputs/
For use on a GFF file from a CDS prediction tool such as Prodigal - Provide a GFF file and StORF-Reporter will find the matching .fa/.fasta/.fna (must have the same name).
StORF-Reporter -anno Feature_Types Single_Genome -p .../Test_Datasets/Matching_GFF_FASTA/Myco.gff
For use on a directory containing multiple GFF files from a CDS prediction tool such as Prodigal - StORF-Reporter will find the matching .fa/.fasta/.fna (must have the same name).
StORF-Reporter -anno Feature_Types Multiple_Genomes -p .../Test_Datasets/Matching_GFF_FASTA/
StORF-Reporter -anno Feature_Types Multiple_Combined_GFFs -p .../Test_Datasets/Combined_GFFs/
To perform a fresh end-to-end annotation of a genome without an annotation, StORF-Reporter will use Pyrodigal to predict CDS genes and then supplement with StORFs.
StORF-Reporter -anno Pyrodigal Single_FASTA -p .../Test_Datasets/Pyrodigal/E-coli.fa
StORF-Reporter -anno Ensembl Single_Genome -p .../Test_Datasets/Matching_GFF_FASTA/E-coli.gff
usage: StORF_Reporter.py [-h] [-anno [{Prokka,Bakta,Out_Dir,Multiple_Out_Dirs,Single_GFF,Multiple_GFFs,Ensembl,Feature_Types,Single_Genome,Multiple_Genomes,Single_Combined_GFF,Multiple_Combined_GFFs,Pyrodigal,Single_FASTA,Multiple_FASTA} ...]] [-p PATH] [-af ALT_FILENAME] [-oname O_NAME] [-odir O_DIR] [-sout {True,False}] [-lw {True,False}] [-aa {True,False}] [-gz {True,False}] [-py_train [{longest,individual,meta}]] [-py_fasta {True,False}] [-py_unstorfed {True,False}] [-gene_ident GENE_IDENT] [-min_len MINLEN] [-max_len MAXLEN] [-ex_len EXLEN] [-spos {True,False}] [-rs {True,False}] [-con_storfs {True,False}] [-con_only {True,False}] [-ps {True,False}] [-wc {True,False}] [-short_storfs {False,Nolap,Olap}] [-short_storfs_only {True,False}] [-minorf MIN_ORF] [-maxorf MAX_ORF] [-codons STOP_CODONS] [-olap_filt [{none,single-strand,both-strand}]] [-start_filt {True,False}] [-so [{start_pos,strand}]] [-f_type [{StORF,CDS,ORF}]] [-non_standard NON_STANDARD] [-olap OVERLAP_NT] [-ao ALLOWED_OVERLAP] [-overwrite {True,False}] [-verbose {True,False}] [-v]StORF-Reporter v1.4.3: StORF-Reporter Run Parameters.Required Options:-anno [{Prokka,Bakta,Out_Dir,Multiple_Out_Dirs,Single_GFF,Multiple_GFFs,Ensembl,Feature_Types,Single_Genome,Multiple_Genomes,Single_Combined_GFF,Multiple_Combined_GFFs,Pyrodigal,Single_FASTA,Multiple_FASTA} ...] Select Annotation and Input optionsfor one of the 3 options listed below### Prokka/Bakta Annotation Option 1: Prokka= Report StORFsfor a Prokka annotation; Bakta= Report StORFsfor a Bakta annotation; --- Prokka/Bakta Input Options: Out_Dir= To provide the output directory of either a Prokka or Bakta run (will produce a new GFF and FASTA file containing original and extended annotations); Multiple_Out_Dirs= To provide a directory containing multiple Prokka/Bakta standard output directories - Will run on each sequentially; Single_GFF= To provide a single Prokka or Bakta GFF - searchesfor accompanying".fna" file (will provide a new extended GFF); Multiple_GFFs= To provide a directory containing multiple Prokka or Bakta GFF files - searchesfor accompanying".fna" files (will provide a new extended GFF);### Standard GFF Annotation Option 2: Ensembl= Report StORFsfor an Ensembl Bacteria annotation (ID=gene); Feature_Types= Usedin conjunction with-gene_ident to define features such as CDS,rRNA,tRNAfor UR extraction (default CDS); --- Standard GFF Input Options: Single_Genome= To provide a single Genome - accompanying FASTA must share same name as given gff file (can be .fna, .fa or .fasta); Multiple_Genomes= To provide a directory containing multiple accompanying GFF and FASTA files - files must share the same name (fasta can be .fna, .fa or .fasta); Single_Combined_GFF= To provide a GFF file with embedded FASTA at the bottom; Multiple_Combined_GFFs= To provide a directory containing multiple GFF files with embedded FASTA at the bottom;### Complete Annotation Option 3: Pyrodigal= Run Pyrodigalthen Report StORFs (provide path to single FASTA or directory of multiple FASTA files; --- Complete Annotation Input Options: Single_FASTA = To provide a single FASTA file; Multiple_FASTA = To provide a directory containing multiple FASTA files (will detect .fna,.fa,.fasta); -p PATH Provide input file or directory pathStORF-Reporter Options: -af ALT_FILENAME Default - Prokka/Bakta output directory share the same prefix with their gff/fna files - Use this option when Prokka/Bakta output directory name is different from the gff/fna files within and StORF-Reporter will searchfor the gff/fna with the given prefix (MyProkkaDir/"altname".gff) - Does not work with"Multiple_Out_Dirs" option -oname O_NAME Default - Appends'_StORF-Reporter_Extended' to end of input filename - Takes the directory name of Prokka/Bakta outputif given as input or the inputfor -afif given - Multiple_* runs will be numbered -odir O_DIR Default - Same directory as input -sout {True,False} Default - False: Print out StORF sequences separately from Prokka/Bakta annotations -lw {True,False} Default - True: Line wrap FASTA sequence output at 60 chars -aa {True,False} Default - False: Report StORFs as amino acid sequences -gz {True,False} Default - False: Output as .gzPyrodigal Options: -py_train [{longest,individual,meta}] Default - longest: Type of model training to bedonefor Pyrodigal CDS prediction: Options: longest = Trains on longest contig; individual = Trains on each contig separately - runsin meta modeif contig is< 20KB; meta = Runsin meta modefor all sequences -py_fasta {True,False} Default - False: Output Pyrodigal+StORF predictionsin FASTA format -py_unstorfed {True,False} Default - False: Provide GFF containing original Pyrodigal predictionsUR-Extractor Options: -gene_ident GENE_IDENT Default:"CDS". Specifies feature types to exclude from Unannotated rRegion extraction. Provide a comma- separated list of feature types, e.g.,"misc_RNA,gene, mRNA,CDS,rRNA,tRNA,tmRNA,CRISPR,ncRNA,regulatory_regio n,oriC,pseudo", to identify annotated regions. - To be used with"-anno Feature_Types" -"-gene_ident Prokka" willselect"most" features presentin Prokka/Bakta annotations- Providing"ID=gene" will check the attribute columnfor features assigned as genes (compatible with Ensembl annotations). All regions without these feature types will be extracted as unannotated. -min_len MINLEN Default - 30: Minimum UR Length -max_len MAXLEN Default - 100,000: Maximum UR Length -ex_len EXLEN Default - 50: UR Extension LengthStORF-Finder Options: -spos {True,False} Default - False: Output StORF sequences and GFF positions inclusive of first stop codon -This canbreak some downstream toolsif changed to True. -rs {True,False} Default - True: Remove stop"*" from StORF amino acid sequences -con_storfs {True,False} Default - False: Output Consecutive StORFs -con_only {True,False} Default - False: Only output Consecutive StORFs -ps {True,False} Default - False: Partial StORFs reported -wc {True,False} Default - False: StORFs reported across entire sequence -short_storfs {False,Nolap,Olap} Default - False: Run StORF-Finderin"Short-StORF" mode. Will onlyreturn StORFs between 30 and 120 nt thatdo not overlap longer StORFs - Only works with StORFsfor now."Nolap" will filter Short-StORFs which areoverlapped by StORFs and Olap will report Short- StORFs whichdo overlap StORFs. Overlap is defined by"-olap". -short_storfs_only {True,False} Default - True. Only report Short-StORFs? -minorf MIN_ORF Default - 99: Minimum StORF sizein nt -maxorf MAX_ORF Default - 60kb: Maximum StORF sizein nt -codons STOP_CODONS Default - ('TAG,TGA,TAA'): List Stop Codons to use -olap_filt [{none,single-strand,both-strand}] Default -"both-strand": Filtering level"none" is not recommended,"single-strand"for single strand filtering and both-strandfor both-strand longest- first tiling -start_filt {True,False} Default - False: Filter out StORFs without at least one of the 3 common start codons (best usedfor short- storfs). -so [{start_pos,strand}] Default - Start Position: How should StORFs be ordered when>1 reportedin a single UR. -f_type [{StORF,CDS,ORF}] Default -"CDS": Which GFF featuretypefor StORFs to be reported asin GFF -"CDS" is probably neededfor usein tools such as Roary and Panaroo -non_standard NON_STANDARD Default - 0.20: Reject StORFs with>=20% non-standard nucleotides (A,T,G,C) - Provide % as decimal -olap OVERLAP_NT Default - 50: Maximum number of nt of a StORF which can overlap another StORF. -ao ALLOWED_OVERLAP Default - 50 nt: Maximum overlap between a StORF and an original gene.Misc: -overwrite {True,False} Default - False: Overwrite StORF-Reporter outputif already present -verbose {True,False} Default - False: Print out runtime messages -v Print out version number andexit
###################################
UR-Extractor -f .../Test_Datasets/Matching_GFF_FASTA/E-coli.fa -gff .../Test_Datasets/Matching_GFF_FASTA/E-coli.gff
usage: UR_Extractor.py [-h] -gff GFF [-f FASTA] [-ident IDENT] [-min_len MINLEN] [-max_len MAXLEN] [-ex_len EXLEN] [-gene_ident GENE_IDENT] [-oname O_NAME] [-odir O_DIR] [-gz {True,False}] [-verbose {True,False}] [-v]StORF-Reporter v1.4.3: UR-Extractor Run Parameters.Required Arguments: -gff GFF GFF file containing genome annotationOptional Arguments: -f FASTA Accompanying FASTA fileif GFF file does not contain sequence data -ident IDENT Identifier givenfor Unannotated Region output sequences - Do not modifyif output is to be used by StORF-Finder: Default"Sequence-ID"_UR -min_len MINLEN Minimum UR Length: Default 30 -max_len MAXLEN Maximum UR Length: Default 100,000 -ex_len EXLEN UR Extension Length on 5' and 3': Default 50 -gene_ident GENE_IDENT Default:"CDS". Specifies feature types to exclude from Unannotated rRegion extraction. Provide a comma- separated list of feature types, e.g.,"misc_RNA,gene, mRNA,CDS,rRNA,tRNA,tmRNA,CRISPR,ncRNA,regulatory_regio n,oriC,pseudo", to identify annotated regions."-gene_ident Prokka" willselect"most" features presentin Prokka/Bakta annotations- Providing"ID=gene" will check the attribute columnfor features assigned as genes (compatible with Ensembl annotations). All regions without these feature types will be extracted as unannotated.Output: -oname O_NAME Default - Appends'_UR' to end of input GFF filename -odir O_DIR Default - Same directory as input GFF -gz {True,False} Default - False: Output as .gzMisc: -verbose {True,False} Default - False: Print out runtime messages -v Default - False: Print out version number andexit
StORF-Finder -f .../Test_Datasets/Matching_GFF_FASTA/E-coli_UR.fa
usage: StORF_Finder.py [-h] -f FASTA [-ua {True,False}] [-wc {True,False}] [-ps {True,False}] [-olap_filt [{none,single-strand,both-strand}]] [-start_filt {True,False}] [-con_storfs {True,False}] [-con_only {True,False}] [-short_storfs {False,Nolap,Olap}] [-short_storfs_only {True,False}] [-f_type [{StORF,CDS,ORF}]] [-minorf MIN_ORF] [-maxorf MAX_ORF] [-codons STOP_CODONS] [-non_standard NON_STANDARD] [-olap OVERLAP_NT] [-s SUFFIX] [-so [{start_pos,strand}]] [-oname O_NAME] [-odir O_DIR] [-gff {True,False}] [-aa {True,False}] [-aa_only {True,False}] [-lw {True,False}] [-spos {True,False}] [-stop_ident {True,False}] [-gff_fasta {True,False}] [-gz {True,False}] [-verbose {True,False}] [-v]StORF-Reporter v1.4.3: StORF-Finder Run Parameters.Required Arguments:-f FASTA Input FASTA File - (UR_Extractor output)Optional Arguments:-ua {True,False} Default - Treat input as Unannotated: Use"-ua False"for standard fasta-wc {True,False} Default - False: StORFs reported across entire sequence-ps {True,False} Default - False: Partial StORFs reported-olap_filt [{none,single-strand,both-strand}] Default -"both-strand": Filtering level"none" is not recommended,"single-strand"for single strand filtering and both-strandfor both-strand longest- first tiling-start_filt {True,False} Default - False: Filter out StORFs without at least one of the 3 common start codons (best usedfor short- storfs).-con_storfs {True,False} Default - False: Output Consecutive StORFs-con_only {True,False} Default - False: Only output Consecutive StORFs-short_storfs {False,Nolap,Olap} Default - False: Run StORF-Finderin"Short-StORF" mode. Will onlyreturn StORFs between 30 and 120 nt thatdo not overlap longer StORFs - Only works with StORFsfor now."Nolap" will filter Short-StORFs which areoverlapped by StORFs and Olap will report Short- StORFs whichdo overlap StORFs. Overlap is defined by"-olap".-short_storfs_only {True,False} Default - True. Only report Short-StORFs?-f_type [{StORF,CDS,ORF}] Default -"StORF": Which GFF featuretypefor StORFs to be reported asin GFF -minorf MIN_ORF Default - 99: Minimum StORF sizein nt -maxorf MAX_ORF Default - 60kb: Maximum StORF sizein nt-codons STOP_CODONS Default - ('TAG,TGA,TAA'): List Stop Codons to use-non_standard NON_STANDARD Default - 0.20: Reject StORFs with>=20% non-standard nucleotides (A,T,G,C) - Provide % as decimal-olap OVERLAP_NT Default - 50: Maximum number of nt of a StORF which can overlap another StORF.-s SUFFIX Default - Do not append suffix to genome ID-so [{start_pos,strand}] Default - Start Position: How should StORFs be ordered when>1 reportedin a single UR.Output:-oname O_NAME Default - Appends'_StORF-Finder' to end of input FASTA filename-odir O_DIR Default - Same directory as input FASTA-gff {True,False} Default - True: Output a GFF file-aa {True,False} Default - False: Report StORFs as amino acid sequences-aa_only {True,False} Default - False: Only output Amino Acid Fasta -lw {True,False} Default - True: Line wrap FASTA sequence output at 60 chars-spos {True,False} Default - False: Output StORF sequences and GFF positions inclusive of first stop codon -This canbreak some downstream toolsif changed to True. -stop_ident {True,False} Default - True: Identify Stop Codon positions with'*' -gff_fasta {True,False} Default - False: Report all gene sequences (nt) at the bottom of GFF filesin Prokka output mode -gz {True,False} Default - False: Output as .gzMisc: -verbose {True,False} Default - False: Print out runtime messages -v Default - False: Print out version number andexit
Subpackage to extract sequences reported by StORF-Reporter from a genome annotation.
StORF-Extractor -storf_input Combined -p .../Test_Datasets/Combined_GFFs/E-coli_Combined_StORF-Reporter_Extended.gff
usage: StORF_Extractor.py [-h] [-storf_input {Combined,Separate}] [-p PATH] [-gff_out {True,False}] [-oname O_NAME] [-odir O_DIR] [-gz {True,False}] [-verbose {True,False}] [-v]StORF-Reporter v1.4.3: StORF-Extractor Run Parameters.Required Arguments: -storf_input {Combined,Separate} Are StORFs to be extracted from Combined GFF/FASTA or Separate GFF/FASTA files? -p PATH Provide input file or directory pathOutput: -gff_out {True,False} Default - False: Output StORFsin GFF format -oname O_NAME Default - Appends'_Extracted_StORFs' to end of input GFF filename -odir O_DIR Default - Same directory as input FASTA -gz {True,False} Default - False: Output as .gzMisc: -verbose {True,False} Default - False: Print out runtime messages -v Default - False: Print out version number andexit
Subpackage to remove sequences reported by StORF-Reporter without a Blast/Diamond hit (any alignment in BLAST 6 format).
StORF-Remover -gff .../Test_Datasets/StORF_Extractor_And_Remover/Myco_UR_StORF-R.gff -blast .../Test_Datasets/StORF_Extractor_And_Remover/Myco_URs_StORFs_aa_Swiss.tab
usage:StORF_Remover.py [-h] [-gffGFF] [-blastBLAST] [-min_scoreMINSCORE] [-onameO_NAME] [-odirO_DIR] [-gz {True,False}] [-verbose {True,False}] [-v]StORF-Reporterv1.4.1:UR-RemoverRunParameters.RequiredArguments:-gffGFFGFFannotationfilefortheFASTA-blastBLASTBLASTformat6annotationfileOptionalArguments:-min_scoreMINSCOREMinimumBitScoretokeepStORF:Default30Output:-onameO_NAMEDefault-Appends'_UR'toendofinputGFFfilename-odirO_DIRDefault-SamedirectoryasinputGFF-gz {True,False}Default-False:Outputas .gzMisc:-verbose {True,False}Default-False:Printoutruntimemessages-vDefault-False:Printoutversionnumberandexit
About
Package to extract Unannotated Regions from prokaryotic genomes report coding and pseudogenised genes delimited by stop codons - Named StORFs (Stop - Open Reading Frames)
Topics
Resources
License
Stars
Watchers
Forks
Packages0
No packages published