NickJD/StORF-ReporterPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star16

Package to extract Unannotated Regions from prokaryotic genomes report coding and pseudogenised genes delimited by stop codons - Named StORFs (Stop - Open Reading Frames)

License

GPL-3.0 license

16 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
Genomes		Genomes
Test_Datasets		Test_Datasets
aux_tools		aux_tools
src/StORF_Reporter		src/StORF_Reporter
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Visual_Abstract.jpg		Visual_Abstract.jpg
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Repository files navigation

gkad814

StORF-Reporter, a toolkit that returns missed CDS genes from the Unannotated Regions (URs) of prokaryotic genomes.

Please use `pip install StORF-Reporter' to install StORF-Reporter.

This will also install the python-standard library numpy (>=1.22.0,<1.24.0) and Pyrodigal (https://github.com/althonos/pyrodigal).

Consider using '--no-cache-dir' with pip to ensure the download of the newest version of StORF-Reporter.

The directory "Test_Datasets" is provided to confirm functionality of StORF-Reporter.

#############################################################

StORF-Reporter:

Most common use cases -

Supplement a current annotation from a tool such as Prokka or Bakta. A new GFF file will be created compatible with downstream pangenome analysis tools such as Roary and Panaroo.

For use on a single Prokka/Bakta output directory - Will also create a new fasta file with Prokka/Bakta genes and StORF sequences.

StORF-Reporter -anno Prokka Out_Dir -p .../Test_Datasets/Prokka_E-coli/

For use on multiple Prokka/Bakta output directies - Will also create a new fasta file with Prokka/Bakta genes and StORF sequences.

StORF-Reporter -anno Prokka Multiple_Out_Dirs -p ../Test_Datasets/Multi_Prokka_Outs

For use on a directory containing multiple Prokka/Bakta output gffs - Only produces new GFF files.

StORF-Reporter -anno Prokka Multiple_GFFs -p .../Test_Datasets/Prokka_Outputs/

For use on a GFF file from a CDS prediction tool such as Prodigal - Provide a GFF file and StORF-Reporter will find the matching .fa/.fasta/.fna (must have the same name).

StORF-Reporter -anno Feature_Types Single_Genome -p .../Test_Datasets/Matching_GFF_FASTA/Myco.gff

For use on a directory containing multiple GFF files from a CDS prediction tool such as Prodigal - StORF-Reporter will find the matching .fa/.fasta/.fna (must have the same name).

StORF-Reporter -anno Feature_Types Multiple_Genomes -p .../Test_Datasets/Matching_GFF_FASTA/

For use on a directory containing multiple GFF files with embedded FASTA.

StORF-Reporter -anno Feature_Types Multiple_Combined_GFFs -p .../Test_Datasets/Combined_GFFs/

To perform a fresh end-to-end annotation of a genome without an annotation, StORF-Reporter will use Pyrodigal to predict CDS genes and then supplement with StORFs.

StORF-Reporter -anno Pyrodigal Single_FASTA -p .../Test_Datasets/Pyrodigal/E-coli.fa

Menu - (StORF-Reporter -h):

StORF-Reporter -anno Ensembl Single_Genome -p .../Test_Datasets/Matching_GFF_FASTA/E-coli.gff

usage: StORF_Reporter.py [-h]                         [-anno [{Prokka,Bakta,Out_Dir,Multiple_Out_Dirs,Single_GFF,Multiple_GFFs,Ensembl,Feature_Types,Single_Genome,Multiple_Genomes,Single_Combined_GFF,Multiple_Combined_GFFs,Pyrodigal,Single_FASTA,Multiple_FASTA} ...]]                         [-p PATH] [-af ALT_FILENAME] [-oname O_NAME]                         [-odir O_DIR] [-sout {True,False}] [-lw {True,False}]                         [-aa {True,False}] [-gz {True,False}]                         [-py_train [{longest,individual,meta}]]                         [-py_fasta {True,False}] [-py_unstorfed {True,False}]                         [-gene_ident GENE_IDENT] [-min_len MINLEN]                         [-max_len MAXLEN] [-ex_len EXLEN]                         [-spos {True,False}] [-rs {True,False}]                         [-con_storfs {True,False}] [-con_only {True,False}]                         [-ps {True,False}] [-wc {True,False}]                         [-short_storfs {False,Nolap,Olap}]                         [-short_storfs_only {True,False}] [-minorf MIN_ORF]                         [-maxorf MAX_ORF] [-codons STOP_CODONS]                         [-olap_filt [{none,single-strand,both-strand}]]                         [-start_filt {True,False}] [-so [{start_pos,strand}]]                         [-f_type [{StORF,CDS,ORF}]]                         [-non_standard NON_STANDARD] [-olap OVERLAP_NT]                         [-ao ALLOWED_OVERLAP] [-overwrite {True,False}]                         [-verbose {True,False}] [-v]StORF-Reporter v1.4.3: StORF-Reporter Run Parameters.Required Options:-anno [{Prokka,Bakta,Out_Dir,Multiple_Out_Dirs,Single_GFF,Multiple_GFFs,Ensembl,Feature_Types,Single_Genome,Multiple_Genomes,Single_Combined_GFF,Multiple_Combined_GFFs,Pyrodigal,Single_FASTA,Multiple_FASTA} ...]                        Select Annotation and Input optionsfor one of the 3 options listed below### Prokka/Bakta Annotation Option 1:                        Prokka= Report StORFsfor a Prokka annotation;                         Bakta= Report StORFsfor a Bakta annotation;                         --- Prokka/Bakta Input Options:                         Out_Dir= To provide the output directory of either a Prokka or Bakta run (will produce a new GFF and FASTA file containing original and extended annotations);                         Multiple_Out_Dirs= To provide a directory containing multiple Prokka/Bakta standard output directories - Will run on each sequentially;                         Single_GFF= To provide a single Prokka or Bakta GFF - searchesfor accompanying".fna" file (will provide a new extended GFF);                         Multiple_GFFs= To provide a directory containing multiple Prokka or Bakta GFF files - searchesfor accompanying".fna" files (will provide a new extended GFF);### Standard GFF Annotation Option 2:                        Ensembl= Report StORFsfor an Ensembl Bacteria annotation (ID=gene);                         Feature_Types= Usedin conjunction with-gene_ident to define features such as CDS,rRNA,tRNAfor UR extraction (default CDS);                         --- Standard GFF Input Options:                         Single_Genome= To provide a single Genome - accompanying FASTA must share same name as given gff file (can be .fna, .fa or .fasta);                         Multiple_Genomes= To provide a directory containing multiple accompanying GFF and FASTA files - files must share the same name (fasta can be .fna, .fa or .fasta);                         Single_Combined_GFF= To provide a GFF file with embedded FASTA at the bottom;                         Multiple_Combined_GFFs= To provide a directory containing multiple GFF files with embedded FASTA at the bottom;### Complete Annotation Option 3:                        Pyrodigal= Run Pyrodigalthen Report StORFs (provide path to single FASTA or directory of multiple FASTA files;                        --- Complete Annotation Input Options:                         Single_FASTA = To provide a single FASTA file;                         Multiple_FASTA = To provide a directory containing multiple FASTA files (will detect .fna,.fa,.fasta);                           -p PATH               Provide input file or directory pathStORF-Reporter Options:  -af ALT_FILENAME      Default - Prokka/Bakta output directory share the same                        prefix with their gff/fna files - Use this option when                        Prokka/Bakta output directory name is different from                        the gff/fna files within and StORF-Reporter will                        searchfor the gff/fna with the given prefix                        (MyProkkaDir/"altname".gff) - Does not work with"Multiple_Out_Dirs" option  -oname O_NAME         Default - Appends'_StORF-Reporter_Extended' to end of                        input filename - Takes the directory name of                        Prokka/Bakta outputif given as input or the inputfor                        -afif given - Multiple_* runs will be numbered  -odir O_DIR           Default - Same directory as input  -sout {True,False}    Default - False: Print out StORF sequences separately                        from Prokka/Bakta annotations  -lw {True,False}      Default - True: Line wrap FASTA sequence output at 60                        chars  -aa {True,False}      Default - False: Report StORFs as amino acid sequences  -gz {True,False}      Default - False: Output as .gzPyrodigal Options:  -py_train [{longest,individual,meta}]                        Default - longest: Type of model training to bedonefor Pyrodigal CDS prediction: Options: longest =                        Trains on longest contig; individual = Trains on each                        contig separately - runsin meta modeif contig is<                        20KB; meta = Runsin meta modefor all sequences  -py_fasta {True,False}                        Default - False: Output Pyrodigal+StORF predictionsin                        FASTA format  -py_unstorfed {True,False}                        Default - False: Provide GFF containing original                        Pyrodigal predictionsUR-Extractor Options:  -gene_ident GENE_IDENT                        Default:"CDS". Specifies feature types to exclude                        from Unannotated rRegion extraction. Provide a comma-                        separated list of feature types, e.g.,"misc_RNA,gene,                        mRNA,CDS,rRNA,tRNA,tmRNA,CRISPR,ncRNA,regulatory_regio                        n,oriC,pseudo", to identify annotated regions. - To be                        used with"-anno Feature_Types" -"-gene_ident Prokka"                        willselect"most" features presentin Prokka/Bakta                        annotations- Providing"ID=gene" will check the                        attribute columnfor features assigned as genes                        (compatible with Ensembl annotations). All regions                        without these feature types will be extracted as                        unannotated.  -min_len MINLEN       Default - 30: Minimum UR Length  -max_len MAXLEN       Default - 100,000: Maximum UR Length  -ex_len EXLEN         Default - 50: UR Extension LengthStORF-Finder Options:  -spos {True,False}    Default - False: Output StORF sequences and GFF                        positions inclusive of first stop codon -This canbreak some downstream toolsif changed to True.  -rs {True,False}      Default - True: Remove stop"*" from StORF amino acid                        sequences  -con_storfs {True,False}                        Default - False: Output Consecutive StORFs  -con_only {True,False}                        Default - False: Only output Consecutive StORFs  -ps {True,False}      Default - False: Partial StORFs reported  -wc {True,False}      Default - False: StORFs reported across entire                        sequence  -short_storfs {False,Nolap,Olap}                        Default - False: Run StORF-Finderin"Short-StORF"                        mode. Will onlyreturn StORFs between 30 and 120 nt                        thatdo not overlap longer StORFs - Only works with                        StORFsfor now."Nolap" will filter Short-StORFs which                        areoverlapped by StORFs and Olap will report Short-                        StORFs whichdo overlap StORFs. Overlap is defined by"-olap".  -short_storfs_only {True,False}                        Default - True. Only report Short-StORFs?  -minorf MIN_ORF       Default - 99: Minimum StORF sizein nt  -maxorf MAX_ORF       Default - 60kb: Maximum StORF sizein nt  -codons STOP_CODONS   Default - ('TAG,TGA,TAA'): List Stop Codons to use  -olap_filt [{none,single-strand,both-strand}]                        Default -"both-strand": Filtering level"none" is not                        recommended,"single-strand"for single strand                        filtering and both-strandfor both-strand longest-                        first tiling  -start_filt {True,False}                        Default - False: Filter out StORFs without at least                        one of the 3 common start codons (best usedfor short-                        storfs).  -so [{start_pos,strand}]                        Default - Start Position: How should StORFs be ordered                        when>1 reportedin a single UR.  -f_type [{StORF,CDS,ORF}]                        Default -"CDS": Which GFF featuretypefor StORFs to                        be reported asin GFF -"CDS" is probably neededfor                        usein tools such as Roary and Panaroo  -non_standard NON_STANDARD                        Default - 0.20: Reject StORFs with>=20% non-standard                        nucleotides (A,T,G,C) - Provide % as decimal  -olap OVERLAP_NT      Default - 50: Maximum number of nt of a StORF which                        can overlap another StORF.  -ao ALLOWED_OVERLAP   Default - 50 nt: Maximum overlap between a StORF and                        an original gene.Misc:  -overwrite {True,False}                        Default - False: Overwrite StORF-Reporter outputif                        already present  -verbose {True,False}                        Default - False: Print out runtime messages  -v                    Print out version number andexit

###################################

UR-Extractor:

Subpackage to extract Unannotated Regions from DNA sequences using FASTA and GFF files as input.

Menu - (UR-Extractor -h):

UR-Extractor -f .../Test_Datasets/Matching_GFF_FASTA/E-coli.fa -gff .../Test_Datasets/Matching_GFF_FASTA/E-coli.gff

usage: UR_Extractor.py [-h] -gff GFF [-f FASTA] [-ident IDENT]                       [-min_len MINLEN] [-max_len MAXLEN] [-ex_len EXLEN]                       [-gene_ident GENE_IDENT] [-oname O_NAME] [-odir O_DIR]                       [-gz {True,False}] [-verbose {True,False}] [-v]StORF-Reporter v1.4.3: UR-Extractor Run Parameters.Required Arguments:  -gff GFF              GFF file containing genome annotationOptional Arguments:  -f FASTA              Accompanying FASTA fileif GFF file does not contain                        sequence data  -ident IDENT          Identifier givenfor Unannotated Region output                        sequences - Do not modifyif output is to be used by                        StORF-Finder: Default"Sequence-ID"_UR  -min_len MINLEN       Minimum UR Length: Default 30  -max_len MAXLEN       Maximum UR Length: Default 100,000  -ex_len EXLEN         UR Extension Length on 5' and 3': Default 50  -gene_ident GENE_IDENT                        Default:"CDS". Specifies feature types to exclude                        from Unannotated rRegion extraction. Provide a comma-                        separated list of feature types, e.g.,"misc_RNA,gene,                        mRNA,CDS,rRNA,tRNA,tmRNA,CRISPR,ncRNA,regulatory_regio                        n,oriC,pseudo", to identify annotated regions."-gene_ident Prokka" willselect"most" features                        presentin Prokka/Bakta annotations- Providing"ID=gene" will check the attribute columnfor features                        assigned as genes (compatible with Ensembl                        annotations). All regions without these feature types                        will be extracted as unannotated.Output:  -oname O_NAME         Default - Appends'_UR' to end of input GFF filename  -odir O_DIR           Default - Same directory as input GFF  -gz {True,False}      Default - False: Output as .gzMisc:  -verbose {True,False}                        Default - False: Print out runtime messages  -v                    Default - False: Print out version number andexit

StORF-Finder:

Subpackage to extract StORFs from Fasta sequences - Works directly with the output of UR-Extractor.

Menu - (StORF-Finder -h):

StORF-Finder -f .../Test_Datasets/Matching_GFF_FASTA/E-coli_UR.fa

usage: StORF_Finder.py [-h] -f FASTA [-ua {True,False}] [-wc {True,False}]                       [-ps {True,False}]                       [-olap_filt [{none,single-strand,both-strand}]]                       [-start_filt {True,False}] [-con_storfs {True,False}]                       [-con_only {True,False}]                       [-short_storfs {False,Nolap,Olap}]                       [-short_storfs_only {True,False}]                       [-f_type [{StORF,CDS,ORF}]] [-minorf MIN_ORF]                       [-maxorf MAX_ORF] [-codons STOP_CODONS]                       [-non_standard NON_STANDARD] [-olap OVERLAP_NT]                       [-s SUFFIX] [-so [{start_pos,strand}]] [-oname O_NAME]                       [-odir O_DIR] [-gff {True,False}] [-aa {True,False}]                       [-aa_only {True,False}] [-lw {True,False}]                       [-spos {True,False}] [-stop_ident {True,False}]                       [-gff_fasta {True,False}] [-gz {True,False}]                       [-verbose {True,False}] [-v]StORF-Reporter v1.4.3: StORF-Finder Run Parameters.Required Arguments:-f FASTA              Input FASTA File - (UR_Extractor output)Optional Arguments:-ua {True,False}      Default - Treat input as Unannotated: Use"-ua False"for standard fasta-wc {True,False}      Default - False: StORFs reported across entire                        sequence-ps {True,False}      Default - False: Partial StORFs reported-olap_filt [{none,single-strand,both-strand}]                        Default -"both-strand": Filtering level"none" is not                        recommended,"single-strand"for single strand                        filtering and both-strandfor both-strand longest-                        first tiling-start_filt {True,False}                        Default - False: Filter out StORFs without at least                        one of the 3 common start codons (best usedfor short-                        storfs).-con_storfs {True,False}                        Default - False: Output Consecutive StORFs-con_only {True,False}                        Default - False: Only output Consecutive StORFs-short_storfs {False,Nolap,Olap}                        Default - False: Run StORF-Finderin"Short-StORF"                        mode. Will onlyreturn StORFs between 30 and 120 nt                        thatdo not overlap longer StORFs - Only works with                        StORFsfor now."Nolap" will filter Short-StORFs which                        areoverlapped by StORFs and Olap will report Short-                        StORFs whichdo overlap StORFs. Overlap is defined by"-olap".-short_storfs_only {True,False}                        Default - True. Only report Short-StORFs?-f_type [{StORF,CDS,ORF}]                        Default -"StORF": Which GFF featuretypefor StORFs                        to be reported asin GFF  -minorf MIN_ORF       Default - 99: Minimum StORF sizein nt  -maxorf MAX_ORF       Default - 60kb: Maximum StORF sizein nt-codons STOP_CODONS   Default - ('TAG,TGA,TAA'): List Stop Codons to use-non_standard NON_STANDARD                        Default - 0.20: Reject StORFs with>=20% non-standard                        nucleotides (A,T,G,C) - Provide % as decimal-olap OVERLAP_NT      Default - 50: Maximum number of nt of a StORF which                        can overlap another StORF.-s SUFFIX             Default - Do not append suffix to genome ID-so [{start_pos,strand}]                        Default - Start Position: How should StORFs be ordered                        when>1 reportedin a single UR.Output:-oname O_NAME         Default - Appends'_StORF-Finder' to end of input                        FASTA filename-odir O_DIR           Default - Same directory as input FASTA-gff {True,False}     Default - True: Output a GFF file-aa {True,False}      Default - False: Report StORFs as amino acid sequences-aa_only {True,False}                        Default - False: Only output Amino Acid Fasta  -lw {True,False}      Default - True: Line wrap FASTA sequence output at 60                        chars-spos {True,False}    Default - False: Output StORF sequences and GFF                        positions inclusive of first stop codon -This canbreak some downstream toolsif changed to True.  -stop_ident {True,False}                        Default - True: Identify Stop Codon positions with'*'  -gff_fasta {True,False}                        Default - False: Report all gene sequences (nt) at the                        bottom of GFF filesin Prokka output mode  -gz {True,False}      Default - False: Output as .gzMisc:  -verbose {True,False}                        Default - False: Print out runtime messages  -v                    Default - False: Print out version number andexit

StORF-Extractor

Subpackage to extract sequences reported by StORF-Reporter from a genome annotation.

Menu - (StORF-Extractor -h):

StORF-Extractor -storf_input Combined -p .../Test_Datasets/Combined_GFFs/E-coli_Combined_StORF-Reporter_Extended.gff

usage: StORF_Extractor.py [-h] [-storf_input {Combined,Separate}] [-p PATH] [-gff_out {True,False}] [-oname O_NAME] [-odir O_DIR] [-gz {True,False}] [-verbose {True,False}] [-v]StORF-Reporter v1.4.3: StORF-Extractor Run Parameters.Required Arguments:  -storf_input {Combined,Separate}                        Are StORFs to be extracted from Combined GFF/FASTA or Separate GFF/FASTA files?  -p PATH               Provide input file or directory pathOutput:  -gff_out {True,False}                        Default - False: Output StORFsin GFF format  -oname O_NAME         Default - Appends'_Extracted_StORFs' to end of input GFF filename  -odir O_DIR           Default - Same directory as input FASTA  -gz {True,False}      Default - False: Output as .gzMisc:  -verbose {True,False}                        Default - False: Print out runtime messages  -v                    Default - False: Print out version number andexit

StORF-Remover

Subpackage to remove sequences reported by StORF-Reporter without a Blast/Diamond hit (any alignment in BLAST 6 format).

Menu - (StORF-Remover -h):

StORF-Remover -gff .../Test_Datasets/StORF_Extractor_And_Remover/Myco_UR_StORF-R.gff -blast .../Test_Datasets/StORF_Extractor_And_Remover/Myco_URs_StORFs_aa_Swiss.tab

usage:StORF_Remover.py [-h] [-gffGFF] [-blastBLAST] [-min_scoreMINSCORE] [-onameO_NAME] [-odirO_DIR] [-gz {True,False}]                        [-verbose {True,False}] [-v]StORF-Reporterv1.4.1:UR-RemoverRunParameters.RequiredArguments:-gffGFFGFFannotationfilefortheFASTA-blastBLASTBLASTformat6annotationfileOptionalArguments:-min_scoreMINSCOREMinimumBitScoretokeepStORF:Default30Output:-onameO_NAMEDefault-Appends'_UR'toendofinputGFFfilename-odirO_DIRDefault-SamedirectoryasinputGFF-gz {True,False}Default-False:Outputas .gzMisc:-verbose {True,False}Default-False:Printoutruntimemessages-vDefault-False:Printoutversionnumberandexit

Test Datasets:

The directory 'Test_Datasets' contains GFF and FASTA files to test the installation and use of StORF-Reporter - Example output files are also provided for comparison.

About

Package to extract Unannotated Regions from prokaryotic genomes report coding and pseudogenised genes delimited by stop codons - Named StORFs (Stop - Open Reading Frames)

Releases32

v1.4.2: Cleaned up some error messages, removed ORForise as a dependancy, fixed -spos parameter and corrected GFF positional information in some situations. Latest

Oct 23, 2024

+ 31 releases

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

License

NickJD/StORF-Reporter

Folders and files

Latest commit

History

Repository files navigation

StORF-Reporter has now been published in NAR:https://doi.org/10.1093/nar/gkad814

StORF-Reporter, a toolkit that returns missed CDS genes from the Unannotated Regions (URs) of prokaryotic genomes.

Please use `pip install StORF-Reporter' to install StORF-Reporter.

This will also install the python-standard library numpy (>=1.22.0,<1.24.0) and Pyrodigal (https://github.com/althonos/pyrodigal).

Consider using '--no-cache-dir' with pip to ensure the download of the newest version of StORF-Reporter.

The directory "Test_Datasets" is provided to confirm functionality of StORF-Reporter.

StORF-Reporter:

Most common use cases -

Supplement a current annotation from a tool such as Prokka or Bakta. A new GFF file will be created compatible with downstream pangenome analysis tools such as Roary and Panaroo.

For use on a single Prokka/Bakta output directory - Will also create a new fasta file with Prokka/Bakta genes and StORF sequences.

For use on multiple Prokka/Bakta output directies - Will also create a new fasta file with Prokka/Bakta genes and StORF sequences.

For use on a directory containing multiple Prokka/Bakta output gffs - Only produces new GFF files.

For use on a GFF file from a CDS prediction tool such as Prodigal - Provide a GFF file and StORF-Reporter will find the matching .fa/.fasta/.fna (must have the same name).

For use on a directory containing multiple GFF files from a CDS prediction tool such as Prodigal - StORF-Reporter will find the matching .fa/.fasta/.fna (must have the same name).

For use on a directory containing multiple GFF files with embedded FASTA.

To perform a fresh end-to-end annotation of a genome without an annotation, StORF-Reporter will use Pyrodigal to predict CDS genes and then supplement with StORFs.

Menu - (StORF-Reporter -h):

UR-Extractor:

Subpackage to extract Unannotated Regions from DNA sequences using FASTA and GFF files as input.

Menu - (UR-Extractor -h):

StORF-Finder:

Subpackage to extract StORFs from Fasta sequences - Works directly with the output of UR-Extractor.

Menu - (StORF-Finder -h):

StORF-Extractor

Menu - (StORF-Extractor -h):

StORF-Remover

Menu - (StORF-Remover -h):

Test Datasets:

The directory 'Test_Datasets' contains GFF and FASTA files to test the installation and use of StORF-Reporter - Example output files are also provided for comparison.

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases32

Packages0

Uh oh!

Languages

Packages