Movatterモバイル変換

Dada-BLAST-TaxonAssign-Condense (DBTC)

DBTC is an R implementation of Dada2 and a BLAST approach tometabarcoding analysis.

Description

molecular analysis of high-throughput metabarcode molecular sequencedata fastq files and taxonomic assignment of unique reads (in fastaformat files) using the Basic Local Alignment Search Tool (BLAST) and Rto reduced obtained results (See additional documentation atDBTC). The Dada-BLAST-TaxonAssign-Condense (‘DBTC’) package contains the foundational‘DBTC’ functions run throughthe R command line (But also see‘DBTCShiny’ andDBTCShinyTutorial).The ‘DBTC’ functions have four main outcomes…

Fastq fileprocessing using Dada in R
Using the Basic Local Alignment Search Tool (BLAST),amplicon sequence variants (ASV)can be searched against local NCBI or custom libraries
Assign taxa to the unique reads using NCBI taxon database (obtainthe database usingtaxonomizrwebsite)
Condense the resulting ASV taxonomic assignment tables to uniquetaxa with the ability to combine datasets (using different sequencelibraries for the same reads, or results from the same samples fordifferent molecular regions) into a combined results table

NOTE: While the DBTC package has been built for theanalysis of high-throughput sequencing results, the BLAST, taxonomicassignment, and taxonomic condense can be utilized with single specimenSanger sequencing data.

Also seeDBTCShiny and theassociatedDBTCShinyTutorial.

Installation

DBTC can be installed three ways.

1. Install from CRAN

install.packages(‘DBTCShiny’)

2. Install via GitHub

Run the following commands in your R terminal…

if(!require(devtools)) install.packages('devtools')library('devtools')devtools::install_github('rgyoung6/DBTC')library('DBTC')

Note: the first command to install the “devtools”may not be necessary if already installed.

3. Install throughdownload from GitHub

Navigate to theDBTCGitHub page. Download the files associated with this page to your localcomputer and place them somewhere in the main file folder named DBTC.Then run the following command pointing to that location on your localcomputer by replacing the HERE with the path in the belowcommand…

library("DBTC", lib.loc="HERE")

(Back to Top)

Package Dependencies

External R ElementDependencies

NCBIBLAST+ local program to run BLAST on local databases

Follow the instructions on the NCBIBLAST+executables page to obtain a local version of the BLAST tools. The listof the latest installation files can be foundhere.

Note: It is best to download and install the most recent versions ofthe blast+ suite to your computer and place the programs in yourcomputerspath so you canaccess the program from any folder. However, the program files for bothblastn and makeblastdb have been included in theDBTCShinyTutorialGitHub page for ease of use (please note that these may not be the mostrecent versions).

Rpackage taxonomizr to establish a local NCBI taxonomy database

The R packagetaxonomizrwebsite is used to establish a NCBI taxaID database (NOTE: thispackage is also required when using the taxon assignment elements in theDBTC pipeline).

install.packages('taxonomizr')library('taxonomizr')

Establish alocal NCBI prepared sequence database

NCBI BLASTable databases can be established through two methods.

Download your desired preformatted NCBI database by using the‘update_blastdb.pl’ (found in the NCBI BLAST+ local install folder).NOTE: Perl programming language needs to be installed on your localmachine. Instructions can be found atGet NCBI BLASTdatabases.
You can download your desired preformatted NCBI database manuallywith instructions atBLASTFTP Site and a list of the available databases atIndex of/blast/db.

Create acustom sequence database to BLAST against

In addition to the NCBI resources, DBTC can also use customdatabases. To establish these databases you will require aFasta file withthe desired records withMACER formatted headers.TheMACER R package andinstructions can be found at either of the two locations:

MACER CRAN

MACER GitHub (willhave the most recent version and development versions)

Createa local NCBI taxonomy database to assign taxonomic identifications toBLAST results

In the ‘Preparation’ section of thetaxonomizrwebsite, use the instructions and theprepareDatabase(‘accessionTaxa.sql’, getAccessions = FALSE) taxonomizrcommand to establish a local taxonomy database.

prepareDatabase('accessionTaxa.sql', getAccessions = FALSE)

Note: Along with the accessionTaxa.sql two other files nodes.dmp andnames.dmp files are downloaded. These two files are not necessary fordownstream analyses and can be deleted.

(Back to Top)

R Packages Dependencies

BioconductorShortRead and Dada2 packages

TheShortReadpackage is required to run elements of the DBTC pipeline and can beobtained through Bioconductor.

Thedada2package is main package to process the raw fastq files and can beobtained from Bioconductor. There is also adada2 GitHub resource.

if (!require('BiocManager', quietly = TRUE))install.packages('BiocManager')BiocManager::install('ShortRead')BiocManager::install('dada2')library(dada2)

CRAN

Each of below CRAN packages and their dependencies are required forthe DBTC package.

install.packages(c('ggplot2',                   'parallel',                   'pbapply',                   'plyr',                   'stats',                   'taxonomizr',                   'utils'))                   library(c('ggplot2',          'parallel',          'pbapply',          'plyr',          'stats',          'taxonomizr',          'utils'))

(Back to Top)

Run DBTC

After installation of the DBTC and all of its dependencies you needto load the package using the following commands…

library(DBTC)

Function Descriptions

dada_implement()

The dada_implement() function takesfastq files asinput, analyses them and produces amplicon sequence variant (ASV)files. This function requires a main directory containing folder(s)representing sequencing runs which in-turn containsfastq files (thelocation of one of thefastq files in oneof the sequencing run folders is used as an input argument).Arun is a group of results processed at the same time on the same machinerepresenting the same molecular methods. All sequencing foldersin the main directory need to represent data from sequencing runs thathave used the same primers and protocols. Output from this functionincludes all processing files and final main output files in the form offasta files andASVtables.

combine_dada_output()

DBTC dada_implement() usesASVoutput files (‘YYYY_MM_DD_HH_MM_UserInputRunName_Merge’ and/or‘YYYY_MM_DD_HH_MM_UserInputRunName_MergeFwdRev’) and combines them intoa singleASVtable and creates an accompanyingfasta file. Thisfunction also produces a file containing the processing information forthe function. The main input argument for this function is the locationof a file in a folder containing allASVtables wanting to be combined. Output files are generated with thenaming convention ‘YYYY_MM_DD_HH_MM_combinedDada’.

make_BLAST_DB()

This function takes afasta file withheaders in the MACER format (Young et al. 2021) and establishes adatabase upon which a BLAST search can be completed. However, if a NCBIsequence database is desired, it is advisable to use, where applicable,NCBI preformatted databases and skip the make_BLAST_DB() function(https://www.ncbi.nlm.nih.gov/books/NBK62345/#blast_ftp_site.The_blastdb_subdirectory).The outcome of the function is a folder with a BLASTable NCBI formattedsequence database.

The MACERfasta headerformat

## seq_BLAST()[Fasta](https://en.wikipedia.org/wiki/FASTA_format) file(s) are used as input along with a user selected NCBI formatted database upon which query sequences will be searched using BLAST. The outcome of the function are two files, a BLAST run file and a single file containing all of the BLAST results in tab delimited format. There are no headers in the BLAST results file but the columns (in order left to right) are: query sequence ID, search sequence ID, search taxonomic ID, query to sequence coverage, percent identity, search scientific name, search common name, query start, query end, search start, search end, e-value.## taxon_assign()This function takes a BLAST result file and associated [fasta](https://en.wikipedia.org/wiki/FASTA_format) files (either on their own or with accompanying [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) files generated from the dada_implement() function) and collapses the multiple BLAST results into as single result for each query sequence. When an [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) table is present the taxonomic results will be combined with the [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) table.## combine_assign_output()The combine_assign_output() function takes a file selection and then uses all DBTC [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) taxon assign files ('_taxaAssign_YYYY_MM_DD_HHMM.tsv') in a selected directory and combines them into a single output 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' file. The files being combined should represent different samples but representing data that have all come from analysis using the same molecular methods, the same analysis arguments, and the same molecular sequence databases.## reduce_taxa()To reduce [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) results to unique taxa per file the reduce_taxa() function takes a file selection and then uses all '_taxaAssign_YYYY_MM_DD_HHMM.tsv' and/or 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' files in that directory. This function then reduces all [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) with the same taxonomic assignment into a single result and places these results in a '_taxaReduced_YYYY_MM_DD_HHMM.tsv' file for each of the target files in the directory.## combine_reduced_output()This function takes a file selection and then uses all '_taxaReduced_YYYY_MM_DD_HHMM.tsv' files in that directory and combines them into a single 'YYYY_MM_DD_HHMM_CombineTaxaReduced.txt' taxa table file with presence absence results. The files being combined should represent the same biological samples but with different molecular marker information. The output [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) can include read numbers or be reduced to presence absence results.([Back to Top](#table-of-contents))***# Naming convention rulesWARNING - NO WHITESPACE!When running DBTC functions the paths for the files selected cannot have white space! File folder locations should be as short as possible (close to the [root](https://en.wikipedia.org/wiki/Root_directory) directory) as some functions do not process long naming conventions. Also, special characters should be avoided (including question mark, number sign, exclamation mark). It is recommended that dashes be used for separations in naming conventions while retaining underscores for use as information delimiters (this is how DBTC functions use underscore). There are several key character strings used in the DBTC pipeline, the presence of these strings in file or folder names will cause errors when running DBTC functions. The following strings are those used in DBTC and should not be used in file or folder naming:  - _BLAST  - _combinedDada  - _taxaAssign  - _taxaAssignCombined  - _taxaReduced  - _CombineTaxaReduced([Back to Top](#table-of-contents))***# Package Function Details## Dada Implementdada_implement() - Process metabarcode raw fastq files by [runs](#runs) using Dada2 (Note: molecular markers are independently analysed and combined at the end of the analysis pipeline using the [combine_reduced_output()](#combine-reduced-output) function).### Input Two files are required as input for the dada_implement() function. The first are the fastq files in the appropriate folder structure (see below) and the second is a file containing the primers used for the amplification of the sequence reads (tab separated file, see [DBTCShinyTutorial](https://github.com/rgyoung6/DBTCShinyTutorial) for file examples). To select all of the fastq files simply select one file in one of the [Run](#runs) directories to point to the desired data files, as long as the file and folder structure is correct.**Fastq File Folder Structure**

        Parent Directory              |              |      -----------------      |               |      |               |Run1 Directory     Run2 Directory-Fastq             -Fastq-Fastq             -Fastq...                ...

**Format of the primer file**| Forward        | Reverse           | | :------------- |:-------------| | AGTGTGTAGTGATTG      | CGCATCGCTCAGACTGACTGC | | GAGCCCTCGATCGCT      | GGTCGATAGCTACGCGCGCATACGACT      |  |  | GGTTCACATCGCATTCAT      |   ### Arguments- <strong>runFolderLoc -</strong> Select a file in the one of the [run](#runs) folders with the fastq files of interest (Default NULL).- <strong>primerFile -</strong> Select a file with the primers for this analysis (Default NULL).- <strong>fwdIdent -</strong> Forward identifier naming string (Default '_R1_001').- <strong>revIdent -</strong> Reverse identifier naming string (Default '_R2_001').- <strong>bidirectional -</strong> Selection to process paired forward and reverse sequence for analysis (Default TRUE).- <strong>unidirectional -</strong> Selection to process files independently (Default FALSE).- <strong>printQualityPdf -</strong> Selection to process save image files showing quality metrics (Default TRUE).- <strong>maxPrimeMis -</strong>Maximum number of mismatches allowed when pattern matching trimming the primers from the ends of the reads for the ShortRead trimLRPatterns() function (Default 2).- <strong>fwdTrimLen -</strong> Select a forward trim length for the Dada filterAndTrim() function (Default 0).- <strong>revTrimLen -</strong> Select a reverse trim length for the Dada filterAndTrim() function (Default 0).- <strong>maxEEVal -</strong> Maximum number of expected errors allowed in a read for the Dada filterAndTrim() function (Default 2).- <strong>truncQValue -</strong> Truncation value use to trim ends of reads, nucleotides with quality values less than this value will be used to trim the remainder of the reads for the Dada filterAndTrim() function (Default 2).- <strong>truncLenValueF -</strong> Dada forward length trim value for the Dada filterAndTrim() function. This function is set to 0 when the pattern matching trim function is enabled (Default 0).- <strong>truncLenValueR -</strong> Dada reverse length trim value for the Dada filterAndTrim() function. This function is set to 0 when the pattern matching trim function is enabled (Default 0).- <strong>error -</strong> Percent of fastq files used to assess error rates for the Dada learnErrors() function (Default 0.1).- <strong>nbases -</strong> The total number of bases used to assess errors for the Dada learnErrors() function (Default 1e80) NOTE: this value is set very high to get all nucleotides in the error present file subset. If the error is to be assessed using total reads and not specific fastq files then set the error to 1 and set this value to the desired number of reads.- <strong>maxMismatchValue -</strong> Maximum number of mismatches allowed when merging two reads for the Dada mergePairs() function (Default 2).- <strong>minOverlapValue -</strong> Minimum number of overlapping nucleotides for the forward and reverse reads for the Dada mergePairs() function (Default 12).- <strong>trimOverhang -</strong> Trim merged reads past the start of the complimentary primer regions for the Dada mergePairs() function (Default FALSE).- <strong>minFinalSeqLen -</strong> The minimum final desired length of the read (Default 100).- <strong>verbose -</strong> If set to TRUE then there will be output to the R console, if FALSE then this reporting data is suppressed (Default TRUE).### OutputThe output files from this function appear in four folders. See the below diagram for the saved file structure.

                          Parent Directory                                |                                |                  -----------------------------                  |                           |                  |                           |            Run1 Directory                 Run2 Directory            - Fastq                        - Fastq            - Fastq                        - Fastq            ...                            ...                  |                  |        --------------------------------------------------------           |                |             |                        |

```

Interpretation

Quality pdf’s in the A_Qual folder represent the quality metrics forthe rawFastqfiles (This folder may not be present if printQualityPdf is set toFALSE). Files in the B_Filt folder represent the trimming (in thePrimer_Trim folder) and trimmed and cleaned in the larger folder.Quality pdf’s in the C_FiltQual folder represent the quality metrics forthe trimmed and cleanedFastq files (Thisfolder may not be present if printQualityPdf is set to FALSE). There arenumerous files in the D_Output folder. These include:

dadaSummary.txt file which provides all of the information on therunning of the dada_implement() function.
dadaSummaryTable.tsv contains a table with summary information forthe processing of the samples in therun.
ErrorForward.pdf and ErrorReverse.pdf provide visualizations on theassessed sequencing error for the sequencingrun.
MergeFwdRev.tsv is theASVtable with the data from the sequencingrun and theMergeFwdRev.fas is aFasta file withthe reads from the samples. The MergeFwdRev files include reads thatwere able to be merged, as well as reads that were not able to bemerged. NOTE: The merged, forward, and reverse reads are obtained inparallel analyses and combined into a single file so MergeFwdRev fileswill represent triplicate molecular processing results. These files arepresent to see if there are reads that are not represented or poorlyrepresented across merged and unidirectional results, perhaps indicatingissues with one of the primers.
Merge.tsv is theASVtable with the read data able to be merged from the sequencingrun. The companion Merge.fas is aFasta file withthe reads from the samples inFasta format.
TotalTable.tsv is anASVtable with all of the merged, forward, and reverse results as well asthe retained results for the merged reads removed due to being suspectedchimeric combinations.

Dependencies

dada2
ShortRead readFastq()
ShortRead writeFastq()

(Back to Top) ***

Combine Dada Output

combine_dada_output() - Combine Dada2ASVtables (YYYY_MM_DD_HHMM_FileName_MergeFwdRev.tsv ORYYYY_MM_DD_HHMM_FileName_Merge.tsv) into a singleASVoutput file.

Input

Two or more files to be combined are required as input for thisfunction. These files need to beASVfiles as outputted from thedada_inplement() and can include Merge,MergeFwdRev, or TotalTable.tsv files.

Arguments

fileLoc - Select a file in the file folder withdada_inplement() results you would like tocombine (YYYY_MM_DD_HHMM_FileName_MergeFwdRev ORYYYY_MM_DD_HHMM_FileName_Merge both .tsv and .fas files (DefaultNULL).
minLen - The minimum final desired length of theread (Default 100).
verbose - If set to TRUE then there will be outputto the R console, if FALSE then this reporting data is suppressed(Default TRUE).

Output

The output from this function includes three files. 1.YYYY_MM_DD_HHMM_combinedDada.tsv - combinedASVtable 2. YYYY_MM_DD_HHMM_combinedDada.fas - combinedFasta file 3.YYYY_MM_DD_HHMM_combinedDada.txt - Summary file from thecombine_dada_output()

Interpretation

Outputted data files will come in the sameASVtable format as the outputdada_inplement()ASVfiles.

Dependencies

Base R

(Back to Top) ***

Make BLAST DB

make_BLAST_DB() - Create a local database to BLAST against.

Input

Arguments

fileLoc - The location of a file in a directorywhere allFastafiles will be used to construct a BLASTable database (Default =NULL).
makeblastdbPath - The local path for the blast+makeblastdbPath program (Default ‘makeblastdb’).
taxaDBLoc - The location of the NCBIaccessionTaxa.sqltaxonomic data base (Default NULL).
dbName - A short 6-8 alpha character name used whenbuilding a database (Default NULL).
minLen - The minimum sequence length used toconstruct the BLAST database (Default 100).
verbose - If set to TRUE then there will be outputto the R console, if FALSE then this reporting data is suppressed(Default TRUE).

Output

The output from this function includes a folder with the BLASTdatabase named according to the submitted dbName.

Interpretation

The constructed database can then be used with theseq_BLAST() function.

Dependencies

taxonomizr()

(Back to Top)

Sequence BLAST

seq_BLAST() - SearchFasta files ofunknown sequences against a BLAST formatted database.

Input

Provide a location for the BLAST database you would like to use byselecting a file in the target directory (This can be a built databaseusing themake_BLAST_DB() function or apreformattedNCBIBLAST database). Then provide the location of the query sequencefiles by indicating a file in a directory that contains theFasta files.Provide the path for the blast+ blastn program. Finally, provide theminimum query sequence length to BLAST (Default = 100), the maximumdepth of the BLAST returned results (Default = 200), and finally thenumber of cores to process the function (default = 1, Windowsimplementation can only use a single core and will default to this valuewhen running on Windows).

Arguments

databasePath - The location of a file in adirectory where the desired BLAST database is located (DefaultNULL).
querySeqPath - The location of a file in adirectory containing all of theFasta fileswishing to be BLASTed (Default NULL).
blastnPath - The location of the NCBI blast+ blastnprogram (Default ‘blastn’).
minLen - The minimum length of the sequences thatwill be BLASTed (Default 100).
BLASTResults - The number of returned results, orthe depth of the reported results, saved from the BLAST (Default200).
numCores - The number of cores used to run thefunction (Default 1, Windows systems can only use a single core).
verbose - If set to TRUE then there will be outputto the R console, if FALSE then this reporting data is suppressed(Default TRUE).

Output

Two files are produced from this function, a BLAST run file and aBLAST results file for each of theFasta files in thetarget directory.

Interpretation

The BLAST run file contains the command used to run the BLAST search.The BLAST results file includes all results in a tab delimited .tsv fileformat with the columns qseqid, sseqid, staxid, qcovs, pident, ssciname,scomname, qstart, qend, sstart, send, evalue.

Dependencies

Base R

(Back to Top)

Taxon Assignment

taxon_assign() - Using BLAST results to construct a table withtaxonomic assignments for each unique sequence.

Input

This function requires a BLAST output file and an associatedFasta file. Inaddition, if present anASVfile will also be used to combine with the taxonomic results. Thefunction will take the BLAST results and reduce the taxonomic assignmentto a single result for each read.

Arguments

fileLoc - The location of a file in a directorywhere all of the pairedFasta and BLAST(and potentiallyASV)files are located (Default NULL).
taxaDBLoc - The location of the NCBIaccessionTaxa.sqltaxonomic data base (Default NULL).
numCores - The number of cores used to run thefunction (Default 1, Windows systems can only use a single core).
coverage - The percent coverage used for taxonomicassignment for the above threshold results (Default 95).
ident - The percent identity used for the taxonomicassignment for above threshold results (Default 95).
propThres - The proportional threshold flags thefinal result based on the preponderance of the data. So if the thresholdis set to 0.95, results will be flagged if the taxa directly below theassigned taxa has fewer than 0.95 percent of the records causing theupward taxonomic placement (Default 0.95).
coverReportThresh - The percent coverage thresholdused for reporting flags below this threshold (Default 95).
identReportThresh - The percent identity thresholdused for reporting flags below this threshold (Default 95).
includeAllDada - When paired DadaASVtables are present, when set to FALSE, this will exclude records withouttaxonomic assignment (Default TRUE).
verbose - If set to TRUE then there will be outputto the R console, if FALSE then this reporting data is suppressed(Default TRUE).

Output

A single taxonomic assignment file is created for each BLAST outputfile with the naming convention having the string’_taxaAssign_YYYY_MM_DD_HHMM.tsv’. In addition, there is a run file thatis also generated which contains the run details with the namingconvention ‘YYYY_MM_DD_HHMM_taxaAssign.txt’.

Interpretation

The number of returned BLAST results is dictated by theseq_BLAST() BLASTResults argument. Thetaxon_assign() function takes into account all returned BLAST resultsfor each read. At each taxonomic level assignments have quality metricsin parentheses after the name. These values (“Num_Rec”, “Coverage”,“Identity”, “Max_eVal”) represent the number of records with thistaxonomic placement, the minimum coverage and identity, and the maximumeValue for the reported taxa.

Column headers for the resulting taxonomic assignments include…

uniqueID, superkingdom, phylum, class, order, family, genus, species,Top_BLAST, Lowest_Single_Ran, Lowest_Single_Taxa,Lowest_Single_Rank_Above_Thres, Lowest_Single_Taxa_Above_Thres,Final_Common_Names, Final_Rank, Final_Taxa, Final_Rank_Taxa_Thres,Result_Code, Sequence, Length, Results, followed by columns of samplescontainingASVvalues

There are three columns that deserve special explanation.

The Final_Rank_Taxa_Thres column contains the threshold values(Coverage, Identitiy) applied to the final rank and taxonomic values forthe associated records.
The Results column contains the reference to the record if it wasfrom a merged, forward or reverse analysis result.
The Result_Code column contains flags placed on the results tobetter understand the quality of the resulting taxonomic assignments.Below is a list of codes:
- SFAT(coverage, ident): Saturated filtered taxa above threshold
- SANF(coverage, ident): Saturated non-filtered
- BIRT(identReportThresh): Final taxa result is below the identityreporting threshold
- BCRT(coverReportThresh): Final taxa result is below the nucleotidecoverage reporting threshold
- TBAT(propThres): Taxa Below Assigned Taxa Threshold

Note: Records with BIRT and BCRT flags should be highly scrutinized.TBAT are also concerning in that they may represent a less specifictaxonomic placement due to the moving of the result to a highertaxonomic placement. The taxonomic rank directly below the finalreported rank should be reviewed as there may be potential to adjust thefinal taxonomic assignment. SANF results should be explored and the sizeof the database, the trust placed in the records in the database, andthe depth of the BLAST results should be considered when assessingrecords with this flag. Records with SFAT are among the least concerningas the BLAST results were saturated but this taxonomic assignmentsaturation occurred above your set quality coverage and identitythreshold. Concerns with records with this result could be that thedepth of the BLAST analysis was not low enough for very large databases,or that the database is not complete (taxonomic breadth) when usingsmaller databases.

Dependencies

taxonomizr()
pbapply()
parallel()

(Back to Top)

Combine Assignment Output

combine_assign_output() - Using results from thetaxon_assign() function, combines all fileswith the string ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ in to a single .tsvfile.

Input

Select a file in a folder with the taxa assigned files you would liketo combine (extension ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’). NOTE: all’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ files in the folder location shouldoriginate from the same dada output file but have outputs from differentBLAST sequence libraries and therefore contain the sameASV’s.

Arguments

fileLoc - The location of a file in a directorywhere all of the ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ files are located(Default NULL).
numCores - The number of cores used to run thefunction (Default 1, Windows systems can only use a single core).
verbose - If set to TRUE then there will be outputto the R console, if FALSE then this reporting data is suppressed(Default TRUE).

Output

This function produces a ‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ anda ‘YYYY_MM_DD_HHMM_taxaAssignCombined.txt’ file in the selected targetdirectory.

Interpretation

The interpretation of the output file for the combine_assign_output()‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ files is the same as the is thesame as thetaxon_assign()’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ files.

Dependencies

pbapply()
parallel()

(Back to Top)

Reduce Taxa

reduce_taxa() - Using results fromtaxon_assign() and/orCombine Assignment Output thisfunction combines all reads with the same taxonomic assignment into asingle result for each of the files submitted.

Input

This function requires a file in a directory where all’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ and/or‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ files in that directory will becombined. All records with the same taxonomic result will be combined.The BLAST values in parentheses (“Num_Rec”, “Coverage”, “Identity”,“Max_eVal”) are combine by the mean number of records, the mean of theminimum coverage and identity values, and the mean of the maximumeValues.

Arguments

fileLoc - The location of a file in a directorywhere all of the ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ and/or‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ files are located (DefaultNULL).
numCores - The number of cores used to run thefunction (Default 1, Windows systems can only use a single core).
verbose - If set to TRUE then there will be outputto the R console, if FALSE then this reporting data is suppressed(Default TRUE).

Output

This function produces a ’_taxaReduced_YYYY_MM_DD_HHMM.tsv’ file forevery ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ or‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ present in the targetdirectory.

Interpretation

Reduced taxonomic assignment files have fewer columns in the maintaxa_reduced.tsv file than the ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ or‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ files as columns are collapsed.In addition, the values in the taxonomic columns in parenthesesrepresent the average values across all of the results with the sametaxonomic assignment (seetaxon_assign()interpretation).

The columns include, superkingdom, phylum, class, order, family,genus, species, Top_BLAST, Final_Common_Names, Final_Rank, Final_Taxa,Result_Code, RepSequence, Number_ASV, Average_ASV_Length,Number_Occurrences, Average_ASV_Per_Sample, Median_ASV_Per_Sample,Results. NOTE: The representative sequence (RepSequence) is the longestsequence for each of the collapsed taxa assigned.

Dependencies

pbapply()
parallel()

(Back to Top)

Combine Reduced Output

combine_reduced_output() - This function takes’_taxaReduced_YYYY_MM_DD_HHMM.tsv’ files generated from the samebiological samples but representing different amplified molecularmarkers and collapses these data into a single file. The outcome of thisprocess results in a presence absence matrix for all taxa andmarkers.

Input

Select a file in a folder with ’_taxaReduced_YYYY_MM_DD_HHMM.tsv’files representing data for the same biological samples but representingdifferent amplified molecular markers.

Arguments

fileLoc - The location of a file in a directorywhere all of the ’_taxaReduced_YYYY_MM_DD_HHMM.tsv’ files are located(Default NULL).
presenceAbsence - A TRUE or FALSE value used toindicate if the read values should be replaced with presence/absence(1/0) data. This change is necessary when combining taxa for the samenamed samples across molecular markers (TRUE) but is not necessary whencombining results for taxa with all unique sample names (FALSE).
verbose - If set to TRUE then there will be outputto the R console, if FALSE then this reporting data is suppressed(Default TRUE).

Output

Two files, a ‘YYYY_MM_DD_HHMM_CombineTaxaReduced.tsv’ result file anda ‘YYYY_MM_DD_HHMM_CombineTaxaReduced.txt’ run summary file aregenerated from this function. The result file contains presence/absencedata in a matrix that associates the data with samples, taxa, andmolecular marker. The column headers in the results file includes thefollowing, superkingdom, phylum, class, order, family, genus, species,markers(n number of columns), samples (n number).

Interpretation

The interpretation of the output file is the same as thetaxon_assign()’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ files.

Dependencies

plyr() rbind.fill

(Back to Top)

Citation

Young RG, et al., Hanner RH (2024) A Scalable, Open Source, CrossPlatform, MetaBarcode Analysis Method using Dada2 and BLAST.Biodiversity Data Journal (In progress)

(Back to Top)

[8]ページ先頭

Movatterモバイル変換

Dada-BLAST-TaxonAssign-Condense (DBTC)

Description

Table of Contents

Installation

1. Install from CRAN

2. Install via GitHub

3. Install throughdownload from GitHub

Package Dependencies

External R ElementDependencies

NCBIBLAST+ local program to run BLAST on local databases

Establish alocal NCBI prepared sequence database

Create acustom sequence database to BLAST against

R Packages Dependencies

BioconductorShortRead and Dada2 packages

CRAN

Run DBTC

Function Descriptions

dada_implement()

combine_dada_output()

make_BLAST_DB()

Interpretation

Dependencies

Combine Dada Output

Input

Arguments

Output

Interpretation

Dependencies

Make BLAST DB

Input

Arguments

Output

Interpretation

Dependencies

Sequence BLAST

Input

Arguments

Output

Interpretation

Dependencies

Taxon Assignment

Input

Arguments

Output

Interpretation

Dependencies

Combine Assignment Output

Input

Arguments

Output

Interpretation

Dependencies

Reduce Taxa

Input

Arguments

Output

Interpretation

Dependencies

Combine Reduced Output

Input

Arguments

Output

Interpretation

Dependencies

Citation