DBTC is an R implementation of Dada2 and a BLAST approach tometabarcoding analysis.
molecular analysis of high-throughput metabarcode molecular sequencedata fastq files and taxonomic assignment of unique reads (in fastaformat files) using the Basic Local Alignment Search Tool (BLAST) and Rto reduced obtained results (See additional documentation atDBTC). The Dada-BLAST-TaxonAssign-Condense (‘DBTC’) package contains the foundational‘DBTC’ functions run throughthe R command line (But also see‘DBTCShiny’ andDBTCShinyTutorial).The ‘DBTC’ functions have four main outcomes…
NOTE: While the DBTC package has been built for theanalysis of high-throughput sequencing results, the BLAST, taxonomicassignment, and taxonomic condense can be utilized with single specimenSanger sequencing data.
Also seeDBTCShiny and theassociatedDBTCShinyTutorial.
DBTC can be installed three ways.
install.packages(‘DBTCShiny’)
Run the following commands in your R terminal…
if(!require(devtools)) install.packages('devtools')library('devtools')devtools::install_github('rgyoung6/DBTC')library('DBTC')Note: the first command to install the “devtools”may not be necessary if already installed.
Navigate to theDBTCGitHub page. Download the files associated with this page to your localcomputer and place them somewhere in the main file folder named DBTC.Then run the following command pointing to that location on your localcomputer by replacing the HERE with the path in the belowcommand…
library("DBTC", lib.loc="HERE")Follow the instructions on the NCBIBLAST+executables page to obtain a local version of the BLAST tools. The listof the latest installation files can be foundhere.
Note: It is best to download and install the most recent versions ofthe blast+ suite to your computer and place the programs in yourcomputerspath so you canaccess the program from any folder. However, the program files for bothblastn and makeblastdb have been included in theDBTCShinyTutorialGitHub page for ease of use (please note that these may not be the mostrecent versions).
The R packagetaxonomizrwebsite is used to establish a NCBI taxaID database (NOTE: thispackage is also required when using the taxon assignment elements in theDBTC pipeline).
install.packages('taxonomizr')library('taxonomizr')NCBI BLASTable databases can be established through two methods.
Download your desired preformatted NCBI database by using the‘update_blastdb.pl’ (found in the NCBI BLAST+ local install folder).NOTE: Perl programming language needs to be installed on your localmachine. Instructions can be found atGet NCBI BLASTdatabases.
You can download your desired preformatted NCBI database manuallywith instructions atBLASTFTP Site and a list of the available databases atIndex of/blast/db.
In addition to the NCBI resources, DBTC can also use customdatabases. To establish these databases you will require aFasta file withthe desired records withMACER formatted headers.TheMACER R package andinstructions can be found at either of the two locations:
MACER GitHub (willhave the most recent version and development versions)
In the ‘Preparation’ section of thetaxonomizrwebsite, use the instructions and theprepareDatabase(‘accessionTaxa.sql’, getAccessions = FALSE) taxonomizrcommand to establish a local taxonomy database.
prepareDatabase('accessionTaxa.sql', getAccessions = FALSE)Note: Along with the accessionTaxa.sql two other files nodes.dmp andnames.dmp files are downloaded. These two files are not necessary fordownstream analyses and can be deleted.
TheShortReadpackage is required to run elements of the DBTC pipeline and can beobtained through Bioconductor.
Thedada2package is main package to process the raw fastq files and can beobtained from Bioconductor. There is also adada2 GitHub resource.
if (!require('BiocManager', quietly = TRUE))install.packages('BiocManager')BiocManager::install('ShortRead')BiocManager::install('dada2')library(dada2)Each of below CRAN packages and their dependencies are required forthe DBTC package.
install.packages(c('ggplot2', 'parallel', 'pbapply', 'plyr', 'stats', 'taxonomizr', 'utils')) library(c('ggplot2', 'parallel', 'pbapply', 'plyr', 'stats', 'taxonomizr', 'utils'))After installation of the DBTC and all of its dependencies you needto load the package using the following commands…
library(DBTC)The dada_implement() function takesfastq files asinput, analyses them and produces amplicon sequence variant (ASV)files. This function requires a main directory containing folder(s)representing sequencing runs which in-turn containsfastq files (thelocation of one of thefastq files in oneof the sequencing run folders is used as an input argument).Arun is a group of results processed at the same time on the same machinerepresenting the same molecular methods. All sequencing foldersin the main directory need to represent data from sequencing runs thathave used the same primers and protocols. Output from this functionincludes all processing files and final main output files in the form offasta files andASVtables.
DBTC dada_implement() usesASVoutput files (‘YYYY_MM_DD_HH_MM_UserInputRunName_Merge’ and/or‘YYYY_MM_DD_HH_MM_UserInputRunName_MergeFwdRev’) and combines them intoa singleASVtable and creates an accompanyingfasta file. Thisfunction also produces a file containing the processing information forthe function. The main input argument for this function is the locationof a file in a folder containing allASVtables wanting to be combined. Output files are generated with thenaming convention ‘YYYY_MM_DD_HH_MM_combinedDada’.
This function takes afasta file withheaders in the MACER format (Young et al. 2021) and establishes adatabase upon which a BLAST search can be completed. However, if a NCBIsequence database is desired, it is advisable to use, where applicable,NCBI preformatted databases and skip the make_BLAST_DB() function(https://www.ncbi.nlm.nih.gov/books/NBK62345/#blast_ftp_site.The_blastdb_subdirectory).The outcome of the function is a folder with a BLASTable NCBI formattedsequence database.
The MACERfasta headerformat
## seq_BLAST()[Fasta](https://en.wikipedia.org/wiki/FASTA_format) file(s) are used as input along with a user selected NCBI formatted database upon which query sequences will be searched using BLAST. The outcome of the function are two files, a BLAST run file and a single file containing all of the BLAST results in tab delimited format. There are no headers in the BLAST results file but the columns (in order left to right) are: query sequence ID, search sequence ID, search taxonomic ID, query to sequence coverage, percent identity, search scientific name, search common name, query start, query end, search start, search end, e-value.## taxon_assign()This function takes a BLAST result file and associated [fasta](https://en.wikipedia.org/wiki/FASTA_format) files (either on their own or with accompanying [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) files generated from the dada_implement() function) and collapses the multiple BLAST results into as single result for each query sequence. When an [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) table is present the taxonomic results will be combined with the [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) table.## combine_assign_output()The combine_assign_output() function takes a file selection and then uses all DBTC [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) taxon assign files ('_taxaAssign_YYYY_MM_DD_HHMM.tsv') in a selected directory and combines them into a single output 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' file. The files being combined should represent different samples but representing data that have all come from analysis using the same molecular methods, the same analysis arguments, and the same molecular sequence databases.## reduce_taxa()To reduce [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) results to unique taxa per file the reduce_taxa() function takes a file selection and then uses all '_taxaAssign_YYYY_MM_DD_HHMM.tsv' and/or 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' files in that directory. This function then reduces all [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) with the same taxonomic assignment into a single result and places these results in a '_taxaReduced_YYYY_MM_DD_HHMM.tsv' file for each of the target files in the directory.## combine_reduced_output()This function takes a file selection and then uses all '_taxaReduced_YYYY_MM_DD_HHMM.tsv' files in that directory and combines them into a single 'YYYY_MM_DD_HHMM_CombineTaxaReduced.txt' taxa table file with presence absence results. The files being combined should represent the same biological samples but with different molecular marker information. The output [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) can include read numbers or be reduced to presence absence results.([Back to Top](#table-of-contents))***# Naming convention rulesWARNING - NO WHITESPACE!When running DBTC functions the paths for the files selected cannot have white space! File folder locations should be as short as possible (close to the [root](https://en.wikipedia.org/wiki/Root_directory) directory) as some functions do not process long naming conventions. Also, special characters should be avoided (including question mark, number sign, exclamation mark). It is recommended that dashes be used for separations in naming conventions while retaining underscores for use as information delimiters (this is how DBTC functions use underscore). There are several key character strings used in the DBTC pipeline, the presence of these strings in file or folder names will cause errors when running DBTC functions. The following strings are those used in DBTC and should not be used in file or folder naming: - _BLAST - _combinedDada - _taxaAssign - _taxaAssignCombined - _taxaReduced - _CombineTaxaReduced([Back to Top](#table-of-contents))***# Package Function Details## Dada Implementdada_implement() - Process metabarcode raw fastq files by [runs](#runs) using Dada2 (Note: molecular markers are independently analysed and combined at the end of the analysis pipeline using the [combine_reduced_output()](#combine-reduced-output) function).### Input Two files are required as input for the dada_implement() function. The first are the fastq files in the appropriate folder structure (see below) and the second is a file containing the primers used for the amplification of the sequence reads (tab separated file, see [DBTCShinyTutorial](https://github.com/rgyoung6/DBTCShinyTutorial) for file examples). To select all of the fastq files simply select one file in one of the [Run](#runs) directories to point to the desired data files, as long as the file and folder structure is correct.**Fastq File Folder Structure** Parent Directory | | ----------------- | | | |Run1 Directory Run2 Directory-Fastq -Fastq-Fastq -Fastq... ...**Format of the primer file**| Forward | Reverse | | :------------- |:-------------| | AGTGTGTAGTGATTG | CGCATCGCTCAGACTGACTGC | | GAGCCCTCGATCGCT | GGTCGATAGCTACGCGCGCATACGACT | | | GGTTCACATCGCATTCAT | ### Arguments- <strong>runFolderLoc -</strong> Select a file in the one of the [run](#runs) folders with the fastq files of interest (Default NULL).- <strong>primerFile -</strong> Select a file with the primers for this analysis (Default NULL).- <strong>fwdIdent -</strong> Forward identifier naming string (Default '_R1_001').- <strong>revIdent -</strong> Reverse identifier naming string (Default '_R2_001').- <strong>bidirectional -</strong> Selection to process paired forward and reverse sequence for analysis (Default TRUE).- <strong>unidirectional -</strong> Selection to process files independently (Default FALSE).- <strong>printQualityPdf -</strong> Selection to process save image files showing quality metrics (Default TRUE).- <strong>maxPrimeMis -</strong>Maximum number of mismatches allowed when pattern matching trimming the primers from the ends of the reads for the ShortRead trimLRPatterns() function (Default 2).- <strong>fwdTrimLen -</strong> Select a forward trim length for the Dada filterAndTrim() function (Default 0).- <strong>revTrimLen -</strong> Select a reverse trim length for the Dada filterAndTrim() function (Default 0).- <strong>maxEEVal -</strong> Maximum number of expected errors allowed in a read for the Dada filterAndTrim() function (Default 2).- <strong>truncQValue -</strong> Truncation value use to trim ends of reads, nucleotides with quality values less than this value will be used to trim the remainder of the reads for the Dada filterAndTrim() function (Default 2).- <strong>truncLenValueF -</strong> Dada forward length trim value for the Dada filterAndTrim() function. This function is set to 0 when the pattern matching trim function is enabled (Default 0).- <strong>truncLenValueR -</strong> Dada reverse length trim value for the Dada filterAndTrim() function. This function is set to 0 when the pattern matching trim function is enabled (Default 0).- <strong>error -</strong> Percent of fastq files used to assess error rates for the Dada learnErrors() function (Default 0.1).- <strong>nbases -</strong> The total number of bases used to assess errors for the Dada learnErrors() function (Default 1e80) NOTE: this value is set very high to get all nucleotides in the error present file subset. If the error is to be assessed using total reads and not specific fastq files then set the error to 1 and set this value to the desired number of reads.- <strong>maxMismatchValue -</strong> Maximum number of mismatches allowed when merging two reads for the Dada mergePairs() function (Default 2).- <strong>minOverlapValue -</strong> Minimum number of overlapping nucleotides for the forward and reverse reads for the Dada mergePairs() function (Default 12).- <strong>trimOverhang -</strong> Trim merged reads past the start of the complimentary primer regions for the Dada mergePairs() function (Default FALSE).- <strong>minFinalSeqLen -</strong> The minimum final desired length of the read (Default 100).- <strong>verbose -</strong> If set to TRUE then there will be output to the R console, if FALSE then this reporting data is suppressed (Default TRUE).### OutputThe output files from this function appear in four folders. See the below diagram for the saved file structure. Parent Directory | | ----------------------------- | | | | Run1 Directory Run2 Directory - Fastq - Fastq - Fastq - Fastq ... ... | | -------------------------------------------------------- | | | |YYYY_MM_DD_HHMM_Run1_A_Qual | | | -revQual.pdf |YYYY_MM_DD_HHMM_Run1_C_FiltQual | … | -filtFwdQual.pdf | |-filtRevQual.pdf | | … | | | | YYYY_MM_DD_HHMM_Run1_D_Output |-dadaSummary.txt | -dadaSummaryTable.tsv | -ErrorForward.pdf |-ErrorReverse.pdf | -MergeFwdRev.tsv YYYY_MM_DD_HHMM_Run1_B_Filt-MergeFwdRev.fas -fwdFilt.fastq -Merge.tsv -revFilt.fastq -Merge.fas …-TotalTable.tsv Primer_Trim -primeTrim.fastq …
```
Quality pdf’s in the A_Qual folder represent the quality metrics forthe rawFastqfiles (This folder may not be present if printQualityPdf is set toFALSE). Files in the B_Filt folder represent the trimming (in thePrimer_Trim folder) and trimmed and cleaned in the larger folder.Quality pdf’s in the C_FiltQual folder represent the quality metrics forthe trimmed and cleanedFastq files (Thisfolder may not be present if printQualityPdf is set to FALSE). There arenumerous files in the D_Output folder. These include:
(Back to Top) ***
combine_dada_output() - Combine Dada2ASVtables (YYYY_MM_DD_HHMM_FileName_MergeFwdRev.tsv ORYYYY_MM_DD_HHMM_FileName_Merge.tsv) into a singleASVoutput file.
Two or more files to be combined are required as input for thisfunction. These files need to beASVfiles as outputted from thedada_inplement() and can include Merge,MergeFwdRev, or TotalTable.tsv files.
The output from this function includes three files. 1.YYYY_MM_DD_HHMM_combinedDada.tsv - combinedASVtable 2. YYYY_MM_DD_HHMM_combinedDada.fas - combinedFasta file 3.YYYY_MM_DD_HHMM_combinedDada.txt - Summary file from thecombine_dada_output()
Outputted data files will come in the sameASVtable format as the outputdada_inplement()ASVfiles.
(Back to Top) ***
make_BLAST_DB() - Create a local database to BLAST against.
This function takes aFasta file (inMACER format) andestablishes a database upon which a BLAST search can be completed. Theoutcome of the function is a folder with an NCBI database. - The MACERFasta headerformat ->UniqueID|OtherInformation|Genus|species|OtherInformation|Marker- An example of the header format output from theMACER program is>GenBankAccessionOrBOLDID|GenBankAccession|Genus|species|UniqueID|Marker
The output from this function includes a folder with the BLASTdatabase named according to the submitted dbName.
The constructed database can then be used with theseq_BLAST() function.
seq_BLAST() - SearchFasta files ofunknown sequences against a BLAST formatted database.
Provide a location for the BLAST database you would like to use byselecting a file in the target directory (This can be a built databaseusing themake_BLAST_DB() function or apreformattedNCBIBLAST database). Then provide the location of the query sequencefiles by indicating a file in a directory that contains theFasta files.Provide the path for the blast+ blastn program. Finally, provide theminimum query sequence length to BLAST (Default = 100), the maximumdepth of the BLAST returned results (Default = 200), and finally thenumber of cores to process the function (default = 1, Windowsimplementation can only use a single core and will default to this valuewhen running on Windows).
Two files are produced from this function, a BLAST run file and aBLAST results file for each of theFasta files in thetarget directory.
The BLAST run file contains the command used to run the BLAST search.The BLAST results file includes all results in a tab delimited .tsv fileformat with the columns qseqid, sseqid, staxid, qcovs, pident, ssciname,scomname, qstart, qend, sstart, send, evalue.
taxon_assign() - Using BLAST results to construct a table withtaxonomic assignments for each unique sequence.
This function requires a BLAST output file and an associatedFasta file. Inaddition, if present anASVfile will also be used to combine with the taxonomic results. Thefunction will take the BLAST results and reduce the taxonomic assignmentto a single result for each read.
A single taxonomic assignment file is created for each BLAST outputfile with the naming convention having the string’_taxaAssign_YYYY_MM_DD_HHMM.tsv’. In addition, there is a run file thatis also generated which contains the run details with the namingconvention ‘YYYY_MM_DD_HHMM_taxaAssign.txt’.
The number of returned BLAST results is dictated by theseq_BLAST() BLASTResults argument. Thetaxon_assign() function takes into account all returned BLAST resultsfor each read. At each taxonomic level assignments have quality metricsin parentheses after the name. These values (“Num_Rec”, “Coverage”,“Identity”, “Max_eVal”) represent the number of records with thistaxonomic placement, the minimum coverage and identity, and the maximumeValue for the reported taxa.
Column headers for the resulting taxonomic assignments include…
uniqueID, superkingdom, phylum, class, order, family, genus, species,Top_BLAST, Lowest_Single_Ran, Lowest_Single_Taxa,Lowest_Single_Rank_Above_Thres, Lowest_Single_Taxa_Above_Thres,Final_Common_Names, Final_Rank, Final_Taxa, Final_Rank_Taxa_Thres,Result_Code, Sequence, Length, Results, followed by columns of samplescontainingASVvalues
There are three columns that deserve special explanation.
The Final_Rank_Taxa_Thres column contains the threshold values(Coverage, Identitiy) applied to the final rank and taxonomic values forthe associated records.
The Results column contains the reference to the record if it wasfrom a merged, forward or reverse analysis result.
The Result_Code column contains flags placed on the results tobetter understand the quality of the resulting taxonomic assignments.Below is a list of codes:
Note: Records with BIRT and BCRT flags should be highly scrutinized.TBAT are also concerning in that they may represent a less specifictaxonomic placement due to the moving of the result to a highertaxonomic placement. The taxonomic rank directly below the finalreported rank should be reviewed as there may be potential to adjust thefinal taxonomic assignment. SANF results should be explored and the sizeof the database, the trust placed in the records in the database, andthe depth of the BLAST results should be considered when assessingrecords with this flag. Records with SFAT are among the least concerningas the BLAST results were saturated but this taxonomic assignmentsaturation occurred above your set quality coverage and identitythreshold. Concerns with records with this result could be that thedepth of the BLAST analysis was not low enough for very large databases,or that the database is not complete (taxonomic breadth) when usingsmaller databases.
combine_assign_output() - Using results from thetaxon_assign() function, combines all fileswith the string ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ in to a single .tsvfile.
Select a file in a folder with the taxa assigned files you would liketo combine (extension ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’). NOTE: all’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ files in the folder location shouldoriginate from the same dada output file but have outputs from differentBLAST sequence libraries and therefore contain the sameASV’s.
This function produces a ‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ anda ‘YYYY_MM_DD_HHMM_taxaAssignCombined.txt’ file in the selected targetdirectory.
The interpretation of the output file for the combine_assign_output()‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ files is the same as the is thesame as thetaxon_assign()’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ files.
reduce_taxa() - Using results fromtaxon_assign() and/orCombine Assignment Output thisfunction combines all reads with the same taxonomic assignment into asingle result for each of the files submitted.
This function requires a file in a directory where all’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ and/or‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ files in that directory will becombined. All records with the same taxonomic result will be combined.The BLAST values in parentheses (“Num_Rec”, “Coverage”, “Identity”,“Max_eVal”) are combine by the mean number of records, the mean of theminimum coverage and identity values, and the mean of the maximumeValues.
This function produces a ’_taxaReduced_YYYY_MM_DD_HHMM.tsv’ file forevery ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ or‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ present in the targetdirectory.
Reduced taxonomic assignment files have fewer columns in the maintaxa_reduced.tsv file than the ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ or‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ files as columns are collapsed.In addition, the values in the taxonomic columns in parenthesesrepresent the average values across all of the results with the sametaxonomic assignment (seetaxon_assign()interpretation).
The columns include, superkingdom, phylum, class, order, family,genus, species, Top_BLAST, Final_Common_Names, Final_Rank, Final_Taxa,Result_Code, RepSequence, Number_ASV, Average_ASV_Length,Number_Occurrences, Average_ASV_Per_Sample, Median_ASV_Per_Sample,Results. NOTE: The representative sequence (RepSequence) is the longestsequence for each of the collapsed taxa assigned.
combine_reduced_output() - This function takes’_taxaReduced_YYYY_MM_DD_HHMM.tsv’ files generated from the samebiological samples but representing different amplified molecularmarkers and collapses these data into a single file. The outcome of thisprocess results in a presence absence matrix for all taxa andmarkers.
Select a file in a folder with ’_taxaReduced_YYYY_MM_DD_HHMM.tsv’files representing data for the same biological samples but representingdifferent amplified molecular markers.
Two files, a ‘YYYY_MM_DD_HHMM_CombineTaxaReduced.tsv’ result file anda ‘YYYY_MM_DD_HHMM_CombineTaxaReduced.txt’ run summary file aregenerated from this function. The result file contains presence/absencedata in a matrix that associates the data with samples, taxa, andmolecular marker. The column headers in the results file includes thefollowing, superkingdom, phylum, class, order, family, genus, species,markers(n number of columns), samples (n number).
The interpretation of the output file is the same as thetaxon_assign()’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ files.
Young RG, et al., Hanner RH (2024) A Scalable, Open Source, CrossPlatform, MetaBarcode Analysis Method using Dada2 and BLAST.Biodiversity Data Journal (In progress)