Documentation

Source:vignettes/articles/Documentation.Rmd

Documentation.Rmd

Introduction

This documentation page provides additional information on to use thedemulticoder R package for processing andanalyzing metabarcode sequencing data. Specifically, it provides moredetail on input directory and file requirements, and key parameters.

Package workflow

Prepare your input files (metadata.csv,primerinfo_params.csv, unformattedreference databases, and PE Illumina read files).
Place all input files in a single directory.
Ensure your file names comply with the specified format.
Run the four steps/functions of pipeline with default settings oradjust parameters as needed.

Data directory structure

Place all your input files into a single directory. The directoryshould contain the following files:

PE Illumina read files
metadata.csv
primerinfo_params.csv
Reference databases

Read Name Format

To avoid errors, the only characters that are acceptable in samplenames are letters and numbers. Characters can be separated byunderscores, but no other symbols. The files must end with the suffixR1.fastq.gz orR2.fastq.gz

Examples of permissible sample names are asfollows:

Sample1_R1.fastq.gz
Sample1_R2.fastq.gz

Other permissible names are:

Sample1_001_R1.fastq.gz
Sample1_001_R2.fastq.gz

What is not permissible is:

Sample1_001_R1_001.fastq.gz
Sample1_001_R2_001.fastq.gz

Metadata file components

Themetadata.csv file contains information about thesamples and primers (and associated metabarcodes) used in theexperiment. It has the following two required columns:

sample_name: Identifier for each sample (e.g., S1,S2)
primer_name: Name of the primer used (applicableoptions:rps10,its,r16S,other1,other2)

Please add your associated metadata to the file after these tworequired columns. This can then be used for your downstream exploratoryor diversity analyses, as the sample data will be incorporated into thefinalphyloseq andtaxmap objects.

Example file (with optional third column):

sample_name,primer_name,organismS1,rps10,CryS2,rps10,CinS1,its,CryS2,its,Cin

Primer and parameter file components

Theprimerinfo_params.csv file contains informationabout the primer sequences used in the experiment, along with optionaladditional parameters that are part of theDADA2 pipeline. If anything is notspecified, the default values will be used.

Required columns:

primer_name: Name of theprimer/barcode (e.g.,its,rps10)
forward: Forward primer sequence
reverse: Reverse primer sequence

Below are the parameters that can be input into theprimerinfo_params.csv file along with the defaults.Refer to theDADA2 documentation andmanualfor additional information.

DADA2filterAndTrim functionparameters:

already_trimmed: Boolean indicating ifprimers are already trimmed (TRUE/FALSE) (default:FALSE)
minCutadaptlength:Cutadapt parameter-Filter out processedreads that are shorter than specified length (default:0)
multithread: Boolean formultithreading (TRUE/FALSE) (default:FALSE)
verbose: Boolean for verbose output(TRUE/FALSE) (default:FALSE)
maxN: Maximum number of N basesallowed (default:0)
maxEE_forward: Maximum expected errorsfor forward reads (default:Inf)
maxEE_reverse: Maximum expected errorsfor reverse reads (default:Inf)
truncLen_forward: Truncation lengthfor forward reads (default:0)
truncLen_reverse: Truncation lengthfor reverse reads (default:0)
truncQ: Truncation quality threshold(default:2)
minLen: Minimum length of reads afterprocessing (default:20)
maxLen: Maximum length of reads afterprocessing (default:Inf)
minQ: Minimum quality score (default:0)
trimLeft: Number of bases to trim fromthe start of reads (default:0)
trimRight: Number of bases to trimfrom the end of reads (default:0)
rm.lowcomplex: Boolean for removinglow complexity sequences (default:TRUE)

DADA2learnErrors functionparameters:

nbases: Number of bases to use forerror rate learning (default:1e+08)
randomize: Randomize reads for errorrate learning (default:FALSE)
MAX_CONSIST: Maximum number ofself-consistency iterations (default:10)
OMEGA_C: Convergence threshold for theerror rates (default:0)
qualityType: Quality score type("Auto","FastqQuality", or"ShortRead") (default:"Auto")

DADA2plotErrorsparameters:

err_out: Return the error rates usedfor inference (default:TRUE)
err_in: Use input error rates insteadof learning them (default:FALSE)
nominalQ: Use nominal Q-scores(default:FALSE)
obs: Return the observed error rates(default:TRUE)

DADA2dada functionparameters:

OMP: Use OpenMP multi-threading ifavailable (default:TRUE)
n: Number of reads to use for errorrate estimation (default:1e+05)
id.sep: Character separating sample IDfrom sequence name (default:"\\s")
orient.fwd: NULL or TRUE/FALSE toorient sequences (default:NULL)
pool: Pool samples for error rateestimation (default:FALSE)
selfConsist: Perform self-consistencyiterations (default:FALSE)

DADA2mergePairs functionparameters:

minOverlap: Minimum overlap formerging paired-end reads (default:12)
maxMismatch: Maximum mismatchesallowed in the overlap region (default:0)

DADA2removeBimeraDenovo functionparameters:

method: Method for sample inference("consensus" or"pooled") (default:"consensus")

DADA2assignTaxonomy functionparameters:

minBoot: The minimum bootstrapconfidence for assigning a taxonomic level (default:0)
tryRC: If TRUE, the reverse-complementof each sequences will be used for classification if it is a bettermatch to the reference sequences than the forward sequence (default:FALSE)

Other parameters to include in CSV input file:

min_asv_length: Minimum length ofAmplicon Sequence Variants (ASVs) after core dada ASV inference steps(default =0)
seed: For greater reproducibility,user can specify an integer to set as a seed to use when the followingDADA2 functions are run:plotQualityProfile,learnErrors,dada,makeSequenceTable, andassignTaxonomy (default:NULL)

Example file (with select optional columns after forward and reverseprimer sequence columns):

primer_name,forward,reverse,already_trimmed,minCutadaptlength,multithread,verbose,maxN,maxEE_forward,maxEE_reverse,truncLen_forward,truncLen_reverse,truncQ,minLen,maxLen,minQ,trimLeft,trimRight,rm.lowcomplex,minOverlap,maxMismatch,min_asv_lengthrps10,GTTGGTTAGAGYARAAGACT,ATRYYTAGAAAGAYTYGAACT,FALSE,100,TRUE,FALSE,1.00E+05,5,5,0,0,5,150,Inf,0,0,0,0,15,0,50its,CTTGGTCATTTAGAGGAAGTAA,GCTGCGTTCTTCATCGATGC,FALSE,50,TRUE,FALSE,1.00E+05,5,5,0,0,5,50,Inf,0,0,0,0,15,0,50

Reference Databases

Databases will be copied into the user-specified data folder whereraw data files and csv files are located. The names will be parametersin theassignTax function.

For now, the package is compatible with the following databases:

oomyceteDB from:https://grunwaldlab.github.io/OomyceteDB/
SILVA 16S databasewith species assignments:https://www.arb-silva.de/
- An easily accessible download is found here:https://zenodo.org/records/14169026
UNITE database fromhttps://unite.ut.ee/repository.php
Up to two other reference databases. The user will need toreformat headers exactly as outlinedhere. The usercan then specify the path to the database in the input file. Thedatabase should be in FASTA format.

Movatterモバイル変換