Example bioinformatic methodology to generate OTUs
Anoperational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 byRobert R. Sokal andPeter H. A. Sneath in the context ofnumerical taxonomy, where an "operational taxonomic unit" is simply the group of organisms currently being studied.[1] In this sense, an OTU is a pragmatic definition to group individuals by similarity, equivalent to but not necessarily in line with classicalLinnaean taxonomy or modernevolutionary taxonomy.
Nowadays, however, the term "OTU" is commonly used in a different context and refers to clusters of (uncultivated or unknown) organisms, grouped by DNA sequence similarity of a specific taxonomic marker gene (originally coined as mOTU; molecular OTU).[2] In other words, OTUs are pragmatic proxies for "species" (microbial ormetazoan) at different taxonomic levels, in the absence of traditional systems ofbiological classification as are available for macroscopic organisms. For several years, OTUs have been the most commonly used units of diversity, especially when analysing small subunit16S (for prokaryotes) or18S rRNA (for eukaryotes[3]) marker gene sequence datasets.
Sequences can be clustered according to their similarity to one another, and operational taxonomic units are defined based on the similarity threshold (usually 97% similarity; however also 100% similarity is common, also known assingle variants[4]) set by the researcher. It remains debatable how well this commonly used method recapitulates true microbial species phylogeny or ecology. Although OTUs can be calculated differently when using different algorithms or thresholds, research by Schmidt et al. (2014) demonstrated that microbial OTUs were generally ecologically consistent across habitats and several OTU clustering approaches.[5] The number of OTUs defined may be inflated due to errors inDNA sequencing.[6]
There are three main approaches to clustering OTUs:[7]
De novo, for which the clustering is based on similarities between sequencing reads.
Closed-reference, for which the clustering is performed against a reference database of sequences.
Open-reference, where clustering is first performed against a reference database of sequences, then any remaining sequences that could not be mapped to the reference are clusteredde novo.