Read trimming has minimal effect on bacterial SNP-calling accuracy
- PMID:33332257
- PMCID: PMC8116680
- DOI: 10.1099/mgen.0.000434
Read trimming has minimal effect on bacterial SNP-calling accuracy
Abstract
Read alignment is the central step of many analytic pipelines that perform variant calling. To reduce error, it is common practice to pre-process raw sequencing reads to remove low-quality bases and residual adapter contamination, a procedure collectively known as 'trimming'. Trimming is widely assumed to increase the accuracy of variant calling, although there are relatively few systematic evaluations of its effects and no clear consensus on its efficacy. As sequencing datasets increase both in number and size, it is worthwhile reappraising computational operations of ambiguous benefit, particularly when the scope of many analyses now routinely incorporates thousands of samples, increasing the time and cost required. Using a curated set of 17 Gram-negative bacterial genomes, this study initially evaluated the impact of four read-trimming utilities (Atropos, fastp, Trim Galore and Trimmomatic), each used with a range of stringencies, on the accuracy and completeness of three bacterial SNP-calling pipelines. It was found that read trimming made only small, and statistically insignificant, increases in SNP-calling accuracy even when using the highest-performing pre-processor in this study, fastp. To extend these findings, >6500 publicly archived sequencing datasets fromEscherichia coli,Mycobacterium tuberculosis andStaphylococcus aureus were re-analysed using a common analytic pipeline. Of the approximately 125 million SNPs and 1.25 million indels called across all samples, the same bases were called in 98.8 and 91.9 % of cases, respectively, irrespective of whether raw reads or trimmed reads were used. Nevertheless, the proportion of mixed calls (i.e. calls where <100 % of the reads support the variant allele; considered a proxy of false positives) was significantly reduced after trimming, which suggests that while trimming rarely alters the set of variant bases, it can affect the proportion of reads supporting each call. It was concluded that read quality- and adapter-trimming add relatively little value to a SNP-calling pipeline and may only be necessary if small differences in the absolute number of SNP calls, or the false call rate, are critical. Broadly similar conclusions can be drawn about the utility of trimming to an indel-calling pipeline. Read trimming remains routinely performed prior to variant calling likely out of concern that doing otherwise would typically have negative consequences. While historically this may have been the case, the data in this study suggests that read trimming is not always a practical necessity.
Keywords: SNP calling; read pre-processing; read trimming; variant calling.
Conflict of interest statement
The author declares that there are no conflicts of interest.
Figures



Similar articles
- Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines.Bush SJ, Foster D, Eyre DW, Clark EL, De Maio N, Shaw LP, Stoesser N, Peto TEA, Crook DW, Walker AS.Bush SJ, et al.Gigascience. 2020 Feb 1;9(2):giaa007. doi: 10.1093/gigascience/giaa007.Gigascience. 2020.PMID:32025702Free PMC article.
- Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data.Hall MB, Wick RR, Judd LM, Nguyen AN, Steinig EJ, Xie O, Davies M, Seemann T, Stinear TP, Coin L.Hall MB, et al.Elife. 2024 Oct 10;13:RP98300. doi: 10.7554/eLife.98300.Elife. 2024.PMID:39388235Free PMC article.
- Generalizable characteristics of false-positive bacterial variant calls.Bush SJ.Bush SJ.Microb Genom. 2021 Aug;7(8):000615. doi: 10.1099/mgen.0.000615.Microb Genom. 2021.PMID:34346861Free PMC article.
- A beginners guide to SNP calling from high-throughput DNA-sequencing data.Altmann A, Weber P, Bader D, Preuss M, Binder EB, Müller-Myhsok B.Altmann A, et al.Hum Genet. 2012 Oct;131(10):1541-54. doi: 10.1007/s00439-012-1213-z. Epub 2012 Aug 11.Hum Genet. 2012.PMID:22886560Review.
- The Future of Livestock Management: A Review of Real-Time Portable Sequencing Applied to Livestock.Lamb HJ, Hayes BJ, Nguyen LT, Ross EM.Lamb HJ, et al.Genes (Basel). 2020 Dec 9;11(12):1478. doi: 10.3390/genes11121478.Genes (Basel). 2020.PMID:33317066Free PMC article.Review.
Cited by
- Discovering a novel glycosyltransferase geneCmUGT1 enhances main metabolites production ofCordyceps militaris.He RA, Huang C, Zheng CH, Wang J, Yuan SW, Chen BX, Feng K.He RA, et al.Front Microbiol. 2024 Oct 22;15:1437963. doi: 10.3389/fmicb.2024.1437963. eCollection 2024.Front Microbiol. 2024.PMID:39502416Free PMC article.
- Genomic New Insights Into Emergence and Clinical Therapy of Multidrug-ResistantKlebsiella pneumoniae in Infected Pancreatic Necrosis.Hao H, Liu Y, Cao J, Gao K, Lu Y, Wang W, Wang P, Lu S, Hu L, Tong Z, Li W.Hao H, et al.Front Microbiol. 2021 Jun 25;12:669230. doi: 10.3389/fmicb.2021.669230. eCollection 2021.Front Microbiol. 2021.PMID:34248878Free PMC article.
- Identification of RNU44 as an Endogenous Reference Gene for Normalizing Cell-Free RNA in Tuberculosis.Gu W, Tu X, Lu W, Yin Y, Meng Q, Wang X, Zhang F, Fu Y.Gu W, et al.Open Forum Infect Dis. 2022 Dec 9;9(12):ofac540. doi: 10.1093/ofid/ofac540. eCollection 2022 Dec.Open Forum Infect Dis. 2022.PMID:36519116Free PMC article.
- PRDM14 extinction enables the initiation of trophoblast stem cell formation.Xu C, Zhao W, Peng L, Yin T, Guo J, Li Y, Liu L, Yang J, Xu C, Du M.Xu C, et al.Cell Mol Life Sci. 2024 May 6;81(1):208. doi: 10.1007/s00018-024-05237-9.Cell Mol Life Sci. 2024.PMID:38710919Free PMC article.
- Bioinformatics of germline variant discovery for rare disease diagnostics: current approaches and remaining challenges.Barbitoff YA, Ushakov MO, Lazareva TE, Nasykhova YA, Glotov AS, Predeus AV.Barbitoff YA, et al.Brief Bioinform. 2024 Jan 22;25(2):bbad508. doi: 10.1093/bib/bbad508.Brief Bioinform. 2024.PMID:38271481Free PMC article.Review.
References
Publication types
MeSH terms
Related information
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials
Miscellaneous