Movatterモバイル変換


[0]ホーム

URL:


Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
Thehttps:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log inShow account info
Access keysNCBI HomepageMyNCBI HomepageMain ContentMain Navigation
pubmed logo
Advanced Clipboard
User Guide

Full text links

Ingenta plc full text link Ingenta plc Free PMC article
Full text links

Actions

Share

.2020 Dec;6(12):mgen000434.
doi: 10.1099/mgen.0.000434. Epub 2020 Dec 11.

Read trimming has minimal effect on bacterial SNP-calling accuracy

Affiliations

Read trimming has minimal effect on bacterial SNP-calling accuracy

Stephen J Bush. Microb Genom.2020 Dec.

Abstract

Read alignment is the central step of many analytic pipelines that perform variant calling. To reduce error, it is common practice to pre-process raw sequencing reads to remove low-quality bases and residual adapter contamination, a procedure collectively known as 'trimming'. Trimming is widely assumed to increase the accuracy of variant calling, although there are relatively few systematic evaluations of its effects and no clear consensus on its efficacy. As sequencing datasets increase both in number and size, it is worthwhile reappraising computational operations of ambiguous benefit, particularly when the scope of many analyses now routinely incorporates thousands of samples, increasing the time and cost required. Using a curated set of 17 Gram-negative bacterial genomes, this study initially evaluated the impact of four read-trimming utilities (Atropos, fastp, Trim Galore and Trimmomatic), each used with a range of stringencies, on the accuracy and completeness of three bacterial SNP-calling pipelines. It was found that read trimming made only small, and statistically insignificant, increases in SNP-calling accuracy even when using the highest-performing pre-processor in this study, fastp. To extend these findings, >6500 publicly archived sequencing datasets fromEscherichia coli,Mycobacterium tuberculosis andStaphylococcus aureus were re-analysed using a common analytic pipeline. Of the approximately 125 million SNPs and 1.25 million indels called across all samples, the same bases were called in 98.8 and 91.9 % of cases, respectively, irrespective of whether raw reads or trimmed reads were used. Nevertheless, the proportion of mixed calls (i.e. calls where <100 % of the reads support the variant allele; considered a proxy of false positives) was significantly reduced after trimming, which suggests that while trimming rarely alters the set of variant bases, it can affect the proportion of reads supporting each call. It was concluded that read quality- and adapter-trimming add relatively little value to a SNP-calling pipeline and may only be necessary if small differences in the absolute number of SNP calls, or the false call rate, are critical. Broadly similar conclusions can be drawn about the utility of trimming to an indel-calling pipeline. Read trimming remains routinely performed prior to variant calling likely out of concern that doing otherwise would typically have negative consequences. While historically this may have been the case, the data in this study suggests that read trimming is not always a practical necessity.

Keywords: SNP calling; read pre-processing; read trimming; variant calling.

PubMed Disclaimer

Conflict of interest statement

The author declares that there are no conflicts of interest.

Figures

Fig. 1.
Fig. 1.
Effect of read trimming upon F-score, a measure of overall SNP-calling accuracy, in a curated Gram-negative dataset. Median difference in F-score per trimmer relative to untrimmed data, across a range of trimming stringencies (i.e. varying the Phred score threshold for trimming 3′ bases). Boxes represent the interquartile range of the F-score, with midlines representing the median. Upper and lower whiskers extend, respectively, to the largest and smallest values no further than 1.5× the interquartile range. Data beyond the ends of each whisker are outliers and plotted individually. Columns are ordered according to median F-score and coloured according to the trimmer used. The dashed liney=0 is marked in black. The raw data for this figure are available in Table S5. Boxplots showing the effect of read trimming upon precision and recall are shown, respectively, in Figs S1 and S2. Note thatfastp implements quality filters other than 3′ trimming by default, which for the data in this figure were retained. A version of this figure with these filters disabled is available in Fig. S3.
Fig. 2.
Fig. 2.
Effect of read trimming upon SNP calls made using publicly archivedE. coli sequencing data. Trimming marginally increases the proportion of successfully aligned reads (a), although the interpretation of those alignments (i.e. SNP calling) is not substantially altered, with the vast majority of SNPs (>99 %) called irrespective of trimming (b). This value is 100 % for a number of samples containing very few SNPs (approximately 15) relative to theE. coli reference genome. A relatively small number of SNPs (in the majority of cases <200) are only called when using either raw or trimmed data, but not both (c). The proportion of mixed SNP calls, considered a proxy of false-positive calling, decreases when using trimmed data (d). The raw data for this figure are available in Table S6 and represent 1606E. coli samples, with a mean of 64 476 SNPs per sample. The red lines denotey=x.
Fig. 3.
Fig. 3.
Effect of read trimming upon SNP calls made using publicly archivedM. tuberculosis sequencing data. This figure recapitulates patterns seen in Fig. 2 and illustrates the effect of read trimming upon SNP calls made in a clonal species,M. tuberculosis, for which relatively high alignment accuracy is expected and the impact of misalignment (i.e. false-positive SNP calls) accordingly greater. Trimming marginally increases the proportion of successfully aligned reads, albeit from a high baseline value, >98 % (a). The vast majority of SNPs (>98 %) are nevertheless called irrespective of any trimming (b). A relatively small number of SNPs (often <40) are only called when using either raw or trimmed data, but not both (c). The proportion of mixed SNP calls, considered a proxy of false-positive calling, decreases when using trimmed data (d). The raw data for this figure are available in Table S7 and represent 3946M.tuberculosis samples, with a mean of 1238 SNPs per sample. The red lines denotey=x.
See this image and copyright information in PMC

Similar articles

See all similar articles

Cited by

See all "Cited by" articles

References

    1. De Maio N, Shaw LP, Hubbard A, George S, Sanderson ND, et al. Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes. Microb Genom. 2019;5:e000294. doi: 10.1099/mgen.0.000294. - DOI - PMC - PubMed
    1. Bush SJ, Foster D, Eyre DW, Clark EL, De Maio N, et al. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines. Gigascience. 2020;9:giaa007. doi: 10.1093/gigascience/giaa007. - DOI - PMC - PubMed
    1. Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM. An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS One. 2013;8:e85024. doi: 10.1371/journal.pone.0085024. - DOI - PMC - PubMed
    1. Farrer RA, Henk DA, MacLean D, Studholme DJ, Fisher MC. Using false discovery rates to benchmark SNP-callers in next-generation sequencing projects. Sci Rep. 2013;3:1512. doi: 10.1038/srep01512. - DOI - PMC - PubMed
    1. Liu Q, Guo Y, Li J, Long J, Zhang B, et al. Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data. BMC Genomics. 2012;13:S8. doi: 10.1186/1471-2164-13-S8-S8. - DOI - PMC - PubMed

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full text links
Ingenta plc full text link Ingenta plc Free PMC article
Cite
Send To

NCBI Literature Resources

MeSHPMCBookshelfDisclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.


[8]ページ先頭

©2009-2025 Movatter.jp