Movatterモバイル変換


[0]ホーム

URL:


Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
Thehttps:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log inShow account info
Access keysNCBI HomepageMyNCBI HomepageMain ContentMain Navigation
pubmed logo
Advanced Clipboard
User Guide

Full text links

Cold Spring Harbor Laboratory full text link Cold Spring Harbor Laboratory Free PMC article
Full text links

Actions

Share

This is a preprint.

It has not yet been peer reviewed by a journal.
The National Library of Medicine isrunning a pilot to include preprints that result from research funded by NIH in PMC and PubMed.
[Preprint].2024 Jan 6:2024.01.02.573821.
doi: 10.1101/2024.01.02.573821.

Comprehensive and accurate genome analysis at scale using DRAGEN accelerated algorithms

Affiliations

Comprehensive and accurate genome analysis at scale using DRAGEN accelerated algorithms

Sairam Behera et al. bioRxiv..

Abstract

Research and medical genomics require comprehensive and scalable solutions to drive the discovery of novel disease targets, evolutionary drivers, and genetic markers with clinical significance. This necessitates a framework to identify all types of variants independent of their size (e.g., SNV/SV) or location (e.g., repeats). Here we present DRAGEN that utilizes novel methods based on multigenomes, hardware acceleration, and machine learning based variant detection to provide novel insights into individual genomes with ~30min computation time (from raw reads to variant detection). DRAGEN outperforms all other state-of-the-art methods in speed and accuracy across all variant types (SNV, indel, STR, SV, CNV) and further incorporates specialized methods to obtain key insights in medically relevant genes (e.g., HLA, SMN, GBA). We showcase DRAGEN across 3,202 genomes and demonstrate its scalability, accuracy, and innovations to further advance the integration of comprehensive genomics for research and medical applications.

PubMed Disclaimer

Conflict of interest statement

Competing interests FJS receives research support from Genentech, Illumina, PacBio and Oxford Nanopore. SC, MR, ST, ZH, MR, AV, GP, CR, VO, SM, JH and RM are employees of Illumina.

Figures

Figure 1:
Figure 1:. Overview of DRAGEN variant calling pipeline.
DRAGEN improves the variant identification from single bp up to multiple Mbp of alleles. This is achieved by implementing multiple optimized novel concepts: i) Mapping utilizes a multigenome (graph) including 64 haplotypes; ii) SV calling is significantly improved over local assemblies based on breakpoint graphs; iii) SNV calling is improved using multiple novelties including machine learning based scoring and filtering; iv) CNV calling utilizes the multigenome (graph) and the SV calling information to make informed decisions; vi) Additional nine tools targeting specific difficult regions of the genome are included, four of them not reported before; vi) STR calling is integrated based on Expansion Hunter; and vii) A gVCF genotyper implementation to provide a population level fully genotyped VCF file.
Figure 2:
Figure 2:. Performance overview of DRAGEN based on GIAB benchmarks.
A) length distribution of small and large variants discovered by DRAGEN (bin sizes used for the plot (from left to right) are: 500, 250,150,50, 150, 250, 500),B) SNV comparison based on GIAB SNV 4.2.1,C) SNV call comparisons based on CMRG v1.0,D) Comparison of SV call performance (INS and DEL types) based on GIAB SV v0.6,E) Comparison of CMRG SV call performance (INS and DEL types) based on GIAB CMRG SV v1.0,F) CNV caller comparison of DRAGEN compared to CNVnator across different sizes of deletions based on GIAB SV v0.6, andG) The benchmarking of short tandem repeats using GIAB v1.0. The recall and F-measure was calculated using GIAB catalog and the recall* and F-measure* were calculated using the catalogs of DRAGEN and GangSTR.
Figure 3:
Figure 3:. Performance overview of DRAGEN for HG001–07 samples
A) Length distribution of different variants for all samples (bin sizes used for the plot from left to right are: 500, 250,150,50300,150, 250, 500);B) The recall, precision, and F-measure of DRAGEN for HG001–07 samples;C) The comparison of False negative (FN) and False positive numbers among DeepVariant with BWA, DeepVariant with Giraffe, and DRAGEN for HG001–07 SNV calls; andD) Comparison of recall, precision and F-measure of these samples for four different tools i.e., DRAGEN, GATK, and DeepVariant-BWA, DeepVariant-Giraffe.E) The average F-measures, and errors (false positives and false negatives) for different tools.
Figure 4:
Figure 4:. DRAGEN SNV calls for 1kGP sample:
A) PCA plot of principal component 1 and 2 for SNVs across the populationB) Distribution of SNV counts andC) Distribution of indel counts at super-population levelD) Singleton (allele count=1), rare (allele frequency <= 1%) and common variant (allele frequency > 1%) counts of GATK v4.1 and DRAGEN v4.0.3 callsets of SNV andE) indel across the cohort level. The Known and Novel variants based on dbSNP 155 databaseF) The distribution of SNVs based on their functional annotations shown in the upper plot and the lower plot showing the fraction of Known and Novel variantsG) The distribution of small insertion and deletions based on their functional annotations.
Figure 5:
Figure 5:. DRAGEN SV calls for 1kGP sample:
A) PCA of merged STR, SV and CNV for deletions >5% on chromosome.B) Distribution of insertion andC) deletion type structural variants (>= 50bp) among populationsD) Distribution of SV, STR (for ~50K loci) and CNV variants based on average count i.e., total variants of a population / population countE) distribution of variant numbers among all 3,202 samples for the 12 challenging medically relevant gene (CMRG) regions (in GRCh38) that are impacted due to falsely duplication and falsely collapsed errors. DRAGEN uses the corrected reference as a part of its multi genome approach to correctly identify more variants in duplicated and in collapsed regions.F) Class I HLA allele frequency distributions among all 1kGP populations
See this image and copyright information in PMC

Similar articles

See all similar articles

References

    1. Goodwin S., McPherson J. D. & McCombie W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016). - PMC - PubMed
    1. Zhang J., Chiodini R., Badr A. & Zhang G. The impact of next-generation sequencing on genomics. J. Genet. Genomics 38, 95–109 (2011). - PMC - PubMed
    1. Tarailo-Graovac M., Wasserman W. W. & Van Karnebeek C. D. M. Impact of next-generation sequencing on diagnosis and management of neurometabolic disorders: current advances and future perspectives. Expert Rev. Mol. Diagn. 17, 307–309 (2017). - PubMed
    1. Satam H. et al. Next-Generation Sequencing Technology: Current Trends and Advancements. Biology 12, (2023). - PMC - PubMed
    1. Coster W. D., De Coster W., Weissensteiner M. H. & Sedlazeck F. J. Towards population-scale long-read sequencing. Nature Reviews Genetics vol. 22 572–587 Preprint at 10.1038/s41576-021-00367-3 (2021). - DOI - PMC - PubMed

Publication types

Related information

Grants and funding

LinkOut - more resources

Full text links
Cold Spring Harbor Laboratory full text link Cold Spring Harbor Laboratory Free PMC article
Cite
Send To

NCBI Literature Resources

MeSHPMCBookshelfDisclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.


[8]ページ先頭

©2009-2025 Movatter.jp