Movatterモバイル変換


[0]ホーム

URL:


Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
Thehttps:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log inShow account info
Access keysNCBI HomepageMyNCBI HomepageMain ContentMain Navigation
pubmed logo
Advanced Clipboard
User Guide

Full text links

Atypon full text link Atypon Free PMC article
Full text links

Actions

Share

.2020 Aug 4;117(31):18489-18496.
doi: 10.1073/pnas.2004821117. Epub 2020 Jul 16.

HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints

Affiliations

HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints

William H Press et al. Proc Natl Acad Sci U S A..

Abstract

Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed-Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine-cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding.

Keywords: DNA; Reed–Solomon; error-correcting code; indel; information storage.

Copyright © 2020 the Author(s). Published by PNAS.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(A) Distribution of insertion and deletion errors (indels) in a typical DNA storage pipeline (Table 1); ins, insertion; del, deletion; sub, substitution. (B) (Left) Existing DNA-based encoding methods require sequence-level redundancy, strand alignment, and consensus calling to reduce indel errors. (Right) HEDGES corrects indel and substitution errors from a single read. (C) Overview of the interleaved encoding pipeline used throughout this paper. (D) HEDGES encoding algorithm in the simplest case: half-rate code, no sequence constraints. The HEDGES encoding algorithm is a variant of plaintext auto-key, but with redundancy introduced because (in the case of a half-rate code, for example) 1 bit of input generates 2 bits of output. Hashing each bit value with its strand ID, bit index, and a few previous bits “poisons” bad decoding hypotheses, allowing for correction of indels. (E) An example HEDGES encode, encoding bit 9 of the shown data strand (red box). As inD, half-rate code, no sequence constraints. (F) The HEDGES decoding algorithm is a greedy search on an expanding tree of hypotheses. Each hypothesis simultaneously guesses one or more message bitsvi, its bit position indexi, and its corresponding DNA character position indexk. A “greediness parameter”Pok (seeSI Appendix, Supplementary Text) limits exponential tree growth: Most spawned nodes are never revisited. (G) Illustration of a simplified HEDGES decode. The example bit strand message is encoded and then sequenced with an insertion error. Blue squares give decoding action order: 1, Initialize Start node; 2 to 5, explore best hypothesis at each step; and 6, traceback and output the best hypothesis message. DNA image credit:freepik.com.
Fig. 2.
Fig. 2.
In silico performance of the HEDGES algorithm. (A) The in silico byte error rate for the HEDGES algorithm as a function of code rate,r, shown for a range of simulated DNA error ratesPerr. (B) The mean number of bytes to an uncorrectable error, assuming the interleaved RS(255,223) design discussed in the text.
See this image and copyright information in PMC

Similar articles

See all similar articles

Cited by

See all "Cited by" articles

References

    1. Church G. M., Gao Y., Kosuri S., Next-generation digital information storage in DNA. Science 337, 1628 (2012) - PubMed
    1. Goldman N., et al. , Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013). - PMC - PubMed
    1. Grass R. N., Heckel R., Pudda M., Paunescu D., Stark W. J., Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. 54, 2552–2555 (2015). - PubMed
    1. Bornholt J., et al. , A DNA-based archival storage system. Comput. Architect. News 44, 637–649 (2016).
    1. Erlich Y., Zielinski D., DNA Fountain enables a robust and efficient storage architecture. Science 255, 950–954 (2017). - PubMed

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full text links
Atypon full text link Atypon Free PMC article
Cite
Send To

NCBI Literature Resources

MeSHPMCBookshelfDisclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.


[8]ページ先頭

©2009-2025 Movatter.jp