.2020 Aug 4;117(31):18489-18496.

doi: 10.1073/pnas.2004821117. Epub 2020 Jul 16.

HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints

William H Press^{1 2}, John A Hawkins^{3 4 5}, Stephen K Jones Jr^{4 5}, Jeffrey M Schaub^{4 5}, Ilya J Finkelstein^{4 5}

Affiliations

¹ Department of Computer Science, The University of Texas at Austin, Austin, TX 78712; wpress@cs.utexas.edu.
² Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712.
³ Oden Institute of Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX 78712.
⁴ Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712.
⁵ Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712.

PMID:32675237
PMCID: PMC7414044
DOI: 10.1073/pnas.2004821117

HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints

William H Press et al. Proc Natl Acad Sci U S A.2020.

.2020 Aug 4;117(31):18489-18496.

doi: 10.1073/pnas.2004821117. Epub 2020 Jul 16.

Authors

William H Press^{1 2}, John A Hawkins^{3 4 5}, Stephen K Jones Jr^{4 5}, Jeffrey M Schaub^{4 5}, Ilya J Finkelstein^{4 5}

Affiliations

¹ Department of Computer Science, The University of Texas at Austin, Austin, TX 78712; wpress@cs.utexas.edu.
² Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712.
³ Oden Institute of Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX 78712.
⁴ Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX 78712.
⁵ Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712.

PMID:32675237
PMCID: PMC7414044
DOI: 10.1073/pnas.2004821117

Abstract

Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed-Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine-cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding.

Keywords: DNA; Reed–Solomon; error-correcting code; indel; information storage.

PubMed Disclaimer

Figures

**Fig. 1.**
(A) Distribution of insertion and deletion errors (indels) in a typical DNA storage pipeline (Table 1); ins, insertion; del, deletion; sub, substitution. (B) (*Left*) Existing DNA-based encoding methods require sequence-level redundancy, strand alignment, and consensus calling to reduce indel errors. (*Right*) HEDGES corrects indel and substitution errors from a single read. (C) Overview of the interleaved encoding pipeline used throughout this paper. (D) HEDGES encoding algorithm in the simplest case: half-rate code, no sequence constraints. The HEDGES encoding algorithm is a variant of plaintext auto-key, but with redundancy introduced because (in the case of a half-rate code, for example) 1 bit of input generates 2 bits of output. Hashing each bit value with its strand ID, bit index, and a few previous bits “poisons” bad decoding hypotheses, allowing for correction of indels. (E) An example HEDGES encode, encoding bit 9 of the shown data strand (red box). As inD, half-rate code, no sequence constraints. (F) The HEDGES decoding algorithm is a greedy search on an expanding tree of hypotheses. Each hypothesis simultaneously guesses one or more message bits $v_{i}$ , its bit position index $i$ , and its corresponding DNA character position index $k$ . A “greediness parameter” $P_{ok}$ (seeSI Appendix, Supplementary Text) limits exponential tree growth: Most spawned nodes are never revisited. (G) Illustration of a simplified HEDGES decode. The example bit strand message is encoded and then sequenced with an insertion error. Blue squares give decoding action order: 1, Initialize Start node; 2 to 5, explore best hypothesis at each step; and 6, traceback and output the best hypothesis message. DNA image credit:freepik.com.

**Fig. 2.**
In silico performance of the HEDGES algorithm. (A) The in silico byte error rate for the HEDGES algorithm as a function of code rate, $r$ , shown for a range of simulated DNA error rates $P_{err}$ . (B) The mean number of bytes to an uncorrectable error, assuming the interleaved RS(255,223) design discussed in the text.

See this image and copyright information in PMC

Cited by

Composite Hedges Nanopores codec system for rapid and portable DNA data readout with high INDEL-Correction.
Zhao X, Li J, Fan Q, Dai J, Long Y, Liu R, Zhai J, Pan Q, Li Y.Zhao X, et al.Nat Commun. 2024 Oct 30;15(1):9395. doi: 10.1038/s41467-024-53455-3.Nat Commun. 2024.PMID:39477940Free PMC article.
FrameD: framework for DNA-based data storage design, verification, and validation.
Volkel KD, Lin KN, Hook PW, Timp W, Keung AJ, Tuck JM.Volkel KD, et al.Bioinformatics. 2023 Oct 3;39(10):btad572. doi: 10.1093/bioinformatics/btad572.Bioinformatics. 2023.PMID:37713474Free PMC article.
An Analysis of Algebraic Codes over Lattice Valued Intuitionistic Fuzzy Type-3R-Submodules.
Riaz A, Kousar S, Kausar N, Pamucar D, Addis GM.Riaz A, et al.Comput Intell Neurosci. 2022 Jun 23;2022:8148284. doi: 10.1155/2022/8148284. eCollection 2022.Comput Intell Neurosci. 2022.PMID:35785082Free PMC article.
RepairNatrix: a Snakemake workflow for processing DNA sequencing data for DNA storage.
Schwarz PM, Welzel M, Heider D, Freisleben B.Schwarz PM, et al.Bioinform Adv. 2023 Aug 26;3(1):vbad117. doi: 10.1093/bioadv/vbad117. eCollection 2023.Bioinform Adv. 2023.PMID:38496344Free PMC article.
Turbo autoencoders for the DNA data storage channel with Autoturbo-DNA.
Welzel M, Dreßler H, Heider D.Welzel M, et al.iScience. 2024 Mar 27;27(5):109575. doi: 10.1016/j.isci.2024.109575. eCollection 2024 May 17.iScience. 2024.PMID:38638577Free PMC article.

See all "Cited by" articles

References

1. Church G. M., Gao Y., Kosuri S., Next-generation digital information storage in DNA. Science 337, 1628 (2012) - PubMed
1. Goldman N., et al. , Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013). - PMC - PubMed
1. Grass R. N., Heckel R., Pudda M., Paunescu D., Stark W. J., Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. 54, 2552–2555 (2015). - PubMed
1. Bornholt J., et al. , A DNA-based archival storage system. Comput. Architect. News 44, 637–649 (2016).
1. Erlich Y., Zielinski D., DNA Fountain enables a robust and efficient storage architecture. Science 255, 950–954 (2017). - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Movatterモバイル変換

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Full text links

Actions

Share

HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints

Affiliations

HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous