Movatterモバイル変換

47Accesses
Explore all metrics

Abstract

We introduce a novel, linguistic-like method of genome analysis. We propose a natural approach to characterizing genomic sequences based on occurrences of fixed length words from a predefined, sufficiently large set of words (strings over the alphabet {A, C, G, T} ). A measure based on this approach is called compositional spectrum and is actually a histogram of imperfect word occurrences. Our results assert that the compositional spectrum is an overall characteristic of a long sequence i.e., a complete genome or an uninterrupted part of a chromosome. This attribute is manifested in the similarity of spectra obtained on different stretches of the same genome, and simultaneously in a broad range of dissimilarities between spectral representations of different genomes. High flexibility characterizes this approach due to imperfect matching and as a result sets of relatively long words can be considered. The proposed approach may have various applications in intra- and intergenomic sequence comparisons.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

REFERENCES

Brendel, V., J.S. Beckmann and E.N. Trifonov (1986). Linguistics of nucleotide sequences: morphology and comparison of vocabularies. Journal of Biomolecular Structure and Dynamics 4: 11–21.
Google Scholar
Dyer, M., A. Frieze and S. Suen (1994). The probability of unique solutions of sequencing by hybridization. Journal of Computational Biology 1: 105–110.
Google Scholar
Karlin, S. (1998). Global dinucleotide signatures and analysis of genomic heterogeneity. Current Opinion in Microbiology 1: 598–610.
Google Scholar
Karlin, S. and J. Mrazek (1997). Compositional differences within and between eukaryotic genomes. Proceedings of the National Academy of Sciences of the United States of America 94: 10227–10232.
Google Scholar
Karlin, S. and C. Burge (1995). Dinucleotide relative abundance extremes: a genomic signature. Trends in Genetics 11: 283–290.
Google Scholar
Kendall, M. G. (1970). Rank Correlation Methods. Charles Griffin & Co., Ltd, London.
Google Scholar
Kendall, M. G. and A. Stuart (1967). Inference and Relationship, 2. Charles Griffin & Co., Ltd, London.
Google Scholar
Kirzhner, V.M., A.B. Korol, A. Bolshoy and E. Nevo (2000). Extensive Sets of Words Reveal Large-Scale Genome Organization. Poster in Genomes 2000: International Conference on Microbial and Model Genomes, Paris, France.
Kirzhner, V.M., A.B. Korol, A. Bolshoy and E. Nevo (2002). Compositional spectrum — revealing patterns for genomic sequence characterization and comparison. Physica A 312: 447–457.
Google Scholar
McInerney, J. O. (1998). Replicational and transcriptional selection on codon usage in Borrelia burgdorferi. Proceedings of the National Academy of Sciences of the United States of America 95: 10698–10703.
Google Scholar
Nelson, K. E. R. A. Clayton, S. R. Gill, M. L. Gwinn, R. J. Dodson, D. H. Haft, E. K. Hickey, J. D. Peterson, W. C. Nelson, K. A. Ketchum, L. Mcdonald, T. R. Utterback, J. A. Malek, K. D. Linher, M. M. Garrett, A. M. Stewart, M. D. Cotton, M. S. Pratt, C. A. Phillips, D. Richardson, J. Heidelberg, G. G. Sutton, R. D. Fleischmann, J. A. Eisen, O. White, S. L. Salzberg, H. O. Smith, J. C. Venter and C. M. Fraser (1999). Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 399: 323–329.
Google Scholar
Pevzner, P., M. Borodovsky and A. Mironov (1989). Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. Journal of Biomolecular Structure and Dynamics 6: 1013–1026.
Google Scholar
Pietrokovski, S. (1994). Comparing nucleotide and protein sequences by linguistic methods. Journal of Biotechnology 35: 257–272.
Google Scholar
Pietrokovski, S., J. Hirshonn and E. N. Trifonov (1990). Linguistic Measure of Taxonomic and Functional Relatedness of Nucleotide Sequences. Journal of Biomolecular Structure and Dynamics 7: 1251–1268.
Google Scholar
Preparata, F., A. Frieze and E. Upfal (1999). Optimal reconstruction of a sequence from its probes. Journal of Computational Biology 6: 361–368.
Google Scholar
Reinert, G., S. Schbath and M. S. Waterman (2000). Probabilistic and statistical properties of words: an overview. Journal of Computational Biology 7: 1–46.
Google Scholar
Sandberg, R., G. Winberg, C-I. Branden, A. Kaske, I. Ernberg and J. Coster (2001). Capturing Whole-Genome Characteristics in Short Sequences Using a Naïve Bayesian Classifier. Genome Research 11: 1404–1409.
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Evolution, University of Haifa, Mount Carmel, Haifa, 31905, Israel
Valery Kirzhner, Eviatar Nevo, Abraham Korol & Alexander Bolshoy

Authors

Valery Kirzhner
View author publications
You can also search for this author inPubMed Google Scholar
Eviatar Nevo
View author publications
You can also search for this author inPubMed Google Scholar
Abraham Korol
View author publications
You can also search for this author inPubMed Google Scholar
Alexander Bolshoy
View author publications
You can also search for this author inPubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kirzhner, V., Nevo, E., Korol, A.et al. A Large-Scale Comparison of Genomic Sequences: One Promising Approach.Acta Biotheor51, 73–89 (2003). https://doi.org/10.1023/A:1024553109779

Download citation

Issue Date:June 2003
DOI:https://doi.org/10.1023/A:1024553109779

Movatterモバイル変換

A Large-Scale Comparison of Genomic Sequences: One Promising Approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Information Theory in Genome Analysis

Ultra-fast genome comparison for large-scale genomic experiments

Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

REFERENCES

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Access this article

Subscribe and save

Buy Now