Abstract
We introduce a novel, linguistic-like method of genome analysis. We propose a natural approach to characterizing genomic sequences based on occurrences of fixed length words from a predefined, sufficiently large set of words (strings over the alphabet {A, C, G, T} ). A measure based on this approach is called compositional spectrum and is actually a histogram of imperfect word occurrences. Our results assert that the compositional spectrum is an overall characteristic of a long sequence i.e., a complete genome or an uninterrupted part of a chromosome. This attribute is manifested in the similarity of spectra obtained on different stretches of the same genome, and simultaneously in a broad range of dissimilarities between spectral representations of different genomes. High flexibility characterizes this approach due to imperfect matching and as a result sets of relatively long words can be considered. The proposed approach may have various applications in intra- and intergenomic sequence comparisons.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.
Similar content being viewed by others
REFERENCES
Brendel, V., J.S. Beckmann and E.N. Trifonov (1986). Linguistics of nucleotide sequences: morphology and comparison of vocabularies. Journal of Biomolecular Structure and Dynamics 4: 11–21.
Dyer, M., A. Frieze and S. Suen (1994). The probability of unique solutions of sequencing by hybridization. Journal of Computational Biology 1: 105–110.
Karlin, S. (1998). Global dinucleotide signatures and analysis of genomic heterogeneity. Current Opinion in Microbiology 1: 598–610.
Karlin, S. and J. Mrazek (1997). Compositional differences within and between eukaryotic genomes. Proceedings of the National Academy of Sciences of the United States of America 94: 10227–10232.
Karlin, S. and C. Burge (1995). Dinucleotide relative abundance extremes: a genomic signature. Trends in Genetics 11: 283–290.
Kendall, M. G. (1970). Rank Correlation Methods. Charles Griffin & Co., Ltd, London.
Kendall, M. G. and A. Stuart (1967). Inference and Relationship, 2. Charles Griffin & Co., Ltd, London.
Kirzhner, V.M., A.B. Korol, A. Bolshoy and E. Nevo (2000). Extensive Sets of Words Reveal Large-Scale Genome Organization. Poster in Genomes 2000: International Conference on Microbial and Model Genomes, Paris, France.
Kirzhner, V.M., A.B. Korol, A. Bolshoy and E. Nevo (2002). Compositional spectrum — revealing patterns for genomic sequence characterization and comparison. Physica A 312: 447–457.
McInerney, J. O. (1998). Replicational and transcriptional selection on codon usage in Borrelia burgdorferi. Proceedings of the National Academy of Sciences of the United States of America 95: 10698–10703.
Nelson, K. E. R. A. Clayton, S. R. Gill, M. L. Gwinn, R. J. Dodson, D. H. Haft, E. K. Hickey, J. D. Peterson, W. C. Nelson, K. A. Ketchum, L. Mcdonald, T. R. Utterback, J. A. Malek, K. D. Linher, M. M. Garrett, A. M. Stewart, M. D. Cotton, M. S. Pratt, C. A. Phillips, D. Richardson, J. Heidelberg, G. G. Sutton, R. D. Fleischmann, J. A. Eisen, O. White, S. L. Salzberg, H. O. Smith, J. C. Venter and C. M. Fraser (1999). Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 399: 323–329.
Pevzner, P., M. Borodovsky and A. Mironov (1989). Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. Journal of Biomolecular Structure and Dynamics 6: 1013–1026.
Pietrokovski, S. (1994). Comparing nucleotide and protein sequences by linguistic methods. Journal of Biotechnology 35: 257–272.
Pietrokovski, S., J. Hirshonn and E. N. Trifonov (1990). Linguistic Measure of Taxonomic and Functional Relatedness of Nucleotide Sequences. Journal of Biomolecular Structure and Dynamics 7: 1251–1268.
Preparata, F., A. Frieze and E. Upfal (1999). Optimal reconstruction of a sequence from its probes. Journal of Computational Biology 6: 361–368.
Reinert, G., S. Schbath and M. S. Waterman (2000). Probabilistic and statistical properties of words: an overview. Journal of Computational Biology 7: 1–46.
Sandberg, R., G. Winberg, C-I. Branden, A. Kaske, I. Ernberg and J. Coster (2001). Capturing Whole-Genome Characteristics in Short Sequences Using a Naïve Bayesian Classifier. Genome Research 11: 1404–1409.
Author information
Authors and Affiliations
Institute of Evolution, University of Haifa, Mount Carmel, Haifa, 31905, Israel
Valery Kirzhner, Eviatar Nevo, Abraham Korol & Alexander Bolshoy
- Valery Kirzhner
You can also search for this author inPubMed Google Scholar
- Eviatar Nevo
You can also search for this author inPubMed Google Scholar
- Abraham Korol
You can also search for this author inPubMed Google Scholar
- Alexander Bolshoy
You can also search for this author inPubMed Google Scholar
Rights and permissions
About this article
Cite this article
Kirzhner, V., Nevo, E., Korol, A.et al. A Large-Scale Comparison of Genomic Sequences: One Promising Approach.Acta Biotheor51, 73–89 (2003). https://doi.org/10.1023/A:1024553109779
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative