Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Springer Nature Link
Log in

Pronunciation Modeling for Spontaneous Mandarin Speech Recognition

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Pronunciation variations in spontaneous speech can be classified intocomplete changes andpartial changes. A complete change is the replacement of a canonical phoneme by another alternative phone, such as 'b' being pronounced as 'p'. Partial changes are variations within the phoneme such as nasalization, centralization and voiced. Most current work in pronunciation modeling for spontaneous Mandarin speech remains at the phone level and can model only complete changes, not partial changes. In this paper, we show that partial changes are much less clear-cut than previously assumed and cannot be modelled by mere representation by alternate phone units. We present a solution for modeling both complete changes and partial changes in spontaneous Mandarin speech.

In order to model complete changes, we adapted the decision tree-based pronunciation modeling from English to Mandarin to predict alternate pronunciations. To solve the data sparseness problem, we used cross-domain data to estimate pronunciation variability. To discard the unreliable alternative pronunciations, we proposed a likelihood ratio test as a confidence measure to evaluate the degree of phonetic confusions. In order to model partial changes, we proposed partial change phone models (PCPM) with acoustic model reconstruction. PCPMs are regarded as extended units of standard phoneme or initial/final subword units, and can be used efficiently to represent partial changes. In order to avoid model confusion, we generated auxiliary decision trees for PCPM triphones, and used decision tree merge to perform acoustic model reconstruction. The effectiveness of these approaches was evaluated on the 1997 Hub4NE Mandarin Broadcast News corpus with different styles of speech. Our phone level pronunciation modeling provided an absolute 0.9% syllable error rate reduction, and the acoustic model reconstruction approach was more efficient than that to cover pronunciation variations, yielding a significant 2.39% absolute reduction in syllable error rate for spontaneous speech. In addition, our proposed method deals with partial changes at the acoustic model level and can be applied to any automatic speech recognition system based on subword units.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Byrne, W., Finke, M., Khudanpur, S., Mcdonough, J., Nock, H., Riley, M., Saraclar, M., Wooters, C., and Zavaliagkos, G. (1998). Pronunciation modeling using a hand-labeled corpus for conversational speech recognition.ICASSP'98 Proceedings. Settle, USA: ICASSP, pp. 313–316.

    Google Scholar 

  • Byrne,W., Venkataramani,V., Kamm, T., Zheng, F., Fung, P., Liu,Y., and Ruhi, U. (2001). Automatic generation of pronunciation lexicons for Mandarin spontaneous speech.ICASSP'01 Proceedings. Salt Lake City, USA: ICASSP.

    Google Scholar 

  • Finke, M., Fritsch, J., Koll, D., and Waibel, A. (1999). Modeling and efficient decoding of large vocabulary conversational speech.Eurospeech'99 Proceedings. Budapest, Hungary: Eurospeech, pp. 467–470.

    Google Scholar 

  • Fosler-Lussier, E. (1999). Dynamic pronunciation models for automatic speech recognition. PhD thesis. International Computer Science Institute, Berkeley, CA.

    Google Scholar 

  • Fung, P., Byrne, W., Zheng, F., Kamm, T., Liu, Y., Song, Z., Venkataramani, V., and Ruhi, U. (2000). Pronunciation modeling of Mandarin casual speech.Final Report, The Johns Hopkins University Summer Workshop.

  • Holter, T. and Svendsen, T. (1999). Maximum likelihood modeling of pronunciation variation.Speech Communication,29:177–191.

    Google Scholar 

  • Huang, C., Chang, E., Zhou, J., and Lee, K. (2000). Accent modeling based on pronunciation dictionary adaptation for large vocabulary Mandarin speech recognition.ICSLP'00 Proceedings. Beijing, China: ICSLP.

    Google Scholar 

  • Kam, P. and Lee, T. (2002). Modeling pronunciation variation for Cantonese speech recognition. ISCA ITR-Workshop on Pronunciation Modeling and Lexicon Adaptation Proceedings. Colorado, USA.

  • Kessens, J.M., Wester, M., and Strik, H. (1999). Improving the performance of a Dutch CSR by modeling within-word and crossword pronunciation variation.Speech Communication,29:193–207.

    Google Scholar 

  • Li, A., Zheng, F., Byrne, W., Fung, P., Kamm, T., Liu, Y., Song, Z., Ruhi, U., Venkataramani, V., and Chen, X. (2000). CASS: A phonetically transcribed corpus of Mandarin spontaneous speech.ICSLP'00 Proceedings. Beijing, China: ICSLP.

    Google Scholar 

  • Liu, M., Xu, B., Huang, T., Deng, Y., and Li, C. (2000). Mandarin accent adaptation based on context-independent/congextdependent pronunciation modeling.ICASSP'00 Proceedings. Istanbul Turkey: ICASSP, pp. 1929–1932.

    Google Scholar 

  • Liu, Y. and Fung, P. (2000). Rule-based word pronunciation networks generation for Mandarin speech recognition.ISCSLP'00 Proceedings. Beijing, China: ISCSLP, pp. 35–38.

    Google Scholar 

  • Liu, Y. (2002). Pronunciation modeling for spontaneous Mandarin speech recognition. PhD thesis, The Hong Kong University of Science and Technology.

  • Luo, X. and Jelineck, F. (1999). Probabilistic classification of HMM states for large vocabulary continuous speech recognition.ICASSP'99 Proceedings. Phoenix, USA: ICASSP, pp. 353–356.

    Google Scholar 

  • Manning, D.C. and Schutze, H. (1999).Foundations of Statistical Natural Language Processing. The Cambridge, Massachusetts: MIT Press.

    Google Scholar 

  • Nakamura, A. (1998). Restructuring Gaussian mixture density functions in speaker-independent acoustic models.ICASSP'98 Proceedings. Settle, USA: ICASSP, pp. 649–652.

    Google Scholar 

  • Rabiner, L. and Juang, B.H. (1993).Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall.

    Google Scholar 

  • Riley, M., Byrne, W., Finker, M., Khudanpur, S., Ljolje, A., Mcdonough, J., Nock, H., Saraclar, M., Wooters, C., and Zavaliagkos, G. (1999). Stochastic pronunciation modelling from hand-labelled phonetic corpura.Speech Communication,29:209–224.

    Google Scholar 

  • Riley, M. and Ljolje, A. (1996). Automatic generation of detailed pronunciation lexicons. Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer Academic Press, Boston: chapter 12. pp. 285–302.

    Google Scholar 

  • Saraclar, M., Nock, H., and Khudanpur, S. (2000). Pronunciation modeling by sharing Gaussian densities across phonetic models.Computer Speech and Language,14:137–160.

    Google Scholar 

  • Saraclar, M. and Khudanpur, S. (2000). Pronunciation ambiguity vs. pronunciation variability in speech recognition.ICASSP'00 Proceedings. Istanbul Turkey: ICASSP, pp. 1679–1682.

    Google Scholar 

  • Saraclar, M. (2000). Pronunciation modeling for conversational speech recognition. PhD thesis, The Johns Hopkins University, Baltimore, MD.

    Google Scholar 

  • Strik, H. and Cucchiarini, C. (1999). Modeling pronunciation variation for ASR: A survey of the literature.Speech Communication,29:225–246.

    Google Scholar 

  • Tsai, M., Chou, F., and Lee, L. (2001). Pronunciation variation analysis with respect to various linguistic levels and contextual conditions for Mandarin Chinese.Eurospeech01 Proceedings. Aalborg, Denmark: Eurospeech, pp. 1445–1448.

    Google Scholar 

  • Young, S. (1999).The HTK Book. Entropic Cambridge Research Laboratory.

  • Zheng, F., Song, Z., Fung, P., and Byrne, W. (2001). Modeling pronunciation variation using context-dependent weighting and B/S refined acoustic modeling.Eurospeech01 Proceedings. Aalborg, Denmark: Eurospeech, pp. 57–60.

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Human Language Technology Center, Department of Electrical and Electronic Engineering, University of Science and Technology, Hong Kong, China

    Yi Liu & Pascale Fung

Authors
  1. Yi Liu

    You can also search for this author inPubMed Google Scholar

  2. Pascale Fung

    You can also search for this author inPubMed Google Scholar

Rights and permissions

About this article

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Advertisement


[8]ページ先頭

©2009-2025 Movatter.jp