Movatterモバイル変換

Speech processing

From Wikipedia, the free encyclopedia

Study of speech signals and the processing methods of these signals

This article is about electronic speech processing. For speech processing in the human brain, seeLanguage processing in the brain.

Speech processing is the study ofspeech signals and the processing methods of signals. The signals are usually processed in adigital representation, so speech processing can be regarded as a special case ofdigital signal processing, applied tospeech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks includespeech recognition,speech synthesis,speaker diarization,speech enhancement,speaker recognition, etc.^[1]

History

[edit]

Early attempts at speech processing and recognition were primarily focused on understanding a handful of simplephonetic elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker.^[2] Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s.^[3]

Linear predictive coding (LPC), a speech processing algorithm, was first proposed byFumitada Itakura ofNagoya University and Shuzo Saito ofNippon Telegraph and Telephone (NTT) in 1966.^[4] Further developments in LPC technology were made byBishnu S. Atal andManfred R. Schroeder atBell Labs during the 1970s.^[4] LPC was the basis forvoice-over-IP (VoIP) technology,^[4] as well asspeech synthesizer chips, such as theTexas Instruments LPC Speech Chips used in theSpeak & Spell toys from 1978.^[5]

One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed byLawrence Rabiner and others at Bell Labs was used byAT&T in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary.^[6]

By the early 2000s, the dominant speech processing strategy started to shift away fromHidden Markov Models towards more modernneural networks anddeep learning.^[7]

In 2012,Geoffrey Hinton and his team at theUniversity of Toronto demonstrated that deep neural networks could significantly outperform traditional HMM-based systems on large vocabulary continuous speech recognition tasks. This breakthrough led to widespread adoption of deep learning techniques in the industry.^[8]^[9]

By the mid-2010s, companies likeGoogle,Microsoft,Amazon, andApple had integrated advanced speech recognition systems into their virtual assistants such asGoogle Assistant,Cortana,Alexa, andSiri.^[10] These systems utilized deep learning models to provide more natural and accurate voice interactions.

The development of Transformer-based models, like Google's BERT (Bidirectional Encoder Representations from Transformers) and OpenAI's GPT (Generative Pre-trained Transformer), further pushed the boundaries of natural language processing and speech recognition. These models enabled more context-aware and semantically rich understanding of speech.^[8] In recent years, end-to-end speech recognition models have gained popularity. These models simplify the speech recognition pipeline by directly converting audio input into text output, bypassing intermediate steps like feature extraction and acoustic modeling. This approach has streamlined the development process and improved performance.^[11]

Techniques

[edit]

Dynamic time warping

[edit]

Main article:Dynamic time warping

Dynamic time warping (DTW) is analgorithm for measuring similarity between twotemporal sequences, which may vary in speed. In general, DTW is a method that calculates anoptimal match between two given sequences (e.g. time series) with certain restriction and rules. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.^{[citation needed]}

Hidden Markov models

[edit]

Main article:Hidden Markov model

A hidden Markov model can be represented as the simplestdynamic Bayesian network. The goal of the algorithm is to estimate a hidden variable x(t) given a list of observations y(t). By applying theMarkov property, theconditional probability distribution of the hidden variablex(t) at timet, given the values of the hidden variablex at all times, dependsonly on the value of the hidden variablex(t − 1). Similarly, the value of the observed variabley(t) only depends on the value of the hidden variablex(t) (both at timet).^{[citation needed]}

Artificial neural networks

[edit]

Main article:Artificial neural network

An artificial neural network (ANN) is based on a collection of connected units or nodes calledartificial neurons, which loosely model theneurons in a biologicalbrain. Each connection, like thesynapses in a biologicalbrain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is areal number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs.^{[citation needed]}

Phase-aware processing

[edit]

Phase is often assumed to be random, but contains useful information. Wrapping of phase:^[12] can be introduced due to periodical jumps on $2\pi$ . Phase unwrapping (see,^[13] Chapter 2.3;Instantaneous phase and frequency), it can be expressed as:^[12]^[14] $\phi (h,l)=\phi _{lin}(h,l)+\Psi (h,l)$ , where $\phi _{lin}(h,l)=\omega _{0}(l'){}_{\Delta }t$ is linear phase ( ${}_{\Delta }t$ is temporal shift at each frame of analysis), $\Psi (h,l)$ is phase contribution of the vocal tract and phase source.^[14]Obtained phase estimations can be used for noise reduction: temporal smoothing of instantaneous phase^[15] and its derivatives by time (instantaneous frequency) and frequency (group delay),^[16] smoothing of phase across frequency.^[16] Joined amplitude and phase estimators can recover speech more accurately basing on assumption of von Mises distribution of phase.^[14]

Applications

[edit]

References

[edit]

^Sahidullah, Md; Patino, Jose; Cornell, Samuele; Yin, Ruiking; Sivasankaran, Sunit; Bredin, Herve; Korshunov, Pavel; Brutti, Alessio; Serizel, Romain; Vincent, Emmanuel; Evans, Nicholas; Marcel, Sebastien; Squartini, Stefano; Barras, Claude (2019-11-06). "The Speed Submission to DIHARD II: Contributions & Lessons Learned".arXiv:1911.02388 [eess.AS].
^Juang, B.-H.; Rabiner, L.R. (2006), "Speech Recognition, Automatic: History",Encyclopedia of Language & Linguistics, Elsevier, pp. 806–819,doi:10.1016/b0-08-044854-2/00906-8,ISBN 9780080448541
^Myasnikov, L. L.; Myasnikova, Ye. N. (1970).Automatic recognition of sound pattern (in Russian). Leningrad: Energiya.
^^a ^b ^cGray, Robert M. (2010)."A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol"(PDF).Found. Trends Signal Process.3 (4):203–303.doi:10.1561/2000000036.ISSN 1932-8346.
^"VC&G - VC&G Interview: 30 Years Later, Richard Wiggins Talks Speak & Spell Development".
^Huang, Xuedong; Baker, James; Reddy, Raj (2014-01-01). "A historical perspective of speech recognition".Communications of the ACM.57 (1):94–103.doi:10.1145/2500887.ISSN 0001-0782.S2CID 6175701.
^Furui, Sadaoki (2005)."50 Years of Progress in Speech and Speaker Recognition Research".ECTI Transactions on Computer and Information Technology.1 (2):64–74.doi:10.37936/ecti-cit.200512.51834.ISSN 2286-9131.
^^a ^b"Deep Neural Networks for Acoustic Modeling in Speech Recognition"(PDF). 2019-07-23. Retrieved2024-11-05.
^"SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS"(PDF). 2019-07-23. Retrieved2024-11-05.
^Hoy, Matthew B. (2018). "Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants".Medical Reference Services Quarterly.37 (1):81–88.doi:10.1080/02763869.2018.1404391.ISSN 1540-9597.PMID 29327988.
^Hagiwara, Masato (2021-12-21).Real-World Natural Language Processing: Practical applications with deep learning. Simon and Schuster.ISBN 978-1-63835-039-2.
^^a ^bMowlaee, Pejman; Kulmer, Josef (August 2015). "Phase Estimation in Single-Channel Speech Enhancement: Limits-Potential".IEEE/ACM Transactions on Audio, Speech, and Language Processing.23 (8):1283–1294.Bibcode:2015ITASL..23.1283M.doi:10.1109/TASLP.2015.2430820.ISSN 2329-9290.S2CID 13058142.
^Mowlaee, Pejman; Kulmer, Josef; Stahl, Johannes; Mayer, Florian (2017).Single channel phase-aware signal processing in speech communication: theory and practice. Chichester: Wiley.ISBN 978-1-119-23882-9.
^^a ^b ^cKulmer, Josef; Mowlaee, Pejman (April 2015). "Harmonic phase estimation in single-channel speech enhancement using von Mises distribution and prior SNR".Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE. pp. 5063–5067.
^Kulmer, Josef; Mowlaee, Pejman (May 2015). "Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition".IEEE Signal Processing Letters.22 (5):598–602.Bibcode:2015ISPL...22..598K.doi:10.1109/LSP.2014.2365040.ISSN 1070-9908.S2CID 15503015.
^^a ^bMowlaee, Pejman; Saeidi, Rahim; Stylianou, Yannis (July 2016)."Advances in phase-aware signal processing in speech communication".Speech Communication.81:1–29.doi:10.1016/j.specom.2016.04.002.ISSN 0167-6393.S2CID 17409161. Retrieved2017-12-03.

v t e Computer audition
Acoustic fingerprint Audio mining Computational auditory scene analysis Music information retrieval Semantic audio Speech processing Speech analytics Speaker recognition Speech recognition Sound recognition 3D sound localization 3D sound reconstruction