Speech processing is the study ofspeechsignals and the processing methods of signals. The signals are usually processed in adigital representation, so speech processing can be regarded as a special case ofdigital signal processing, applied tospeech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks includespeech recognition,speech synthesis,speaker diarization,speech enhancement,speaker recognition, etc.[1]
Early attempts at speech processing and recognition were primarily focused on understanding a handful of simplephonetic elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker.[2] Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s.[3]
Linear predictive coding (LPC), a speech processing algorithm, was first proposed byFumitada Itakura ofNagoya University and Shuzo Saito ofNippon Telegraph and Telephone (NTT) in 1966.[4] Further developments in LPC technology were made byBishnu S. Atal andManfred R. Schroeder atBell Labs during the 1970s.[4] LPC was the basis forvoice-over-IP (VoIP) technology,[4] as well asspeech synthesizer chips, such as theTexas Instruments LPC Speech Chips used in theSpeak & Spell toys from 1978.[5]
One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed byLawrence Rabiner and others at Bell Labs was used byAT&T in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary.[6]
By the early 2000s, the dominant speech processing strategy started to shift away fromHidden Markov Models towards more modernneural networks anddeep learning.[7]
In 2012,Geoffrey Hinton and his team at theUniversity of Toronto demonstrated that deep neural networks could significantly outperform traditional HMM-based systems on large vocabulary continuous speech recognition tasks. This breakthrough led to widespread adoption of deep learning techniques in the industry.[8][9]
By the mid-2010s, companies likeGoogle,Microsoft,Amazon, andApple had integrated advanced speech recognition systems into their virtual assistants such asGoogle Assistant,Cortana,Alexa, andSiri.[10] These systems utilized deep learning models to provide more natural and accurate voice interactions.
The development of Transformer-based models, like Google's BERT (Bidirectional Encoder Representations from Transformers) and OpenAI's GPT (Generative Pre-trained Transformer), further pushed the boundaries of natural language processing and speech recognition. These models enabled more context-aware and semantically rich understanding of speech.[8] In recent years, end-to-end speech recognition models have gained popularity. These models simplify the speech recognition pipeline by directly converting audio input into text output, bypassing intermediate steps like feature extraction and acoustic modeling. This approach has streamlined the development process and improved performance.[11]
Dynamic time warping (DTW) is analgorithm for measuring similarity between twotemporal sequences, which may vary in speed. In general, DTW is a method that calculates anoptimal match between two given sequences (e.g. time series) with certain restriction and rules. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.[citation needed]
A hidden Markov model can be represented as the simplestdynamic Bayesian network. The goal of the algorithm is to estimate a hidden variable x(t) given a list of observations y(t). By applying theMarkov property, theconditional probability distribution of the hidden variablex(t) at timet, given the values of the hidden variablex at all times, dependsonly on the value of the hidden variablex(t − 1). Similarly, the value of the observed variabley(t) only depends on the value of the hidden variablex(t) (both at timet).[citation needed]
An artificial neural network (ANN) is based on a collection of connected units or nodes calledartificial neurons, which loosely model theneurons in a biologicalbrain. Each connection, like thesynapses in a biologicalbrain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is areal number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs.[citation needed]
Phase is often assumed to be random, but contains useful information. Wrapping of phase:[12] can be introduced due to periodical jumps on. Phase unwrapping (see,[13] Chapter 2.3;Instantaneous phase and frequency), it can be expressed as:[12][14], where is linear phase ( is temporal shift at each frame of analysis), is phase contribution of the vocal tract and phase source.[14]Obtained phase estimations can be used for noise reduction: temporal smoothing of instantaneous phase[15] and its derivatives by time (instantaneous frequency) and frequency (group delay),[16] smoothing of phase across frequency.[16] Joined amplitude and phase estimators can recover speech more accurately basing on assumption of von Mises distribution of phase.[14]