Our invention relates to speech processing and more particularly to digital speech coding arrangements.
Digital speech communication systems including voice storage and voice response facilities utilize signal compression to reduce the bit rate needed for storage and/or transmission. As is well known in the art, a speech pattern contains redundancies that are not essential to its apparent quality. Removal of redundant components of the speech pattern significantly lowers the number of digital codes required to construct a replica of the speech. The subjective quality of the speech replica, however, is dependent on the compression and coding techniques.
One well known digital speech coding system such as disclosed in U.S. Pat. No. 3,624,302 issued Nov. 30, 1971 includes linear prediction analysis of an input speech signal. The speech signal is partitioned into successive intervals and a set of parameters representative of the interval speech is generated. The parameter set includes linear prediction coefficient signals representative of the spectral envelope of the speech in the interval, and pitch and voicing signals corresponding to the speech excitation. These parameter signals may be encoded at a much lower bit rate than the speech signal waveform itself. A replica of the input speech signal is formed from the parameter signal codes by synthesis. The synthesizer arrangement generally comprises a model of the vocal tract in which the excitation pulses are modified by the spectral envelope representative prediction coefficients in an all pole predictive filter.
The foregoing pitch excited linear predictive coding is very efficient. The produced speech replica, however, exhibits a synthetic quality that is often difficult to understand. In general, the low speech quality results from the lack of correspondence between the speech pattern and the linear prediction model used. Errors in the pitch code or errors in determining whether a speech interval is voiced or unvoiced cause the speech replica to sound disturbed or unnatural. Similar problems are also evident in formant coding of speech. Alternative coding arrangements in which the speech excitation is obtained from the residual after prediction, e.g., ADPCM or APC, provide a marked improvement because the excitation is not dependent upon an inexact model. The excitation bit rate of these systems, however, is at least an order of magnitude higher than the linear predictive model. Attempts to lower the excitation bit rate in the residual type systems have generally resulted in a substantial loss in quality. It is an object of the invention to provide improved speech coding of high quality at lower bit rates than residual coding schemes.
BRIEF SUMMARY OF THE INVENTIONWe have found that the foregoing residual encoding problems may be solved by forming a pattern predictive of a pattern (e.g. speech pattern) to be encoded and comparing the pattern to be encoded with the predictive pattern on a frame by frame basis. The differences between the pattern to be encoded and the predictive pattern over each frame are utilized to form a coded signal of a prescribed format which coded signal modifies the predictive pattern to minimize the frame differences. The bit rate of the prescribed format coded signal is selected so that the modified predictive pattern approximates the speech pattern to a desired level consistent with coding requirements.
The invention is directed to a sequential pattern processing arrangement in which the sequential pattern is partitioned into successive time intervals. In each time interval, a set of signals representative of the interval sequential pattern and a signal representative of the differences between the interval sequential pattern and the interval representative signal set are generated. A first signal corresponding to the interval pattern is formed responsive to said interval pattern representative signals and said interval differences representative signal and a second interval corresponding signal is generated responsive to said interval pattern representative signals. A signal corresponding to the differences between the first and second interval corresponding signals is formed and a third signal is produced responsive to said interval differences corresponding signal that alters the second signal to reduce the differences between said first and second interval corresponding signals.
According to one aspect of the invention, a speech pattern is partitioned into successive time intervals. In each interval, a set of signals representative of the speech pattern in each time interval and a signal representative of the differences between said interval speech pattern and the interval speech pattern representative signal set are generated. A first signal corresponding to the interval speech pattern is formed responsive to said interval speech representative signals and differences representative signal and a second interval corresponding signal is generated responsive to the interval speech pattern representative signals. A signal corresponding to the differences between the first and second interval representative signals is formed and a third signal is produced responsive to the interval differences corresponding signal that alters said second interval corresponding signal to reduce the differences corresponding signal.
According to another aspect of the invention, the third signal is utilized to construct a replica of the interval pattern.
In an embodiment of the invention, a set of predictive parameter signals is generated for each time frame from a speech signal. A prediction residual signal is formed responsive to the time frame speech signal and the time frame predictive parameters. The prediction residual signal is passed through a first predictive filter to produce a first speech representative signal for the time frame. An second speech representative signal is generated for the time frame in a second predictive filter from the frame prediction parameters. Responsive to the first speech representative and second speech representative signals of the time frame, a coded excitation signal is formed and applied to the second predictive filter to minimize the perceptually weighted mean squared difference between the frame first and second speech representative signals. The coded excitation signal and the predictive parameter signals are utilized to construct a replica of the time frame speech pattern.
DESCRIPTION OF THE DRAWINGFIG. 1 depicts a block diagram of a speech processor circuit illustrative of the invention;
FIG. 2 depicts a block diagram of an excitation signal forming processor that may be used in the circuit of FIG. 1;
FIG. 3 shows a flow chart that illustrates the operation of the excitation signal forming circuit of FIG. 1;
FIGS. 4 and 5 show flow charts that illustrate the operation of the circuit of FIG. 2;
FIG. 6 shows a timing diagram that is illustrative of the operation of the excitation signal forming circuit of FIG. 1 and of FIG. 2; and
FIG. 7 shows waveforms illustrating the speech processing of the invention.
DETAILED DESCRIPTIONFIG. 1 shows a general block diagram of a speech processor illustrative of the invention. In FIG. 1, a speech pattern such as a spoken message is received by microphone transducer 101. The corresponding analog speech signal therefrom is bandlimited and converted into a sequence of pulse samples in filter andsampler circuit 113 ofprediction analyzer 110. The filtering may be arranged to remove frequency components of the speech signal above 4.0 KHz and the sampling may be at an 8.0 KHz rate as is well known in the art. The timing of the samples is controlled by sample clock CL fromclock generator 103. Each sample fromcircuit 113 is transformed into an amplitude representative digital code in analog-to-digital converter 115.
The sequence of speech samples is supplied topredictive parameter computer 119 which is operative, as is well known in the art, to partition the speech signals into 10 to 20 ms intervals and to generate a set of linear prediction coefficient signals ak,k=1,2, . . . , p representative of the predicted short time spectrum of the N>>p speech samples of each interval. The speech samples from A/D converter 115 are delayed indelay 117 to allow time for the formation of signals ak. The delayed samples are supplied to the input of predictionresidual generator 118. The prediction residual generator, as is well known in the art, is responsive to the delayed speech samples and the prediction parameters ak to form a signal corresponding to the difference therebetween. The formation of the predictive parameters and the prediction residual signal for each frame shown inpredictive analyzer 110 may be performed according to the arrangement disclosed in U.S. Pat. No. 3,740,476 issued to B. S. Atal June 19, 1973 and assigned to the same assignee or in other arrangements well known in the art.
While the predictive parameter signals ak form an efficient representation of the short time speech spectrum, the residual signal generally varies widely from interval to interval and exhibits a high bit rate that is unsuitable for many applications. In the pitch excited vocoder, only the peaks of the residual are transmitted as pitch pulse codes. The resulting quality, however, is generally poor.Waveform 701 of FIG. 7 illustrates a typical speech pattern over two time frames.Waveform 703 shows the predictive residual signal derived from the pattern ofwaveform 701 and the predictive parameters of the frames. As is readily seen,waveform 703 is relatively complex so that encoding pitch pulses corresponding to peaks therein does not provide an adequate approximation of the predictive residual. In accordance with the invention,excitation code processor 120 receives the residual signal dk and the prediction parameters ak of the frame and generates an interval excitation code which has a predetermined number of bit positions. The resulting excitation code shown inwaveform 705 exhibits a relatively low bit rate that is constant. A replica of the speech pattern ofwaveform 701 constructed from the excitation code and the prediction parameters of the frames is shown inwaveform 707. As seen by a comparison ofwaveforms 701 and 707, higher quality speech characteristic of adaptive predictive coding is obtained at much lower bit rates.
The prediction residual signal dk and the predictive parameter signals ak for each successive frame are applied fromcircuit 110 to excitationsignal forming circuit 120 at the beginning of the succeeding frame.Circuit 120 is operative to produce a multielement frame excitation code EC having a predetermined number of bit positions for each frame. Each excitation code corresponds to a sequence of 1≦i≦I pulses representative of the excitation function of the frame. The amplitude βi and location mi of each pulse within the frame is determined in the excitation signal forming circuit so as to permit construction of a replica of the frame speech signal from the excitation signal and the predictive parameter signals of the frame. The βi and mi signals are encoded incoder 131 and multiplexed with the prediction parameter signals of the frame inmultiplexer 135 to provide a digital signal corresponding to the frame speech pattern.
In excitationsignal forming circuit 120, the predictive residual signal dk and the predictive parameter signals ak of a frame are supplied to filter 121 viagates 122 and 124, respectively. At the beginning of each frame, frame clock signal FC opensgates 122 and 124 whereby the dk signals are supplied to filter 121 and the ak signals are applied tofilters 121 and 123.Filter 121 is adapted to modify signal dk so that the quantizing spectrum of the error signal is concentrated in the formant regions thereof. As disclosed in U.S. Pat. No. 4,133,976 issued to B. S. Atal et al, Jan. 9, 1979 and assigned to the same assignee, this filter arrangement is effective to mask the error in the high signal energy portions of the spectrum.
The transfer function offilter 121 is expressed in z transform notation as ##EQU1## where B(z) is controlled by the frame predictive parameters ak.
Predictive filter 123 receives the frame predictive parameter signals fromcomputer 119 and an artificial excitation signal EC fromexcitation signal processor 127.Filter 123 has the transfer function ofEquation 1. Filter 121 forms a weighted frame speech signal y responsive to the predictive residual dk whilefilter 123 generates a weighted artificial speech signal y responsive to the excitation signal fromsignal processor 127. Signals y and y are correlated incorrelation processor 125 which generates a signal E corresponding to the weighted difference therebetween. Signal E is applied to signalprocessor 127 to adjust the excitation signal EC so that the differences between the weighted speech representative signal fromfilter 121 and the weighted artificial speech representative signal fromfilter 123 are reduced.
The excitation signal is a sequence of 1≦i≦I pulses. Each pulse has an amplitude βi and a location mi.Processor 127 is adapted to successively form the βi, mi signals which reduce the differences between the weighted frame speech representative signal fromfilter 121 and the weighted frame artificial speech representative signal fromfilter 123. The weighted frame speech representative signal may be expressed as: ##EQU2## and the weighted artificial speech representative signal of the frame may be expressed as ##EQU3## where hn is the impulse response offilter 121 orfilter 123.
The excitation signal formed incircuit 120 is a coded signal having elements βi, mi, i=1,2, . . . , I. Each element represents a pulse in the time frame. βi is the amplitude of the pulse and mi is the location of the pulse in the frame. Correlationsignal generator circuit 125 is operative to successively generate a correlation signal for each element. Each element may be located attime 1≦q≦Q in the time frame. Consequently, the correlation processor circuit forms Q possible candidates for element i in accordance with Equation 4: ##EQU4##Excitation signal generator 127 receives the Ciq signals from the correlation signal generator circuit, selects the Ciq signal having the maximum absolute value and forms the ith element of the coded signal ##EQU5## where q* is the location of the correlation signal having the maximum absolute value. The index i is incremented to i+1 and signal yn at the output ofpredictive filter 123 is modified. The process in accordance withEquations 4, 5 and 6 is repeated to form element βi+1, mi+1. After the formation of element βI, mI, the signal having elements βi m1, β2 m2, . . . , βI mI is transferred tocoder 131. As is well known in the art,coder 131 is operative to quantize the βi mi elements and to form a coded signal suitable for transmission to network 140.
Each offilters 121 and 123 in FIG. 1 may comprise a transversal filter of the type described in aforementioned U.S. Pat. No. 4,133,976. Each ofprocessors 125 and 127 may comprise one of the processor arrangements well known in the art adapted to perform the processing required byEquations 4 and 6 such as the C.S.P., Inc. Macro Arithmetic Processor System 100 or other processor arrangements well known in the art.Processor 125 includes a read-only memory which permanently stores programmed instructions to control the Ciq signal formation in accordance withEquation 4 andprocessor 127 includes a read-only memory which permanently stores programmed instructions to select the Bi, mi signal elements according to Equation 6 as is well known in the art. The program instructions inprocessor 125 are set forth in FORTRAN language form in Appendix A and the program instructions inprocessor 127 are listed in FORTRAN language form in Appendix B.
FIG. 3 depicts a flow chart showing the operation ofprocessors 125 and 127 for each time frame. Referring to FIG. 3, the hk impulse response signals are generated inbox 305 responsive to the frame predictive parameters for the transfer function ofEquation 1. This occurs after receipt of the FC signal fromclock 103 in FIG. 1 as perwait box 303. The element index i and the excitation pulse location index q are initially set to 1 inbox 307. Upon receipt of signals yn and yn,i-1 frompredictive filters 121 and 123, signal Ciq is formed as perbox 309. The location index q is incremented inbox 311 and the formation of the next location Ciq signal is initiated.
After the CiQ signal is formed for excitation signal element i inprocessor 125,processor 127 is activated. The q index inprocessor 127 is initially set to 1 inbox 315 and the i index as well as the Ciq signals formed inprocessor 125 are transferred toprocessor 127. Signal Ciq * which represents the Ciq signal having the maximum absolute value and its location q* are set to zero inbox 317. The absolute values of the Ciq signals are compared to signal Ciq * and the maximum of these absolute values is stored as signal Ciq * in theloop including boxes 319, 321, 323, and 325.
After the CiQ signal fromprocessor 125 has been processed,box 327 is entered frombox 325. The excitation code element location mi is set to q* and the magnitude of the excitation code element βi is generated in accordance with Equation 6. The βi mi element is output topredictive filter 123 as perbox 328 and index i is incremented as perbox 329. Upon formulation of the βI mI element of the frame, waitbox 303 is reentered fromdecision box 331.Processors 125 and 127 are then placed in wait states until the FC frame clock pulse of the next frame.
The excitation code inprocessor 127 is also supplied tocoder 131. The coder is operative to transform the excitation code fromprocessor 127 into a form suitable for use innetwork 140. The prediction parameter signals ak for the frame are supplied to an input ofmultiplexer 135 viadelay 133 as prediction signals a'k. The excitation coded signal ECS fromcoder 131 is applied to the other input of the multiplexer. The multiplexed excitation and predictive parameter codes for the frame are then sent tonetwork 140.
Network 140 may be a communication system, the message store of a voice storage arrangement, or apparatus adapted to store a complete message or vocabulary of prescribed message units, e.g., words, phonemes, etc., for use in speech synthesizers. Whatever the message unit, the resulting sequence of frame codes fromcircuit 120 are forwarded vianetwork 140 tospeech synthesizer 150. The synthesizer, in turn, utilizes the frame excitation codes fromcircuit 120 as well as the frame predictive parameter codes to construct a replica of the speech pattern.
Demultiplexer 152 insynthesizer 150 separates the excitation code EC of a frame from the prediction parameters ak thereof. The excitation code, after being decoded into an excitation pulse sequence indecoder 153, is applied to the excitation input ofspeech synthesizer filter 154. The ak codes are supplied to the parameter inputs offilter 154.Filter 154 is operative in response to the excitation and predictive parameter signals to form a coded replica of the frame speech signal as is well known in the art. D/A converter 156 is adapted to transform the coded replica into an analog signal which is passed through low-pass filter 158 and transformed into a speech pattern bytransducer 160.
An alternative arrangement to perform the excitation code formation operations tocircuit 120 may be based on the weighted means squared error between signals yn and yn. This weighted mean squared error upon forming βi and mi for the ith excitation signal pulse is ##EQU6## where hn is the nth sample of the impulse response of H(z), mj is the location of the jth pulse in the excitation code signal, and βj is the magnitude of the jth pulse.
The pulse locations and amplitudes are generated sequentially. The ith element of the excitation is determined by minimizing Ei in Equation 7. Equation 7 may be rewritten as ##EQU7## so that the known excitation code elements preceding βi,mi appear only in the first term.
As is well known, the value of βi which minimizes Ei can be determined by differentiating Equation 8 with respect to βi and setting ##EQU8##
Consequently, the optimum value of βi is ##EQU9## are the autocorrelation coefficients of the predictive filter impulse response signal hk.
βi inEquation 10 is a function of the pulse location and is determined for each possible value thereof. The maximum of the |βi | values over the possible pulse locations is then selected. After βi and mi values are obtained, βi+1 mi+1 values are generated by solvingEquation 10 in similar fashion. The first term ofEquation 10, i.e., ##EQU10## corresponds to the speech representative signal of the frame at the output ofpredictive filter 121. The second term ofEquation 10, i.e., ##EQU11## corresponds to the artificial speech representative signal of the frame at the output ofpredictive filter 123. βi is the amplitude of an excitation pulse at location mi which minimizes the difference between the first and second terms.
The data processing circuit depicted in FIG. 2 provides an alternative arrangement to excitationsignal forming circuit 120 of FIG. 1. The circuit of FIG. 2 yields the excitation code for each frame of the speech pattern in response to the frame prediction residual signal dk and the frame prediction parameter signals ak in accordance withEquation 10 and may comprise the previously mentioned C.S.P., Inc. Macro Arithmetic Processor System 100 or other processor arrangements well known in the art.
Referring to FIG. 2,processor 210 receives the predictive parameter signals ak and the prediction residual signals dn of each successive frame of the speech pattern fromcircuit 110 viastore 218. The processor is operative to form the excitation code signal elements β1 m1, β2, m2, . . . , βI, mI under control of permanently stored instructions in predictive filter subroutine read-only memory 201 and excitation processing subroutine read-only memory 205. The predictive filter subroutine ofROM 201 is set forth in Appendix C and the excitation processing subroutine inROM 205 is set forth in Appendix D.
Processor 210 comprisescommon bus 225,data memory 230,central processor 240,arithmetic processor 250,controller interface 220 and input-output interface 260. As is well known in the art,central processor 240 is adapted to control the sequence of operations of the other units ofprocessor 210 responsive to coded instructions fromcontroller 215.Arithmetic processor 250 is adapted to perform the arithmetic processing on coded signals fromdata memory 230 responsive to control signals fromcentral processor 240.Data memory 230 stores signals as directed bycentral processor 240 and provides such signals toarithmetic processor 250 and input-output interface 260.Controller interface 220 provides a communication link for the program instructions inROM 201 andROM 205 tocentral processor 240 viacontroller 215, and input-output interface 260 permits the dk and ak signal to be supplied todata memory 230 and supplies output signals βi and mi from the data memory tocoder 131 in FIG. 1.
The operation of the circuit of FIG. 2 is illustrated in the filter parameter processing flow chart of FIG. 4, the excitation code processing flow chart of FIG. 5, and the timing chart of FIG. 6. At the start of the speech signal,box 410 in FIG. 4 is entered viabox 405 and the frame count r is set to the first frame by a single pulse ST fromclock generator 103. FIG. 6 illustrates the operation of the circuit of FIGS. 1 and 2 for two successive frames. Between times t0 and t7 in the first frame,prediction analyzer 110 forms the speech pattern samples of frame r+2 as inwaveform 605 under control of the sample clock pulses ofwaveform 601.Analyzer 110 generates the ak signals corresponding to frame r+1 between times t0 and t3 and forms predictive residual signal dk between times t3 and t6 as indicated inwaveform 607. Signal FC (waveform 603) occurs between times t0 and t1. The signals dk fromresidual signal generator 118 previously stored instore 218 during the preceding frame are placed indata memory 230 via input-output interface 260 andcommon bus 225 under control ofcentral processor 240. As indicatedoperation box 415 of FIG. 4, these operations are responsive to frame clock signal FC. The frame prediction parameter signals ak fromprediction parameter computer 119 previously placed instore 218 during the preceding frame are also inserted inmemory 230 as per operation box 420. These operations occur between times t0 and t1 on FIG. 6.
After insertion of the frame dk and ak signals intomemory 230,box 425 is entered and the predictive filter coefficients bk corresponding to the transfer function of Equation 1:
b.sub.k =α.sup.k a.sub.k k=1,2, . . . ,p (12)
are generated inarithmetic processor 250 and placed indata memory 230. p is typically 16 and α is typically 0.85 for a sampling rate of 8 KHz. The predictive filter impulse response signals hk ##EQU12## are then generated inarithmetic processor 250 and stored indata memory 230. When the hk impulse response signal is stored,box 435 is entered and the predictive filter autocorrelation signals of Equation 11 are generated and stored.
At time t2 in FIG. 6,controller 215disconnects ROM 201 frominterface 220 and connects excitationprocessing subroutine ROM 205 to the interface. The formation of the βi, mi excitation pulse codes shown in the flow chart of FIG. 5 is then initiated. Between times t2 and t4 in FIG. 6, the excitation pulse sequence is formed. Excitation pulse index i is initially set to 1 and pulse location index q is set to 1 inbox 505. β1 is set to zero inbox 510 andoperation box 515 is entered to determine βiq =β11. β11 is the optimum excitation pulse at location q=1 of the frame. The absolute value of β11 is then compared to the previously stored β1 indecision box 520. Since β1 is initially zero, the mi code is set to q=1 and the βi code is set to β11 inbox 525.
Location index q is then incremented inbox 530 andbox 515 is entered viadecision box 535 to generate signal β12. Theloop including boxes 515, 520, 525, 530 and 535 is iterated for allpulse location values 1≦q≦Q. After the Qth iteration, the first excitation pulse amplitude β1 =βiq* and its location in the frame m1 =q* are stored inmemory 230. In this manner, the first of the I excitation pulses is determined. Referring towaveform 705 in FIG. 7, frame r occurs between times t0 and t1. The excitation code for the frame consists of 8 pulses. The first pulse of amplitude β1 and location m1 occurs at time tm1 in FIG. 7 as determined in the flow chart of FIG. 5 for index i=1.
Index i is incremented to the succeeding excitation pulse inbox 545 andoperation box 515 is entered viabox 550 andbox 510. Upon completion of each iteration of the loop betweenboxes 510 and 550, the excitation signal is modified to further reduce the signal of Equation 7. Upon completion of the second iteration, pulse β2 m2 (time tm2 in waveform 705) is formed. Excitation pulses β3 m3 (time tm3), β4 m4 (time tm4), β5 m5 (time tm5), β6 m6 (time tm6), β7 m7 (time tm7), and β8 m8 (time tm8), are then successively formed as index i is incremented.
After the Ith iteration (waveform 609 at t4),box 555 is entered fromdecision box 550 and the current frame excitation code β1 m1, β2 m2, . . . , βImI is generated therein. The frame index is incremented inbox 560 and the predictive filter operations of FIG. 4 for the next frame are started inbox 415 at time t7 in FIG. 6. Upon the occurrence of the FC clock signal for the next frame at t7 in FIG. 6, the predictive parameter signals for frame r+3 are formed (waveform 605 between times t7 and t14), the ak and dk signals are generated for frame r+2 (waveform 607 between times t7 and t13), and the excitation code for frame r+1 is produced (waveform 609 between times t7 and t12).
The frame excitation code from the processor of FIG. 2 is supplied via input-output interface 260 tocoder 131 in FIG. 1 as is well known in the art.Coder 131 is operative as previously mentioned in quantize and format the excitation code for application to network 140. The ak prediction parameter signals for the frame are applied to one input ofmultiplexer 135 throughdelay 133 so that the frame excitation code fromcoder 131 may be appropriately multiplexed therewith.
The invention has been described with reference to particular illustrative embodiments. It is apparent to those skilled in the art with various modifications may be made without departing from the scope and the spirit of the invention. For example, the embodiments described herein have utilized linear predictive parameters and a predictive residual. The linear predictive parameters may be replaced by formant parameters or other speech parameters well known in the art. The predictive filters are then arranged to be responsive to the speech parameters that are utilized and to the speech signal so that the excitation signal formed incircuit 120 of FIG. 1 is used in combination with the speech parameter signals to construct a replica of the speech pattern of the frame in accordance with the invention. The encoding arrangement of the invention may be extended to sequential patterns such as biological and geological patterns to obtain efficient representations thereof. ##SPC1##