Tlnite States atent ,Tan.0,1974
Berkley et a1.
1 1 SPEECH SUPPRESSION BY PREDICTIVE FlLTERllNG [75] Inventors: David Arthur Berkley, New York,
N.Y.; Olga Mary Mracek Mitchell, Summit, NJ.
[73] Assignee: Bell Telephone Laboratories,
Incorporated, Murray Hill, NJ.
[22] Filed: Dec. 3, 1971 {21] Appl. No.: 204,509
[52] 11.5. C1 179/1 HF, 179/1 P, 179/1 FS [51] lnt. Cl. H04m l/20,H04b 15/00 [58] Field 01 Search 179/1 P, 1 VC, 1 HF,
179/1 F, l.FS, 1 SA, 15.55 R, 100.2 K, 81 B, 170.8; 324/473-476 [56] References Cited UNITED STATES PATENTS 3,177,489 4/1965 Saltzberg 325/476 3,631,520 12/1971 Atal 179/1 SA aw SPEECH 111 SIGNAL Q 3,133,990 5/1964 S6618) 179/1 P 3,644,674 2/1972 Mitchell 179/1 P 3,601,549 8/1971 M1IC11811 179/1 FS 3,603,744 Krasin 179/81 B Primary Examinen-William C. Cooper Assistant Examiner-Jon Bradford Leaheey A ttorngy -firavesn [57] ABSTRACT Speech signal energy from an undesired source is suppressed by extracting from the undesired signal a delay parameter and a gain parameter. These parameters control a delay and gain network through which both the desired and undesired signals are routed. The
- delayed, amplified signal, when applied to a subtractor along with the undelayed signal, suppresses the undesired signal. The invention is applied to several handsfree telephony situations and to the suppression of one of two speakers in a room.
W 11 Claims, 10Drawing Figures PATH FROM FAR-END STATlON w F LOW PASS In I l FILTER SAMPLER STATION PATENTED 81974 SHEY 1 0f F/GJ PATH FROM j 1 v FAR-END STATQNW I I 0 m a? I LOW PASS f vfi/ ,6} FILTER SPEECH g SIGNAL 5 w, 6 I SAMPLER -Ia PARAMETER V PEE H PARAMETER A9 -Q v coNTRoL I95 ILOW PASS 9A I FILTER I f DELAY l H5 EYEJ DURATION AMP ADJUST. F/G .Z s LER floA I SPEECH l PRED cTIvE D/A 2 SIGNALS FILTER DCONVERTER b L I EIT ER \LOWPASS 9 N FILTER 1 O PREDICTOR FoRsAMP'LER SIGNAL 2 lgA L D/A coNvERTER FIG. 3
F I DELAY l3 II NETWORK I (h SAMPLES) I 2 19A 1 (5) I? w coNv RTER I PATENTEBJAN 8W4 SHEET 2 [1F 4 SIGNAL;
FIG. 5
SPECTRAL MAGNITUDE vs. FREQUENCY FOR PREDICTIVE FILTER OF FIG. 3
montzoiz PRLTHZ l FREQUENCY FIG. 6
DELAY PARAMETER R FOR A TYPICAL VOICED SEGMENT GEE/5 1 TIME (msEg) SPEECH SUPPRESSION BY PREDTCTIVIE FTILTEIRING FIELD OF THE INVENTION This invention relates to speech signal processing, and in particular to reducing the energy content of that part of a composite speech signal attributable to an undesired source.
BACKGROUND OF THE INVENTION In telephony and elsewhere,,it often happens that speech from a source the listener wishes to hear is seriously impaired in intelligibility by speech from a second, undesired source. Numerous expedients to reduce the effects of the second source have been proposed. These involve relative enhancement of the desired speech signal, rendering the undesired signal relatively unintelligible or reducing the energy of the undesired signal. Regardless of the approach, the result typically has been that the desired signal is more intelligible than it would be in the absence of the processing. The hands-free" telephone well exemplifies this problem of'conflicting speech sources, because its electroacoustic speaker constitutes a potential source of undesired signal at the microphone of the same station.
Accordingly, a general object of the invention is to reduce the energy from an undesired speech source in a composite signal containing desired speech.
Another object of the invention is tosuppress an undesired speech signal in an electronic communications channel.
A specific object of the invention is to render a desired speech signal relatively more intelligible despite the presence of undesired speech energy.
A particular inventive object is to achieve the foregoing objects in the hands-free telephony situation.
Another specific object of the invention is to avoid voice switching functions and thus enable full-time duplex operation of a hands-free telephone channel.
Yet another inventive object is to distinguish one talker from another nearby talker and to suppress speech signals from one of them.
SUMMARY OF THE INVENTION The invention is grounded in the general recognition that an unwanted speech signal can be rejected on the basis of its speech parameters.
A discussion of certain speech parameters is found in the patent application of B. S. Atal, Ser. No. 753,408, filed Aug. 19, 1968 now Pat. No. 3,631,520 and assigned to applicants assignee. A predictive coding technique for reducing transmission bandwidth needs is therein disclosed by Atal in which an estimate of the present value of a speech sample is made based on a known corresponding past value. From these data, a difference or error signal is generated, and transmitted to a remote receiving station along with certain predictor parameters. At the remote receiving station the entire signal is reconstituted from the error signal, using the predictor parameters. 7 It has been realized that the generic process as represented by the Atal disclosure can be rearranged'soas to substantially eliminate a given undesired voice signal.
The basic concept contemplated by the present invention is to extract, from the undesired signal, a gain parameter and a delay parameter. These parameters control a delay and gain network through which both desired speech signals and the undesired signal are routed. The delay is approximately equal to the current duration of the pitch period of the undesired speech. The gain is calculated, in accordance with one of several possible formulas, so as to bring the delayed unwanted signal to the amplitude level of the present value of the unwanted signal. Alternatively, in a technically less complex embodiment, the gain is set equal to l. In either case, the network output is then subtractively applied to the unwanted signal or to any composite signal containing the unwanted signal. The process may be carried out in analog or digital fashion.
Advantageously however, the process is carried out by sampling techniques where the signal is sampled at a rate of, for example, 6 kHz that results in 30 to 60 samples per pitch period. The number of samples in a pitch period will vary in accordance with the pitch frequency.
In one embodiment pursuant to the invention, speech from the loudspeaker of a hands-free telephone set impinging either directly or reverberatively on the sets microphone, can be largely removed from the microphone output. The reverberant signal as well as the direct signal is suppressed because the unwanted speech parameters do not vary rapidly during voicing.
In another embodiment pursuant to the invention, speech from, for example, two talkers in the same room is detected by a multiplicity of microphones, and the speech of one talker is suppressed using speech parameters determined by combining the outputs of the microphones.
It will be apparent that the rearrangement of the Atal process constitutes in one aspect a filter; and more specifically, a comb filter with minima at the pitch frequency (and harmonics hereof) of the undesired speech. This distinguishes the predictive filter of the present invention from a conventional echo canceler which merely replicates a reverberant signal and subtractively applies the replica to the composite signal.
The invention and its further objects, features, and advantages will be readily discerned in detail from a reading of the description to follow of illustrative embodiments.
BRIEF DESCRIPTION OF THE DRAWING FIG. 1 is a communications network schematic block diagram containing a hands-free telephone and an inventive embodiment;
FIG. 2 is a schematic block diagram of the inventive predictive filter;
FIG. 3 is a schematic block diagram further delineating the inventive predictor;
FIGS. 4-6 are graphs depicting various characteristics of the predictor;
FIGS. 7 and 8 are two further embodiments of the invention in a communications network containing hands-free telephones; and
FIGS. 9 and 10 are schematic diagrams of the invention as applied to suppression of speech from talkers in a room.
DETAILED DESCRIPTION OF INVENTIVE EMBODIMENTS Hands-free Telephone Situations In the first inventive embodiment, a hands-free telephone loudspeaker 1 andmicrophone 2 present in areverberative enclosure 3 are shown in FIG. 1 connected to the speech processor of the present invention. Usually, the desired speech signal input tomicrophone 2 is from source 4, the near-end talker, whose signal denoted a travels mainly thedirect path 5 and also reverberative paths not shown.Loudspeaker 1 which broadcasts the far-end talker signal, is a source of undesired input tomicrophone 2 either via the direct path denoted 6 or reverberative paths illustrated by path 7. The far-end talker direct path speech signal is denoted Thespeech processing network 8 in FIG. 1 consists of what will be called apredictive filter 9 connected in themicrophone 2 output circuit. As seen in FIGS. 2 and 3,filter 9 consists of two parallel legs. The first leg is apredictor 11 which may be a network consisting ofadelay network 12 and anamplifier 13. The second leg is a direct shunt path. Both legs are connected to asubtractor 10. Thepredictor 11 is controlled in a manner to be described, by aparameter extractor 14 connected in the loudspeaker l circuit.
Pursuant to one embodiment, the invention is carried out digitally. A low pass filter l5 advantageously 3 kHz and a 6kHz sampler 16 are serially connected in the output circuit ofmicrophone 2. Similarly, alow pass filter 17 and asampler 18 are in shunt relation to the loudspeaker ll input circuit and serially connected to parameter extractor l4.
A waveform representing the far-end talker signal 0 is illustrated in FIG. 4. Because a speech signal is redundant-Le, the signal changes little in shape and length of pitch period from one pitch period to the next -the present form or value of signal c can be estimated by a linear prediction based on a past value of signal c.
Thesignal 0 of FIG. 4 is shown made up of speech in consecutive pitch periods I, I 1 etc. lnherently, the speech signals in adjacent pitch periods of signal c are of unequal amplitude. Thus, a gain denoted b can be calculated (in a manner to be described) that when applied to the sampled signal of thepitch period 1 will cause the latter to approximate the sampled signal in thenext pitch period 1 If then the amplified signal of period I, is subtractively combined with the signal value of period the result is the substantial filtering out of thesignal 0. In like manner, if a composite signal a c containingsignal c is amplified duringperiod 1 and subtractively combined with the composite signal a 0 duringperiod 1 the same result obtains.
Thus, in mathematical terms, W, (the amplitude of sample n reaching subtractor via the direct path in predictive filter 9) is subtractively combined with W,, (the amplitude of the delayed sample) where k is the number of samples in a pitch period. Advantageously, the time window over which the parameters are evaluated is of the order of the pitch period to ensure that sufficient energy is present. A time window of 30 samples at a sampling rate of 6 kHz will include between one-half and all the samples in a given pitch period.
Since speech is only quasi-stationary during voicing, the gain parameter b and delay parameter k have to be periodically calculated. This is accomplished in thedigital parameter extractor 14 pursuant to the teaching of the aforementioned Atal patent application Ser. No. 753,408.
As taught therein, input speech samples fromsampler 18 are stored as frames of signals. The store content is then fed to an arithmetic unit which is part ofparameter extractor 14, wherein for 30 samples, computational values of correlation X, are computed as follows:
where N can advantageously be in the range 30-60 samples.
The computed values of X, are then inspected in a peak locating network also part ofextractor 14, to de termine the largest value of X, The value ofj is found such that X, is the maximum of all values of X. This particular value ofj is the delay parameter, k, which is supplied topredictive filter 9 as one parameter. It is seen that k is a variable delay and that the maximum value of X, is X The delay parameter k for a typical voiced segment is shown in FIG. 6.
The gain parameter b is calculated by computing circuitry also inparameter extractor 14, that solves:
The gain parameter b likewise is supplied topredictive filter 9.
The described calculation of delay parameter k and gain parameter b is but one of several systems by which, from an analysis of the speech energy content in adjacent or substantially adjacent signal segments, parameters may be calculated that when applied to a past signal segment will render the latter closely similar to the shape of the present signal segment.
Thus, incoming speech toloudspeaker 1 is continuously analyzed to extract therefrom an optimum delay parameter, and a gain factor. These parameters are periodically updated as for example, every 5 ms. When no incoming signal to loudspeaker l is present, the delay and gain are zero. With incoming signal, the calculated present signal value output ofpredictor 11 is subtracted from the undelayed, unamplified signal sample representing signals a c.
Reconversion to analog form of the signal in themicrophone 2 output circuit is achieved in D/A converter 10A.
The filter depicted in FIG. 3 and described above has a transfer function in Z transform notation.
H(Z) l bZ' The magnitude of the frequency response of a typical embodiment offilter 9 is shown in FIG. 5 usingpredictor 11 where T is the sampling period.
The frequency response for gain parameter b l are shown by the solid curves and for gain parameter bzl by the broken line. Since speech is dynamic during voicing, the parameters b and k have to be optimized as stated above, and readjusted periodically as, for example, every 5 ms.
Filtering of the input speech by the calculated parameters results in suppression during voicing of up to 30 dB during voiced segments of the undesired signal c,
and an average suppression of about 14 dB of theundesired signal 0.
The parameters b and k calculated do not vary smoothly with time. The optimum delay occasionally doubles during voiced segments. Also, during unvoiced segments, the optimum delayvaries rapidly over a wide range while the correlation remains relatively low. However, the gains calculated are not negligible during these unvoiced portions. Desired speech a, uncorrelated with theundesired signal 0 which is to be rejected, is degraded when passed through a filter with these rapidly varying filter parameters, while under such conditions no additional suppression of the unwanted source is accomplished.
To avoid this difficulty, logic is introduced pursuant to one facet of the invention, to prevent undesirable variation of the filter parameters b and k. It was determined that not much suppression was obtained when the correlation X was less than 0.85. Consequently gain b is set equal to zero for X 0.85. This is achieved by parameter control circuit 19 (FIG. l) which sets b equal to zero for X 0.85. This choice of X is a compromise between one as great as possible and one low enough so that all of the voiced segments of speech are suppressed. The resulting suppression during voicing is unchanged while degradation ofa second speech is reduced. FIG. 6 shows the variation of delay parameter k during a typical voiced segment.
Since the parameters b and k vary relatively slowly during voiced segments, thepredictive filter 9 will be effective in removing part of the reverberant signal as well as the direct sound. Specifically, that part of the reverbcrant signal that has parameters not greatly different from the filter parameters will be reduced in amplitude.
The foregoing discussion of the invention as applied to hands-free telephony has assumed no separation between loudspeaker l andmicrophone 2. In practice, however, a significant transit time for thesignal 0 to travel path 6 tomicrophone 2 is required. It is therefore necesssary to compensate inspeech processing network 8 for the loudspeaker-microphone transit time.
ephone. Like numerals denote items which correspond to counterparts in FIGS. 1-3. The far-end echo picked up by themicrophone 2 fromloudspeaker 1 is first reduced in amplitude during voiced segments by aspeech processor 8 in the manner described previously. Gain and delay parameters b and k of the far-end speech are measured on the received loudspeaker signal, and the far-end echo component of the microphone signal is reduced by filtering. The remaining far-end signal at the output of thespeech processor 8 is then removed V by the center-clipping echo suppressor.
This is achieved byparameter delay circuit 19A which is serially connected between the output of parameter control 19 andpredictive filter 9.Parameter delay circuit 19A advantageously is provided with a delay duration adjustment circuit 198 with which the delay duration may be set to correspond to the transit time which characterizes each given hands-free telephone.
A combination of a predictive filter with a centerclipping echo suppressor of the type taught in D. A Berkley-O. M. M. Mitchell-J. R. Pierce U.S. Pat. No. 3,699,271 which is hereby incorporated by reference, is shown in FIGS. 7 and 8. This combination is a possible replacement for voice switching presently used for echo and feedback suppression.
FIG. 7 shows a network denoted for eliminating the echo ofthe far-end talker in a 4-wire hands-free tel- As taught in D. A. Berkley et al. U.S. Pat. No. 3,699,271, the received signal is used to set the clipping levels by means of clippingcontrol 22 so as just to remove the echo. The output of D/A converter 10A is fed to filterbank 40 which comprises plural contiguous band filters in the voice frequency range. Incenter clipper 41 the signal in each subband fromfilter 41 is center clipped at a level determined by clippingcontrol 22 which measures in effect the'energy level in the received signal within each of the subbands. The output ofclipper 41 is filtered inbank 42 which is similar tobank 40.
In this embodiment, the clippingcontrol 22 is advantageously controlled also by theparameter extractor 14. Since the echo is reduced by thepredictive filter 9 during voicing, the clipping levels can be reduced by substantially the same amount during voicing. Consequently in FIG. 7, a control signal is shown (dashed line) between the parameter extractor l4 and the clippingcontrol 22, which causes an attenuation of the input to clippinglevel control 22 that is equal to the suppression achieved byspeech processing network 8. It will be recognized that optimum performance ofclipping level control 22 will be realized by inserting a delay in its input path to compensate for the already mentioned signal transit time betweenloudspeaker 1 andmicrophone 2. With the clipping levels thus reduced during voiced segments, there will be less mutilation of the near-end speech by the center-clipping process.
FIG. 8 shows a circuit for eliminating both the farend echo (echo of far-end talker caused by acoustic coupling through room acoustics) and near-end echo (echo of near-end talker caused by imperfect hybrid junction) in a 2-wire hands-free telephone. The far-end echo is eliminated bynetwork 50 as described above for FIG. 7. The near-end echo is eliminated by a similar circuit denoted 51 introduced on the receive side of the local 4-wire network as shown.
An alternative method of adjusting the clipping level control by theparameter extractor 14 via theparameter delay 19A is shown incircuit 51. A second predictive filter designated 9a is used incircuit 51 to attenuate the clipping level control signal during voiced segments. Thus the clipping levels follow the signal at the input to the narrow band center clipper, i.e., at the output of the predictive filter 9a.
Suppression of One of Two Room Speakers A further embodiment allows the suppression of the speech signal from one of two talkers in a room. FIG. 9 shows the desiredspeech source 23 and anundesired source 24 both of whose speech signals form the input tomicrophones 25 and 26. Theundesired source 24 is positioned so that the time delays for direct sound transmission tomicrophones 25 and 26 are equal. In the output ofmicrophone 25 ispredictive filter 9 asmicrophone 26, enter theparameter extractor 30 wherein an arithmetic unit within the extractor calculates the computational values and A peak picking network within the extractor then selects the peaks from X, and Y, and a comparator finds the largest value peak which occurs in both sequence X, and Y for the same value ofj. This value ofj is the delay parameter k for the undesired speech supplied to thepredictive filter 9.
An alternative method of extracting the parameters is shown in FIG. 10. Twoadditional microphones 27 and 28 are positioned so that time delays from desiredspeaker 24 for direct sound transmission tomicrophones 27 and 28 are equal to the time delays tomicrophones 25 and 26. The outputs of all microphones 25-28 are processed by anon-linear processor 31 as described in O. M. M. Mitchell-C. A. Ross-R. L. Wallace, Jr. U.S. Pat. No. 3,644,671, which is hereby incorporated by reference. The output ofprocessor 31 contains the undesired signal and an attenuated and disturbed component of the desired signal. (The outputs may alternatively be added to merely attenuate the desired signal.) The output of thenon-linear processor 31 enters thespeech processing network 8. The output ofmicrophone 25 is processed byspeech processing network 8 which filters out theundesired talker 24 in the manner already described. The presence in the output of thenonlinear processor 31 of a small amount of the desired talker does not significantly affect the delay parameter k but will cause a small error in the evaluation of X and b.
[t is to be understood that the embodiments described herein are merely illustrative of the principles of the invention. Various modifications may be made thereto by persons skilled in the art without departing from the spirit and scope of the invention.
What is claimed is:
1. Speech processing apparatus for suppressing voiced segments of an undesired speech signal while leaving a desired speech signal intelligible, comprising:
means forderiving an electronic waveform representing the undesired speech signal;
means for deriving an electronic waveform representing a composite signal containing a reverberant version of the undesired speech signal and the desired signal;
means for deriving from the waveform of said undesired speech signal a delay parameter determined from the signal values during an interval embracing a substantial portion ofa pitch period of said undesired speech signal;
means for applying said composite speech signal waveform to a summer over a first path;
means for delaying in a second path said composite speech signal waveform by an amount of said delay parameter; and means for subtractively applying to said summer the delayed said composite speech waveform.
2. Apparatus pursuant to claim 1, further comprising means responsive to the absence of voiced segments of said undesired speech signal for interrupting said second path.
3. Apparatus in accordance withclaim 1 further comprising means for deriving from the waveform of said undesired speech signal a gain parameter specifying the amount by which the amplitudes of corresponding values of said undesired speech signal in a past said interval must be respectively adjusted so as to produce a substantial duplicate of the undesired said speech signal of a present said interval; and which further comprises means controlled by said gain parameter for amplifying said delayed composite speech waveform prior to its being subtractively applied to said summer.
4. A communications network comprising:
a hands-free telephone station including a direct acoustic coupling path between the station loudspeaker and microphone, a second remote telephone station, and transmission means interconnecting said stations;
means for derivingfrom the incoming signal waveform to said loudspeaker from said second stationa delay parameter representing the duration of an interval embracing a substantial portion of the present pitch period of speech from the remote station; and a gain parameter specifying the amount by which the waveform in a past said interval must be changed in amplitude to substantially correspond to the undesired speech waveform of the present interval;
means for applying the composite signal-consisting of the desired near-end talker signal and the acoustically coupled far-end talker signal from said loudspeaker-in said microphone output to a summer over a first path;
means disposed in a second path for delaying said composite signal by the amount of said delay parameter and for amplifying the delayed composite signal an amount determined by said gain parameter; and
means for subtractively applying the delayed, amplified composite signal to said summer.
5. A communications network pursuant to claim 4 wherein said deriving means comprises:
a signal sampler connected to the circuit of said loudspeaker and operating at a set sampling rate; and
means for computing values of a term X, in accordance with the relationship where W is the amplitude of a sample n reaching said sampler, and means for finding that value ofj such that X,- is the maximum of all values of X,, the
9 110 found value of j constituting said delay parameter. means for center-clipping the output of each said fil- 6. A communications network pursuant to claim ter bank subband a varying amount in response to wherein said gain parameter deriving means comprises the concurrent said energy level value; and means for computing the value b in accordance with means connecting said control signal producing therelationship 5 means and said deriving means responsive to voiced portions of signal from said remote station for reducing all said clipping levels.
7 11. Speech processing apparatus for suppressing 2 n n-k speech from one of two talkers in a room comprising: b=% 10 first and second microphones located equidistant 2 2 from the first, desired said talker but at unequal distances from the second, undesired said talker;
means for deriving from the two said microphone outputs a first parameter X, in accordance with the 1 r h where W, is the amplitude of a sample n reaching said re a Ions sampler, and k is a delay parameter for a voiced segment. N l l/2 I 1/2 7. A communications network pursuant to claim 6, X1: 2 2 it) in) 71 I! ll further comprising means for rendering said gain parameter equal to zero in the absence of voiced segments of the signal in said loudspeaker path from said remote station.
8. A communications network-pursuant to claim 7, 1/2
further comprising means for setting said gain parame- N I WT?) 1/2 W2 and a second parameter Y5, respectively, calculated in accordance with the relationship ter equal to zero in response to values of X corresponding to the maximum computed values of X, which are less than a critical predetermined value.
9. A communications network pursuant to claim 8, whefe is the speechigflal receive? by Said fi further comprising means for adjustably delaying armlcrolphone and W" the Speech slgnal recelved rival of said delay and gain parameters at said second by Sam P path by an amount that compensates for the transit meahs for Selectmg the largest Value P from the time delay over said direct acoustic coupling path of composite P ,Vahles Ofsaid Parameters 1 and 1 speech from said remote station. for the Same Value of the term j;
10. A communications network pursuant to claim 4, means for pp y the desired and the undesired Said f th i i signals from one of said microphones to a summer filter bank means connected to the output of said irectly over a first path and alternately over a secsummer and comprising plural contiguous sub- 0nd path through a network including delay means; band s; and
means for producing-from saidremote station 40 means for adjusting said delay means as a function of speech signalcontrol signals representative of the the value of the term j. incoming speech energy level in each said subband;