Noise Reduction in Voice Communications
Technical Field of the Invention
The present invention relates to noise reduction in voice communications, in particular to noise reduction in reproduction of voices captured as part of a voice communication.
Background to the Invention
In voice communications, an acoustic signal captured at a first device is transmitted to a second device and reproduced. Typically, the second device is also operable to capture an acoustic signal and transmit this to the first device for reproduction. For convenience, the acoustic signal is usually converted to or encoded in another form for transmission. The captured acoustic signals generally comprise a speaker’s voice and background noise. Furthermore, in transmission of the signal there may be significant channel noise introduced to the signal. If the overall noise level is low, this will not be a significant issue. If the overall noise level is high, whether resulting from background noise, channel noise or both, this can have a significant impact on the intelligibility and/or recognisability of the captured voice reproduced at the other device.
This problem can be addressed by amplifying or filtering the captured voice signals either on capture or on reproduction or by applying similar techniques to the converted form of the signal for transmission. Such simple techniques typically only provide very limited success. It is also possible to address this problem using noise cancellation technology. This requires the provision of noise cancellation microphones to capture the background noise independently of the voice allowing this background noise to subsequently be cancelled from the captured signal either by emitting an opposing acoustic signal or by deleting said noise from the captured signal including the voice. This technique relies upon the provision of additional hardware and on there being suitable places to mount said additional hardware. Furthermore, whilst this system may have impact on reducing background noise, it will not have an impact on channel noise.
It is therefore an object of the present invention to provide a method and system for at least partially overcoming or alleviating the above problems.
Summary of the Invention
According to a first aspect of the present invention there is provided a method of noise reduction in voice communications, the method comprising the steps of: comparing an initial acoustic signal including a voice to a stored model of the voice; identifying elements of the initial acoustic signal corresponding to words or phonemes uttered by the voice; parsing the identified elements into an ordered data stream of said words or phonemes; retrieving data from the stored model of the voice corresponding to the words or phonemes of the ordered data stream; and utilising the retrieved data to generate a secondary acoustic signal corresponding to the parsed words or phonemes.
Identifying the voiced words or phonemes in this manner allows the subsequent reconstruction of a secondary acoustic signal corresponding to the voiced words or phonemes without or with reduced noise. Since the method concentrates on identifying elements within the voice of interest, it can perform more effectively than simple filtering or amplification techniques applied to the initial acoustic signal as a whole. This method can further be applied without the provision of additional microphones to cancel background noise.
The above method may be applied in systems wherein the initial signal is captured by a first voice communication device and the secondary acoustic signal is reproduced by a second voice communication device. In such instances, the method may be applied by the first device or second device as desired or as appropriate. The transmission may take place using any suitable communication networks including but not limited to: public telephone systems, either cellular or fixed line as desired or required, internet connections, Wi-Fi (Registered Trade Mark) networks or other data networks. For transmission the initial or secondary acoustic signal may be converted or encoded in any suitable manner according to the standards of the communication network.
The method may include the step of capturing the initial acoustic signal using a suitable microphone or a device comprising a suitable microphone. The method may include the step of outputting the secondary signal using a suitable loudspeaker or a device comprising a suitable loudspeaker.
The or each voice communication device may be a fixed line or cellular telephone; desktop, laptop or tablet computer; audio or audiovisual recording device or the like.
The method may include the step of identifying the voice. The identification can be achieved by direct consideration of the initial acoustic signal. This consideration may involve comparing the captured acoustic signal to one or more stored voice models. Preferably, where possible, a specific speech model is stored for each speaker. Using individual models for each speaker in this way can significantly increase the effectiveness of the method. Additionally or alternatively, the identification may be made by identifying the voice communication device or a physical or network location of the voice communication device used to capture the acoustic signal. For example, a telephone handset may be identified by a phone number, SIM or handset IMEI.
The method may be applied on all possible occasions. Alternatively, the method may only be applied in response to a user request, or when the noise exceeds a particular threshold. In the last case, the method may include the step of measuring the background noise and/or channel noise and comparing it to a predetermined threshold.
Identifying and parsing the words or phonemes in the initial acoustic signal can be achieved directly by comparing the acoustic signal to the stored model. Additionally or alternatively, the identification and parsing may include a probabilistic prediction based on the syntax of other identified words or phonemes. Using a probabilistic approach can also allow for the identification of phonemes previously missing from a particular voice model.
The stored model may comprise samples of the voice uttering words or phonemes. Additionally or alternatively, the stored model may comprise data indicating how characteristics of the voice differ from reference samples of the same words or phonemes. The voice characteristics may include accent, cadence, tone, excitation, inflexion, spectral characteristics, sound/pause duration or the like.
The method may include the step of updating a stored model on an ongoing basis and/or the step of building up and storing a model of any unidentified voices. This may be achieved by capturing samples of the voice, analysing the samples to identify corresponding words or phonemes and storing said samples or data indicating how the voice characteristics differ from reference samples of the same words or phonemes.
According to a second aspect of the present invention there is provided a noise reduction system for use in voice communication comprising: a library of stored voice models; a speech detection engine operable to identify elements of an initial acoustic signal corresponding to words or phonemes uttered by a voice and parse the identified elements into an ordered data stream of said words or phonemes; a speech reconstruction engine operable to retrieve data from the library of stored voice models corresponding to the words or phonemes of the ordered data stream and to utilise the retrieved data to generate a secondary acoustic signal corresponding to the parsed words or phonemes.
The noise reduction system of the second aspect of the present invention may incorporate any or all features of the first aspect of the present invention, as desired or as appropriate.
According to a third aspect of the present invention there is provided a voice communications device incorporating a noise reduction system according to the second aspect of the present invention.
The voice communications device may be a fixed line or cellular telephone, desktop, laptop or tablet computer, audio or audiovisual recording device or the like.
Detailed Description of the Invention
In order that the invention may be more clearly understood an embodiment/embodiments thereof will now be described, by way of example only, with reference to the accompanying drawings, of which:
Figure 1 is a schematic illustration of a voice communication situation in which the present invention might be implemented;
Figure 2 is a flow diagram illustrating the steps involved in creating or updating a stored voice model in the present invention;
Figure 3 is a flow diagram illustrating the steps involved in processing an initial acoustic signal to reduce background noise in the present invention; and Figure 4 is a schematic block diagram of a mobile telephone handset adapted to implement the present invention.
In a conventional voice communication system, a first voice communication device A (such as a telephone handset) captures an acoustic signal including a speaker’s voice. This captured acoustic signal is then transmitted, in suitably encoded form, via a communication network N to a second voice communication device B. The captured signal is subsequently reproduced by device B for the benefit of a listener. Should the listener at device B wish to reply, device B is also operable to capture an acoustic signal and transmit the suitable encoded signal to device A for reproduction. On occasions where the voice is captured alongside significant amounts of background noise, this background noise forms part of the acoustic signal reproduced for the listener. Additionally or alternatively, there can be significant channel noise encountered upon transmission of a signal. These noise contributions can significantly reduce the intelligibility and/or recognisability of the voice communication.
In the present invention, the acoustic signal captured by device A is subjected to noise reduction processing before reproduction by device B. This processing can take place either at device A before transmission or at device B after receipt but before reproduction. The processing involves an initial step of analysing the captured acoustic signal with respect to a stored model of the speaker’s voice. By way of this analysis, elements of the captured acoustic signal corresponding to words or phonemes uttered by the speaker can be identified and parsed into an ordered data stream of said words or phonemes. Subsequently, data from the stored model of the voice corresponding to the words or phonemes of the ordered data stream can be retrieved and used to generate a new acoustic signal corresponding to the parsed words or phonemes. This new acoustic signal can then be reproduced for the listener. By identifying the voiced words or phonemes in this manner, the subsequent reconstruction of new acoustic signal corresponding to the voiced words or phonemes can substantially exclude noise. This increases the intelligibility of the voice communications considerably on occasions where the voice is captured alongside significant amounts of background noise or is subject to significant amounts of channel noise.
In order for the method to operate, it is necessary to have a viable model of a speaker’s voice. Such a model may be created by processing samples of the speaker’s voice. These samples may be acquired by the speaker submitting a predetermined range of voice samples. More typically, these samples can be acquired by capturing and analysing voice samples on occasions where the noise reduction is not required. Ideally, ongoing sampling allows for each speaker’s voice model to be continuously adapted. This can significantly improve models overtime as the noise contribution in the collected averaged samples will tend to zero as more samples are collected.
Turning now to figure 2, a schematic illustration of this analysis is illustrated. Following detection of a voice at SI, a determination is made at S2 as to whether the speaker has an existing voice model. This may be achieved by comparing the initially detected voice against existing speech models. Alternatively, the speaker may be identified by another method (for instance by their phone number or by direct input of an identity). If the speaker does have an existing model, then at S3, the voice sample is analysed to determine its likely content and characteristic parameters ofthe voice. These parameters may include accent, cadence, tone, excitation, inflexion, spectral characteristics, sound/pause duration or the like. At S4, the voice model is then updated with any additional or revised parameters.
If the speaker does not have an existing model, a choice may be made at step S5 whether to create a new model or not. If a new model is to be created, this model is assigned to the speaker identity at S6. The voice sample can then be analysed and updated as set out above in steps S3 & S4.
Turning now to figure 3, there is presented a flow chart illustrating the steps involved in a preferred implementation of noise reduction processing according to the present invention. The steps are performed on a captured or received acoustic signal. Initially, at step SI 1, it is determined whether a voice is detected. If a voice is detected, at S12, an attempt is made to identify the voice. Thi s attempt may involve comparing the voice against existing voice models and/or analysing the source of the acoustic signal. For instance an acoustic signal received from a particular phone could be directly identified as containing a voice corresponding to the user of the phone. If a voice model exists, at step S13, an assessment is made as to whether the model contains sufficient data to make use of the present method viable.
In the event that use of the voice model is viable, the model parameters are retrieved from the library of voice models at SI 4. At S14, the acoustic signal is analysed probabilistically based on the word/phoneme recognition, syntax considerations and the specific parameters of the voice model At S15 this analysis is processed into an ordered data stream corresponding to the predicted word or phonemes uttered by the voice. Subsequently, at SI6, the voice model can be used to generate a new acoustic signal corresponding to the successive words or phonemes of the data stream. By applying the voice model, the new acoustic signal will correspond substantially to the voice elements within the original signal excluding noise. If desired, for a more natural sound, the new acoustic signal may be mixed with a low level of background noise.
To facilitate processing, one implementation of the invention may involve delaying the signal for a processing interval. In view of the low latency of contemporary networks, a delay of a few milliseconds may prove adequate for processing whilst having minimal impact on a user.
Turning now to figure 4, an exemplary device incorporating a system for implementing the method of the preset invention is shown. The device in this example is a cellular telephone handset 100, albeit that the skilled man will appreciate that this method may be applied to or implemented by any other device useable for voice communication including but not limited to fixed line telephones, desktop, laptop or tablet computers and the like.
The handset 10 incorporates a communication unit 11 adapted to enable data, in particular encoded acoustic signals to be transmitted and received via a cellular telephone network. The handset is also provided with a microphone 12 for capturing an acoustic signal including the voice of a phone user and a loudspeaker 13 for reproducing an acoustic signal received via the communication unit 11.
Within the phone 10 is provided a noise reduction system 100 according to the present invention for implementing the above discussed method. The system 100 comprises a data storage means 110, a speech detection engine 120 and a speech reconstruction engine 130.
The data storage means 110 contains a library of stored voice models. The speech detection engine 120 is operable to retrieve data from the library and use this in the analysis of an acoustic signal. The acoustic signal may be a signal captured by the microphone 12 or may be an acoustic signal received via the communication unit 11. The analysis can allow the speech detection engine to identify elements of the acoustic signal as corresponding to words or phonemes uttered by the modelled voice and to parse the identified elements into an ordered data stream of said words or phonemes. The ordered data stream can then be passed to the speech reconstruction engine 130. Subsequently, the speech reconstruction engine 130 is operable to retrieve data from the library corresponding to the words or phonemes of the ordered data stream and to utilise the retrieved data to generate a new acoustic signal corresponding to the parsed words or phonemes. This new acoustic signal may be output by the loudspeaker 13 or may be passed to the communication unit for transmission to another device via the cellular telephone network.
In further implementations of the invention, it is possible for the ordered data stream of the acoustic signal recreated from the ordered data stream to be fed to additional voice processing units. The data stream/reconstructed audio signal can provide a high quality in put for such systems to undertake further processing before generating an output audio signal. In a particular example, the additional voice processing unit may include a translation engine. In such an example, the captured acoustic signal may be translated into a separate language for regeneration as text or an acoustic signal in a different language.
It is of course to be understood that the invention is not to be restricted to the details of the above embodiment is/embodiments which are described by way of example only.