The present disclosure claims priority to U.S. non-provisional patent application serial No. 15/229,429 filed onday 5/8/2016, U.S. non-provisional patent application serial No. 15/229,429 claims priority to U.S. provisional patent application serial No. 62/202,303 filed on day 7/8/2015, U.S. provisional patent application serial No. 62/237,868 filed onday 6/10/2015, and U.S. provisional patent application serial No. 62/351,499 filed onday 17/6/2016, each of which is incorporated herein by reference.
Detailed Description
According to embodiments of the present disclosure, systems and methods are presented that may use at least three different audio event detectors that may be used in an automatic playback management framework. Such audio event detectors of audio devices may include: a near-field detector that can detect when near-field sounds of the audio device are detected, such as when a user of the audio device (e.g., a user wearing or otherwise using the audio device) is speaking; a proximity detector that can detect when a proximity sound of the audio device is detected, such as when another person near a user of the audio device speaks; a tone alarm detector that detects an acoustic alarm that may have occurred in the vicinity of the audio device. Fig. 1 illustrates an example of a use case scenario in which such a detector may be used in conjunction with a playback management system to enhance a user experience, in accordance with an embodiment of the present disclosure.
Fig. 2 illustrates an exemplary playback management system that modifies the playback signal based on the decision of the event detector 2 according to an embodiment of the present disclosure. The signal processing functions in theprocessor 50 may include anacoustic echo canceller 1, whichacoustic echo canceller 1 may cancel acoustic echoes received at amicrophone 52 due to echo coupling between an output audio transducer 51 (e.g., a speaker) and themicrophone 52. The echo reduced signal may be passed to an event detector 2, which event detector 2 may detect one or more different ambient events, including but not limited to near field events detected by a near field detector 3 (e.g., including but not limited to speech from a user of the audio device), near field events detected by a near field detector 4 (e.g., including but not limited to speech or other ambient sounds in addition to near field sounds), and/or tonal alarm events detected by analarm detector 5. If an audio event is detected, the event-basedplayback control 6 may modify the characteristics of the audio information (shown as "playback content" in FIG. 2) reproduced to theoutput audio transducer 51. The audio information may include any information that may be reproduced at theoutput audio transducer 51, including, but not limited to, downlink speech associated with a telephone conversation received via a communication network (e.g., a cellular network) and/or internal audio from an internal audio source (e.g., a music file, a video file, etc.).
Fig. 3 illustrates an example event detector in accordance with an embodiment of the present disclosure. As shown in fig. 3, an exemplary event detector may include avoice activity detector 10, amusic detector 9, a direction of arrival estimator 7, a near-fieldspatial information extractor 8, a background noise soundpressure level estimator 11, and a decisionfusion logic device 12. The decisionfusion logic device 12 uses information from thevoice activity detector 10, themusic detector 9, the direction of arrival estimator 7, the near fieldspatial information extractor 8 and the background noise soundpressure level estimator 11 to detect audio events including, but not limited to, near field sounds, close range sounds other than near field sounds and tone alarms.
Thenear field detector 3 may detect near field sounds including voices. When such near-field sounds are detected, it may be desirable to modify the audio information reproduced to theoutput audio transducer 51, since the detection of near-field sounds may indicate that the user is participating in a conversation. Such near-field detection may need to be able to detect near-field sounds in noisy sound conditions and accommodate false detection of near-field sounds in very diverse background noise conditions (e.g., background noise in restaurants, noise while driving a car, etc.). As explained in more detail below, near field detection may require spatial sound processing usingmultiple microphones 51. In some embodiments, such near field sound detection may be implemented in the same or similar manner as described in U.S. patent No. 8,565,446 and/or U.S. application serial No. 13/199,593.
The proximity detector 4 may detect ambient sounds other than near-field sounds (e.g., speech from a person near the user, background music, etc.). As explained in more detail below, because it may be difficult to distinguish near sounds from non-stationary background noise and background music, the near detector may utilize the music detector and noise sound pressure level estimation to disable near detection by the near detector 4 to avoid poor user experience due to false detection of near sounds. In some embodiments, such close proximity sound detection may be accomplished in the same or similar manner as described in U.S. patent No. 8,126,706, U.S. patent No. 8,565,446, and/or U.S. application serial No. 13/199,593.
Thetone alarm detector 5 may detect a tone alarm (e.g., siren) near the audio device. To provide the maximum user experience, it may be desirable for thetonal alarm detector 5 to ignore certain alarms (e.g., weak or low volume alarms). As described in more detail below, tone alarm detection may require spatial sound processing usingmultiple microphones 51. In some embodiments, such tone alarm detection may be accomplished in the same or similar manner as described in U.S. patent No. 8,126,706 and/or U.S. application serial No. 13/199,593.
FIG. 4 shows functional blocks of a system for obtaining near-field spatial statistics that may be used to detect audio events, according to an embodiment of the present disclosure. Soundpressure level analysis 41 may be performed onmicrophone 52 by estimating an inter-microphone sound pressure level difference between the near and far microphones (imd) (e.g., as described in U.S. application serial No. 13/199,593).Cross-correlation analysis 13 may be performed on the signals received bymicrophone 52 to obtain direction of arrival information DOA of ambient sound impinging on microphone 52 (e.g., as described in U.S. patent No. 8,565,446). In thecross-correlation analysis 13, a maximum normalized correlation value norm maxcorr (e.g., as described in U.S. application serial No. 13/199,593) may also be obtained. Thevoice activity detector 10 may detect the presence of speech and generate a signal speechDet indicative of the presence or absence of speech in the ambient sound (e.g., as described in the probabilistic based speech presence/absence method of U.S. patent No. 7,492,889). Thebeamformer 15 may generate near-field signal estimates and interference signal estimates based on the signals from themicrophones 52, which may be used by thenoise analysis 14 to determine the noise sound pressure level noiseLevel and the interference-to-near-field signal ratio idr in the ambient sound. Us patent No. 8,565,446 describes an example method of estimating the interference-to-near-field signal ratio idr using a pair ofbeamformers 15. The voice activity detector 36 may use the interference estimate to detect any voice signals that do not originate from the desired signal direction (prox spechdet). As long as the direction of arrival estimate DOA of the ambient sound is outside the acceptance angle of the near-field sound, thenoise analysis 14 may be performed by updating the interfering signal energy based on the direction of arrival estimate DOA. The direction of arrival of near-field sound may be known a priori for a given microphone array configuration in the industrial design of a personal audio device.
The presence of near-field sound may then be detected using a variety of statistics generated by the system of fig. 4. Fig. 5 illustrates exemplary fusion logic for detecting near-field sounds according to embodiments of the present disclosure. As shown in fig. 5, near-field speech may be detected when all of the following criteria are met:
the direction of arrival estimate DOA of the ambient sound is within the acceptance angle of the near-field sound (block 16);
the maximum normalized cross-correlation statistic norm maxcorr is greater than the threshold norm maxcorrthres1 (block 17);
the interference-to-near-field desired signal ratio idr is less than the threshold idrThres1 (block 18);
voice activity is detected, as represented by the signal speeddet (block 19);
the inter-microphone pressure level difference statistic imd is greater than the threshold imdTh (block 42).
In some embodiments, the thresholds idrThres and imdTh may be dynamically adjusted based on the background noise sound pressure level estimate.
The close-in detection by the close-in detector 4 may differ from the near-field sound detection by the near-field detector 3, because the signal characteristics of close-in speech may be very similar to surrounding signals such as music and noise. Therefore, the proximity detector 4 must avoid false detection of near speech to achieve an acceptable user experience. Thus, as long as there is music in the background, themusic detector 9 can be used to disable close-range detection. Likewise, the close-range detector 4 may be disabled as long as the background noise sound pressure level is above a certain threshold. The background noise threshold may be determined a priori such that false detections below the threshold sound pressure level are very unlikely. Fig. 6 illustrates exemplary fusion logic for detecting near sounds (e.g., speech) in accordance with embodiments of the present disclosure. Furthermore, there may be many sources of ambient noise that produce transient acoustic stimuli. These noise types may be erroneously detected as voice signals by the voice detector. To reduce the likelihood of false detections, Spectral Flatness Measure (SFM) statistics from themusic detector 9 may be used to distinguish speech from transient noise. For example, the SFM may be tracked over a period of time and the difference between the maximum SFM value and the minimum SFM value over the same period of time may be calculated, the difference being defined as sfmSwing. The value of sfmSwing may typically be small for transient noise signals because the spectral content of these signals is broad-band and they tend to level out over short time intervals (300ms-500 ms). The value of sfmSwing may be higher for a voice signal because the spectral content of the voice signal may change faster than the transient signal. As shown in fig. 6, a near sound (e.g., speech) may be detected when all of the following criteria are met:
no music detected in the background (block 20);
the direction of arrival estimate DOA is within the acceptance angle of the near sound (block 21);
the maximum normalized cross-correlation statistic norm maxcorr is greater than the threshold norm maxcorrthres2 (block 22);
the background noise sound pressure level noiseLevel is below the threshold noiseLevel th (block 23);
detection of near speech activity, as represented by the signal proxSpeechDet (block 19);
SFM change statistic sfmSwing greater than threshold sfmSwing th (block 37);
the interference-to-near-field desired signal ratio idr is greater than a threshold idrThres2 (block 40);
the inter-microphone pressure level difference statistic imd is close to 0dB (block 43).
In some embodiments, themusic detector 9 used to detect the presence of background music may be implemented using a music detector as taught in U.S. patent No. 8,126,706. Another embodiment of a near speech detector according to an embodiment of the present disclosure is shown in fig. 7. According to the present embodiment, a close-up voice can be detected if the following conditions are satisfied.
The interference-to-near-field desired signal ratio idr is greater than a threshold idrThres2 (block 39);
detecting near voice activity (block 27);
the maximum normalized cross-correlation statistic norm maxcorr is greater than the threshold norm maxcorrthres3 (block 28);
the direction of arrival estimate DOA is within the acceptance angle of the near sound (block 29);
no music detected in the background (block 30);
the presence of low or medium sound pressure level background noise or the absence of background noise (block 31). This condition is verified by comparing the estimated background noise sound pressure level with a threshold noiseLevelThLo. If a low noise sound pressure level is detected, the following two conditions are also tested to confirm the presence of near speech:
SFM change statistic sfmSwing greater than threshold sfmSwing th (block 38);
the inter-microphone pressure level difference statistic imd is close to 0dB (block 44).
If the above-described background noise pressure level condition is not met atblock 31, then the following condition may indicate a near voice to improve the detection rate of near voice without increasing the occurrence of false alarms (e.g., due to background noise conditions):
there is a stationary background noise (block 32). Stationary background noise may be detected by calculating the peak-to-root mean square ratio of the SFM generated by the music detector (block 9) over a period of time. In particular, if the above ratio is high, non-stationary noise may be present because the spectral flatness measure of non-stationary noise tends to vary faster than stationary noise;
there is a high noise sound pressure level (block 32). A high noise condition may be detected if the estimated background noise is greater than the threshold noiseLevelLo and less than the threshold noiseLevelHi. If the stationary noise and direction of arrival conditions are not met atblock 32, then the presence of the following set of two conditions may indicate the presence of near speech:
there are close talking close talkers (block 33). A close-talking close-talker may be detected when the maximum normalized cross-correlation statistic normmaxcorrr is greater than a threshold normMaxCorrThres4 (the threshold normMaxCorrThres4 may be greater than normMaxCorrThres3 to indicate the presence of an close-talking talker);
the presence or absence of low or medium or high sound pressure level background noise (block 34). This condition may be detected if the estimated background noise sound pressure level is less than the threshold noiseLevelThHi.
If the above direction-of-arrival condition is not met atblock 29, then the presence of the following condition may indicate near speech:
music is not present (block 35);
there are close talking close talkers (block 33). A close-talking close-talker may be detected when the maximum normalized cross-correlation statistic normmaxcorrr is greater than a threshold normMaxCorrThres4 (the threshold normMaxCorrThres4 may be greater than normMaxCorrThres3 to indicate the presence of an close-talking talker);
the presence or absence of low or medium or high sound pressure level background noise (block 34). This condition may be detected if the estimated background noise sound pressure level is less than the threshold noiseLevelThHi.
Thetonal alarm detector 5 may be configured to detect tonal alarm signals, where the acoustic bandwidth of such alarm signals is also narrow (e.g., siren, beep). In some embodiments, the pitch of the ambient sound may be detected by dividing the time domain signal into a plurality of sub-bands by time-frequency transformation, and a spectral flatness measure, shown in fig. 6 as the signal sfm [ ] generated by themusic detector 9, may be calculated in each sub-band. The spectral flatness measure sfm can be estimated for all sub-bands, and a tone alarm can be detected if the spectrum is flat in most but not all sub-bands. Furthermore, in a playback management system, it may not be necessary to detect far-field alarm signals. Thus, the near fieldspatial statistics 8 of FIG. 3 may be used to distinguish far field alarm signals from near field signals. Fig. 8 illustrates exemplary fusion logic for detecting tone alarm events (e.g., siren, beep), in accordance with embodiments of the present disclosure. As shown in FIG. 8, a tone alarm event may be detected when all of the following criteria are met:
the direction of arrival estimate DOA is within the acceptance angle of the alarm signal (block 24);
the maximum normalized cross-correlation statistic norm maxcorr is greater than the threshold norm maxcorrthres5 (block 25);
the spectral flatness measure sfm [ ] indicates that the noise spectrum is flat in most but not all sub-bands (block 26).
In fact, the transient audio event detections of thenear field detector 3, the proximity detector 4 and thetone alert detector 5 as shown in fig. 5, 6,7 and 8 may represent false audio events. Therefore, it may be desirable to verify the transient audio event detection signal before passing it to theplayback control 6. FIG. 9 illustrates an exemplary timing diagram showing delay and hysteresis logic that may be applied to a transient audio event detection signal to generate a validated audio event signal, according to an embodiment of the disclosure. As shown in fig. 9, in response to the instantaneous detection of an audio event (e.g., near-field sound, tonal alarm event) lasting at least a predetermined time, the delay logic may generate a validated audio event signal, while the hysteresis logic may continue to assert the validated audio event signal until the instantaneous detection of the audio event has ceased for a second predetermined time.
The following pseudo-code may demonstrate the application of delay and hysteresis logic to reduce false detection of audio events, according to embodiments of the present disclosure.
/*If the instant.detect is true,increment the hold off counter and reset the hang over counter*/
If(instDet==TRUE)
{
holdOffCntr=holdOffCntr+1;
hangOverCntr=0;
}
/*If the instant.detect is false,increment the hang over counter and reset the hold off counter*/
else
{
hangOverCntr=hangOverCntr+1;
holdOffCntr=0;
}
/******************
*Hold-off Logic*
******************/
/*Valid detect will transition to true state if the instant.detect is continuously true for certain time and the previous valid detect is false*/if(holdOffCntr>holdOffThres&&validDet==FALSE)
{
validDet=TRUE;
holdOffCntr=0;
hangOverCntr=0;
}
/******************
*Hang-Over Logic*
******************/
/*Valid NF detect will transition to false state if the instant.NF detect is continuously false for certain time and the previous valid NF detect is true*/
If(hangOverCntr>hangOverThres&&validDet==TRUE)
{
validDet=FALSE;
holdOffCntr=0;
hangOverCntr=0;
}
The verified event may be further verified before generating the playback mode switching control. For example, the following pseudo-code may demonstrate the application of delay and hysteresis logic for gracefully switching between a talk mode (e.g., where audio information reproduced to theoutput audio transducer 51 may be modified in response to an audio event) and a normal playback mode (e.g., where audio information reproduced to theoutput audio transducer 51 is unmodified).
/***********************************
*Conversational Mode Enter Logic*
***********************************/
/*Increment the time to enter conversational mode counter if the event detect is true and the mode is not in the conversational mode.If the counter exceeds the threshold,switch to conversational mode and reset the counters.Note that the event detect need not be true contiguously.*/if(convModeEn==FALSE&&validDet==TRUE)
{
timeToEnterConvModeCntr=timeToEnterConvModeCntr+1;
if(timeToEnterConvModeCntr>timeToEnterConvModeThres)
{
convModeEn=TRUE;
timeToEnterConvModeCntr=0;
timeToExitConvModeCntr=0;
}
}
/***********************************
*Conversational Mode Exit Logic*
***********************************/
/*Increment the time to exit conversational mode counter if the event detect is false and the mode is in the conversational mode.If the counter exceeds the threshold,switch to normal mode and reset the counters.
Note that the event detect must be false contiguously.*/
if(convModeEn==TRUE&&validDet==FALSE)
{
timeToExitConvModeCntr++;
if(timeToExitConvModeCntr>timeToExitConvModeThres)
{
convModeEn=FALSE;
timeToEnterConvModeCntr=0;
timeToExitConvModeCntr=0;
}
}
else
{
timeToExitConvModeCntr=0;
}
FIG. 10 illustrates different audio event detectors with delay and hysteresis logic according to embodiments of the disclosure. The delay period and/or the hysteresis period of the respective detectors may be set differently. In addition, in some embodiments, playback management may be controlled differently based on the type of event detected. In these and other embodiments, as shown in fig. 9, the playback gain (and thus the audio information reproduced at the output audio transducer 51) may be attenuated whenever one or more of the audio events are detected. In these and other embodiments, to provide smooth gain transitions, the playback gain may be smoothed using a first order exponential averaging filter represented by the following pseudocode:
if(convModeEn==TRUE)
{
playBackGain=(1-alpha)*convModeGain+alpha*playBackGain
}
else
{
playBackGain=(1-beta)*normalModeGain+beta*playBackGain
}
the smoothing parameters a and β may be set to different values to adjust the gain slope.
It should be understood that various operations described herein, particularly in conjunction with the figures, may be implemented by other circuits or other hardware components, particularly by those of ordinary skill in the art having the benefit of this disclosure. The order in which the various operations of a given method are performed can be varied, and various elements of the systems illustrated herein can be added, reordered, combined, omitted, modified, etc. The disclosure is intended to embrace all such modifications and changes, and therefore the above description should be taken as illustrative and not restrictive.
Likewise, although the present disclosure makes reference to specific embodiments, certain modifications and changes may be made to these embodiments without departing from the scope of the present disclosure. Furthermore, any benefits, advantages, or solutions to problems that are described herein with regard to specific embodiments are not intended to be construed as a critical, required, or essential feature or element.
Further embodiments will likewise be apparent to those of ordinary skill in the art, given the benefit of this disclosure, and such embodiments should be considered to be encompassed herein.