CN102056026A

Movatterモバイル変換

Info

Publication number: CN102056026A
Application number: CN2009102374145A
Authority: CN
Inventors: 陈欣伟; 方力; 沈亮; 高屹; 常静; 侯优优; 阮征
Original assignee: China Mobile Group Design Institute Co Ltd
Current assignee: China Mobile Group Design Institute Co Ltd
Priority date: 2009-11-06
Filing date: 2009-11-06
Publication date: 2011-05-11
Anticipated expiration: 2029-11-06
Also published as: CN102056026B

Abstract

本发明公开了一种音视频同步检测方法及其系统，以及一种语音检测方法及其系统，该音视频同步检测方法包括：确定目标端所播放的音视频文件中，与音频参考数据匹配的音频段的起始播放时间，以及与视频参考数据匹配的视频帧的起始播放时间；根据所述与音频参考数据匹配的音频段的起始播放时间，以及所述与视频参考数据匹配的视频帧的起始播放时间，确定出所述音视频文件在目标端播放时的音视频播放时间差；获取所述音视频文件在源端播放时的音视频播放时间差，根据所述音视频文件在源端和目标端播放时的音视频播放时间差，确定出所述音视频文件在所述目标端播放时的音视频同步情况。采用本发明可提高音视频同步检测的准确度。

The invention discloses an audio and video synchronization detection method and its system, and a voice detection method and its system. The audio and video synchronization detection method includes: determining the audio and video files played by the target end, which match the audio reference data The start play time of the audio segment, and the start play time of the video frame matched with the video reference data; according to the start play time of the audio segment matched with the audio reference data, and the video frame matched with the video reference data The initial playback time of the frame determines the audio-video playback time difference when the audio-video file is played at the target end; obtains the audio-video playback time difference when the audio-video file is played at the source end, according to the audio-video file at the source The audio and video playback time difference between the end and the target end determines the audio and video synchronization of the audio and video file when it is played on the target end. The accuracy of audio and video synchronous detection can be improved by adopting the invention.

Description

Audio-visual synchronization detection method and system thereof, speech detection method and system thereof

Technical field

The present invention relates to the audio frequency and video detection technique in the communications field, relate in particular to a kind of audio-visual synchronization detection method and system thereof, and a kind of speech detection method and system thereof.

Background technology

In the mobile communication video traffic, because Voice ﹠ Video does not carry temporal information in cataloged procedure, the synchronizing information that therefore obtains audio frequency and video becomes quite difficult.

If add temporal information in packets of audio data in advance behind audio/video coding and the video packets of data respectively, then the audio-video document after encoding is after Network Transmission arrives receiving terminal, resolve by the audio-video document that receiving terminal is received, parse the temporal information of carrying in packets of audio data and the video packets of data, judge the synchronous situation of audio frequency and video then according to the temporal information that parses.

But there is following problem in above-mentioned audio-visual synchronization detection method:

(1) although Voice ﹠ Video is carrying temporal information after the packing respectively, but the temporal information after the two grouping packing does not have corresponding corresponding relation, moreover the size of the frame length of Voice ﹠ Video and packet is also inequality, therefore can't accurately determine the relative time delay of Voice ﹠ Video;

(2) audio-visual synchronization is carried out the result of synchronous detecting according to the temporal information of carrying in packets of audio data and the video packets of data packet header, the propagation delay time that only can reflect network, and in the actual play process, the audio-video document player of receiving terminal is provided with buffer memory, audio stream and video flowing through decoding are adjusted by buffer memory synchronously by player, therefore, carry out result that audio-visual synchronization detects according to the temporal information of carrying in packets of audio data and the video packets of data packet header and can not reflect that the audio-video document player adjusts the back synchronously to influence that audio-visual synchronization produced, that is, adopting this kind mode to carry out audio-visual synchronization, to detect resulting result inaccurate.

Summary of the invention

The embodiment of the invention provides a kind of audio-visual synchronization detection method and system thereof, in order to solve the existing low problem of audio-visual synchronization detection accuracy.

The technical scheme that the embodiment of the invention provides comprises:

A kind of audio-visual synchronization detection method comprises the steps:

Determine in the audio-video document that destination end plays, with the initial reproduction time of the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching;

According to the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when destination end is play;

It is poor to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, poor according to the audio frequency and video reproduction time of described audio-video document when source end and destination end are play, determine the audio-visual synchronization situation of described audio-video document when described destination end is play.

A kind of audio-visual synchronization detection system comprises:

The audio identification module is used for the audio-video document of determining that destination end is play, with the initial reproduction time of the audio section of audioref Data Matching;

The video identification module is used for the audio-video document of determining that destination end is play, with the initial reproduction time of the frame of video of video reference Data Matching;

The time difference determination module, be used for initial reproduction time that determine according to described audio identification module and the audio section audioref Data Matching, and the described video identification module initial reproduction time with the frame of video video reference Data Matching that determine, it is poor to determine the audio frequency and video reproduction time of described audio-video document when destination end is play;

Synchronous detection module, it is poor to be used to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, poor according to the audio frequency and video reproduction time that the described audio frequency and video reproduction time difference that gets access to and described time difference determination module are determined, determine the audio-visual synchronization situation of described audio-video document when described destination end is play.

The above embodiment of the present invention, the audio-video document of playing for destination end, determine the initial reproduction time of the audio section of itself and audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching, thereby the audio frequency and video reproduction time when obtaining the destination end broadcast is poor, compare with the audio frequency and video reproduction time difference of this audio-video document when the source end is play then, thereby determine the audio-visual synchronization situation of this audio-video document when described destination end is play, compared with prior art, the audio-visual synchronization of the embodiment of the invention detects the temporal information that does not rely in the audio, video data bag, but carry out synchronous detecting according to the audio-video document of destination end institute actual play, simultaneously the factor of in the audio/video decoding course of destination end audio-visual synchronization being adjusted is taken into account, therefore resulting audio-visual synchronization testing result is more accurate.Be particularly useful for process to the audio-visual synchronization situation detection of audio frequency and video after Network Transmission.

The embodiment of the invention also provides a kind of speech detection method and system thereof, is used to solve the low problem of prior art speech detection accuracy.

The technical scheme that the embodiment of the invention provides comprises:

A kind of speech detection method comprises the steps:

According to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;

When searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold, continue along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;

When searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.

A kind of speech detection system comprises:

First search module, be used for according to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;

Second search module is used for searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold when described first search module, continues along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;

The voice segments determination module, be used for searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value when described second search module, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.

The above embodiment of the present invention, in the speech detection process, discern more effective at voice segments standby average energy when background noise is smaller, discern relatively effectively characteristics at the average zero-crossing rate of time standby that background noise is bigger, the short-time average magnitude and the short-time average zero-crossing rate of voice signal have been taken all factors into consideration, on basis based on the short-time average magnitude detection method, investigate the short-time average zero-crossing rate of voice signal again, utilize amplitude and zero-crossing rate double characteristic to carry out the voice signal terminal and detect, thereby make detected voice segments terminal more accurate.

Description of drawings

Fig. 1 is the schematic flow sheet that audio-visual synchronization detects in the embodiment of the invention;

Fig. 2 is the schematic flow sheet that IP network video telephone audio-visual synchronization detects in the embodiment of the invention;

Fig. 3 is the dynamic route search schematic diagram of speech recognition process in the embodiment of the invention;

Fig. 4 is the audio-visual synchronization scoring model schematic diagram in the embodiment of the invention;

Fig. 5 is the structural representation of the audio sync detection system in the embodiment of the invention;

Fig. 6 is the structural representation of the speech detection system in the embodiment of the invention.

Embodiment

The problems referred to above at the prior art existence, the embodiment of the invention provides a kind of audio-visual synchronization detection method and system thereof, adopt the mode of pattern recognition to carry out the audio-visual synchronization detection, promptly respectively the audio-video document of broadcast and the reference data of these audio frequency and video are carried out pattern recognition at transmitting terminal and receiving terminal, note the audio frame that is complementary with audioref data and video reference data and the initial reproduction time of frame of video respectively, the audio frequency and video reproduction time that obtains transmitting terminal and receiving terminal is poor, by being compared, the audio frequency and video reproduction time difference of transmitting terminal and receiving terminal calculates delay inequality again, thus the audio-visual synchronization situation the when audio-video document that obtains receiving terminal is play.

In the embodiment of the invention, before carrying out the audio-visual synchronization detection, to prepare audioref data and video reference data earlier, be used for detecting the audioref point and the video reference point of audio-video document, thereby determine the audio-visual synchronization parameter according to audioref point and video reference point in synchronization detection process.The audioref data can be the audio volume control data, and the video reference data can be vedio datas, and audioref data and video reference data can be stored in the feature ancient term for country school in advance.

Referring to Fig. 1, be the schematic flow sheet of audio-visual synchronization detection in the embodiment of the invention.This flow process can be applicable to assess the influence of Network Transmission to audio-visual synchronization, also can be used for assessing different influences of playing end to audio-visual synchronization.If be used to assess the influence of Network Transmission to audio-visual synchronization, then the source end in this flow process is meant that transmitting terminal, the destination end of audio-video document are meant the receiving terminal that audio-video document arrives after Network Transmission; Play the influences of end to audio-visual synchronization if be used to assess difference, then the source end in this flow process can be that audio frequency and video are play end to the audio-visual synchronization quality preferably, destination end is meant the audio frequency and video broadcast end that need carry out the audio-visual synchronization quality evaluation.This flow process comprises the steps:

Step 101, adopt the audio mode recognition methods to find out in the audio-video document that destination end plays,, and write down the initial reproduction time of this audio section with the audio section of audioref Data Matching;

Step 102, adopt video mode recognition method to find out in the audio-video document that destination end plays,, and write down the initial reproduction time of this frame of video with the frame of video of video reference Data Matching;

Step 103, according to initial reproduction time record and the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching, the reproduction time of determining audio frequency and video is poor;

Step 104, poor according to the audio frequency and video reproduction time of determining, and this audio-video document is when the source end is play and the audio section of audioref Data Matching, poor with the audio frequency and video reproduction time of the frame of video of video reference Data Matching, determine this destination end and play the audio-visual synchronization situation of this audio-video document, as, compare with the audio-visual synchronization time delay of source end, variable quantity of the synchronization delayed time of the audio frequency and video of destination end (situation of change of the time span of comparing with the source end as the time span of or hysteresis video leading) or degree at the destination end audio frequency, and can further the audio-visual synchronization situation be mapped as corresponding audio-visual synchronization credit rating.

In thestep 101 and step 102 of above-mentioned flow process, the time of being write down can be the destination end current system time, also can be the time of playing starting point with respect to this audio-video document.Step 101 in the above-mentioned flow process and step 102 are not strict with on sequential, that is, this two step can go up exchange in proper order, also can executed in parallel.

Usually, audioref data and video reference data are one to one, and in order to make synchronous detecting more accurate, how right audioref data and video reference data are generally.At many to audioref data and video reference data conditions, the reproduction time difference that thestep 103 of flow process shown in Figure 1 is determined also be with audioref data and video reference data to one to one, promptly, determine initial reproduction time with the audio section of its coupling at audioref data, at determining initial reproduction time with the frame of video of its coupling with the pairing video reference data of these audioref data, it is poor to pairing audio frequency and video reproduction time that both time differences are with these audioref data and video reference data; In like manner, can obtain in the step 104, audio-video document is when the source end is play and the audio section of audioref Data Matching, poor with the audio frequency and video reproduction time of the frame of video of video reference Data Matching.

Can be in advance obtain audio frequency and video time difference of the audio-video document that this synchronous detecting uses in the above described manner at transmitting terminal, and when follow-up this audio-video document of each use carries out the audio-visual synchronization detection, directly use the audio frequency and video time difference of this detected in advance transmitting terminal audio frequency and video time difference and receiving terminal to compare, thereby determine the audio-visual synchronization situation of this audio-video document after transmission.

Generally, in order accurately to detect the audio-visual synchronization situation, audio-visual synchronization detects the audioref data of usefulness and video reference data should be had and comparatively significantly be convenient to the feature discerning and be convenient to carry out pattern matching, audio-visual synchronization detect then comprise in the audio-video document of usefulness with the audio section of audioref Data Matching and with the frame of video of video reference Data Matching.Preferably, audio-visual synchronization is detected in the video file of usefulness, with the initial reproduction time of the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of corresponding video reference Data Matching, identical on the sampled point meaning, promptly the audio frequency and video time difference is 0.In this case, in the step 104 of flow process shown in Figure 1, because the audio frequency and video reproduction time difference of audio-video document when the source end is play be 0, the audio frequency and video reproduction time of then can be directly determining according tostep 103 is poor, makes the audio-visual synchronization situation that this destination end is play this audio-video document.

Detecting with IP network video telephone audio-visual synchronization is example, the audio-video document of using as synchronous detecting, aspect audio frequency, comprise the pronunciation of numeral 1,2,3,4,5, the picture that aspect video, comprises 5 kinds of different human body gestures that show before the solid background, and during the pronunciation of a numeral of every appearance, show corresponding a kind of gesture on the picture in the playing process; The audioref data are the audio volume control data of each numeric utterance in the numeral 1,2,3,4,5, are stored in the audio frequency characteristics storehouse; The video reference data are the vedio data of each gesture in 5 kinds of human body gestures under the solid background, are stored in the video features storehouse; This audio-video document is when transmitting terminal is play, and each numeric utterance is known with the synchronization time difference of corresponding gesture picture.In network transmission process, the Voice ﹠ Video in this audio-video document transmits respectively, forms WAV audio file and AVI video file at receiving terminal.Detect the process of this audio-video document, can comprise the steps: as shown in Figure 2 in the audio-visual synchronization situation of receiving terminal

Obtain the WAV audio file (step 201) in the audio-video document that the audio frequency and video receiving terminal receives, the terminal of determining wherein each voice segments according to audio signal is to find out voice segments (step 202), adopt the audio mode recognition methods, the speech data of each numeric utterance in each voice segments and the audio frequency characteristics storehouse is compared, determine numeral 1 in each voice segments respectively, 2,3,4, the voice segments (step 203) of 5 pronunciations, and write down the start-stop reproduction time of these voice segments, thereby in time (the then corresponding more time that writes down of repetition being arranged) (step 204) that the audio frequency and video receiving terminal can write down at least 5 audio sections as the digital pronunciation in the WAV audio file;

Obtain the AVI video file (step 205) in the audio-video document that the audio frequency and video receiving terminal receives, extract the every two field picture (step 206) in the AVI video file, adopt video mode recognition method, the view data of various gestures in each video frame images and the video features storehouse is compared, determine the wherein frame of video of various gestures respectively, usually only get the frame of video (step 207) that first identifies, and write down the initial reproduction time of these frame of video, thereby in time (the then corresponding more time that writes down of repetition being arranged) (step 208) of at least 5 frame of video of audio frequency and video receiving terminal record as the gesture picture in the AVI video file;

The initial reproduction time of frame of video of the gesture that the numeral 1 of the initial reproduction time of numeral 1 pronunciation of record and record is corresponding subtracts each other, the audio frequency and video reproduction time that obtains digital 1 correspondence poor (time of being write down all is that the system time with receiving terminal is a benchmark), and the like, obtain the corresponding audio frequency and video reproduction time poor (step 209) of other numerals respectively;

The resulting audio frequency and video reproduction time ofstep 209 is poor, compare in the reproduction time difference of transmitting terminal with known this audio-video document, determine with respect to the audio frequency and video time delay (210) of this audio-video document of transmitting terminal at receiving terminal;

According to the result of step 210, determine corresponding audio-visual synchronization credit rating or MOS score value (step 211).

In the embodiment of the invention aspect being provided with of audioref data, the subjective feeling of considering the people is to the starting point (from noiseless to sound) of audio frequency and the asynchronous relatively sensitivity of terminating point (from sound to noiseless) and picture material, preferably, audioref is chosen at voice segments (as the voice segments of digital 1-5 pronunciation), therefore, when the audio section of definite and audioref Data Matching, at first to detect the terminal position of each voice segments in the audio volume control of this audio-video document, then voice segments and the audioref data determined be carried out audio mode identification.

For detecting the voice segments in the audio file, the embodiment of the invention can adopt traditional voice segments waveforms detection method based on short-time energy or short-time average magnitude.Traditional voice segments waveforms detection method based on short-time energy or short-time average magnitude is a kind of detection method of simple gate limit in essence, a kind of stronger in order to obtain than conventional method adaptability, the audiotime message of extracting is sound end detecting method more accurately, the invention process is also improved traditional speech detection method, and adopts the speech detection method after improving to carry out speech detection.Speech detection method after the improvement, discern more effective at voice segments standby average energy when background noise is smaller, discern relatively effectively characteristics at the average zero-crossing rate of time standby that background noise is bigger, the short-time average magnitude and the short-time average zero-crossing rate of voice signal have been taken all factors into consideration, on basis based on the short-time average magnitude detection method, investigate the short-time average zero-crossing rate of voice signal again, utilize amplitude and zero-crossing rate double characteristic to carry out the voice signal terminal and detect.

The foundation that can realize these judgements is that the various parameters in short-term of voice of different nature have different probability density functions and adjacent some frame voice should have consistent characteristics of speech sounds, and promptly they can not undergone mutation at voiced sound, voiceless sound, between noiseless.Usually, the short-time average magnitude maximum of voice signal voiced sound, noiseless short-time average magnitude minimum; The short-time average zero-crossing rate maximum of voiceless sound, noiseless placed in the middle, the short-time average zero-crossing rate minimum of voiced sound.

In the speech detection method that the embodiment of the invention adopted, at first rule of thumb value is determined two amplitude threshold parameter MH and ML (MH＞ML), and a short-time zero-crossing rate threshold value Z0.The value of MH should be set than higher, makes when the short-time average magnitude M of frame voice signal value during above MH, and can be voiced sound just than to determine this frame voice signal be not noiseless and sizable possibility is arranged surely.When the short-time average magnitude M of voice signal when being reduced to ML greatly, adopt short-time average zero-crossing rate to proceed judgement, when the short-time average zero-crossing rate of voice signal is lower than threshold value Z0, can determine that it is the end points (beginning or end) of voice segments.

The statistical analysis of short-time average magnitude and short-time average zero-crossing rate be can carry out according to a large amount of speech samples, and amplitude threshold value MH and ML determined in conjunction with the short-time average magnitude of actual sample.The process of determining amplitude thresholding MH according to speech samples is:

Data in each speech samples are carried out windowing divide frame.According to people's the physilogical characteristics and the result who comes out of lot of data statistics, generally window length is made as 20ms, step-length is set at half of window length, then the total amount of frame=total sampling number/step-length;

According to the short-time average magnitude in the computing formula unit of account frame of following short-time average magnitude:

M_{m} = Σ_{n = m}^{N + m - 1} | S_{w} (n - m) |

According to the short-time average zero-crossing rate in the computing formula unit of account frame of following short-time average zero-crossing rate;

Z_{m} = \frac{1}{2} {Σ_{n = m}^{N + m - 1} | sgn [s_{w} (n)] - sgn [s_{n} (n - 1)] |}

All speech frames in each speech samples are traveled through statistical analysis, with the short-time average magnitude that draws speech samples and the distribution situation of short-time average zero-crossing rate;

Distribution situation according to the short-time average magnitude and the short-time average zero-crossing rate of speech samples, short-time average magnitude according to quiet period, set out the threshold value MH of a thresholding, with fixed bigger of this threshold value, to guarantee that short-time average magnitude in each speech samples is a voice segments greater than the part of MH, to get then the zero-crossing rate threshold value Z0 of period three short-time average zero-crossing rate doubly that mourn in silence as voice segments.

According to the amplitude thresholding MH that determines and ML and short-time average zero-crossing rate thresholding Z0, the speech detection process of the embodiment of the invention is:

Determine former and later two time points A1 and A2 in the audio signal to be detected according to MH, wherein, when the short-time average magnitude M of voice signal surpasses MH, this is designated as A1 constantly, the moment when A1 drops to MH first with voice signal backward is designated as A2; Substantially can be defined as voice segments between A1 and the A2;

Continue search before A1 and in the voice signal after the A2; When searching for forward,, then current time can be designated as B1 if the short-time average magnitude M of voice signal reduces to ML from big to small by A1; In like manner, when searching for backward,, then current time is designated as B2 if the short-time average magnitude M of voice signal reduces to ML from big to small by A2.Still can determine it is voice segments between B1 and the B2;

Continuation is searched for forward and by B2 backward by B1.When searching for forward,, drop to Z0 suddenly when following, current time is designated as C1 and as the starting point of voice segments up to Z if the short-time zero-crossing rate Z of voice signal all the time greater than Z0, thinks that then these voice signals still belong to voice segments by B1; In like manner, when searching for backward,, drop to Z0 suddenly when following, current time is designated as C2 and as the terminal point of this voice segments up to Z if the short-time zero-crossing rate Z of voice signal all the time greater than Z0, thinks that then these voice signals still belong to voice segments by B2;

And the like, detect all audio sections and starting point and terminal point in the audio file voice signal.

Take the reason of this algorithm to be: before the B1 and B2 may be one section voiceless consonant section afterwards, their energy quite a little less than, rely on short-time average magnitude not differentiate they and unvoiced segments fully, but their short-time average zero-crossing rate but will be apparently higher than noiseless, thereby enough this parameters of energy are judged the cut-point of the two, just real starting point and the terminal point of voice accurately.

This kind algorithm not only is adapted to the voice segments testing process in the embodiment of the invention, is applicable to that also other need detect the application scenarios of the voice segments in the audio signal.

After obtaining the temporal information of voice segments, also need the voice segments that obtains is carried out pattern recognition, to determine the voice segments with the audioref Data Matching.The embodiment of the invention adopts the linear forecasting technology (LPCC) in the audio frequency to carry out audio mode identification.

Obtaining of LPCC characteristic parameter mainly is divided into four steps: preliminary treatment, auto-correlation are calculated, moral guest's algorithm is found the solution linear predictor coefficient (LPC) regular equation and LPCC recursion.Wherein, in preliminary treatment, the preemphasis employing promotes high frequency to the mode that voice signal adds single order FIR filter, is used to compensate the decay of glottal excitation and the radiation-induced high frequency spectrum of mouth and nose; The preferred window shape Hamming window of this algorithm picks of window adding technology is as window function.

Voice signal has just changed into one group of LPCC characteristic vector after each frame is extracted the LPCC characteristic parameter.Speech recognition is exactly the speech feature vector of this stack features and reference audio data will be carried out pattern matching, thereby seeks the shortest pattern of distance.

Adopt pattern matching method to carry out speech recognition and be divided into two classes usually: training stage and cognitive phase.Form standard form in the training stage, at cognitive phase, the standard form vector that treating after the transmission attenuation known in speech characteristic vector and the standard form carries out similarity calculating.In the embodiment of the invention, be the characteristic vector of audioref data by formed standard form of training stage.

But consider the influence of the factors such as decay packet loss of audio file in transmission course, voice sequence length after the raw tone sequence is transmitted with process may be unequal, for addressing this problem, the embodiment of the invention adopts based on the DTW recognizer of dynamic time warping coupling carries out pattern recognition.

In the DTW method that the embodiment of the invention provided, at first calculate input pattern (being the audio signal characteristic vector of each voice segments to be identified) and reference model (being the characteristic vector of audioref data) apart from matrix, then, in distance matrix, find out an optimal path, the accumulation distance minimum in this path, this paths are exactly the non-linear relation between the time calculation degree of two patterns.Its algorithm principle is as follows:

Suppose that input pattern to be identified and reference model represent with T and R respectively,, can calculate the distortion D[T between them, R for the similarity between them relatively], the more little similarity of the distortion factor is high more.In order to calculate this distortion, the distortion from T and R between each corresponding frame is counted.If N and M are respectively the totalframes among T and the R, n and m are respectively optional frame numbers among T and the R, D[T (n), R (m)] represent the distortion between these two characteristic vectors, then:

When N=M (being that the T pattern is identical with the frame number of R pattern), directly T (1) and R (1) frame, T (2) and R (2) frame ..., T (m) and R (m) frame coupling, calculate D[T (1), R (1)], D[T (2), R (2)] ..., D[T (m), R (m)] the distortion factor, and ask itself and, promptly obtain total distortion;

When N ≠ M (frame number that is T pattern and R pattern is inequality), adopt dynamic programming method to carry out route searching, be specially: with (the n=1～N) mark on the transverse axis in a two-dimentional rectangular coordinate system of each frame number among the T, with (the m=1～M) on the ordinate of this coordinate system, mark of each frame number among the R, as shown in Figure 3, each crosspoint (n in the formed grid of horizontal ordinate, m) plotted point of a certain frame among the expression T, the route searching process just can be summed up as seeks a path by some crosspoints in these grids, and the crosspoint that the path is passed through promptly is the voice frame number that carries out distortion computation among T and the R.

Wherein, the path is not elective, considers that the speed of voice has variation, but the precedence of each several part can not change, therefore selected path should be from the lower left corner, finish in the upper right corner.Secondly, in order to prevent planless search, can further leave out those to the n axle or to the undue path that tilts of m axle, this be because the pressure of the voice in the reality, expand always limited, so just can in the path respectively the maximum and the minimum value of G-bar in the path by point limited, usually, greatest gradient is decided to be 2, minimum slope location 1/2.

The path cost function that defines in the present embodiment is: d[(ni, mi)], its meaning be from starting point (n0, m0) set out current point (computing formula is as follows for ni, each frame distortion aggregate-value mi):

d[(ni，mi)]＝D[T(ni)，R(mi)]+d[(ni-1，mi-1)]

d[(ni-1，mi-1)]＝min{d[(ni-1，mi)]，d[(ni-1，mi-1)]，d[(ni-1，mi-2)]}

According to above formula, can be in the hope of needed D[T (ni), R (mi)] value.More than Ding Yi path cost function only is a kind of example, does not get rid of the algorithm of other path costs.

The video mode recognition method that the embodiment of the invention adopted is meant image-recognizing method, promptly, each frame of video that intercepting is play compares each two field picture that intercept and the video frame images in the feature database, thus find out with feature database in the video frame images frame of video of mating.This image recognition processes mainly is divided into two stages: video interception and image recognition.

Video interception can utilize the AVIFile library file of windows operating system to realize, is specially:

At first, initialization AVIFile storehouse, open the avi file for the treatment of synchronous detecting then and obtain its file interface address, if open file successfully (being that video format meets the requirements), then obtain needed avi file information according to the file interface address, these information can comprise: the data rate of file maximum (bytes persecond), document flow number, file height (pixels), width (pixels), sample rate (samples persecond), file size (frames), kind of document etc.; Can obtain the interface IP address of AVI stream according to the file interface address, interface IP address according to AVI stream, obtain the avi file stream information, because audio/video flow is a separate processes, so the stream information of Huo Deing only is a video flowing here, these information can comprise: the kind class description of document flow kind, frame rate (fps), start frame, end frame, image quality value, document flow etc.;

Then, handle the Video stream information obtain, call the address that corresponding decoding functions obtains data behind the decompress(ion), and the memory address of every frame data (being used to preserve into the BMP file), so far, just obtained needed image data information;

At last, write the header file of this image data information again, it is preserved into needed BMP file.The frame number of BMP file AVI video flowing by name, frame time can multiply by frame time by current frame number and obtain at interval, wherein frame period information can find in being specifically designed to the structure of preserving avi file information, for example, the file playback rate is 15fps, it is 66666ns that interframe is divided into 1/15, so it is poor with respect to the reproduction time of start frame to be easy to obtain each frame.

Intercept out the BMP picture from avi file after, the known BMP file of preserving is 24 RGB bitmaps, and further work promptly is that the BMP picture is carried out image recognition.Image recognition processes can be: with the colored bitmap-converted of 24RGB is the binary picture of 8RGB, the feature of outstanding target object, adopt pixel statistics and profile track algorithm to ask the area and the girth of detected image target object, it and image in the feature database are compared, specifically can be divided into following several steps:

Step 1, with target image (image that promptly is truncated to) gray processing, obtain corresponding grey value profile;

Step 2, grey value profile is carried out interative computation, calculate threshold value;

Step 3, according to threshold value with image binaryzation (be converted into black and white picture, white is background, and black is target object);

Step 4, the image of binaryzation is carried out pixels statistics, calculate the area (pixel number) of target object;

Step 5, carry out next step image processing, depict the profile of target object;

Step 6, carry out pixels statistics, calculate the girth of target object profile;

The information of the respective image of storing in the area that step 7, usefulness obtain and girth and the feature database is compared, and judges whether this image is required target image, is then to note reproduction time.

In the embodiment of the invention, when the audio-visual synchronization situation is estimated, can compare the degree of lead and lag according to audio ﹠ video, mapping obtains corresponding audio-visual synchronization grade and corresponding MOS score value.

The MOS score value of the audio-visual synchronization in the embodiment of the invention is with reference to the scoring algorithm in ITU-R.BT 1359 standards, copy its segmentation Calculation Method, according to the subjective feeling of people, set the threshold value of 4 kinds of audio-visual synchronization credit ratings to the audio-visual synchronization situation.Audio-visual synchronization scoring model can be as shown in Figure 4, transverse axis is the time of audio frequency hysteresis video among the figure, vertical pivot is represented the score value of marking, and A, B, C, A ', B ', C ' each point are represented the Three Estate thresholding formulated, will estimate score value and be divided into 4 grades, the corresponding MOS score value of each audio-visual synchronization credit rating, maximum score value is 4.0, and minimum score value is 1.0, and floating space is 0.3, each audio-visual synchronization grade and thresholding thereof and corresponding MOS score value, can be as shown in table 1:

Table 1

In order more accurately to estimate the audio-visual synchronization quality objectively, a plurality of monitoring points are set to detect the audio-visual synchronization situation and to carry out the audio-visual synchronization quality evaluation in the embodiment of the invention, when carrying out the audio sync quality evaluation, with the synchronous MOS score value addition of these a plurality of monitoring points, then obtain overall synchronous MOS score value.The MOS score value of general synchronization can be used as the MOS score value that draws the video traffic total quality after an important indicator and audio frequency MOS, the video MOS score value weighted calculation.

Based on the embodiment of the invention in audio-visual synchronization detect identical technical conceive, the embodiment of the invention also provides a kind of audio-visual synchronization detection system.As shown in Figure 5, this system comprises: audio identification module 501, video identification module 502, time difference determination module 503 and synchronous detection module 504, wherein:

Audio identification module 501 can be determined in the audio-video document that destination end plays by the audio mode RM, with the initial reproduction time of the audio section of audioref Data Matching;

Video identification module 502 can be determined in the audio-video document that destination end plays by the video mode RM, with the initial reproduction time of the frame of video of video reference Data Matching;

Time difference determination module 503, the initial reproduction time that is used for the audio section of and audioref Data Matching that determine according to audio identification module 501, and the initial reproduction time of video identification module 502 frame of video with the video reference Data Matching that determine, it is poor to determine the audio frequency and video reproduction time of audio-video document when destination end is play;

Synchronous detection module 504, it is poor to be used to obtain the audio frequency and video reproduction time of audio-video document when the source end is play, poor according to the audio frequency and video reproduction time that the audio frequency and video reproduction time difference that gets access to and time difference determination module 503 are determined, determine the audio-visual synchronization situation of this audio-video document when described destination end is play.

The specific implementation process of each function in above-mentioned each functional module, similar to the respective process in the aforementioned audio-visual synchronization testing process, do not repeat them here.

Based on the technical conceive identical with speech detection in the embodiment of the invention, the embodiment of the invention also provides a kind of speech detection system, as shown in Figure 6, this system comprises:first search module 601,second search module 602, voicesegments determination module 603, wherein:

First search module 601, receive the audio signal to be measured of input, according to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of amplitude threshold MH, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below the amplitude threshold MH first, search for audio signal backward from current time;

Second search module 602 is used for searching short-time average magnitude forward and backward when dropping to the audio signal of amplitude threshold ML whenfirst search module 601, continues along former direction of search search audio signal according to short-time average zero-crossing rate;

Voicesegments determination module 603, be used for searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value Z0 whensecond search module 602, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value Z0, with the terminal point of current time as voice segments.

This system can comprise that also threshold value is provided withmodule 604, be used for distributing to determine amplitude threshold MH, amplitude threshold ML and zero-crossing rate threshold value Z0 according to short-time average magnitude distribution and short-time average zero-crossing rate to speech samples data sound intermediate frequency signal, wherein, the audio signal of short-time average zero-crossing rate more than amplitude threshold MH is voice signal, in the voice signal of short-time average magnitude below amplitude threshold ML, the audio signal that short-time average zero-crossing rate is lower than zero-crossing rate threshold value Z0 is not a voice signal.

The specific implementation process of each function in above-mentioned each functional module, similar to the respective process in the aforementioned speech detection flow process, do not repeat them here.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. an audio-visual synchronization detection method is characterized in that, comprises the steps:

2. the method for claim 1 is characterized in that, it is poor to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, and comprising:

Determine in the audio-video document that the source end play, with the initial reproduction time of the audio section of described audioref Data Matching, and with the initial reproduction time of the frame of video of described video reference Data Matching;

According to the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when the source end is play.

3. method as claimed in claim 1 or 2 is characterized in that, described audioref data are speech data;

Determine and the process of the initial reproduction time of the audio section of audioref Data Matching, comprising:

Detect the voice segments and the start-stop reproduction time thereof that comprise in the audio-video document of being play;

By detected voice segments and described audioref data are carried out voice recognition processing, determine voice segments with described audioref Data Matching.

4. method as claimed in claim 3 is characterized in that, the voice segments that comprises in the audio-video document of determining to be play and the process of start-stop reproduction time thereof comprise:

In the audio-video document of being play, search for audio signal according to the voice signal short-time average magnitude, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;

5. method as claimed in claim 4, it is characterized in that, described first amplitude threshold, second amplitude threshold and zero-crossing rate threshold value distribute according to the short-time average magnitude to speech samples data sound intermediate frequency signal and short-time average zero-crossing rate distributes to determine, wherein, the audio signal of short-time average zero-crossing rate more than first amplitude threshold is voice signal, in the voice signal of short-time average magnitude below second amplitude threshold, the audio signal that short-time average zero-crossing rate is lower than the zero-crossing rate threshold value is not a voice signal.

6. method as claimed in claim 3 is characterized in that, determines and the process of the voice segments of described audioref Data Matching, comprising:

According to the characteristic vector of each voice segments audio signal, and the characteristic vector of described phonetic reference data, by the definite similarity to each other of the space length that calculates each voice segments and described phonetic reference data;

According to the similarity of determining, get wherein and the most similar voice segments of described phonetic reference data, as with the voice segments of described audioref Data Matching.

7. method as claimed in claim 6 is characterized in that, when the audio frame number of the audio frame number of voice segments and audioref data was unequal, the process of the distance of computing voice section and described phonetic reference data was specially:

Each audio frame frame number of described voice segments is mapped on the transverse axis in the two-dimentional rectangular coordinate system, each audio frame frame number of audioref data is mapped on the ordinate of this coordinate system, on the direction of the upper right corner, determine a paths along the lower left corner of described coordinate system; According to the coordinate points of described path process, determine with described voice segments in the frame number of each frame number corresponding audio reference data;

According to the corresponding relation of the frame number of determining, utilize the characteristic vector of audio signal, calculate the distortion factor of two frame audio signals with corresponding relation, according to the distortion factor that calculates, determine the space length between described voice segments and the described audioref data.

8. method as claimed in claim 7, it is characterized in that, the described path of determining on along the lower left corner of described coordinate system to upper right corner direction, slope at the joint place of the frame number that each ordinate and abscissa identified, be no more than first slope threshold value, be not less than second slope threshold value, described first slope threshold value is greater than second slope threshold value.

9. method as claimed in claim 1 or 2 is characterized in that, determines and the process of the initial reproduction time of the frame of video of video reference Data Matching, comprising:

Extract the frame of video that comprises in the audio-video document of being play;

Carry out image recognition processing by frame of video and the described video reference data that will extract, determine frame of video and initial reproduction time thereof with described video reference Data Matching.

10. the method for claim 1 is characterized in that, determines the audio-visual synchronization situation of described audio-video document, comprising:

Determine described audio-video document when destination end is play with respect to the audio frequency and video time delay variable quantity that when the source end is play, is produced;

According to the audio frequency and video time delay variable quantity of determining, determine corresponding audio-visual synchronization credit rating or mark.

11. an audio-visual synchronization detection system is characterized in that, comprising:

12. system as claimed in claim 11, it is characterized in that, when described synchronous detection module is obtained the audio frequency and video reproduction time difference of described audio-video document when the source end is play, determine in the audio-video document that the source end play, with the initial reproduction time of the audio section of described audioref Data Matching, and with the initial reproduction time of the frame of video of described video reference Data Matching; Then, in the audio-video document of being play according to the source end, the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when the source end is play.

13. system as claimed in claim 12 is characterized in that, described audioref data are speech data;

Described audio identification module or described synchronous detection module are determined and the process of the initial reproduction time of the audio section of audioref Data Matching, being comprised:

14. system as claimed in claim 13 is characterized in that, the voice segments that comprises in the audio-video document that described audio identification module or described synchronous detection module are determined to be play and the process of start-stop reproduction time thereof comprise:

15. system as claimed in claim 13 is characterized in that, described audio identification module is determined and the process of the voice segments of described audioref Data Matching, being comprised:

16. system as claimed in claim 15 is characterized in that, when the audio frame number of the audio frame number of voice segments and audioref data was unequal, the process of the distance of described audio identification module computing voice section and described phonetic reference data was specially:

17. system as claimed in claim 12 is characterized in that, described video identification module is determined and the process of the initial reproduction time of the frame of video of video reference Data Matching, being comprised:

18. system as claimed in claim 11 is characterized in that, described synchronous detection module is determined the audio-visual synchronization situation of described audio-video document, comprising:

19. a speech detection method is characterized in that, comprises the steps:

20. method as claimed in claim 19, it is characterized in that, described first amplitude threshold, second amplitude threshold and zero-crossing rate threshold value distribute according to the short-time average magnitude to speech samples data sound intermediate frequency signal and short-time average zero-crossing rate distributes to determine, wherein, the audio signal of short-time average zero-crossing rate more than first amplitude threshold is voice signal, in the voice signal of short-time average magnitude below second amplitude threshold, the audio signal that short-time average zero-crossing rate is lower than the zero-crossing rate threshold value is not a voice signal.

21. a speech detection system is characterized in that, comprising:

22. system as claimed in claim 21 is characterized in that, also comprises:

Threshold value is provided with module, be used for distributing to determine described first amplitude threshold, second amplitude threshold and zero-crossing rate threshold value according to short-time average magnitude distribution and short-time average zero-crossing rate to speech samples data sound intermediate frequency signal, wherein, the audio signal of short-time average zero-crossing rate more than first amplitude threshold is voice signal, in the voice signal of short-time average magnitude below second amplitude threshold, the audio signal that short-time average zero-crossing rate is lower than the zero-crossing rate threshold value is not a voice signal.