Movatterモバイル変換


[0]ホーム

URL:


CN102056026A - Audio/video synchronization detection method and system, and voice detection method and system - Google Patents

Audio/video synchronization detection method and system, and voice detection method and system
Download PDF

Info

Publication number
CN102056026A
CN102056026ACN2009102374145ACN200910237414ACN102056026ACN 102056026 ACN102056026 ACN 102056026ACN 2009102374145 ACN2009102374145 ACN 2009102374145ACN 200910237414 ACN200910237414 ACN 200910237414ACN 102056026 ACN102056026 ACN 102056026A
Authority
CN
China
Prior art keywords
audio
video
short
time
audio signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2009102374145A
Other languages
Chinese (zh)
Other versions
CN102056026B (en
Inventor
陈欣伟
方力
沈亮
高屹
常静
侯优优
阮征
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Group Design Institute Co Ltd
Original Assignee
China Mobile Group Design Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Group Design Institute Co LtdfiledCriticalChina Mobile Group Design Institute Co Ltd
Priority to CN2009102374145ApriorityCriticalpatent/CN102056026B/en
Publication of CN102056026ApublicationCriticalpatent/CN102056026A/en
Application grantedgrantedCritical
Publication of CN102056026BpublicationCriticalpatent/CN102056026B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Landscapes

Abstract

Translated fromChinese

本发明公开了一种音视频同步检测方法及其系统,以及一种语音检测方法及其系统,该音视频同步检测方法包括:确定目标端所播放的音视频文件中,与音频参考数据匹配的音频段的起始播放时间,以及与视频参考数据匹配的视频帧的起始播放时间;根据所述与音频参考数据匹配的音频段的起始播放时间,以及所述与视频参考数据匹配的视频帧的起始播放时间,确定出所述音视频文件在目标端播放时的音视频播放时间差;获取所述音视频文件在源端播放时的音视频播放时间差,根据所述音视频文件在源端和目标端播放时的音视频播放时间差,确定出所述音视频文件在所述目标端播放时的音视频同步情况。采用本发明可提高音视频同步检测的准确度。

Figure 200910237414

The invention discloses an audio and video synchronization detection method and its system, and a voice detection method and its system. The audio and video synchronization detection method includes: determining the audio and video files played by the target end, which match the audio reference data The start play time of the audio segment, and the start play time of the video frame matched with the video reference data; according to the start play time of the audio segment matched with the audio reference data, and the video frame matched with the video reference data The initial playback time of the frame determines the audio-video playback time difference when the audio-video file is played at the target end; obtains the audio-video playback time difference when the audio-video file is played at the source end, according to the audio-video file at the source The audio and video playback time difference between the end and the target end determines the audio and video synchronization of the audio and video file when it is played on the target end. The accuracy of audio and video synchronous detection can be improved by adopting the invention.

Figure 200910237414

Description

Audio-visual synchronization detection method and system thereof, speech detection method and system thereof
Technical field
The present invention relates to the audio frequency and video detection technique in the communications field, relate in particular to a kind of audio-visual synchronization detection method and system thereof, and a kind of speech detection method and system thereof.
Background technology
In the mobile communication video traffic, because Voice ﹠ Video does not carry temporal information in cataloged procedure, the synchronizing information that therefore obtains audio frequency and video becomes quite difficult.
If add temporal information in packets of audio data in advance behind audio/video coding and the video packets of data respectively, then the audio-video document after encoding is after Network Transmission arrives receiving terminal, resolve by the audio-video document that receiving terminal is received, parse the temporal information of carrying in packets of audio data and the video packets of data, judge the synchronous situation of audio frequency and video then according to the temporal information that parses.
But there is following problem in above-mentioned audio-visual synchronization detection method:
(1) although Voice ﹠ Video is carrying temporal information after the packing respectively, but the temporal information after the two grouping packing does not have corresponding corresponding relation, moreover the size of the frame length of Voice ﹠ Video and packet is also inequality, therefore can't accurately determine the relative time delay of Voice ﹠ Video;
(2) audio-visual synchronization is carried out the result of synchronous detecting according to the temporal information of carrying in packets of audio data and the video packets of data packet header, the propagation delay time that only can reflect network, and in the actual play process, the audio-video document player of receiving terminal is provided with buffer memory, audio stream and video flowing through decoding are adjusted by buffer memory synchronously by player, therefore, carry out result that audio-visual synchronization detects according to the temporal information of carrying in packets of audio data and the video packets of data packet header and can not reflect that the audio-video document player adjusts the back synchronously to influence that audio-visual synchronization produced, that is, adopting this kind mode to carry out audio-visual synchronization, to detect resulting result inaccurate.
Summary of the invention
The embodiment of the invention provides a kind of audio-visual synchronization detection method and system thereof, in order to solve the existing low problem of audio-visual synchronization detection accuracy.
The technical scheme that the embodiment of the invention provides comprises:
A kind of audio-visual synchronization detection method comprises the steps:
Determine in the audio-video document that destination end plays, with the initial reproduction time of the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching;
According to the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when destination end is play;
It is poor to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, poor according to the audio frequency and video reproduction time of described audio-video document when source end and destination end are play, determine the audio-visual synchronization situation of described audio-video document when described destination end is play.
A kind of audio-visual synchronization detection system comprises:
The audio identification module is used for the audio-video document of determining that destination end is play, with the initial reproduction time of the audio section of audioref Data Matching;
The video identification module is used for the audio-video document of determining that destination end is play, with the initial reproduction time of the frame of video of video reference Data Matching;
The time difference determination module, be used for initial reproduction time that determine according to described audio identification module and the audio section audioref Data Matching, and the described video identification module initial reproduction time with the frame of video video reference Data Matching that determine, it is poor to determine the audio frequency and video reproduction time of described audio-video document when destination end is play;
Synchronous detection module, it is poor to be used to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, poor according to the audio frequency and video reproduction time that the described audio frequency and video reproduction time difference that gets access to and described time difference determination module are determined, determine the audio-visual synchronization situation of described audio-video document when described destination end is play.
The above embodiment of the present invention, the audio-video document of playing for destination end, determine the initial reproduction time of the audio section of itself and audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching, thereby the audio frequency and video reproduction time when obtaining the destination end broadcast is poor, compare with the audio frequency and video reproduction time difference of this audio-video document when the source end is play then, thereby determine the audio-visual synchronization situation of this audio-video document when described destination end is play, compared with prior art, the audio-visual synchronization of the embodiment of the invention detects the temporal information that does not rely in the audio, video data bag, but carry out synchronous detecting according to the audio-video document of destination end institute actual play, simultaneously the factor of in the audio/video decoding course of destination end audio-visual synchronization being adjusted is taken into account, therefore resulting audio-visual synchronization testing result is more accurate.Be particularly useful for process to the audio-visual synchronization situation detection of audio frequency and video after Network Transmission.
The embodiment of the invention also provides a kind of speech detection method and system thereof, is used to solve the low problem of prior art speech detection accuracy.
The technical scheme that the embodiment of the invention provides comprises:
A kind of speech detection method comprises the steps:
According to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;
When searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold, continue along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;
When searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.
A kind of speech detection system comprises:
First search module, be used for according to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;
Second search module is used for searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold when described first search module, continues along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;
The voice segments determination module, be used for searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value when described second search module, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.
The above embodiment of the present invention, in the speech detection process, discern more effective at voice segments standby average energy when background noise is smaller, discern relatively effectively characteristics at the average zero-crossing rate of time standby that background noise is bigger, the short-time average magnitude and the short-time average zero-crossing rate of voice signal have been taken all factors into consideration, on basis based on the short-time average magnitude detection method, investigate the short-time average zero-crossing rate of voice signal again, utilize amplitude and zero-crossing rate double characteristic to carry out the voice signal terminal and detect, thereby make detected voice segments terminal more accurate.
Description of drawings
Fig. 1 is the schematic flow sheet that audio-visual synchronization detects in the embodiment of the invention;
Fig. 2 is the schematic flow sheet that IP network video telephone audio-visual synchronization detects in the embodiment of the invention;
Fig. 3 is the dynamic route search schematic diagram of speech recognition process in the embodiment of the invention;
Fig. 4 is the audio-visual synchronization scoring model schematic diagram in the embodiment of the invention;
Fig. 5 is the structural representation of the audio sync detection system in the embodiment of the invention;
Fig. 6 is the structural representation of the speech detection system in the embodiment of the invention.
Embodiment
The problems referred to above at the prior art existence, the embodiment of the invention provides a kind of audio-visual synchronization detection method and system thereof, adopt the mode of pattern recognition to carry out the audio-visual synchronization detection, promptly respectively the audio-video document of broadcast and the reference data of these audio frequency and video are carried out pattern recognition at transmitting terminal and receiving terminal, note the audio frame that is complementary with audioref data and video reference data and the initial reproduction time of frame of video respectively, the audio frequency and video reproduction time that obtains transmitting terminal and receiving terminal is poor, by being compared, the audio frequency and video reproduction time difference of transmitting terminal and receiving terminal calculates delay inequality again, thus the audio-visual synchronization situation the when audio-video document that obtains receiving terminal is play.
In the embodiment of the invention, before carrying out the audio-visual synchronization detection, to prepare audioref data and video reference data earlier, be used for detecting the audioref point and the video reference point of audio-video document, thereby determine the audio-visual synchronization parameter according to audioref point and video reference point in synchronization detection process.The audioref data can be the audio volume control data, and the video reference data can be vedio datas, and audioref data and video reference data can be stored in the feature ancient term for country school in advance.
Referring to Fig. 1, be the schematic flow sheet of audio-visual synchronization detection in the embodiment of the invention.This flow process can be applicable to assess the influence of Network Transmission to audio-visual synchronization, also can be used for assessing different influences of playing end to audio-visual synchronization.If be used to assess the influence of Network Transmission to audio-visual synchronization, then the source end in this flow process is meant that transmitting terminal, the destination end of audio-video document are meant the receiving terminal that audio-video document arrives after Network Transmission; Play the influences of end to audio-visual synchronization if be used to assess difference, then the source end in this flow process can be that audio frequency and video are play end to the audio-visual synchronization quality preferably, destination end is meant the audio frequency and video broadcast end that need carry out the audio-visual synchronization quality evaluation.This flow process comprises the steps:
Step 101, adopt the audio mode recognition methods to find out in the audio-video document that destination end plays,, and write down the initial reproduction time of this audio section with the audio section of audioref Data Matching;
Step 102, adopt video mode recognition method to find out in the audio-video document that destination end plays,, and write down the initial reproduction time of this frame of video with the frame of video of video reference Data Matching;
Step 103, according to initial reproduction time record and the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching, the reproduction time of determining audio frequency and video is poor;
Step 104, poor according to the audio frequency and video reproduction time of determining, and this audio-video document is when the source end is play and the audio section of audioref Data Matching, poor with the audio frequency and video reproduction time of the frame of video of video reference Data Matching, determine this destination end and play the audio-visual synchronization situation of this audio-video document, as, compare with the audio-visual synchronization time delay of source end, variable quantity of the synchronization delayed time of the audio frequency and video of destination end (situation of change of the time span of comparing with the source end as the time span of or hysteresis video leading) or degree at the destination end audio frequency, and can further the audio-visual synchronization situation be mapped as corresponding audio-visual synchronization credit rating.
In thestep 101 and step 102 of above-mentioned flow process, the time of being write down can be the destination end current system time, also can be the time of playing starting point with respect to this audio-video document.Step 101 in the above-mentioned flow process and step 102 are not strict with on sequential, that is, this two step can go up exchange in proper order, also can executed in parallel.
Usually, audioref data and video reference data are one to one, and in order to make synchronous detecting more accurate, how right audioref data and video reference data are generally.At many to audioref data and video reference data conditions, the reproduction time difference that thestep 103 of flow process shown in Figure 1 is determined also be with audioref data and video reference data to one to one, promptly, determine initial reproduction time with the audio section of its coupling at audioref data, at determining initial reproduction time with the frame of video of its coupling with the pairing video reference data of these audioref data, it is poor to pairing audio frequency and video reproduction time that both time differences are with these audioref data and video reference data; In like manner, can obtain in the step 104, audio-video document is when the source end is play and the audio section of audioref Data Matching, poor with the audio frequency and video reproduction time of the frame of video of video reference Data Matching.
Can be in advance obtain audio frequency and video time difference of the audio-video document that this synchronous detecting uses in the above described manner at transmitting terminal, and when follow-up this audio-video document of each use carries out the audio-visual synchronization detection, directly use the audio frequency and video time difference of this detected in advance transmitting terminal audio frequency and video time difference and receiving terminal to compare, thereby determine the audio-visual synchronization situation of this audio-video document after transmission.
Generally, in order accurately to detect the audio-visual synchronization situation, audio-visual synchronization detects the audioref data of usefulness and video reference data should be had and comparatively significantly be convenient to the feature discerning and be convenient to carry out pattern matching, audio-visual synchronization detect then comprise in the audio-video document of usefulness with the audio section of audioref Data Matching and with the frame of video of video reference Data Matching.Preferably, audio-visual synchronization is detected in the video file of usefulness, with the initial reproduction time of the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of corresponding video reference Data Matching, identical on the sampled point meaning, promptly the audio frequency and video time difference is 0.In this case, in the step 104 of flow process shown in Figure 1, because the audio frequency and video reproduction time difference of audio-video document when the source end is play be 0, the audio frequency and video reproduction time of then can be directly determining according tostep 103 is poor, makes the audio-visual synchronization situation that this destination end is play this audio-video document.
Detecting with IP network video telephone audio-visual synchronization is example, the audio-video document of using as synchronous detecting, aspect audio frequency, comprise the pronunciation of numeral 1,2,3,4,5, the picture that aspect video, comprises 5 kinds of different human body gestures that show before the solid background, and during the pronunciation of a numeral of every appearance, show corresponding a kind of gesture on the picture in the playing process; The audioref data are the audio volume control data of each numeric utterance in the numeral 1,2,3,4,5, are stored in the audio frequency characteristics storehouse; The video reference data are the vedio data of each gesture in 5 kinds of human body gestures under the solid background, are stored in the video features storehouse; This audio-video document is when transmitting terminal is play, and each numeric utterance is known with the synchronization time difference of corresponding gesture picture.In network transmission process, the Voice ﹠ Video in this audio-video document transmits respectively, forms WAV audio file and AVI video file at receiving terminal.Detect the process of this audio-video document, can comprise the steps: as shown in Figure 2 in the audio-visual synchronization situation of receiving terminal
Obtain the WAV audio file (step 201) in the audio-video document that the audio frequency and video receiving terminal receives, the terminal of determining wherein each voice segments according to audio signal is to find out voice segments (step 202), adopt the audio mode recognition methods, the speech data of each numeric utterance in each voice segments and the audio frequency characteristics storehouse is compared, determine numeral 1 in each voice segments respectively, 2,3,4, the voice segments (step 203) of 5 pronunciations, and write down the start-stop reproduction time of these voice segments, thereby in time (the then corresponding more time that writes down of repetition being arranged) (step 204) that the audio frequency and video receiving terminal can write down at least 5 audio sections as the digital pronunciation in the WAV audio file;
Obtain the AVI video file (step 205) in the audio-video document that the audio frequency and video receiving terminal receives, extract the every two field picture (step 206) in the AVI video file, adopt video mode recognition method, the view data of various gestures in each video frame images and the video features storehouse is compared, determine the wherein frame of video of various gestures respectively, usually only get the frame of video (step 207) that first identifies, and write down the initial reproduction time of these frame of video, thereby in time (the then corresponding more time that writes down of repetition being arranged) (step 208) of at least 5 frame of video of audio frequency and video receiving terminal record as the gesture picture in the AVI video file;
The initial reproduction time of frame of video of the gesture that the numeral 1 of the initial reproduction time of numeral 1 pronunciation of record and record is corresponding subtracts each other, the audio frequency and video reproduction time that obtains digital 1 correspondence poor (time of being write down all is that the system time with receiving terminal is a benchmark), and the like, obtain the corresponding audio frequency and video reproduction time poor (step 209) of other numerals respectively;
The resulting audio frequency and video reproduction time ofstep 209 is poor, compare in the reproduction time difference of transmitting terminal with known this audio-video document, determine with respect to the audio frequency and video time delay (210) of this audio-video document of transmitting terminal at receiving terminal;
According to the result of step 210, determine corresponding audio-visual synchronization credit rating or MOS score value (step 211).
In the embodiment of the invention aspect being provided with of audioref data, the subjective feeling of considering the people is to the starting point (from noiseless to sound) of audio frequency and the asynchronous relatively sensitivity of terminating point (from sound to noiseless) and picture material, preferably, audioref is chosen at voice segments (as the voice segments of digital 1-5 pronunciation), therefore, when the audio section of definite and audioref Data Matching, at first to detect the terminal position of each voice segments in the audio volume control of this audio-video document, then voice segments and the audioref data determined be carried out audio mode identification.
For detecting the voice segments in the audio file, the embodiment of the invention can adopt traditional voice segments waveforms detection method based on short-time energy or short-time average magnitude.Traditional voice segments waveforms detection method based on short-time energy or short-time average magnitude is a kind of detection method of simple gate limit in essence, a kind of stronger in order to obtain than conventional method adaptability, the audiotime message of extracting is sound end detecting method more accurately, the invention process is also improved traditional speech detection method, and adopts the speech detection method after improving to carry out speech detection.Speech detection method after the improvement, discern more effective at voice segments standby average energy when background noise is smaller, discern relatively effectively characteristics at the average zero-crossing rate of time standby that background noise is bigger, the short-time average magnitude and the short-time average zero-crossing rate of voice signal have been taken all factors into consideration, on basis based on the short-time average magnitude detection method, investigate the short-time average zero-crossing rate of voice signal again, utilize amplitude and zero-crossing rate double characteristic to carry out the voice signal terminal and detect.
The foundation that can realize these judgements is that the various parameters in short-term of voice of different nature have different probability density functions and adjacent some frame voice should have consistent characteristics of speech sounds, and promptly they can not undergone mutation at voiced sound, voiceless sound, between noiseless.Usually, the short-time average magnitude maximum of voice signal voiced sound, noiseless short-time average magnitude minimum; The short-time average zero-crossing rate maximum of voiceless sound, noiseless placed in the middle, the short-time average zero-crossing rate minimum of voiced sound.
In the speech detection method that the embodiment of the invention adopted, at first rule of thumb value is determined two amplitude threshold parameter MH and ML (MH>ML), and a short-time zero-crossing rate threshold value Z0.The value of MH should be set than higher, makes when the short-time average magnitude M of frame voice signal value during above MH, and can be voiced sound just than to determine this frame voice signal be not noiseless and sizable possibility is arranged surely.When the short-time average magnitude M of voice signal when being reduced to ML greatly, adopt short-time average zero-crossing rate to proceed judgement, when the short-time average zero-crossing rate of voice signal is lower than threshold value Z0, can determine that it is the end points (beginning or end) of voice segments.
The statistical analysis of short-time average magnitude and short-time average zero-crossing rate be can carry out according to a large amount of speech samples, and amplitude threshold value MH and ML determined in conjunction with the short-time average magnitude of actual sample.The process of determining amplitude thresholding MH according to speech samples is:
Data in each speech samples are carried out windowing divide frame.According to people's the physilogical characteristics and the result who comes out of lot of data statistics, generally window length is made as 20ms, step-length is set at half of window length, then the total amount of frame=total sampling number/step-length;
According to the short-time average magnitude in the computing formula unit of account frame of following short-time average magnitude:
Mm=Σn=mN+m-1|Sw(n-m)|
According to the short-time average zero-crossing rate in the computing formula unit of account frame of following short-time average zero-crossing rate;
Zm=12{Σn=mN+m-1|sgn[sw(n)]-sgn[sn(n-1)]|}
All speech frames in each speech samples are traveled through statistical analysis, with the short-time average magnitude that draws speech samples and the distribution situation of short-time average zero-crossing rate;
Distribution situation according to the short-time average magnitude and the short-time average zero-crossing rate of speech samples, short-time average magnitude according to quiet period, set out the threshold value MH of a thresholding, with fixed bigger of this threshold value, to guarantee that short-time average magnitude in each speech samples is a voice segments greater than the part of MH, to get then the zero-crossing rate threshold value Z0 of period three short-time average zero-crossing rate doubly that mourn in silence as voice segments.
According to the amplitude thresholding MH that determines and ML and short-time average zero-crossing rate thresholding Z0, the speech detection process of the embodiment of the invention is:
Determine former and later two time points A1 and A2 in the audio signal to be detected according to MH, wherein, when the short-time average magnitude M of voice signal surpasses MH, this is designated as A1 constantly, the moment when A1 drops to MH first with voice signal backward is designated as A2; Substantially can be defined as voice segments between A1 and the A2;
Continue search before A1 and in the voice signal after the A2; When searching for forward,, then current time can be designated as B1 if the short-time average magnitude M of voice signal reduces to ML from big to small by A1; In like manner, when searching for backward,, then current time is designated as B2 if the short-time average magnitude M of voice signal reduces to ML from big to small by A2.Still can determine it is voice segments between B1 and the B2;
Continuation is searched for forward and by B2 backward by B1.When searching for forward,, drop to Z0 suddenly when following, current time is designated as C1 and as the starting point of voice segments up to Z if the short-time zero-crossing rate Z of voice signal all the time greater than Z0, thinks that then these voice signals still belong to voice segments by B1; In like manner, when searching for backward,, drop to Z0 suddenly when following, current time is designated as C2 and as the terminal point of this voice segments up to Z if the short-time zero-crossing rate Z of voice signal all the time greater than Z0, thinks that then these voice signals still belong to voice segments by B2;
And the like, detect all audio sections and starting point and terminal point in the audio file voice signal.
Take the reason of this algorithm to be: before the B1 and B2 may be one section voiceless consonant section afterwards, their energy quite a little less than, rely on short-time average magnitude not differentiate they and unvoiced segments fully, but their short-time average zero-crossing rate but will be apparently higher than noiseless, thereby enough this parameters of energy are judged the cut-point of the two, just real starting point and the terminal point of voice accurately.
This kind algorithm not only is adapted to the voice segments testing process in the embodiment of the invention, is applicable to that also other need detect the application scenarios of the voice segments in the audio signal.
After obtaining the temporal information of voice segments, also need the voice segments that obtains is carried out pattern recognition, to determine the voice segments with the audioref Data Matching.The embodiment of the invention adopts the linear forecasting technology (LPCC) in the audio frequency to carry out audio mode identification.
Obtaining of LPCC characteristic parameter mainly is divided into four steps: preliminary treatment, auto-correlation are calculated, moral guest's algorithm is found the solution linear predictor coefficient (LPC) regular equation and LPCC recursion.Wherein, in preliminary treatment, the preemphasis employing promotes high frequency to the mode that voice signal adds single order FIR filter, is used to compensate the decay of glottal excitation and the radiation-induced high frequency spectrum of mouth and nose; The preferred window shape Hamming window of this algorithm picks of window adding technology is as window function.
Voice signal has just changed into one group of LPCC characteristic vector after each frame is extracted the LPCC characteristic parameter.Speech recognition is exactly the speech feature vector of this stack features and reference audio data will be carried out pattern matching, thereby seeks the shortest pattern of distance.
Adopt pattern matching method to carry out speech recognition and be divided into two classes usually: training stage and cognitive phase.Form standard form in the training stage, at cognitive phase, the standard form vector that treating after the transmission attenuation known in speech characteristic vector and the standard form carries out similarity calculating.In the embodiment of the invention, be the characteristic vector of audioref data by formed standard form of training stage.
But consider the influence of the factors such as decay packet loss of audio file in transmission course, voice sequence length after the raw tone sequence is transmitted with process may be unequal, for addressing this problem, the embodiment of the invention adopts based on the DTW recognizer of dynamic time warping coupling carries out pattern recognition.
In the DTW method that the embodiment of the invention provided, at first calculate input pattern (being the audio signal characteristic vector of each voice segments to be identified) and reference model (being the characteristic vector of audioref data) apart from matrix, then, in distance matrix, find out an optimal path, the accumulation distance minimum in this path, this paths are exactly the non-linear relation between the time calculation degree of two patterns.Its algorithm principle is as follows:
Suppose that input pattern to be identified and reference model represent with T and R respectively,, can calculate the distortion D[T between them, R for the similarity between them relatively], the more little similarity of the distortion factor is high more.In order to calculate this distortion, the distortion from T and R between each corresponding frame is counted.If N and M are respectively the totalframes among T and the R, n and m are respectively optional frame numbers among T and the R, D[T (n), R (m)] represent the distortion between these two characteristic vectors, then:
When N=M (being that the T pattern is identical with the frame number of R pattern), directly T (1) and R (1) frame, T (2) and R (2) frame ..., T (m) and R (m) frame coupling, calculate D[T (1), R (1)], D[T (2), R (2)] ..., D[T (m), R (m)] the distortion factor, and ask itself and, promptly obtain total distortion;
When N ≠ M (frame number that is T pattern and R pattern is inequality), adopt dynamic programming method to carry out route searching, be specially: with (the n=1~N) mark on the transverse axis in a two-dimentional rectangular coordinate system of each frame number among the T, with (the m=1~M) on the ordinate of this coordinate system, mark of each frame number among the R, as shown in Figure 3, each crosspoint (n in the formed grid of horizontal ordinate, m) plotted point of a certain frame among the expression T, the route searching process just can be summed up as seeks a path by some crosspoints in these grids, and the crosspoint that the path is passed through promptly is the voice frame number that carries out distortion computation among T and the R.
Wherein, the path is not elective, considers that the speed of voice has variation, but the precedence of each several part can not change, therefore selected path should be from the lower left corner, finish in the upper right corner.Secondly, in order to prevent planless search, can further leave out those to the n axle or to the undue path that tilts of m axle, this be because the pressure of the voice in the reality, expand always limited, so just can in the path respectively the maximum and the minimum value of G-bar in the path by point limited, usually, greatest gradient is decided to be 2, minimum slope location 1/2.
The path cost function that defines in the present embodiment is: d[(ni, mi)], its meaning be from starting point (n0, m0) set out current point (computing formula is as follows for ni, each frame distortion aggregate-value mi):
d[(ni,mi)]=D[T(ni),R(mi)]+d[(ni-1,mi-1)]
d[(ni-1,mi-1)]=min{d[(ni-1,mi)],d[(ni-1,mi-1)],d[(ni-1,mi-2)]}
According to above formula, can be in the hope of needed D[T (ni), R (mi)] value.More than Ding Yi path cost function only is a kind of example, does not get rid of the algorithm of other path costs.
The video mode recognition method that the embodiment of the invention adopted is meant image-recognizing method, promptly, each frame of video that intercepting is play compares each two field picture that intercept and the video frame images in the feature database, thus find out with feature database in the video frame images frame of video of mating.This image recognition processes mainly is divided into two stages: video interception and image recognition.
Video interception can utilize the AVIFile library file of windows operating system to realize, is specially:
At first, initialization AVIFile storehouse, open the avi file for the treatment of synchronous detecting then and obtain its file interface address, if open file successfully (being that video format meets the requirements), then obtain needed avi file information according to the file interface address, these information can comprise: the data rate of file maximum (bytes persecond), document flow number, file height (pixels), width (pixels), sample rate (samples persecond), file size (frames), kind of document etc.; Can obtain the interface IP address of AVI stream according to the file interface address, interface IP address according to AVI stream, obtain the avi file stream information, because audio/video flow is a separate processes, so the stream information of Huo Deing only is a video flowing here, these information can comprise: the kind class description of document flow kind, frame rate (fps), start frame, end frame, image quality value, document flow etc.;
Then, handle the Video stream information obtain, call the address that corresponding decoding functions obtains data behind the decompress(ion), and the memory address of every frame data (being used to preserve into the BMP file), so far, just obtained needed image data information;
At last, write the header file of this image data information again, it is preserved into needed BMP file.The frame number of BMP file AVI video flowing by name, frame time can multiply by frame time by current frame number and obtain at interval, wherein frame period information can find in being specifically designed to the structure of preserving avi file information, for example, the file playback rate is 15fps, it is 66666ns that interframe is divided into 1/15, so it is poor with respect to the reproduction time of start frame to be easy to obtain each frame.
Intercept out the BMP picture from avi file after, the known BMP file of preserving is 24 RGB bitmaps, and further work promptly is that the BMP picture is carried out image recognition.Image recognition processes can be: with the colored bitmap-converted of 24RGB is the binary picture of 8RGB, the feature of outstanding target object, adopt pixel statistics and profile track algorithm to ask the area and the girth of detected image target object, it and image in the feature database are compared, specifically can be divided into following several steps:
Step 1, with target image (image that promptly is truncated to) gray processing, obtain corresponding grey value profile;
Step 2, grey value profile is carried out interative computation, calculate threshold value;
Step 3, according to threshold value with image binaryzation (be converted into black and white picture, white is background, and black is target object);
Step 4, the image of binaryzation is carried out pixels statistics, calculate the area (pixel number) of target object;
Step 5, carry out next step image processing, depict the profile of target object;
Step 6, carry out pixels statistics, calculate the girth of target object profile;
The information of the respective image of storing in the area that step 7, usefulness obtain and girth and the feature database is compared, and judges whether this image is required target image, is then to note reproduction time.
In the embodiment of the invention, when the audio-visual synchronization situation is estimated, can compare the degree of lead and lag according to audio ﹠ video, mapping obtains corresponding audio-visual synchronization grade and corresponding MOS score value.
The MOS score value of the audio-visual synchronization in the embodiment of the invention is with reference to the scoring algorithm in ITU-R.BT 1359 standards, copy its segmentation Calculation Method, according to the subjective feeling of people, set the threshold value of 4 kinds of audio-visual synchronization credit ratings to the audio-visual synchronization situation.Audio-visual synchronization scoring model can be as shown in Figure 4, transverse axis is the time of audio frequency hysteresis video among the figure, vertical pivot is represented the score value of marking, and A, B, C, A ', B ', C ' each point are represented the Three Estate thresholding formulated, will estimate score value and be divided into 4 grades, the corresponding MOS score value of each audio-visual synchronization credit rating, maximum score value is 4.0, and minimum score value is 1.0, and floating space is 0.3, each audio-visual synchronization grade and thresholding thereof and corresponding MOS score value, can be as shown in table 1:
Table 1
Figure B2009102374145D0000141
In order more accurately to estimate the audio-visual synchronization quality objectively, a plurality of monitoring points are set to detect the audio-visual synchronization situation and to carry out the audio-visual synchronization quality evaluation in the embodiment of the invention, when carrying out the audio sync quality evaluation, with the synchronous MOS score value addition of these a plurality of monitoring points, then obtain overall synchronous MOS score value.The MOS score value of general synchronization can be used as the MOS score value that draws the video traffic total quality after an important indicator and audio frequency MOS, the video MOS score value weighted calculation.
Based on the embodiment of the invention in audio-visual synchronization detect identical technical conceive, the embodiment of the invention also provides a kind of audio-visual synchronization detection system.As shown in Figure 5, this system comprises: audio identification module 501, video identification module 502, time difference determination module 503 and synchronous detection module 504, wherein:
Audio identification module 501 can be determined in the audio-video document that destination end plays by the audio mode RM, with the initial reproduction time of the audio section of audioref Data Matching;
Video identification module 502 can be determined in the audio-video document that destination end plays by the video mode RM, with the initial reproduction time of the frame of video of video reference Data Matching;
Time difference determination module 503, the initial reproduction time that is used for the audio section of and audioref Data Matching that determine according to audio identification module 501, and the initial reproduction time of video identification module 502 frame of video with the video reference Data Matching that determine, it is poor to determine the audio frequency and video reproduction time of audio-video document when destination end is play;
Synchronous detection module 504, it is poor to be used to obtain the audio frequency and video reproduction time of audio-video document when the source end is play, poor according to the audio frequency and video reproduction time that the audio frequency and video reproduction time difference that gets access to and time difference determination module 503 are determined, determine the audio-visual synchronization situation of this audio-video document when described destination end is play.
The specific implementation process of each function in above-mentioned each functional module, similar to the respective process in the aforementioned audio-visual synchronization testing process, do not repeat them here.
Based on the technical conceive identical with speech detection in the embodiment of the invention, the embodiment of the invention also provides a kind of speech detection system, as shown in Figure 6, this system comprises:first search module 601,second search module 602, voicesegments determination module 603, wherein:
First search module 601, receive the audio signal to be measured of input, according to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of amplitude threshold MH, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below the amplitude threshold MH first, search for audio signal backward from current time;
Second search module 602 is used for searching short-time average magnitude forward and backward when dropping to the audio signal of amplitude threshold ML whenfirst search module 601, continues along former direction of search search audio signal according to short-time average zero-crossing rate;
Voicesegments determination module 603, be used for searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value Z0 whensecond search module 602, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value Z0, with the terminal point of current time as voice segments.
This system can comprise that also threshold value is provided withmodule 604, be used for distributing to determine amplitude threshold MH, amplitude threshold ML and zero-crossing rate threshold value Z0 according to short-time average magnitude distribution and short-time average zero-crossing rate to speech samples data sound intermediate frequency signal, wherein, the audio signal of short-time average zero-crossing rate more than amplitude threshold MH is voice signal, in the voice signal of short-time average magnitude below amplitude threshold ML, the audio signal that short-time average zero-crossing rate is lower than zero-crossing rate threshold value Z0 is not a voice signal.
The specific implementation process of each function in above-mentioned each functional module, similar to the respective process in the aforementioned speech detection flow process, do not repeat them here.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (22)

1. an audio-visual synchronization detection method is characterized in that, comprises the steps:
Determine in the audio-video document that destination end plays, with the initial reproduction time of the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching;
According to the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when destination end is play;
It is poor to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, poor according to the audio frequency and video reproduction time of described audio-video document when source end and destination end are play, determine the audio-visual synchronization situation of described audio-video document when described destination end is play.
2. the method for claim 1 is characterized in that, it is poor to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, and comprising:
Determine in the audio-video document that the source end play, with the initial reproduction time of the audio section of described audioref Data Matching, and with the initial reproduction time of the frame of video of described video reference Data Matching;
According to the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when the source end is play.
3. method as claimed in claim 1 or 2 is characterized in that, described audioref data are speech data;
Determine and the process of the initial reproduction time of the audio section of audioref Data Matching, comprising:
Detect the voice segments and the start-stop reproduction time thereof that comprise in the audio-video document of being play;
By detected voice segments and described audioref data are carried out voice recognition processing, determine voice segments with described audioref Data Matching.
4. method as claimed in claim 3 is characterized in that, the voice segments that comprises in the audio-video document of determining to be play and the process of start-stop reproduction time thereof comprise:
In the audio-video document of being play, search for audio signal according to the voice signal short-time average magnitude, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;
When searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold, continue along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;
When searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.
5. method as claimed in claim 4, it is characterized in that, described first amplitude threshold, second amplitude threshold and zero-crossing rate threshold value distribute according to the short-time average magnitude to speech samples data sound intermediate frequency signal and short-time average zero-crossing rate distributes to determine, wherein, the audio signal of short-time average zero-crossing rate more than first amplitude threshold is voice signal, in the voice signal of short-time average magnitude below second amplitude threshold, the audio signal that short-time average zero-crossing rate is lower than the zero-crossing rate threshold value is not a voice signal.
6. method as claimed in claim 3 is characterized in that, determines and the process of the voice segments of described audioref Data Matching, comprising:
According to the characteristic vector of each voice segments audio signal, and the characteristic vector of described phonetic reference data, by the definite similarity to each other of the space length that calculates each voice segments and described phonetic reference data;
According to the similarity of determining, get wherein and the most similar voice segments of described phonetic reference data, as with the voice segments of described audioref Data Matching.
7. method as claimed in claim 6 is characterized in that, when the audio frame number of the audio frame number of voice segments and audioref data was unequal, the process of the distance of computing voice section and described phonetic reference data was specially:
Each audio frame frame number of described voice segments is mapped on the transverse axis in the two-dimentional rectangular coordinate system, each audio frame frame number of audioref data is mapped on the ordinate of this coordinate system, on the direction of the upper right corner, determine a paths along the lower left corner of described coordinate system; According to the coordinate points of described path process, determine with described voice segments in the frame number of each frame number corresponding audio reference data;
According to the corresponding relation of the frame number of determining, utilize the characteristic vector of audio signal, calculate the distortion factor of two frame audio signals with corresponding relation, according to the distortion factor that calculates, determine the space length between described voice segments and the described audioref data.
8. method as claimed in claim 7, it is characterized in that, the described path of determining on along the lower left corner of described coordinate system to upper right corner direction, slope at the joint place of the frame number that each ordinate and abscissa identified, be no more than first slope threshold value, be not less than second slope threshold value, described first slope threshold value is greater than second slope threshold value.
9. method as claimed in claim 1 or 2 is characterized in that, determines and the process of the initial reproduction time of the frame of video of video reference Data Matching, comprising:
Extract the frame of video that comprises in the audio-video document of being play;
Carry out image recognition processing by frame of video and the described video reference data that will extract, determine frame of video and initial reproduction time thereof with described video reference Data Matching.
10. the method for claim 1 is characterized in that, determines the audio-visual synchronization situation of described audio-video document, comprising:
Determine described audio-video document when destination end is play with respect to the audio frequency and video time delay variable quantity that when the source end is play, is produced;
According to the audio frequency and video time delay variable quantity of determining, determine corresponding audio-visual synchronization credit rating or mark.
11. an audio-visual synchronization detection system is characterized in that, comprising:
The audio identification module is used for the audio-video document of determining that destination end is play, with the initial reproduction time of the audio section of audioref Data Matching;
The video identification module is used for the audio-video document of determining that destination end is play, with the initial reproduction time of the frame of video of video reference Data Matching;
The time difference determination module, be used for initial reproduction time that determine according to described audio identification module and the audio section audioref Data Matching, and the described video identification module initial reproduction time with the frame of video video reference Data Matching that determine, it is poor to determine the audio frequency and video reproduction time of described audio-video document when destination end is play;
Synchronous detection module, it is poor to be used to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, poor according to the audio frequency and video reproduction time that the described audio frequency and video reproduction time difference that gets access to and described time difference determination module are determined, determine the audio-visual synchronization situation of described audio-video document when described destination end is play.
12. system as claimed in claim 11, it is characterized in that, when described synchronous detection module is obtained the audio frequency and video reproduction time difference of described audio-video document when the source end is play, determine in the audio-video document that the source end play, with the initial reproduction time of the audio section of described audioref Data Matching, and with the initial reproduction time of the frame of video of described video reference Data Matching; Then, in the audio-video document of being play according to the source end, the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when the source end is play.
13. system as claimed in claim 12 is characterized in that, described audioref data are speech data;
Described audio identification module or described synchronous detection module are determined and the process of the initial reproduction time of the audio section of audioref Data Matching, being comprised:
Detect the voice segments and the start-stop reproduction time thereof that comprise in the audio-video document of being play;
By detected voice segments and described audioref data are carried out voice recognition processing, determine voice segments with described audioref Data Matching.
14. system as claimed in claim 13 is characterized in that, the voice segments that comprises in the audio-video document that described audio identification module or described synchronous detection module are determined to be play and the process of start-stop reproduction time thereof comprise:
In the audio-video document of being play, search for audio signal according to the voice signal short-time average magnitude, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;
When searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold, continue along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;
When searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.
15. system as claimed in claim 13 is characterized in that, described audio identification module is determined and the process of the voice segments of described audioref Data Matching, being comprised:
According to the characteristic vector of each voice segments audio signal, and the characteristic vector of described phonetic reference data, by the definite similarity to each other of the space length that calculates each voice segments and described phonetic reference data;
According to the similarity of determining, get wherein and the most similar voice segments of described phonetic reference data, as with the voice segments of described audioref Data Matching.
16. system as claimed in claim 15 is characterized in that, when the audio frame number of the audio frame number of voice segments and audioref data was unequal, the process of the distance of described audio identification module computing voice section and described phonetic reference data was specially:
Each audio frame frame number of described voice segments is mapped on the transverse axis in the two-dimentional rectangular coordinate system, each audio frame frame number of audioref data is mapped on the ordinate of this coordinate system, on the direction of the upper right corner, determine a paths along the lower left corner of described coordinate system; According to the coordinate points of described path process, determine with described voice segments in the frame number of each frame number corresponding audio reference data;
According to the corresponding relation of the frame number of determining, utilize the characteristic vector of audio signal, calculate the distortion factor of two frame audio signals with corresponding relation, according to the distortion factor that calculates, determine the space length between described voice segments and the described audioref data.
17. system as claimed in claim 12 is characterized in that, described video identification module is determined and the process of the initial reproduction time of the frame of video of video reference Data Matching, being comprised:
Extract the frame of video that comprises in the audio-video document of being play;
Carry out image recognition processing by frame of video and the described video reference data that will extract, determine frame of video and initial reproduction time thereof with described video reference Data Matching.
18. system as claimed in claim 11 is characterized in that, described synchronous detection module is determined the audio-visual synchronization situation of described audio-video document, comprising:
Determine described audio-video document when destination end is play with respect to the audio frequency and video time delay variable quantity that when the source end is play, is produced;
According to the audio frequency and video time delay variable quantity of determining, determine corresponding audio-visual synchronization credit rating or mark.
19. a speech detection method is characterized in that, comprises the steps:
According to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;
When searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold, continue along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;
When searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.
20. method as claimed in claim 19, it is characterized in that, described first amplitude threshold, second amplitude threshold and zero-crossing rate threshold value distribute according to the short-time average magnitude to speech samples data sound intermediate frequency signal and short-time average zero-crossing rate distributes to determine, wherein, the audio signal of short-time average zero-crossing rate more than first amplitude threshold is voice signal, in the voice signal of short-time average magnitude below second amplitude threshold, the audio signal that short-time average zero-crossing rate is lower than the zero-crossing rate threshold value is not a voice signal.
21. a speech detection system is characterized in that, comprising:
First search module, be used for according to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;
Second search module is used for searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold when described first search module, continues along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;
The voice segments determination module, be used for searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value when described second search module, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.
22. system as claimed in claim 21 is characterized in that, also comprises:
Threshold value is provided with module, be used for distributing to determine described first amplitude threshold, second amplitude threshold and zero-crossing rate threshold value according to short-time average magnitude distribution and short-time average zero-crossing rate to speech samples data sound intermediate frequency signal, wherein, the audio signal of short-time average zero-crossing rate more than first amplitude threshold is voice signal, in the voice signal of short-time average magnitude below second amplitude threshold, the audio signal that short-time average zero-crossing rate is lower than the zero-crossing rate threshold value is not a voice signal.
CN2009102374145A2009-11-062009-11-06Audio/video synchronization detection method and system, and voice detection method and systemActiveCN102056026B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN2009102374145ACN102056026B (en)2009-11-062009-11-06Audio/video synchronization detection method and system, and voice detection method and system

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN2009102374145ACN102056026B (en)2009-11-062009-11-06Audio/video synchronization detection method and system, and voice detection method and system

Publications (2)

Publication NumberPublication Date
CN102056026Atrue CN102056026A (en)2011-05-11
CN102056026B CN102056026B (en)2013-04-03

Family

ID=43959877

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN2009102374145AActiveCN102056026B (en)2009-11-062009-11-06Audio/video synchronization detection method and system, and voice detection method and system

Country Status (1)

CountryLink
CN (1)CN102056026B (en)

Cited By (95)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103051921A (en)*2013-01-052013-04-17北京中科大洋科技发展股份有限公司Method for precisely detecting video and audio synchronous errors of video and audio processing system
CN103974143A (en)*2014-05-202014-08-06北京速能数码网络技术有限公司Method and device for generating media data
CN104538041A (en)*2014-12-112015-04-22深圳市智美达科技有限公司Method and system for detecting abnormal sounds
CN104796578A (en)*2015-04-292015-07-22成都陌云科技有限公司Multi-screen synchronization method based on program audio features
CN104993901A (en)*2015-07-092015-10-21广东威创视讯科技股份有限公司Data synchronization method and device of distributed system
CN105608935A (en)*2015-12-292016-05-25北京奇艺世纪科技有限公司Detection method and device of audio and video synchronization
CN105609118A (en)*2015-12-302016-05-25生迪智慧科技有限公司Speech detection method and device
CN105898498A (en)*2015-12-152016-08-24乐视网信息技术(北京)股份有限公司Video synchronization method and system
CN106157952A (en)*2016-08-302016-11-23北京小米移动软件有限公司Sound identification method and device
CN106415719A (en)*2014-06-192017-02-15苹果公司Robust end-pointing of speech signals using speaker recognition
CN106470339A (en)*2015-08-172017-03-01南宁富桂精密工业有限公司Terminal unit and audio video synchronization detection method
CN107810529A (en)*2015-06-292018-03-16亚马逊技术公司Language model sound end determines
CN107920245A (en)*2017-11-222018-04-17北京奇艺世纪科技有限公司A kind of method and apparatus for detecting video playing and starting the time
CN108632557A (en)*2017-03-202018-10-09中兴通讯股份有限公司A kind of method and terminal of audio-visual synchronization
CN108769559A (en)*2018-05-252018-11-06数据堂(北京)科技股份有限公司The synchronous method and device of multimedia file
CN108882019A (en)*2017-05-092018-11-23腾讯科技(深圳)有限公司Video playing test method, electronic equipment and system
CN109039994A (en)*2017-06-082018-12-18中国移动通信集团甘肃有限公司A kind of method and apparatus calculating the audio and video asynchronous time difference
CN109472487A (en)*2018-11-022019-03-15深圳壹账通智能科技有限公司 Video quality inspection method, device, computer equipment and storage medium
CN109859744A (en)*2017-11-292019-06-07宁波方太厨具有限公司A kind of sound end detecting method applied in range hood
CN110267083A (en)*2019-06-182019-09-20广州虎牙科技有限公司Detection method, device, equipment and the storage medium of audio-visual synchronization
CN110503982A (en)*2019-09-172019-11-26腾讯科技(深圳)有限公司A kind of method and relevant apparatus of voice quality detection
CN110585702A (en)*2019-09-172019-12-20腾讯科技(深圳)有限公司Sound and picture synchronous data processing method, device, equipment and medium
US10600432B1 (en)*2017-03-282020-03-24Amazon Technologies, Inc.Methods for voice enhancement
CN111093108A (en)*2019-12-182020-05-01广州酷狗计算机科技有限公司Sound and picture synchronization judgment method and device, terminal and computer readable storage medium
US10720160B2 (en)2018-06-012020-07-21Apple Inc.Voice interaction at a primary device to access call functionality of a companion device
US10741181B2 (en)2017-05-092020-08-11Apple Inc.User interface for correcting recognition errors
CN112039612A (en)*2020-09-012020-12-04广州市百果园信息技术有限公司Time delay measuring method, device, equipment, system and storage medium
US10878809B2 (en)2014-05-302020-12-29Apple Inc.Multi-command single utterance input method
CN112351273A (en)*2020-11-042021-02-09新华三大数据技术有限公司Video playing quality detection method and device
US10930282B2 (en)2015-03-082021-02-23Apple Inc.Competing devices responding to voice triggers
CN112447185A (en)*2019-08-302021-03-05广州虎牙科技有限公司Audio synchronization error testing method and device, server and readable storage medium
CN112653916A (en)*2019-10-102021-04-13腾讯科技(深圳)有限公司Method and device for audio and video synchronization optimization
US10978090B2 (en)2013-02-072021-04-13Apple Inc.Voice trigger for a digital assistant
US11009970B2 (en)2018-06-012021-05-18Apple Inc.Attention aware virtual assistant dismissal
US11010127B2 (en)2015-06-292021-05-18Apple Inc.Virtual assistant for media playback
US11037565B2 (en)2016-06-102021-06-15Apple Inc.Intelligent digital assistant in a multi-tasking environment
US11070949B2 (en)2015-05-272021-07-20Apple Inc.Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US11087759B2 (en)2015-03-082021-08-10Apple Inc.Virtual assistant activation
US11120372B2 (en)2011-06-032021-09-14Apple Inc.Performing actions associated with task items that represent tasks to perform
US11126400B2 (en)2015-09-082021-09-21Apple Inc.Zero latency digital assistant
US11133008B2 (en)2014-05-302021-09-28Apple Inc.Reducing the need for manual start/end-pointing and trigger phrases
US11140099B2 (en)2019-05-212021-10-05Apple Inc.Providing message response suggestions
US11152002B2 (en)2016-06-112021-10-19Apple Inc.Application integration with a digital assistant
CN113555132A (en)*2020-04-242021-10-26华为技术有限公司 Multi-source data processing method, electronic device and computer-readable storage medium
US11169616B2 (en)2018-05-072021-11-09Apple Inc.Raise to speak
CN113744368A (en)*2021-08-122021-12-03北京百度网讯科技有限公司Animation synthesis method and device, electronic equipment and storage medium
US11217251B2 (en)2019-05-062022-01-04Apple Inc.Spoken notifications
US11237797B2 (en)2019-05-312022-02-01Apple Inc.User activity shortcut suggestions
US11257504B2 (en)2014-05-302022-02-22Apple Inc.Intelligent assistant for home automation
US11269678B2 (en)2012-05-152022-03-08Apple Inc.Systems and methods for integrating third party services with a digital assistant
US11289073B2 (en)2019-05-312022-03-29Apple Inc.Device text to speech
US11307752B2 (en)2019-05-062022-04-19Apple Inc.User configurable task triggers
US11348582B2 (en)2008-10-022022-05-31Apple Inc.Electronic devices with voice command and contextual data processing capabilities
US11348573B2 (en)2019-03-182022-05-31Apple Inc.Multimodality in digital assistant systems
US11360641B2 (en)2019-06-012022-06-14Apple Inc.Increasing the relevance of new available information
US11380310B2 (en)2017-05-122022-07-05Apple Inc.Low-latency intelligent automated assistant
US11388291B2 (en)2013-03-142022-07-12Apple Inc.System and method for processing voicemail
US11405466B2 (en)2017-05-122022-08-02Apple Inc.Synchronization and task delegation of a digital assistant
US11423908B2 (en)2019-05-062022-08-23Apple Inc.Interpreting spoken requests
US11423886B2 (en)2010-01-182022-08-23Apple Inc.Task flow identification based on user intent
US11431642B2 (en)2018-06-012022-08-30Apple Inc.Variable latency device coordination
CN114999453A (en)*2022-05-252022-09-02中南大学湘雅二医院Preoperative visit system based on voice recognition and corresponding voice recognition method
US11468282B2 (en)2015-05-152022-10-11Apple Inc.Virtual assistant in a communication session
US11475884B2 (en)2019-05-062022-10-18Apple Inc.Reducing digital assistant latency when a language is incorrectly determined
US11475898B2 (en)2018-10-262022-10-18Apple Inc.Low-latency multi-speaker speech recognition
US11488406B2 (en)2019-09-252022-11-01Apple Inc.Text detection using global geometry estimators
US11496600B2 (en)2019-05-312022-11-08Apple Inc.Remote execution of machine-learned models
US11500672B2 (en)2015-09-082022-11-15Apple Inc.Distributed personal assistant
US11516537B2 (en)2014-06-302022-11-29Apple Inc.Intelligent automated assistant for TV user interactions
US11526368B2 (en)2015-11-062022-12-13Apple Inc.Intelligent automated assistant in a messaging environment
US11532306B2 (en)2017-05-162022-12-20Apple Inc.Detecting a trigger of a digital assistant
US11580990B2 (en)2017-05-122023-02-14Apple Inc.User-specific acoustic models
US11599331B2 (en)2017-05-112023-03-07Apple Inc.Maintaining privacy of personal information
US11638059B2 (en)2019-01-042023-04-25Apple Inc.Content playback on multiple devices
US11656884B2 (en)2017-01-092023-05-23Apple Inc.Application integration with a digital assistant
US11657813B2 (en)2019-05-312023-05-23Apple Inc.Voice identification in digital assistant systems
US11671920B2 (en)2007-04-032023-06-06Apple Inc.Method and system for operating a multifunction portable electronic device using voice-activation
US11675829B2 (en)2017-05-162023-06-13Apple Inc.Intelligent automated assistant for media exploration
US11710482B2 (en)2018-03-262023-07-25Apple Inc.Natural assistant interaction
US11727219B2 (en)2013-06-092023-08-15Apple Inc.System and method for inferring user intent from speech inputs
US11765209B2 (en)2020-05-112023-09-19Apple Inc.Digital assistant hardware abstraction
US11798547B2 (en)2013-03-152023-10-24Apple Inc.Voice activated device for use with a voice-based digital assistant
US11809483B2 (en)2015-09-082023-11-07Apple Inc.Intelligent automated assistant for media search and playback
US11809783B2 (en)2016-06-112023-11-07Apple Inc.Intelligent device arbitration and control
US11854539B2 (en)2018-05-072023-12-26Apple Inc.Intelligent automated assistant for delivering content from user experiences
US11853536B2 (en)2015-09-082023-12-26Apple Inc.Intelligent automated assistant in a media environment
US11853647B2 (en)2015-12-232023-12-26Apple Inc.Proactive assistance based on dialog communication between devices
US11886805B2 (en)2015-11-092024-01-30Apple Inc.Unconventional virtual assistant interactions
US11928604B2 (en)2005-09-082024-03-12Apple Inc.Method and apparatus for building an intelligent automated assistant
US12010262B2 (en)2013-08-062024-06-11Apple Inc.Auto-activating smart responses based on activities from remote devices
US12073147B2 (en)2013-06-092024-08-27Apple Inc.Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US12087308B2 (en)2010-01-182024-09-10Apple Inc.Intelligent automated assistant
CN118800276A (en)*2024-07-262024-10-18杭州联汇科技股份有限公司 A method for detecting clipping distortion of speech signal
WO2025007738A1 (en)*2023-07-042025-01-09抖音视界有限公司Audio-picture synchronization detection method and apparatus, and device and storage medium
US12223282B2 (en)2016-06-092025-02-11Apple Inc.Intelligent automated assistant in a home environment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6744922B1 (en)*1999-01-292004-06-01Sony CorporationSignal processing method and video/voice processing device
US6928233B1 (en)*1999-01-292005-08-09Sony CorporationSignal processing method and video signal processor for detecting and analyzing a pattern reflecting the semantics of the content of a signal
CN101159834B (en)*2007-10-252012-01-11中国科学院计算技术研究所Method and system for detecting repeatable video and audio program fragment
CN101494049B (en)*2009-03-112011-07-27北京邮电大学Method for extracting audio characteristic parameter of audio monitoring system

Cited By (134)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US11928604B2 (en)2005-09-082024-03-12Apple Inc.Method and apparatus for building an intelligent automated assistant
US11671920B2 (en)2007-04-032023-06-06Apple Inc.Method and system for operating a multifunction portable electronic device using voice-activation
US11348582B2 (en)2008-10-022022-05-31Apple Inc.Electronic devices with voice command and contextual data processing capabilities
US11423886B2 (en)2010-01-182022-08-23Apple Inc.Task flow identification based on user intent
US12087308B2 (en)2010-01-182024-09-10Apple Inc.Intelligent automated assistant
US11120372B2 (en)2011-06-032021-09-14Apple Inc.Performing actions associated with task items that represent tasks to perform
US11269678B2 (en)2012-05-152022-03-08Apple Inc.Systems and methods for integrating third party services with a digital assistant
US11321116B2 (en)2012-05-152022-05-03Apple Inc.Systems and methods for integrating third party services with a digital assistant
CN103051921A (en)*2013-01-052013-04-17北京中科大洋科技发展股份有限公司Method for precisely detecting video and audio synchronous errors of video and audio processing system
CN103051921B (en)*2013-01-052014-12-24北京中科大洋科技发展股份有限公司Method for precisely detecting video and audio synchronous errors of video and audio processing system
US10978090B2 (en)2013-02-072021-04-13Apple Inc.Voice trigger for a digital assistant
US11636869B2 (en)2013-02-072023-04-25Apple Inc.Voice trigger for a digital assistant
US12277954B2 (en)2013-02-072025-04-15Apple Inc.Voice trigger for a digital assistant
US11388291B2 (en)2013-03-142022-07-12Apple Inc.System and method for processing voicemail
US11798547B2 (en)2013-03-152023-10-24Apple Inc.Voice activated device for use with a voice-based digital assistant
US12073147B2 (en)2013-06-092024-08-27Apple Inc.Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US11727219B2 (en)2013-06-092023-08-15Apple Inc.System and method for inferring user intent from speech inputs
US12010262B2 (en)2013-08-062024-06-11Apple Inc.Auto-activating smart responses based on activities from remote devices
CN103974143A (en)*2014-05-202014-08-06北京速能数码网络技术有限公司Method and device for generating media data
CN103974143B (en)*2014-05-202017-11-07北京速能数码网络技术有限公司A kind of method and apparatus for generating media data
US11670289B2 (en)2014-05-302023-06-06Apple Inc.Multi-command single utterance input method
US11257504B2 (en)2014-05-302022-02-22Apple Inc.Intelligent assistant for home automation
US10878809B2 (en)2014-05-302020-12-29Apple Inc.Multi-command single utterance input method
US11699448B2 (en)2014-05-302023-07-11Apple Inc.Intelligent assistant for home automation
US11133008B2 (en)2014-05-302021-09-28Apple Inc.Reducing the need for manual start/end-pointing and trigger phrases
US11810562B2 (en)2014-05-302023-11-07Apple Inc.Reducing the need for manual start/end-pointing and trigger phrases
CN106415719B (en)*2014-06-192019-10-18苹果公司 Robust endpoint indication using speaker-identified speech signals
CN106415719A (en)*2014-06-192017-02-15苹果公司Robust end-pointing of speech signals using speaker recognition
US11516537B2 (en)2014-06-302022-11-29Apple Inc.Intelligent automated assistant for TV user interactions
CN104538041A (en)*2014-12-112015-04-22深圳市智美达科技有限公司Method and system for detecting abnormal sounds
US11087759B2 (en)2015-03-082021-08-10Apple Inc.Virtual assistant activation
US10930282B2 (en)2015-03-082021-02-23Apple Inc.Competing devices responding to voice triggers
US11842734B2 (en)2015-03-082023-12-12Apple Inc.Virtual assistant activation
CN104796578B (en)*2015-04-292018-03-13成都陌云科技有限公司A kind of multi-screen synchronous method based on broadcast sounds feature
CN104796578A (en)*2015-04-292015-07-22成都陌云科技有限公司Multi-screen synchronization method based on program audio features
US11468282B2 (en)2015-05-152022-10-11Apple Inc.Virtual assistant in a communication session
US11070949B2 (en)2015-05-272021-07-20Apple Inc.Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US11010127B2 (en)2015-06-292021-05-18Apple Inc.Virtual assistant for media playback
CN107810529B (en)*2015-06-292021-10-08亚马逊技术公司Language model speech endpoint determination
CN107810529A (en)*2015-06-292018-03-16亚马逊技术公司Language model sound end determines
US11947873B2 (en)2015-06-292024-04-02Apple Inc.Virtual assistant for media playback
CN104993901B (en)*2015-07-092017-08-29广东威创视讯科技股份有限公司Distributed system method of data synchronization and device
CN104993901A (en)*2015-07-092015-10-21广东威创视讯科技股份有限公司Data synchronization method and device of distributed system
CN106470339A (en)*2015-08-172017-03-01南宁富桂精密工业有限公司Terminal unit and audio video synchronization detection method
CN106470339B (en)*2015-08-172018-09-14南宁富桂精密工业有限公司Terminal device and audio video synchronization detection method
US11853536B2 (en)2015-09-082023-12-26Apple Inc.Intelligent automated assistant in a media environment
US11126400B2 (en)2015-09-082021-09-21Apple Inc.Zero latency digital assistant
US11809483B2 (en)2015-09-082023-11-07Apple Inc.Intelligent automated assistant for media search and playback
US11550542B2 (en)2015-09-082023-01-10Apple Inc.Zero latency digital assistant
US11500672B2 (en)2015-09-082022-11-15Apple Inc.Distributed personal assistant
US11526368B2 (en)2015-11-062022-12-13Apple Inc.Intelligent automated assistant in a messaging environment
US11886805B2 (en)2015-11-092024-01-30Apple Inc.Unconventional virtual assistant interactions
CN105898498A (en)*2015-12-152016-08-24乐视网信息技术(北京)股份有限公司Video synchronization method and system
US11853647B2 (en)2015-12-232023-12-26Apple Inc.Proactive assistance based on dialog communication between devices
CN105608935A (en)*2015-12-292016-05-25北京奇艺世纪科技有限公司Detection method and device of audio and video synchronization
CN105609118B (en)*2015-12-302020-02-07生迪智慧科技有限公司Voice detection method and device
CN105609118A (en)*2015-12-302016-05-25生迪智慧科技有限公司Speech detection method and device
US12223282B2 (en)2016-06-092025-02-11Apple Inc.Intelligent automated assistant in a home environment
US11037565B2 (en)2016-06-102021-06-15Apple Inc.Intelligent digital assistant in a multi-tasking environment
US11657820B2 (en)2016-06-102023-05-23Apple Inc.Intelligent digital assistant in a multi-tasking environment
US11152002B2 (en)2016-06-112021-10-19Apple Inc.Application integration with a digital assistant
US11749275B2 (en)2016-06-112023-09-05Apple Inc.Application integration with a digital assistant
US11809783B2 (en)2016-06-112023-11-07Apple Inc.Intelligent device arbitration and control
CN106157952A (en)*2016-08-302016-11-23北京小米移动软件有限公司Sound identification method and device
US11656884B2 (en)2017-01-092023-05-23Apple Inc.Application integration with a digital assistant
CN108632557B (en)*2017-03-202021-06-08中兴通讯股份有限公司Audio and video synchronization method and terminal
CN108632557A (en)*2017-03-202018-10-09中兴通讯股份有限公司A kind of method and terminal of audio-visual synchronization
US10600432B1 (en)*2017-03-282020-03-24Amazon Technologies, Inc.Methods for voice enhancement
US10741181B2 (en)2017-05-092020-08-11Apple Inc.User interface for correcting recognition errors
CN108882019B (en)*2017-05-092021-12-10腾讯科技(深圳)有限公司Video playing test method, electronic equipment and system
CN108882019A (en)*2017-05-092018-11-23腾讯科技(深圳)有限公司Video playing test method, electronic equipment and system
US11599331B2 (en)2017-05-112023-03-07Apple Inc.Maintaining privacy of personal information
US11405466B2 (en)2017-05-122022-08-02Apple Inc.Synchronization and task delegation of a digital assistant
US11580990B2 (en)2017-05-122023-02-14Apple Inc.User-specific acoustic models
US11380310B2 (en)2017-05-122022-07-05Apple Inc.Low-latency intelligent automated assistant
US11532306B2 (en)2017-05-162022-12-20Apple Inc.Detecting a trigger of a digital assistant
US11675829B2 (en)2017-05-162023-06-13Apple Inc.Intelligent automated assistant for media exploration
CN109039994B (en)*2017-06-082020-12-08中国移动通信集团甘肃有限公司 Method and device for calculating asynchronous time difference between audio and video
CN109039994A (en)*2017-06-082018-12-18中国移动通信集团甘肃有限公司A kind of method and apparatus calculating the audio and video asynchronous time difference
CN107920245A (en)*2017-11-222018-04-17北京奇艺世纪科技有限公司A kind of method and apparatus for detecting video playing and starting the time
CN109859744A (en)*2017-11-292019-06-07宁波方太厨具有限公司A kind of sound end detecting method applied in range hood
CN109859744B (en)*2017-11-292021-01-19宁波方太厨具有限公司Voice endpoint detection method applied to range hood
US11710482B2 (en)2018-03-262023-07-25Apple Inc.Natural assistant interaction
US11900923B2 (en)2018-05-072024-02-13Apple Inc.Intelligent automated assistant for delivering content from user experiences
US11169616B2 (en)2018-05-072021-11-09Apple Inc.Raise to speak
US11487364B2 (en)2018-05-072022-11-01Apple Inc.Raise to speak
US11854539B2 (en)2018-05-072023-12-26Apple Inc.Intelligent automated assistant for delivering content from user experiences
CN108769559B (en)*2018-05-252020-12-01数据堂(北京)科技股份有限公司Multimedia file synchronization method and device
CN108769559A (en)*2018-05-252018-11-06数据堂(北京)科技股份有限公司The synchronous method and device of multimedia file
US11009970B2 (en)2018-06-012021-05-18Apple Inc.Attention aware virtual assistant dismissal
US10984798B2 (en)2018-06-012021-04-20Apple Inc.Voice interaction at a primary device to access call functionality of a companion device
US11431642B2 (en)2018-06-012022-08-30Apple Inc.Variable latency device coordination
US10720160B2 (en)2018-06-012020-07-21Apple Inc.Voice interaction at a primary device to access call functionality of a companion device
US11360577B2 (en)2018-06-012022-06-14Apple Inc.Attention aware virtual assistant dismissal
US12080287B2 (en)2018-06-012024-09-03Apple Inc.Voice interaction at a primary device to access call functionality of a companion device
US11475898B2 (en)2018-10-262022-10-18Apple Inc.Low-latency multi-speaker speech recognition
CN109472487A (en)*2018-11-022019-03-15深圳壹账通智能科技有限公司 Video quality inspection method, device, computer equipment and storage medium
US11638059B2 (en)2019-01-042023-04-25Apple Inc.Content playback on multiple devices
US11348573B2 (en)2019-03-182022-05-31Apple Inc.Multimodality in digital assistant systems
US11705130B2 (en)2019-05-062023-07-18Apple Inc.Spoken notifications
US11307752B2 (en)2019-05-062022-04-19Apple Inc.User configurable task triggers
US11423908B2 (en)2019-05-062022-08-23Apple Inc.Interpreting spoken requests
US11475884B2 (en)2019-05-062022-10-18Apple Inc.Reducing digital assistant latency when a language is incorrectly determined
US11217251B2 (en)2019-05-062022-01-04Apple Inc.Spoken notifications
US11888791B2 (en)2019-05-212024-01-30Apple Inc.Providing message response suggestions
US11140099B2 (en)2019-05-212021-10-05Apple Inc.Providing message response suggestions
US11360739B2 (en)2019-05-312022-06-14Apple Inc.User activity shortcut suggestions
US11289073B2 (en)2019-05-312022-03-29Apple Inc.Device text to speech
US11657813B2 (en)2019-05-312023-05-23Apple Inc.Voice identification in digital assistant systems
US11237797B2 (en)2019-05-312022-02-01Apple Inc.User activity shortcut suggestions
US11496600B2 (en)2019-05-312022-11-08Apple Inc.Remote execution of machine-learned models
US11360641B2 (en)2019-06-012022-06-14Apple Inc.Increasing the relevance of new available information
CN110267083A (en)*2019-06-182019-09-20广州虎牙科技有限公司Detection method, device, equipment and the storage medium of audio-visual synchronization
CN112447185A (en)*2019-08-302021-03-05广州虎牙科技有限公司Audio synchronization error testing method and device, server and readable storage medium
CN112447185B (en)*2019-08-302024-02-09广州虎牙科技有限公司Audio synchronization error testing method and device, server and readable storage medium
CN110503982A (en)*2019-09-172019-11-26腾讯科技(深圳)有限公司A kind of method and relevant apparatus of voice quality detection
CN110585702B (en)*2019-09-172023-09-19腾讯科技(深圳)有限公司Sound and picture synchronous data processing method, device, equipment and medium
CN110585702A (en)*2019-09-172019-12-20腾讯科技(深圳)有限公司Sound and picture synchronous data processing method, device, equipment and medium
CN110503982B (en)*2019-09-172024-03-22腾讯科技(深圳)有限公司Voice quality detection method and related device
US11488406B2 (en)2019-09-252022-11-01Apple Inc.Text detection using global geometry estimators
CN112653916B (en)*2019-10-102023-08-29腾讯科技(深圳)有限公司Method and equipment for synchronously optimizing audio and video
CN112653916A (en)*2019-10-102021-04-13腾讯科技(深圳)有限公司Method and device for audio and video synchronization optimization
CN111093108B (en)*2019-12-182021-12-03广州酷狗计算机科技有限公司Sound and picture synchronization judgment method and device, terminal and computer readable storage medium
CN111093108A (en)*2019-12-182020-05-01广州酷狗计算机科技有限公司Sound and picture synchronization judgment method and device, terminal and computer readable storage medium
CN113555132A (en)*2020-04-242021-10-26华为技术有限公司 Multi-source data processing method, electronic device and computer-readable storage medium
US11924254B2 (en)2020-05-112024-03-05Apple Inc.Digital assistant hardware abstraction
US11765209B2 (en)2020-05-112023-09-19Apple Inc.Digital assistant hardware abstraction
CN112039612A (en)*2020-09-012020-12-04广州市百果园信息技术有限公司Time delay measuring method, device, equipment, system and storage medium
CN112351273A (en)*2020-11-042021-02-09新华三大数据技术有限公司Video playing quality detection method and device
CN112351273B (en)*2020-11-042022-03-01新华三大数据技术有限公司Video playing quality detection method and device
CN113744368A (en)*2021-08-122021-12-03北京百度网讯科技有限公司Animation synthesis method and device, electronic equipment and storage medium
CN114999453A (en)*2022-05-252022-09-02中南大学湘雅二医院Preoperative visit system based on voice recognition and corresponding voice recognition method
WO2025007738A1 (en)*2023-07-042025-01-09抖音视界有限公司Audio-picture synchronization detection method and apparatus, and device and storage medium
CN118800276A (en)*2024-07-262024-10-18杭州联汇科技股份有限公司 A method for detecting clipping distortion of speech signal

Also Published As

Publication numberPublication date
CN102056026B (en)2013-04-03

Similar Documents

PublicationPublication DateTitle
CN102056026B (en)Audio/video synchronization detection method and system, and voice detection method and system
US20210375276A1 (en)Robust Audio Identification with Interference Cancellation
CN105405439B (en)Speech playing method and device
KR100636317B1 (en) Distributed speech recognition system and method
US9558744B2 (en)Audio processing apparatus and audio processing method
KR101616112B1 (en)Speaker separation system and method using voice feature vectors
CN112786052B (en)Speech recognition method, electronic equipment and storage device
CN100356446C (en)Noise reduction and audio-visual speech activity detection
CN103700370A (en)Broadcast television voice recognition method and system
CN101359473A (en)Auto speech conversion method and apparatus
CN101199207A (en)Method, system, and program product for measuring audio video synchronization independent of speaker characteristics
CN114067793B (en) Audio processing method and device, electronic device and readable storage medium
KR101022519B1 (en) Speech segment detection system and method using vowel feature and acoustic spectral similarity measuring method
CN103050116A (en)Voice command identification method and system
CN118072734A (en) Speech recognition method, device, processor, memory and electronic device
CN115798518B (en)Model training method, device, equipment and medium
Shi et al.H-VECTORS: Improving the robustness in utterance-level speaker embeddings using a hierarchical attention model
CN107274892A (en)Method for distinguishing speek person and device
CN111009261A (en)Arrival reminding method, device, terminal and storage medium
JP3798530B2 (en) Speech recognition apparatus and speech recognition method
JP2001520764A (en) Speech analysis system
Eyben et al.Audiovisual vocal outburst classification in noisy acoustic conditions
CN114494930A (en)Training method and device for voice and image synchronism measurement model
CN114466179A (en)Method and device for measuring synchronism of voice and image
CN112786071A (en)Data annotation method for voice segments of voice interaction scene

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C14Grant of patent or utility model
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp