Content of the invention
For above-mentioned defect of the prior art, it is an object of the invention to provide audio frequency holds the punctuate processing method and be of making an uproarSystem.Thus solving the problems, such as in existing captions corresponding process it is impossible to automatically be made pauses in reading unpunctuated ancient writings and noise is high.
The present invention is directed to classroom recorded broadcast and network direct broadcasting, proposes a kind of method of intelligent sound punctuate, and this method is passed throughSpeech analysis techniques, can quickly analyze the voice data recorded or gather automatically, and detection obtains meeting the language of subtitle specificationTablet section, saves the time that video and audio captions make.
In order to achieve the above object, the following technical scheme of present invention offer:
Audio frequency holds punctuate processing method of making an uproar, comprising:
Step s101, obtains multiple framing sections according to audio frequency;
Step s102, the energy value according to each framing section obtains energy threshold ek;
Step s103, according to described energy threshold ek, obtain its energy value from described each framing section and exceed energy thresholdet;Framing section, then the preamble frame of this frame or postorder frame are scanned with this framing section for sentence intermediate frame, if preamble frame or afterThe energy threshold of sequence frame is less than and sets energy threshold et, then merging this frame by frame start sequence with described sentence intermediate frame becomes onlyVertical sentence;
Step s104, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to itHis sentence, then merge to two sentences;If the energy of next frame is less than et, and it is not belonging to other sentences, then to this frameCarry out Fourier transform, take the amplitude of 0-4000hz, be divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is vi,i=1,2 ... z.Overall strength is vsum, piProbability for every bands of a spectrum: piComputing formula be:
Then, the spectrum entropy of this frame is:
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frameCan entropy than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Step s105, judges whether the frame length of described independent sentence is the short sentence frame length scope setting, if so, then by historyThe short independent sentence specimen of storage is contrasted with currently independent sentence, if matching degree is less than setting value, independent sentence is designated and makes an uproarSound sentence;
Step s106, independent sentence the breaking as audio frequency not being designated noise sentence that each framing section of described audio frequency is obtainedSentence.
In a preferred embodiment, described step s101 includes:
Step s1011: receive audio file;
Step s1012: the sliced time according to setting is split to described audio file, obtains multiple framing sections.
In a preferred embodiment, described step s102 includes: the energy value according to each framing section averageValue obtains energy threshold ek.
In a preferred embodiment, " if the energy threshold of preamble frame or postorder frame is less than in described step s103Set energy threshold et, then merging this frame by frame start sequence with described sentence intermediate frame becomes independent sentence unit " step bagInclude:
If the energy threshold of preamble frame or postorder frame is less than sets energy et, then judge present frame and next frame interval whenBetween whether less than setting interval time, if so, then merging described sentence intermediate frame by frame start sequence becomes independent sentence.
In a preferred embodiment, also include after step s103:
Step s1031: if the frame length of described independent sentence exceeds sets independent frame length, calculate the spectrum entropy of the every frame of this independent officeRatio, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.
Simultaneously present invention also offers a kind of automatic split system carrying out audio frequency punctuate, comprising: framing unit, energy valveValue acquiring unit, independent sentence acquiring unit;Spectrum entropy analytic unit;
Described framing unit, is configured to obtain multiple framing sections according to audio frequency;
Described energy threshold acquiring unit, is configured to the energy value according to each framing section and obtains energy threshold ek;
Described independent sentence acquiring unit, is configured to according to described energy threshold ek, from described each framing section, obtain its energyValue exceedes energy threshold et;Framing section, then the preamble frame of this frame or postorder frame are swept with this framing section for sentence intermediate frameRetouch, if the energy threshold of preamble frame or postorder frame is less than sets energy threshold et, then this frame is risen by frame with described sentence intermediate frameBeginning order merges becomes independent sentence;
Described spectrum entropy analytic unit, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if searchedNext frame belong to other sentences, then two sentences are merged;If the energy of next frame is less than et, and it is not belonging to otherSentence, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum, every bands of a spectrum according to fixed widthIntensity be vi, i=1,2 ... z.Overall strength is vsum, piFor the probability of every bands of a spectrum, piComputing formula be:
Then, the spectrum entropy of this frame is:
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frameCan entropy than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Described noise sentence judging unit, is configured to judge whether the frame length of described independent sentence is the short sentence frame length model settingEnclose, if so, then the short independent sentence specimen of historical storage and currently independent sentence are contrasted, if matching degree is less than setting value,Independent sentence is designated noise sentence;
Punctuate acquiring unit, the independent sentence not being designated noise sentence being configured to obtain each framing section of described audio frequency is madePunctuate for audio frequency.
In a preferred embodiment, described framing unit is additionally configured to: receives audio file;According to dividing of settingTime of cutting is split to described audio file, obtains multiple framing sections.
In a preferred embodiment, described energy threshold acquiring unit is additionally configured to, according to the energy of each framing sectionThe meansigma methodss of value obtain energy threshold ek.
In a preferred embodiment, described independent sentence acquiring unit is additionally configured to, if preamble frame or postorder frameEnergy threshold is less than and sets energy et, then judge interval time of present frame and next frame whether less than setting interval time, ifIt is that then merging described sentence intermediate frame by frame start sequence becomes independent sentence.
In a preferred embodiment, also include: long sentence judging unit;
Described long sentence judging unit, if the frame length being configured to described independent sentence exceeds sets independent frame length, calculates this onlyThe spectrum entropy ratio of the every frame of vertical office, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independencesSentence.
The invention has the benefit that the main calculating of this method is carried out in time domain, calculating speed is fast.For possibleIt is the limited regional area that consonant is also likely to be noise, is analyzed in conjunction with time-domain and frequency-domain, increase the accuracy of cutting.Only needTime-consuming spectrum analyses are carried out to a few frames, cutting speed is i.e. fast and accurate, has stronger noise resistance characteristic simultaneously again.WithIn the time point automatically generating voice cutting, the workload of audio frequency and video caption editing can be saved.Devise a set of direct utilizationExisting result of calculation, no longer carries out the cutting method of quadratic character calculating, can quickly carry out long sentence cutting, and guarantee is not inLong sentence, meets the demand making captions.Using machine learning method, short sentence is carried out judge detection, judge that it isNo is people's sound or noise, abandons noise, lifts accuracy further.This method both can process the sound having recorded and regardFrequency is it is also possible to process just in live audio frequency and video.For network direct broadcasting stream, can automatically network direct broadcasting voice be cut, sideContinue link after an action of the bowels as dictated link parallel processing, faster processing time.
Specific embodiment
Below in conjunction with the accompanying drawing of the present invention, technical scheme is clearly and completely described it is clear that instituteThe embodiment of description is only a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention,The every other embodiment that those of ordinary skill in the art are obtained under the premise of not making creative work, broadly falls into thisThe scope of bright protection.
Audio frequency in the present invention holds punctuate processing method of making an uproar, as shown in Figure 1, comprising:
Step s101, obtains multiple framing sections according to audio frequency.
The present invention may be mounted on server it is also possible to be arranged on personal computer or mobile computing device.BelowAlleged computing terminal can be server or personal computer or mobile computing device.First, toServer uploads audio-video document, or opens audio-video document on personal computer or mobile computing device.Afterwards, countCalculation equipment extracts the audio stream in audio-video document, and audio stream unification is had symbol single-channel data to fixed sampling frequency.ItAdopt framing parameter set in advance afterwards, sub-frame processing is carried out to data.
Step s1011: receive audio file;Step s1012: the sliced time according to setting enters to described audio fileRow segmentation, obtains multiple framing sections.
Sub-frame processing is carried out to audio frequency.Every frame length is from 10ms to 500ms.In speech recognition, in order to accurately knowOther voice, needs overlap between consecutive frame.The purpose of the present invention is not by speech recognition, can weigh therefore between frame and frameFolded or even allowed interval between consecutive frame it is also possible to not overlapping, be spaced apart 0ms to 500ms.So voice segmentation obtainsFrame number will be less than frame number needed for speech recognition, thus reducing amount of calculation, improves calculating speed.With f1,f2,…fm, represent and obtainFrame, each frame has n sample, is s respectivelyk1,sk2,…,skn, the range value of each sample is fki,fk2,…,fkn.Each frame noteRecord time started and end time.
Speech data be by fixed sample rate, sound is sampled after, the real number numeric string that obtains.Sample rate 16k, justRepresent 16000 data of sampling in 1 second.The meaning of framing be using this burst of data by regular time section be one set as divideAnalysis unit.Such as, 16k sample rate, if every frame length is 100 milliseconds, has 1600 speech datas inside 1 frame.By dividingFrame is determining the granularity of control.In this patent, generally according to 100 milliseconds of framings that is to say, that the video of n second, need to be divided into10n frame.Certainly, can be non-conterminous between frame and frame, such as, 100 milliseconds of the interval of two frames, then the video of n second, framing is exactly5n frame.Increase the interval between frame and frame and can reduce totalframes, improve analyze speed, but cost is time degree of accuracy can dropLow.
Step s102, the energy value according to each framing section obtains energy threshold ek.
In this step:
Each frame is calculated with its energy ek.Energy definition including but not limited to amplitude square and with two kinds of absolute value sumMode.
Energy balane formula according to amplitude square and definition is:
Energy balane formula according to absolute value definition is:
Set an energy threshold et, search adjacent and energy all more than etSpeech frame, obtain speech sentence s1,s2,…sj.That is to say:
si={ fk| k=a, a+1, a+2 ... a+b, ek>=et, and e(a-1)<et, and e(a+b+1)<et}.
In another embodiment, described step s101 includes:
Described step s102 includes: the meansigma methodss of the energy value according to each framing section obtain energy threshold ek.That is, will be upperThe energy value that one step obtains, divided by sample size, obtains average energy.Energy threshold is the threshold value of every frame average energy, usual rootAccording to experience setting, certain numeral between conventional 0.001-0.01, and user can manually adjust.
Step s103, merges into independent sentence.
According to described energy threshold ek, obtain its energy value from described each framing section and exceed energy threshold et;FramingSection, then be scanned to the preamble frame of this frame or postorder frame with this framing section for sentence intermediate frame, if the energy of preamble frame or postorder frameAmount threshold values is less than and sets energy threshold et, then merging this frame by frame start sequence with described sentence intermediate frame becomes independent sentence.
" if the energy threshold of preamble frame or postorder frame is less than sets energy threshold e in described step s103t, then by this frameMerging by frame start sequence with described sentence intermediate frame becomes independent sentence unit " step include: if the energy of preamble frame or postorder frameAmount threshold values is less than and sets energy et, then whether present frame and the interval time of next frame are judged less than setting interval time, if so,Then merging described sentence intermediate frame by frame start sequence becomes independent sentence.
Before and after each sentence, two frames are respectively forwardly searched for afterwards.If the next frame searching belongs to other sentences,Two sentences are merged.If the energy of next frame is less than et, and be not belonging to other sentences, then Fourier is carried out to this frameConversion, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is vi, i=1,2 ... z.Overall strength is vsum, piProbability for every bands of a spectrum.piComputing formula be:
Then, the spectrum entropy of this frame is:
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frameCan entropy than not less than rt, then this frame is grouped in sentence.If scanning beginning or the end of voice flow, scan abort.
Such as, there are 10 speech frames, every frame energy is respectively:
0.05,0.12,0.002,0.004,0.1,0.2,0.4,0,5,0.001,0.12
If with 0.003 as threshold value, pass through the 3rd step, can obtain three sentences:
Sentence 1 comprises: 0.05,0.12
Sentence 2 comprises: 0.004,0.1,0.2,0.4,0.5
Sentence 3 comprises: 0.12
With sentence 2 as example, scan forward, the frame before it is 0.002, and this frame is not belonging to any sentence, andIts energy is less than threshold value 0.003, at this moment, this frame is carried out with Fourier transform, and calculating can entropy ratio.If energy entropy is than less than thisThreshold value then it is assumed that this frame is not belonging to sentence 2, the end of scan forward.If can entropy ratio be not less than this threshold value then it is assumed that thisFrame belongs to sentence 2, continues to scan forward next frame.Next frame is 0.12,0.12 to belong to sentence 1, then will be 2-in-1 to sentence 1 and sentenceAnd.After having merged, foremost one frame is 0.05, has been the first frame it is impossible to scan forward, the end of scan forward.BackwardThe logic that the logical AND of scanning scans forward is the same.Run into energy and be less than energy threshold, calculate its energy entropy ratio, can entropy ratio be less thanEnergy entropy, than threshold value, the then end of scan, otherwise, continues to scan on.Run into other sentences, then merge, after merging, continue to scan on.
Afterwards, merge close sentence.For the sentence being bordered by, calculate its interval time, if interval time is less than referred toFixed time threshold, then merge two sentences.
This step is to merge further, and such as it is assumed that every frame length is 100 milliseconds, sentence 1 comprises the 22nd, 23,24,25,26 totally 5 frames, sentence 2 comprises 29,30,31,32,33,34,35 totally 7 frames, does not have other sentences between two sentences.This twoIt is spaced 2 frames between sentence, that is, 200 milliseconds.It is assumed that 10 milliseconds of the time threshold specified, because 200 milliseconds are less than300 milliseconds, then sentence 1 and sentence 2 are merged, merge into 1 sentence.Frame 27,28 between sentence 1 and sentence 2 also oneAnd in integrating with, the new sentence after merging comprises 22,23,24,25,26,27,28,29,30,31,32,33,34,35 totally 14 frames.
Step s104, carries out to every composing entropy analysis.
In this step, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to itHis sentence, then merge to two sentences;If the energy of next frame is less than et, and it is not belonging to other sentences, then to this frameCarry out Fourier transform, take the amplitude of 0-4000hz, be divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is vi,i=1,2 ... z.Overall strength is vsum, piProbability for every bands of a spectrum: piComputing formula be:
Then, the spectrum entropy of this frame is:
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frameCan entropy than not less than rt, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort;
Step s105, identifies noise sentence;Whether the frame length judging described independent sentence is the short sentence frame length scope setting, ifIt is then the short independent sentence specimen of historical storage and currently independent sentence to be contrasted, if matching degree is less than setting value, will be independentSentence is designated noise sentence;Using machine learning method, short sentence is carried out judge detection, judge whether it is people's sound or makes an uproarSound, abandons noise, lifts accuracy further.
Step s106, obtains punctuate.The independent sentence not being designated noise sentence that each framing section of described audio frequency is obtained is madePunctuate for audio frequency.
In a preferred embodiment, also include after step s103:
Step s1031: if the frame length of described independent sentence exceeds sets independent frame length, calculate the spectrum entropy of the every frame of this independent officeRatio, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.
Split long sentence.If the length of sentence is higher than the time threshold specified, this sentence is split.Tear openPoint mode is as follows: ignores each a certain proportion of speech frame of head and the tail of sentence, remaining speech frame is traveled through.If each frame isIt has been computed spectrum entropy ratio, then weight w has been used for using spectrum entropy.If not calculating spectrum entropy ratio, using this frame energy as weightsw.For each frame, if in this sentence, on the left of this frame, there is nleft frame, there is nright frame on right side, definition splits coefficient valueWs is as follows: by traversal, searching makes the minimum frame of fractionation value ws of this sentence, and this sentence is divided into two sentences in left and right.IfYet suffer from long sentence in two sentences in left and right, then adopt this method long sentence to be continued to split, until not existingLong sentence.Filter too short meaningless sentence.Specify a time threshold, for less than time span sentence it is possible toIt is not that people is speaking.For such sentence, adopt its energy highest one frame, calculate its mel cepstrum coefficients.During useSupport vector machine (svm) grader first training is classified to it, judges whether it is the sound of people.Sound if not peopleSound, then abandon this sentence.Svm classifier training mode is as follows: gathers some people's sounds from lecture video with network direct broadcasting videoSample, as positive sample, some typically inhuman sound samples are as negative sample.It is used Mel to be instructed to spectral coefficient as featurePractice, obtain model parameter.(principle of support vector machine refers to).Here other machines learning method can also be taken, such as deepDegree neutral net carries out classification and judges.
The present invention also provides the automatic split system carrying out audio frequency punctuate simultaneously, as shown in Figure 2, comprising: framing unit101st, energy threshold acquiring unit 201, independent sentence acquiring unit 301;Spectrum entropy analytic unit 401, noise sentence judging unit 501 andPunctuate acquiring unit 601.
Described framing unit 101, is configured to obtain multiple framing sections according to audio frequency;
Described energy threshold acquiring unit 201, is configured to the energy value according to each framing section and obtains energy threshold ek;
Described independent sentence acquiring unit 301, is configured to according to described energy threshold ek, from described each framing section, obtain itEnergy value exceedes energy threshold et;Framing section, then the preamble frame of this frame or postorder frame are carried out with this framing section for sentence intermediate frameScanning, if the energy threshold of preamble frame or postorder frame is less than sets energy threshold et, then this frame and described sentence intermediate frame are pressed frameStart sequence merges becomes independent sentence.
Spectrum entropy analytic unit 401, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if searchNext frame belongs to other sentences, then two sentences are merged;If the energy of next frame is less than et, and it is not belonging to other sentencesSon, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, every bands of a spectrumIntensity is vi, i=1,2 ... z.Overall strength is vsum, piProbability for every bands of a spectrum.piComputing formula be:
Then, the spectrum entropy of this frame is:
The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value rtIf, this frameCan entropy than not less than rt, then this frame is grouped in sentence.If scanning beginning or the end of voice flow, scan abort.
Described noise sentence judging unit 501, is configured to judge whether the frame length of described independent sentence is the short sentence frame length settingIf so, the short independent sentence specimen of historical storage and currently independent sentence are then contrasted by scope, if matching degree is less than setting value,Then independent sentence is designated noise sentence;
Punctuate acquiring unit 601, is configured to the independence not being designated noise sentence obtaining each framing section of described audio frequencySentence is as the punctuate of audio frequency
In a preferred embodiment, described framing unit 101 is additionally configured to: receives audio file;According to settingSliced time described audio file is split, obtain multiple framing sections.
In a preferred embodiment, described energy threshold acquiring unit 201 is additionally configured to, according to each framing sectionThe meansigma methodss of energy value obtain energy threshold ek.
In a preferred embodiment, described independent sentence acquiring unit 301 is additionally configured to, if preamble frame or postorder frameEnergy threshold be less than set energy et, then judge interval time of present frame and next frame whether less than setting interval time,If so, then merging described sentence intermediate frame by frame start sequence becomes independent sentence.
In a preferred embodiment, comprising: long sentence judging unit 3011;
Described long sentence judging unit, if the frame length being configured to described independent sentence exceeds sets independent frame length, calculates this onlyThe spectrum entropy ratio of the every frame of vertical office, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independencesSentence.
The above, the only specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyThose familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, all should containCover within protection scope of the present invention.Therefore, protection scope of the present invention should described be defined by scope of the claims.