CN106373592A

Movatterモバイル変換

Info

Publication number: CN106373592A
Application number: CN201610799384.7A
Authority: CN
Inventors: 胡飞
Original assignee: HUAKEFEIYANG Co Ltd
Current assignee: HUAKEFEIYANG Co Ltd
Priority date: 2016-08-31
Filing date: 2016-08-31
Publication date: 2017-02-01
Anticipated expiration: 2036-08-31
Also published as: CN106373592B

Abstract

The invention relates to an audio noise tolerance punctuation processing method and a system. The method comprises steps that multiple framing segments are acquired according to an audio; an energy threshold is acquired according to an energy value of each framing segment, a framing segment with an energy value surpassing the energy threshold Et is acquired from the framing segments according to the energy threshold, the frame segment with the energy value surpassing the energy threshold Et is taken as a middle sentence frame to scan a front sequence frame or a back sequence frame, if an energy threshold of the front sequence frame or the back sequence frame is smaller the set energy threshold Et, the frame with the energy threshold smaller than the set energy threshold Et and the middle sentence frame are merged according to the start order into an independent sentence, entropy spectrum analysis on each independent sentence is then carried out, and a final analysis sentence is acquired. Through the method, a problem of automatic punctuation incapability existing in a caption corresponding process in the prior art is solved, recorded audios and videos can not only be processed, but also audios and videos which are presently played can be further processed, for network broadcast flows, network broadcast voice cutting can be automatically carried out, subsequent links such as listening and writing links can be conveniently processed parallelly, and the processing time is shortened.

Description

Audio frequency holds punctuate processing method and the system of making an uproar

Technical field

The present invention relates to voice, captions processing technology field, more particularly, to carry out audio frequency and hold the punctuate processing method and be of making an uproarSystem.

Background technology

Captions make field at present, mainly pass through manually to carry out voice punctuate, the premise of artificial speech punctuate is by voiceAll listen one time, mark starting point and the end point of a word while dictation by patting shortcut.Due to patThere is dislocation in time delay, obtained starting point and end point, need to manually adjust.Whole flow process needs to consume the plenty of time.ThanAs the audio frequency of 30 minutes needs the punctuate time of time-consuming 40 minutes to 1 hour, and the productivity is extremely low.And in network direct broadcasting neckDomain, if do not made pauses in reading unpunctuated ancient writings, by manually being dictated, being difficult to carry out parallelization, and the speed of people's dictation can be slower than live speed,Cannot be carried out parallelization and cannot carry out real-time live broadcast in both illustration and text.Rely on artificial punctuate, because the speed of artificial punctuate is also than broadcastingSpeed is slow, also leads to be difficult to real-time live broadcast.

Content of the invention

For above-mentioned defect of the prior art, it is an object of the invention to provide audio frequency holds the punctuate processing method and be of making an uproarSystem.Thus solving the problems, such as in existing captions corresponding process it is impossible to automatically be made pauses in reading unpunctuated ancient writings and noise is high.

The present invention is directed to classroom recorded broadcast and network direct broadcasting, proposes a kind of method of intelligent sound punctuate, and this method is passed throughSpeech analysis techniques, can quickly analyze the voice data recorded or gather automatically, and detection obtains meeting the language of subtitle specificationTablet section, saves the time that video and audio captions make.

In order to achieve the above object, the following technical scheme of present invention offer:

Audio frequency holds punctuate processing method of making an uproar, comprising:

Step s101, obtains multiple framing sections according to audio frequency；

Step s102, the energy value according to each framing section obtains energy threshold e_k；

Step s103, according to described energy threshold e_k, obtain its energy value from described each framing section and exceed energy thresholde_t；Framing section, then the preamble frame of this frame or postorder frame are scanned with this framing section for sentence intermediate frame, if preamble frame or afterThe energy threshold of sequence frame is less than and sets energy threshold e_t, then merging this frame by frame start sequence with described sentence intermediate frame becomes onlyVertical sentence；

Step s104, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to itHis sentence, then merge to two sentences；If the energy of next frame is less than e_t, and it is not belonging to other sentences, then to this frameCarry out Fourier transform, take the amplitude of 0-4000hz, be divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is v_i,i=1,2 ... z.Overall strength is v_sum, p_iProbability for every bands of a spectrum: p_iComputing formula be:

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} \log p_{i}

The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value r_tIf, this frameCan entropy than not less than r_t, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort；

Step s105, judges whether the frame length of described independent sentence is the short sentence frame length scope setting, if so, then by historyThe short independent sentence specimen of storage is contrasted with currently independent sentence, if matching degree is less than setting value, independent sentence is designated and makes an uproarSound sentence；

Step s106, independent sentence the breaking as audio frequency not being designated noise sentence that each framing section of described audio frequency is obtainedSentence.

In a preferred embodiment, described step s101 includes:

Step s1011: receive audio file；

Step s1012: the sliced time according to setting is split to described audio file, obtains multiple framing sections.

In a preferred embodiment, described step s102 includes: the energy value according to each framing section averageValue obtains energy threshold e_k.

In a preferred embodiment, " if the energy threshold of preamble frame or postorder frame is less than in described step s103Set energy threshold e_t, then merging this frame by frame start sequence with described sentence intermediate frame becomes independent sentence unit " step bagInclude:

If the energy threshold of preamble frame or postorder frame is less than sets energy e_t, then judge present frame and next frame interval whenBetween whether less than setting interval time, if so, then merging described sentence intermediate frame by frame start sequence becomes independent sentence.

In a preferred embodiment, also include after step s103:

Step s1031: if the frame length of described independent sentence exceeds sets independent frame length, calculate the spectrum entropy of the every frame of this independent officeRatio, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.

Simultaneously present invention also offers a kind of automatic split system carrying out audio frequency punctuate, comprising: framing unit, energy valveValue acquiring unit, independent sentence acquiring unit；Spectrum entropy analytic unit；

Described framing unit, is configured to obtain multiple framing sections according to audio frequency；

Described energy threshold acquiring unit, is configured to the energy value according to each framing section and obtains energy threshold e_k；

Described independent sentence acquiring unit, is configured to according to described energy threshold e_k, from described each framing section, obtain its energyValue exceedes energy threshold e_t；Framing section, then the preamble frame of this frame or postorder frame are swept with this framing section for sentence intermediate frameRetouch, if the energy threshold of preamble frame or postorder frame is less than sets energy threshold e_t, then this frame is risen by frame with described sentence intermediate frameBeginning order merges becomes independent sentence；

Described spectrum entropy analytic unit, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if searchedNext frame belong to other sentences, then two sentences are merged；If the energy of next frame is less than e_t, and it is not belonging to otherSentence, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum, every bands of a spectrum according to fixed widthIntensity be v_i, i=1,2 ... z.Overall strength is v_sum, p_iFor the probability of every bands of a spectrum, p_iComputing formula be:

p_{i} = \frac{v_{i}}{v_{s u m}}

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} \log p_{i}

Described noise sentence judging unit, is configured to judge whether the frame length of described independent sentence is the short sentence frame length model settingEnclose, if so, then the short independent sentence specimen of historical storage and currently independent sentence are contrasted, if matching degree is less than setting value,Independent sentence is designated noise sentence；

Punctuate acquiring unit, the independent sentence not being designated noise sentence being configured to obtain each framing section of described audio frequency is madePunctuate for audio frequency.

In a preferred embodiment, described framing unit is additionally configured to: receives audio file；According to dividing of settingTime of cutting is split to described audio file, obtains multiple framing sections.

In a preferred embodiment, described energy threshold acquiring unit is additionally configured to, according to the energy of each framing sectionThe meansigma methodss of value obtain energy threshold e_k.

In a preferred embodiment, described independent sentence acquiring unit is additionally configured to, if preamble frame or postorder frameEnergy threshold is less than and sets energy e_t, then judge interval time of present frame and next frame whether less than setting interval time, ifIt is that then merging described sentence intermediate frame by frame start sequence becomes independent sentence.

In a preferred embodiment, also include: long sentence judging unit；

Described long sentence judging unit, if the frame length being configured to described independent sentence exceeds sets independent frame length, calculates this onlyThe spectrum entropy ratio of the every frame of vertical office, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independencesSentence.

The invention has the benefit that the main calculating of this method is carried out in time domain, calculating speed is fast.For possibleIt is the limited regional area that consonant is also likely to be noise, is analyzed in conjunction with time-domain and frequency-domain, increase the accuracy of cutting.Only needTime-consuming spectrum analyses are carried out to a few frames, cutting speed is i.e. fast and accurate, has stronger noise resistance characteristic simultaneously again.WithIn the time point automatically generating voice cutting, the workload of audio frequency and video caption editing can be saved.Devise a set of direct utilizationExisting result of calculation, no longer carries out the cutting method of quadratic character calculating, can quickly carry out long sentence cutting, and guarantee is not inLong sentence, meets the demand making captions.Using machine learning method, short sentence is carried out judge detection, judge that it isNo is people's sound or noise, abandons noise, lifts accuracy further.This method both can process the sound having recorded and regardFrequency is it is also possible to process just in live audio frequency and video.For network direct broadcasting stream, can automatically network direct broadcasting voice be cut, sideContinue link after an action of the bowels as dictated link parallel processing, faster processing time.

Brief description

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existingHave technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description be only thisSome embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, acceptableOther accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is in one embodiment of the present invention, and audio frequency holds the schematic flow sheet of punctuate processing method of making an uproar；

Fig. 2 is in one embodiment of the present invention, and audio frequency holds the logic connection diagram of punctuate processing system of making an uproar.

Specific embodiment

Below in conjunction with the accompanying drawing of the present invention, technical scheme is clearly and completely described it is clear that instituteThe embodiment of description is only a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention,The every other embodiment that those of ordinary skill in the art are obtained under the premise of not making creative work, broadly falls into thisThe scope of bright protection.

Audio frequency in the present invention holds punctuate processing method of making an uproar, as shown in Figure 1, comprising:

Step s101, obtains multiple framing sections according to audio frequency.

The present invention may be mounted on server it is also possible to be arranged on personal computer or mobile computing device.BelowAlleged computing terminal can be server or personal computer or mobile computing device.First, toServer uploads audio-video document, or opens audio-video document on personal computer or mobile computing device.Afterwards, countCalculation equipment extracts the audio stream in audio-video document, and audio stream unification is had symbol single-channel data to fixed sampling frequency.ItAdopt framing parameter set in advance afterwards, sub-frame processing is carried out to data.

Step s1011: receive audio file；Step s1012: the sliced time according to setting enters to described audio fileRow segmentation, obtains multiple framing sections.

Sub-frame processing is carried out to audio frequency.Every frame length is from 10ms to 500ms.In speech recognition, in order to accurately knowOther voice, needs overlap between consecutive frame.The purpose of the present invention is not by speech recognition, can weigh therefore between frame and frameFolded or even allowed interval between consecutive frame it is also possible to not overlapping, be spaced apart 0ms to 500ms.So voice segmentation obtainsFrame number will be less than frame number needed for speech recognition, thus reducing amount of calculation, improves calculating speed.With f₁,f₂,…f_m, represent and obtainFrame, each frame has n sample, is s respectively_k1,s_k2,…,s_kn, the range value of each sample is f_ki,f_k2,…,f_kn.Each frame noteRecord time started and end time.

Speech data be by fixed sample rate, sound is sampled after, the real number numeric string that obtains.Sample rate 16k, justRepresent 16000 data of sampling in 1 second.The meaning of framing be using this burst of data by regular time section be one set as divideAnalysis unit.Such as, 16k sample rate, if every frame length is 100 milliseconds, has 1600 speech datas inside 1 frame.By dividingFrame is determining the granularity of control.In this patent, generally according to 100 milliseconds of framings that is to say, that the video of n second, need to be divided into10n frame.Certainly, can be non-conterminous between frame and frame, such as, 100 milliseconds of the interval of two frames, then the video of n second, framing is exactly5n frame.Increase the interval between frame and frame and can reduce totalframes, improve analyze speed, but cost is time degree of accuracy can dropLow.

Step s102, the energy value according to each framing section obtains energy threshold e_k.

In this step:

Each frame is calculated with its energy e_k.Energy definition including but not limited to amplitude square and with two kinds of absolute value sumMode.

Energy balane formula according to amplitude square and definition is:

e_{k} = σ_{i = 1}^{n} {f_{k i}}^{2}

Energy balane formula according to absolute value definition is:

e_{k} = σ_{i = 1}^{n} | f_{k i} |

Set an energy threshold e_t, search adjacent and energy all more than e_tSpeech frame, obtain speech sentence s₁,s₂,…s_j.That is to say:

s_i={ f_k| k=a, a+1, a+2 ... a+b, e_k>=e_t, and e_(a-1)<e_t, and e_(a+b+1)<e_t}.

In another embodiment, described step s101 includes:

Described step s102 includes: the meansigma methodss of the energy value according to each framing section obtain energy threshold e_k.That is, will be upperThe energy value that one step obtains, divided by sample size, obtains average energy.Energy threshold is the threshold value of every frame average energy, usual rootAccording to experience setting, certain numeral between conventional 0.001-0.01, and user can manually adjust.

Step s103, merges into independent sentence.

According to described energy threshold e_k, obtain its energy value from described each framing section and exceed energy threshold e_t；FramingSection, then be scanned to the preamble frame of this frame or postorder frame with this framing section for sentence intermediate frame, if the energy of preamble frame or postorder frameAmount threshold values is less than and sets energy threshold e_t, then merging this frame by frame start sequence with described sentence intermediate frame becomes independent sentence.

" if the energy threshold of preamble frame or postorder frame is less than sets energy threshold e in described step s103_t, then by this frameMerging by frame start sequence with described sentence intermediate frame becomes independent sentence unit " step include: if the energy of preamble frame or postorder frameAmount threshold values is less than and sets energy e_t, then whether present frame and the interval time of next frame are judged less than setting interval time, if so,Then merging described sentence intermediate frame by frame start sequence becomes independent sentence.

Before and after each sentence, two frames are respectively forwardly searched for afterwards.If the next frame searching belongs to other sentences,Two sentences are merged.If the energy of next frame is less than e_t, and be not belonging to other sentences, then Fourier is carried out to this frameConversion, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is v_i, i=1,2 ... z.Overall strength is v_sum, p_iProbability for every bands of a spectrum.p_iComputing formula be:

p_{i} = \frac{v_{i}}{v_{s u m}}

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} \log p_{i}

The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value r_tIf, this frameCan entropy than not less than r_t, then this frame is grouped in sentence.If scanning beginning or the end of voice flow, scan abort.

Such as, there are 10 speech frames, every frame energy is respectively:

0.05,0.12,0.002,0.004,0.1,0.2,0.4,0,5,0.001,0.12

If with 0.003 as threshold value, pass through the 3rd step, can obtain three sentences:

Sentence 1 comprises: 0.05,0.12

Sentence 2 comprises: 0.004,0.1,0.2,0.4,0.5

Sentence 3 comprises: 0.12

With sentence 2 as example, scan forward, the frame before it is 0.002, and this frame is not belonging to any sentence, andIts energy is less than threshold value 0.003, at this moment, this frame is carried out with Fourier transform, and calculating can entropy ratio.If energy entropy is than less than thisThreshold value then it is assumed that this frame is not belonging to sentence 2, the end of scan forward.If can entropy ratio be not less than this threshold value then it is assumed that thisFrame belongs to sentence 2, continues to scan forward next frame.Next frame is 0.12,0.12 to belong to sentence 1, then will be 2-in-1 to sentence 1 and sentenceAnd.After having merged, foremost one frame is 0.05, has been the first frame it is impossible to scan forward, the end of scan forward.BackwardThe logic that the logical AND of scanning scans forward is the same.Run into energy and be less than energy threshold, calculate its energy entropy ratio, can entropy ratio be less thanEnergy entropy, than threshold value, the then end of scan, otherwise, continues to scan on.Run into other sentences, then merge, after merging, continue to scan on.

Afterwards, merge close sentence.For the sentence being bordered by, calculate its interval time, if interval time is less than referred toFixed time threshold, then merge two sentences.

This step is to merge further, and such as it is assumed that every frame length is 100 milliseconds, sentence 1 comprises the 22nd, 23,24,25,26 totally 5 frames, sentence 2 comprises 29,30,31,32,33,34,35 totally 7 frames, does not have other sentences between two sentences.This twoIt is spaced 2 frames between sentence, that is, 200 milliseconds.It is assumed that 10 milliseconds of the time threshold specified, because 200 milliseconds are less than300 milliseconds, then sentence 1 and sentence 2 are merged, merge into 1 sentence.Frame 27,28 between sentence 1 and sentence 2 also oneAnd in integrating with, the new sentence after merging comprises 22,23,24,25,26,27,28,29,30,31,32,33,34,35 totally 14 frames.

Step s104, carries out to every composing entropy analysis.

In this step, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to itHis sentence, then merge to two sentences；If the energy of next frame is less than e_t, and it is not belonging to other sentences, then to this frameCarry out Fourier transform, take the amplitude of 0-4000hz, be divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is v_i,i=1,2 ... z.Overall strength is v_sum, p_iProbability for every bands of a spectrum: p_iComputing formula be:

p_{i} = \frac{v_{i}}{v_{s u m}}

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} \log p_{i}

Step s105, identifies noise sentence；Whether the frame length judging described independent sentence is the short sentence frame length scope setting, ifIt is then the short independent sentence specimen of historical storage and currently independent sentence to be contrasted, if matching degree is less than setting value, will be independentSentence is designated noise sentence；Using machine learning method, short sentence is carried out judge detection, judge whether it is people's sound or makes an uproarSound, abandons noise, lifts accuracy further.

Step s106, obtains punctuate.The independent sentence not being designated noise sentence that each framing section of described audio frequency is obtained is madePunctuate for audio frequency.

In a preferred embodiment, also include after step s103:

Split long sentence.If the length of sentence is higher than the time threshold specified, this sentence is split.Tear openPoint mode is as follows: ignores each a certain proportion of speech frame of head and the tail of sentence, remaining speech frame is traveled through.If each frame isIt has been computed spectrum entropy ratio, then weight w has been used for using spectrum entropy.If not calculating spectrum entropy ratio, using this frame energy as weightsw.For each frame, if in this sentence, on the left of this frame, there is nleft frame, there is nright frame on right side, definition splits coefficient valueWs is as follows: by traversal, searching makes the minimum frame of fractionation value ws of this sentence, and this sentence is divided into two sentences in left and right.IfYet suffer from long sentence in two sentences in left and right, then adopt this method long sentence to be continued to split, until not existingLong sentence.Filter too short meaningless sentence.Specify a time threshold, for less than time span sentence it is possible toIt is not that people is speaking.For such sentence, adopt its energy highest one frame, calculate its mel cepstrum coefficients.During useSupport vector machine (svm) grader first training is classified to it, judges whether it is the sound of people.Sound if not peopleSound, then abandon this sentence.Svm classifier training mode is as follows: gathers some people's sounds from lecture video with network direct broadcasting videoSample, as positive sample, some typically inhuman sound samples are as negative sample.It is used Mel to be instructed to spectral coefficient as featurePractice, obtain model parameter.(principle of support vector machine refers to).Here other machines learning method can also be taken, such as deepDegree neutral net carries out classification and judges.

The present invention also provides the automatic split system carrying out audio frequency punctuate simultaneously, as shown in Figure 2, comprising: framing unit101st, energy threshold acquiring unit 201, independent sentence acquiring unit 301；Spectrum entropy analytic unit 401, noise sentence judging unit 501 andPunctuate acquiring unit 601.

Described framing unit 101, is configured to obtain multiple framing sections according to audio frequency；

Described energy threshold acquiring unit 201, is configured to the energy value according to each framing section and obtains energy threshold e_k；

Described independent sentence acquiring unit 301, is configured to according to described energy threshold e_k, from described each framing section, obtain itEnergy value exceedes energy threshold e_t；Framing section, then the preamble frame of this frame or postorder frame are carried out with this framing section for sentence intermediate frameScanning, if the energy threshold of preamble frame or postorder frame is less than sets energy threshold e_t, then this frame and described sentence intermediate frame are pressed frameStart sequence merges becomes independent sentence.

Spectrum entropy analytic unit 401, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if searchNext frame belongs to other sentences, then two sentences are merged；If the energy of next frame is less than e_t, and it is not belonging to other sentencesSon, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, every bands of a spectrumIntensity is v_i, i=1,2 ... z.Overall strength is v_sum, p_iProbability for every bands of a spectrum.p_iComputing formula be:

p_{i} = \frac{v_{i}}{v_{s u m}}

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} \log p_{i}

Described noise sentence judging unit 501, is configured to judge whether the frame length of described independent sentence is the short sentence frame length settingIf so, the short independent sentence specimen of historical storage and currently independent sentence are then contrasted by scope, if matching degree is less than setting value,Then independent sentence is designated noise sentence；

Punctuate acquiring unit 601, is configured to the independence not being designated noise sentence obtaining each framing section of described audio frequencySentence is as the punctuate of audio frequency

In a preferred embodiment, described framing unit 101 is additionally configured to: receives audio file；According to settingSliced time described audio file is split, obtain multiple framing sections.

In a preferred embodiment, described energy threshold acquiring unit 201 is additionally configured to, according to each framing sectionThe meansigma methodss of energy value obtain energy threshold e_k.

In a preferred embodiment, described independent sentence acquiring unit 301 is additionally configured to, if preamble frame or postorder frameEnergy threshold be less than set energy e_t, then judge interval time of present frame and next frame whether less than setting interval time,If so, then merging described sentence intermediate frame by frame start sequence becomes independent sentence.

In a preferred embodiment, comprising: long sentence judging unit 3011；

The above, the only specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyThose familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, all should containCover within protection scope of the present invention.Therefore, protection scope of the present invention should described be defined by scope of the claims.

Claims

1. audio frequency holds punctuate processing method of making an uproar, comprising:

Step s101, obtains multiple framing sections according to audio frequency；

Step s103, according to described energy threshold e_k, obtain its energy value from described each framing section and exceed energy threshold e_t；'sFraming section, then be scanned to the preamble frame of this frame or postorder frame with this framing section for sentence intermediate frame, if preamble frame or postorder frameEnergy threshold be less than set energy threshold e_t, then this frame and described sentence intermediate frame are merged by frame start sequence and become independentSentence；

Step s104, before and after each sentence, two frames are respectively forwardly searched for afterwards, if the next frame searching belongs to other sentencesSon, then merge to two sentences；If the energy of next frame is less than e_t, and be not belonging to other sentences, then this frame is carried outFourier transform, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is v_i, i=1,2,…z.Overall strength is v_sum, p_iProbability for every bands of a spectrum: p_iComputing formula be:

p_{i} = \frac{v_{i}}{v_{s u m}}

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} {logp}_{i}

The energy of each frame is energy entropy ratio with the ratio of spectrum entropy, is designated as r.Set an energy entropy than threshold value r_tIf, the energy entropy of this frameThan not less than r_t, then this frame is grouped in sentence, if scanning beginning or the end of voice flow, scan abort；

Step s105, judges whether the frame length of described independent sentence is the short sentence frame length scope setting, if so, then by historical storageShort independent sentence specimen and currently independent sentence are contrasted, if matching degree is less than setting value, independent sentence are designated noise sentence；

Step s106, the punctuate not being designated the independent sentence of noise sentence as audio frequency that each framing section of described audio frequency is obtained.

2. audio frequency appearance according to claim 1 makes an uproar punctuate processing method it is characterised in that described step s101 includes:

Step s1011: receive audio file；

3. audio frequency according to claim 1 and 2 holds punctuate processing method of making an uproar it is characterised in that wrapping in described step s102Include: the meansigma methodss of the energy value according to each framing section obtain energy threshold e_k.

4. audio frequency according to claim 1 holds punctuate processing method of making an uproar it is characterised in that " if front in described step s103The energy threshold of sequence frame or postorder frame is less than and sets energy threshold e_t, then this frame and described sentence intermediate frame are pressed frame start sequence and closeAnd become independent sentence unit " step include:

If the energy threshold of preamble frame or postorder frame is less than sets energy e_t, then judge that present frame with the interval time of next frame isNo less than set interval time, if so, then by described sentence intermediate frame by frame start sequence merge become independent sentence.

5. the audio frequency appearance according to claim 1 or 4 makes an uproar punctuate processing method it is characterised in that also including after step s103:

Step s1031: if the frame length of described independent sentence exceeds sets independent frame length, calculate the spectrum entropy ratio of the every frame of this independent office,Using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.

6. carry out the automatic split system of audio frequency punctuate, comprising: framing unit, energy threshold acquiring unit, independent sentence obtain singleUnit, noise sentence judging unit, punctuate acquiring unit；Spectrum entropy analytic unit:

Described independent sentence acquiring unit, is configured to according to described energy threshold e_k, obtain its energy value from described each framing section and surpassCross energy threshold e_t；Framing section, then the preamble frame of this frame or postorder frame are scanned with this framing section for sentence intermediate frame, ifThe energy threshold of preamble frame or postorder frame is less than and sets energy threshold e_t, then this frame and described sentence intermediate frame are pressed frame start sequenceMerging becomes independent sentence；

Described spectrum entropy analytic unit, is configured to two frames before and after each sentence and respectively forwardly searches for afterwards, if searched downOne frame belongs to other sentences, then two sentences are merged；If the energy of next frame is less than e_t, and it is not belonging to other sentencesSon, then carry out Fourier transform to this frame, takes the amplitude of 0-4000hz, is divided into z bar bands of a spectrum according to fixed width, every bands of a spectrumIntensity is v_i, i=1,2 ... z.Overall strength is v_sum, p_iFor the probability of every bands of a spectrum, p_iComputing formula be:

p_{i} = \frac{v_{i}}{v_{s u m}}

Then, the spectrum entropy of this frame is:

h = - σ_{i = 1}^{z} p_{i} {logp}_{i}

Described noise sentence judging unit, is configured to judge whether the frame length of described independent sentence is the short sentence frame length scope setting, ifIt is then the short independent sentence specimen of historical storage and currently independent sentence to be contrasted, if matching degree is less than setting value, will be independentSentence is designated noise sentence；

Punctuate acquiring unit, is configured to not be designated the independent sentence of noise sentence as sound using what each framing section of described audio frequency obtainedThe punctuate of frequency.

7. the automatic split system carrying out audio frequency punctuate according to claim 6 it is characterised in that described framing unit alsoIt is configured that reception audio file；Sliced time according to setting is split to described audio file, obtains multiple framing sections.

8. the automatic split system carrying out audio frequency punctuate according to claim 6 or 7 is it is characterised in that described energy valveValue acquiring unit is additionally configured to, and the meansigma methodss of the energy value according to each framing section obtain energy threshold e_k.

9. the automatic split system carrying out audio frequency punctuate according to claim 6 is it is characterised in that described independent sentence obtainsUnit is additionally configured to, if the energy threshold of preamble frame or postorder frame is less than sets energy e_t, then present frame and next frame are judgedInterval time, whether if so, then merging described sentence intermediate frame by frame start sequence became independent sentence less than setting interval time.

10. carry out the automatic split system of audio frequency punctuate it is characterised in that also including according to claim 6 or 9: longSentence judging unit；

Described long sentence judging unit, if the frame length being configured to described independent sentence exceeds sets independent frame length, calculates this independent officeThe spectrum entropy ratio of every frame, using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent office style is two independent sentences.