CN106373592B

Movatterモバイル変換

Info

Publication number: CN106373592B
Application number: CN201610799384.7A
Authority: CN
Inventors: 胡飞
Original assignee: HUAKEFEIYANG Co Ltd
Current assignee: HUAKEFEIYANG Co Ltd
Priority date: 2016-08-31
Filing date: 2016-08-31
Publication date: 2019-04-23
Anticipated expiration: 2036-08-31
Also published as: CN106373592A

Abstract

It carries out audio and holds processing method and the system of making pauses in reading unpunctuated ancient writings of making an uproar, comprising: multiple framing sections are obtained according to audio；Energy threshold is obtained according to the energy value of each framing section, according to the energy threshold, it is more than setting energy threshold E that its energy value is obtained from each framing section_tFraming section, then be scanned by preamble frame or postorder frame of the sentence intermediate frame to the frame of the framing section, if the energy threshold of preamble frame or postorder frame be less than setting energy threshold E_t, then merging the frame by frame start sequence with the sentence intermediate frame becomes independent sentence, carries out spectrum entropy analysis to each independent sentence later, obtains last parsing sentence.To solve in existing subtitle corresponding process, the problem of can not be made pauses in reading unpunctuated ancient writings automatically.To which the present invention both can handle the audio-video recorded, also can handle the audio-video being broadcast live.For network direct broadcasting stream, automatically network direct broadcasting voice can be cut, facilitate follow-up link such as dictation link parallel processing, faster processing time.

Description

Audio holds processing method and the system of making pauses in reading unpunctuated ancient writings of making an uproar

Technical field

The present invention relates to voice, subtitle processing technology field, more particularly to carry out audio appearance and make an uproar to make pauses in reading unpunctuated ancient writings processing method and beSystem.

Background technique

Subtitle production field at present, main by manually carrying out voice punctuate, the premise of artificial speech punctuate is by voiceIt all listens one time, marks the starting point and end point of a word by patting shortcut key while dictation.Due to beatingThere is dislocation, need to manually adjust in delay, obtained starting point and end point.Whole flow process needs to consume the plenty of time.ThanSuch as, 30 minutes audios need time-consuming 40 minutes to 1 hour punctuate time, and productivity is extremely low.And it is led in network direct broadcastingDomain, if by manually being dictated, being difficult to carry out parallelization without punctuate, and the speed of people's dictation can be slower than live streaming speed,Can not carry out parallelization cannot carry out real-time live broadcast in both illustration and text.By artificial punctuate, since the speed manually made pauses in reading unpunctuated ancient writings is also than playingSpeed is slow, also causes to be difficult to carry out real-time live broadcast.

Summary of the invention

For above-mentioned defect in the prior art, making an uproar to make pauses in reading unpunctuated ancient writings the object of the present invention is to provide audio appearance processing method and isSystem.To solving in existing subtitle corresponding process, can not be made pauses in reading unpunctuated ancient writings automatically and problem that noise is high.

The present invention is directed to classroom recorded broadcast and network direct broadcasting, and a kind of method for proposing intelligent sound punctuate, this method passes throughSpeech analysis techniques, can quickly analyze the audio data of recording or acquisition automatically, and detection obtains the language for meeting subtitle specificationTablet section saves the time of video and audio subtitle production.

In order to achieve the above object, the invention provides the following technical scheme:

Audio holds processing method of making pauses in reading unpunctuated ancient writings of making an uproar, comprising:

Step S101 obtains multiple framing sections according to audio；

Step S102 obtains energy threshold E according to the energy value of each framing section_k；

Step S103, according to the energy threshold E_k, it is more than setting energy that its energy value is obtained from each framing sectionThreshold value E_tFraming section, then be scanned by preamble frame or postorder frame of the sentence intermediate frame to the frame of the framing section, if preamble frameOr the energy threshold of postorder frame is less than setting energy threshold E_t, then the frame is merged into the sentence intermediate frame by frame start sequenceFor independent sentence；

Step S104, from the front and back of each sentence, two frames is searched for forward and backward, if the next frame searched belongs to itHis sentence, then merge two sentences；If the energy of next frame is less than setting energy threshold E_t, and it is not belonging to other sentencesSon then carries out Fourier transform to the frame, takes the amplitude of 0-4000HZ, is divided into z bands of a spectrum according to fixed width, every bands of a spectrumIntensity is V_i, i=1,2 ... z, overall strength V_sum, P_iFor the probability of every bands of a spectrum: P_iCalculation formula are as follows:

Then, the spectrum entropy of the frame are as follows:

The energy of each frame and the ratio of spectrum entropy are energy entropy ratio, are denoted as R, set an energy entropy than threshold value R_tIf the frameEnergy entropy ratio be not less than R_t, then the frame is grouped into sentence, if the beginning or end of voice flow, scan abort are arrived in scanning；

Step S105 judges whether the frame length of the independent sentence is the short sentence frame length range set, if so, history is depositedThe short independent sentence sample of storage is compared with current independent sentence, if matching degree is lower than setting value, independent sentence is identified as noiseSentence；

Step S106, the independent sentence for not being identified as noise sentence that each framing section of the audio is obtained is as the disconnected of audioSentence.

In a preferred embodiment, include: in the step S101

Step S1011: audio file is received；

Step S1012: the audio file is split according to the sliced time of setting, obtains multiple framing sections.

It in a preferred embodiment, include: being averaged according to the energy value of each framing section in the step S102Value obtains energy threshold E_k。

In a preferred embodiment, " if the energy threshold of preamble frame or postorder frame is less than in the step S103Set energy threshold E_t, then merging the frame and the sentence intermediate frame by frame start sequence becomes independent sentence unit " the step of wrapIt includes:

If the energy threshold of preamble frame or postorder frame is less than setting energy threshold E_t, then judge between present frame and next frameWhether it is less than setting interval time every the time, if so, the sentence intermediate frame is merged by frame start sequence becomes independent sentence.

In a preferred embodiment, after step S103 further include:

Step S1031: if the frame length of the independent sentence calculates the spectrum entropy of the independent every frame of sentence beyond independent frame length is setThan using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent sentence is divided into two independent sentences.

The present invention also provides a kind of automatic split systems for carrying out audio punctuate simultaneously, comprising: framing unit, energy valveIt is worth acquiring unit, independent sentence acquiring unit；Compose entropy analytical unit；

The framing unit is configured to obtain multiple framing sections according to audio；

The energy threshold acquiring unit is configured to obtain energy threshold E according to the energy value of each framing section_k；

The independent sentence acquiring unit, is configured to according to the energy threshold E_k, its energy is obtained from each framing sectionMagnitude is more than setting energy threshold E_tFraming section, then using the framing section as sentence intermediate frame to the preamble frame of the frame or postorder frame intoRow scanning, if the energy threshold of preamble frame or postorder frame is less than setting energy threshold E_t, then the frame and the sentence intermediate frame are pressedFrame start sequence, which merges, becomes independent sentence；

The spectrum entropy analytical unit is configured to search for forward and backward from two frame of the front and back of each sentence, if searchedNext frame belong to other sentences, then two sentences are merged；If the energy of next frame is less than setting energy threshold E_t,And be not belonging to other sentences, then Fourier transform is carried out to the frame, takes the amplitude of 0-4000HZ, be divided into z item according to fixed widthBands of a spectrum, the intensity of every bands of a spectrum are V_i, i=1,2 ... z, overall strength V_sum, P_iFor the probability of every bands of a spectrum, P_iCalculation formulaAre as follows:

Then, the spectrum entropy of the frame are as follows:

The noise sentence judging unit is configured to judge whether the frame length of the independent sentence is the short sentence frame length model setIt encloses, if so, the short independent sentence sample of historical storage and current independent sentence are compared, if matching degree is lower than setting value,Independent sentence is identified as noise sentence；

Punctuate acquiring unit, the independent sentence for not being identified as noise sentence for being configured to obtain each framing section of the audio are madeFor the punctuate of audio.

In a preferred embodiment, the framing unit is additionally configured to: receiving audio file；According to point of settingCutting the time is split the audio file, obtains multiple framing sections.

In a preferred embodiment, the energy threshold acquiring unit is additionally configured to, according to the energy of each framing sectionThe average value of magnitude obtains energy threshold E_k。

In a preferred embodiment, the independent sentence acquiring unit is additionally configured to, if preamble frame or postorder frameEnergy threshold is less than setting energy threshold E_t, then when judging whether the interval time of present frame and next frame is less than setting intervalBetween, if so, the sentence intermediate frame is merged by frame start sequence becomes independent sentence.

In a preferred embodiment, further includes: long sentence judging unit；

The long sentence judging unit, if the frame length for being configured to the independent sentence calculates this solely beyond independent frame length is setAbove-mentioned independent sentence is divided into two independences using lowest spectrum entropy than corresponding frame as cut-point by the spectrum entropy ratio of the vertical every frame of sentenceSentence.

The invention has the benefit that main calculate of this method is carried out in time domain, calculating speed is fast.For possibleIt is the limited regional area that consonant is also likely to be noise, is analyzed in conjunction with time-domain and frequency-domain, increases the accuracy of cutting.Only needA few frames are carried out with time-consuming spectrum analysis (frame as shown below selects part), cutting speed is i.e. fast, again accurate, while having againStronger noise resistance characteristic.For automatically generating the time point of voice cutting, the workload of audio-video caption editing can be saved.It devises a set of directly using existing calculated result, no longer carries out the cutting method of quadratic character calculating, can quickly be grownSentence cutting, guarantee is not in too long sentence, meets the needs of production subtitle.Using machine learning method, to short sentence intoRow determines detection, determines whether it is people's sound or noise, abandons noise, further promotes accuracy.This method can both be locatedThe audio-video recorded is managed, also can handle the audio-video being broadcast live.It, can be automatically by net for network direct broadcasting streamVoice cutting is broadcast live in network, facilitates follow-up link such as dictation link parallel processing, faster processing time.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show belowThere is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only thisSome embodiments of invention for those of ordinary skill in the art without creative efforts, can be withIt obtains other drawings based on these drawings.

Fig. 1 is in one embodiment of the present invention, and audio holds the flow diagram of processing method of making pauses in reading unpunctuated ancient writings of making an uproar；

Fig. 2 is in one embodiment of the present invention, and audio holds the logical connection schematic diagram of processing system of making pauses in reading unpunctuated ancient writings of making an uproar.

Specific embodiment

Below in conjunction with attached drawing of the invention, technical solution of the present invention is clearly and completely described, it is clear that instituteThe embodiment of description is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention,Every other embodiment obtained by those of ordinary skill in the art without making creative efforts, belongs to this hairThe range of bright protection.

Audio in the present invention holds processing method of making pauses in reading unpunctuated ancient writings of making an uproar, as shown in Figure 1, comprising:

Step S101 obtains multiple framing sections according to audio.

The present invention may be mounted on server, also may be mounted on personal computer or mobile computing device.BelowSo-called computing terminal can be server, be also possible to personal computer, be also possible to mobile computing device.Firstly, toServer uploads audio-video document, either opens audio-video document on personal computer or mobile computing device.Later, it countsThe audio stream in equipment extraction audio-video document is calculated, audio stream unification is had into symbol single-channel data to fixed sampling frequency.ItPreset framing parameter is used afterwards, and sub-frame processing is carried out to data.

Step S1011: audio file is received；Step S1012: the audio file is carried out according to the sliced time of settingSegmentation, obtains multiple framing sections.

Sub-frame processing is carried out to audio.Every frame length is differed from 10ms to 500ms.In speech recognition, in order to accurately knowOther voice needs to be overlapped between consecutive frame.The purpose of the present invention is not to carry out speech recognition, therefore can weigh between frame and frameIt is folded, can not also be overlapped in addition consecutive frame between allow to have interval, be divided into 0ms to 500ms.Voice is divided in this wayFrame number, to reduce calculation amount, will improve calculating speed less than frame number needed for speech recognition.With F₁,F₂,…F_m, representFrame, each frame have n sample, are s respectively_k1,s_k2,…,s_kn, the range value of each sample is f_ki,f_k2,…,f_kn.Each frame noteRecord starting and end time.

Voice data is obtained real number numeric string after sampling by fixed sample rate to sound.Sample rate 16K, justRepresent 16000 data of sampling in 1 second.Framing, which means for this burst of data to be used as by regular time section for a set, to divideAnalyse unit.For example, 16K sample rate has 1600 voice data inside 1 frame if every frame length is 100 milliseconds.By dividingFrame determines the granularity of control.In this patent, usually according to 100 milliseconds of framings, that is to say, that N seconds videos need to be divided into10N frame.Certainly, can be non-conterminous between frame and frame, for example, 100 milliseconds of the interval of two frames, then N seconds videos, framing are exactly5N frame.Totalframes can be reduced by increasing the interval between frame and frame, improve analysis speed, but cost is that time accuracy can dropIt is low.

Step S102 obtains energy threshold E according to the energy value of each framing section_k。

In this step:

Its threshold energy E is calculated each frame_k.Energy definition is the sum of including but not limited to amplitude square and with absolute valueTwo ways.

According to the energy balane formula of amplitude square and definition are as follows:

The energy balane formula defined according to absolute value are as follows:

Set an energy threshold E_t, adjacent and energy is searched more than E_tSpeech frame, obtain speech sentence S₁,S₂,…S_j.It that is to say:

S_i={ F_k| k=a, a+1, a+2 ... a+b, E_k>=E_t, and E_(a-1)<E_t, and E_(a+b+1)<E_t}。

In another embodiment, include: in the step S101

It include: that energy threshold E is obtained according to the average value of the energy value of each framing section in the step S102_k.That is, by upperThe energy value that one step obtains obtains average energy divided by sample size.Energy threshold is the threshold value of every frame average energy, usual rootAccording to experience setting, some number between 0.001-0.01 is commonly used, and user can manually adjust.

Step S103 merges into independent sentence.

According to the energy threshold E_k, it is more than setting energy threshold E that its energy value is obtained from each framing section_tPointFrame section is then scanned by preamble frame or postorder frame of the sentence intermediate frame to the frame of the framing section, if preamble frame or postorder frameEnergy threshold is less than setting energy threshold E_t, then merging the frame by frame start sequence with the sentence intermediate frame becomes independent sentence.

" if the energy threshold of preamble frame or postorder frame is less than setting energy threshold E in the step S103_t, then by the frameWith the sentence intermediate frame by frame start sequence merge become independent sentence unit " if the step of include: preamble frame or postorder frame energyIt measures threshold values and is less than setting energy threshold E_t, then judge whether the interval time of present frame and next frame is less than setting interval time,If so, the sentence intermediate frame is merged by frame start sequence becomes independent sentence.

From the front and back of each sentence, two frames is searched for forward and backward.If the next frame searched belongs to other sentences,Two sentences are merged.If the energy of next frame is less than setting energy threshold E_t, and be not belonging to other sentences, then to thisFrame carries out Fourier transform, takes the amplitude of 0-4000HZ, is divided into z bands of a spectrum according to fixed width, the intensity of every bands of a spectrum is V_i,I=1,2 ... z.Overall strength is V_sum, P_iFor the probability of every bands of a spectrum.P_iCalculation formula are as follows:

Then, the spectrum entropy of the frame are as follows:

The energy of each frame and the ratio of spectrum entropy are energy entropy ratio, are denoted as R.An energy entropy is set than threshold value R_tIf the frameEnergy entropy ratio be not less than R_t, then the frame is grouped into sentence.If the beginning or end of voice flow, scan abort are arrived in scanning.

Such as have 10 speech frames, every frame energy is respectively:

0.05,0.12,0.002,0.004,0.1,0.2,0.4,0,5,0.001,0.12

If being threshold value with 0.003, pass through third step, available three sentences:

Sentence 1 includes: 0.05,0.12

Sentence 2 includes: 0.004,0.1,0.2,0.4,0.5

Sentence 3 includes: 0.12

It is example with sentence 2, scans forward, the frame before it is 0.002, this frame is not belonging to any sentence, andIts energy is less than threshold value 0.003, at this moment, carries out Fourier transform to this frame, calculating can entropy ratio.If energy entropy ratio is lower than thisThreshold value, then it is assumed that this frame is not belonging to sentence 2, forward the end of scan.If energy entropy ratio is not less than this threshold value, then it is assumed that thisFrame belongs to sentence 2, and continuation scans forward next frame.Next frame is 0.12,0.12 to belong to sentence 1, then closes sentence 1 and sentence 2And.After having merged, it has been first frame that one frame of foremost, which is 0.05, can not be scanned forward, forward the end of scan.BackwardThe logic that the logical AND of scanning scans forward is the same.Energy is encountered lower than energy threshold, calculates its energy entropy ratio, and energy entropy ratio is lower thanEnergy entropy is than threshold value, then otherwise the end of scan continues to scan on.Other sentences are encountered, then are merged, after merging, are continued to scan on.

This step be further merge, such as, it is assumed that every frame length be 100 milliseconds, sentence 1 include the 22nd, 23,24,25,26 totally 5 frames, sentence 2 include 29,30,31,32,33,34,35 totally 7 frames, and there is no other sentences between two sentences.The two2 frames, that is, 200 milliseconds are spaced between sentence.It is assumed that specified 10 milliseconds of time threshold, because 200 milliseconds are less than300 milliseconds, then sentence 1 and sentence 2 are merged, merges into 1 sentence.Frame 27,28 between sentence 1 and sentence 2 also oneAnd in being merged into, the new sentence after merging includes 22,23,24,25,26,27,28,29,30,31,32,33,34,35 totally 14 frames.

Step S104 carries out spectrum entropy analysis to every.

In this step, from the front and back of each sentence, two frames is searched for forward and backward, if the next frame searched belongs to itHis sentence, then merge two sentences；If the energy of next frame is less than setting energy threshold E_t, and it is not belonging to other sentencesSon then carries out Fourier transform to the frame, takes the amplitude of 0-4000HZ, is divided into z bands of a spectrum according to fixed width, every bands of a spectrumIntensity is V_i, i=1,2 ... z.Overall strength is V_sum, P_iFor the probability of every bands of a spectrum: P_iCalculation formula are as follows:

Then, the spectrum entropy of the frame are as follows:

The energy of each frame and the ratio of spectrum entropy are energy entropy ratio, are denoted as R.An energy entropy is set than threshold value R_tIf the frameEnergy entropy ratio be not less than R_t, then the frame is grouped into sentence, if the beginning or end of voice flow, scan abort are arrived in scanning；

Step S105 identifies noise sentence；Whether the frame length for judging the independent sentence is the short sentence frame length range set, ifIt is then to compare the short independent sentence sample of historical storage and current independent sentence, it, will be independent if matching degree is lower than setting valueSentence is identified as noise sentence；Using machine learning method, judgement detection is carried out to short sentence, determines whether it is people's sound or makes an uproarSound abandons noise, further promotes accuracy.

Step S106 obtains punctuate.The independent sentence for not being identified as noise sentence that each framing section of the audio is obtained is madeFor the punctuate of audio.

In a preferred embodiment, after step S103 further include:

Step S1031: if the frame length of the independent sentence calculates the spectrum entropy of the independent every frame of office beyond independent frame length is setThan using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent sentence is divided into two independent sentences.

Split too long sentence.If the length of sentence is higher than specified time threshold, which is split.It tears openPoint mode is as follows: ignoring each a certain proportion of speech frame of head and the tail of sentence, traverses to remaining speech frame.If each frame isIt is computed spectrum entropy ratio, then using spectrum entropy ratio as weight W.If spectrum entropy ratio is not calculated, using the frame energy as weightW.For each frame, if in this sentence, there is Nleft frame on the left of the frame, there is Nright frame on right side, and definition splits coefficient valueWS is as follows: by traversal, finding the frame for enabling the fractionation value WS of the sentence minimum, which is divided into two sentences in left and right.IfToo long sentence is still had in two sentences in left and right, then too long sentence is continued to split using this method, until being not presentLong sentence.Filter too short meaningless sentence.A time threshold is specified, for being lower than the sentence of time span, it is possible toIt is not that people is speaking.For such sentence, the highest frame of its energy is adopted, its mel cepstrum coefficients are calculated.When useFirst trained support vector machines (SVM) classifier classifies to it, judge whether be people sound.If not the sound of peopleSound then abandons the sentence.SVM classifier training method is as follows: acquiring several people's sounds from lecture video and network direct broadcasting videoSample, as positive sample, several typical inhuman sound samples are as negative sample.Meier is used to be instructed to spectral coefficient as featurePractice, obtains model parameter.(principle of support vector machines can refer to).Here other machines learning method can also be taken, it is such as deepDegree neural network carries out classification judgement.

The present invention also provides the automatic split system for carrying out audio punctuate simultaneously, as shown in Figure 2, comprising: framing unit101, energy threshold acquiring unit 201, independent sentence acquiring unit 301；Compose entropy analytical unit 401, noise sentence judging unit 501 andPunctuate acquiring unit 601.

The framing unit 101 is configured to obtain multiple framing sections according to audio；

The energy threshold acquiring unit 201 is configured to obtain energy threshold E according to the energy value of each framing section_k；

The independent sentence acquiring unit 301, is configured to according to the energy threshold E_k, it is obtained from each framing sectionEnergy value is more than setting energy threshold E_tFraming section, then be sentence intermediate frame to the preamble frame or postorder frame of the frame using the framing sectionIt is scanned, if the energy threshold of preamble frame or postorder frame is less than setting energy threshold E_t, then by the frame and the sentence intermediate frameMerging by frame start sequence becomes independent sentence.

Entropy analytical unit 401 is composed, is configured to search for forward and backward from two frame of the front and back of each sentence, if searchNext frame belongs to other sentences, then merges to two sentences；If the energy of next frame is less than setting energy threshold E_t, andOther sentences are not belonging to, then Fourier transform is carried out to the frame, takes the amplitude of 0-4000HZ, are divided into z item spectrum according to fixed widthBand, the intensity of every bands of a spectrum are V_i, i=1,2 ... z.Overall strength is V_sum, P_iFor the probability of every bands of a spectrum.P_iCalculation formulaAre as follows:

Then, the spectrum entropy of the frame are as follows:

The noise sentence judging unit 501 is configured to judge whether the frame length of the independent sentence is the short sentence frame length setRange, if so, the short independent sentence sample of historical storage and current independent sentence are compared, if matching degree is lower than setting value,Independent sentence is then identified as noise sentence；

Punctuate acquiring unit 601 is configured to the independence for not being identified as noise sentence for obtaining each framing section of the audioPunctuate of the sentence as audio

In a preferred embodiment, the framing unit 101 is additionally configured to: receiving audio file；According to settingSliced time the audio file is split, obtain multiple framing sections.

In a preferred embodiment, the energy threshold acquiring unit 201 is additionally configured to, according to each framing sectionThe average value of energy value obtains energy threshold E_k。

In a preferred embodiment, the independent sentence acquiring unit 301 is additionally configured to, if preamble frame or postorder frameEnergy threshold be less than setting energy threshold E_t, then when judging whether the interval time of present frame and next frame is less than setting intervalBetween, if so, the sentence intermediate frame is merged by frame start sequence becomes independent sentence.

In a preferred embodiment, comprising: long sentence judging unit 3011；

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, anyThose familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all containLid is within protection scope of the present invention.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

The processing method of making pauses in reading unpunctuated ancient writings 1. audio appearance is made an uproar, comprising:
Step S101 obtains multiple framing sections according to audio；
Step S102 obtains energy threshold E according to the energy value of each framing section_k；
Step S103, according to the energy threshold E_k, it is more than setting energy threshold E that its energy value is obtained from each framing section_tFraming section, then be scanned by preamble frame or postorder frame of the sentence intermediate frame to the frame of the framing section, if preamble frame or postorderThe energy threshold of frame is less than setting energy threshold E_t, then merging the frame by frame start sequence with the sentence intermediate frame becomes independentSentence；
Step S104, from the front and back of each sentence, two frames is searched for forward and backward, if the next frame searched belongs to other sentencesSon then merges two sentences；If the energy of next frame is less than setting energy threshold E_t, and other sentences are not belonging to,Fourier transform then is carried out to the frame, takes the amplitude of 0-4000HZ, is divided into z bands of a spectrum according to fixed width, every bands of a spectrum it is strongDegree is V_i, i=1,2 ... z, overall strength V_sum, P_iFor the probability of every bands of a spectrum: P_iCalculation formula are as follows:
Then, the spectrum entropy of the frame are as follows:
The energy of each frame and the ratio of spectrum entropy are energy entropy ratio, are denoted as R, set an energy entropy than threshold value R_tIf the energy entropy of the frameThan being not less than R_t, then the frame is grouped into sentence, if the beginning or end of voice flow, scan abort are arrived in scanning；
Step S105 judges whether the frame length of the independent sentence is the short sentence frame length range set, if so, by historical storageShort independent sentence sample is compared with current independent sentence, if matching degree is lower than setting value, independent sentence is identified as noise sentence；
Step S106, the independent sentence for not being identified as noise sentence that each framing section of the audio is obtained is as the punctuate of audio.
The processing method of making pauses in reading unpunctuated ancient writings 2. audio appearance according to claim 1 is made an uproar, which is characterized in that include: in the step S101
Step S1011: audio file is received；
Step S1012: the audio file is split according to the sliced time of setting, obtains multiple framing sections.
The processing method of making pauses in reading unpunctuated ancient writings 3. audio appearance according to claim 1 or 2 is made an uproar, which is characterized in that wrapped in the step S102It includes: energy threshold E is obtained according to the average value of the energy value of each framing section_k。
The processing method of making pauses in reading unpunctuated ancient writings 4. audio appearance according to claim 1 is made an uproar, which is characterized in that " if preceding in the step S103The energy threshold of sequence frame or postorder frame is less than setting energy threshold E_t, then the frame and the sentence intermediate frame are closed by frame start sequenceAnd become independent sentence unit " the step of include:
If the energy threshold of preamble frame or postorder frame is less than setting energy threshold E_t, then when judging the interval of present frame and next frameBetween whether be less than setting interval time, if so, by the sentence intermediate frame by frame start sequence merge become independent sentence.
The processing method of making pauses in reading unpunctuated ancient writings 5. audio appearance according to claim 1 or 4 is made an uproar, which is characterized in that after step S103 further include:
Step S1031: if the frame length of the independent sentence calculates the spectrum entropy ratio of the independent every frame of sentence beyond independent frame length is set,Using lowest spectrum entropy than corresponding frame as cut-point, above-mentioned independent sentence is divided into two independent sentences.
6. carrying out the automatic split system of audio punctuate, comprising: framing unit, energy threshold acquiring unit, independent sentence obtain singleMember, noise sentence judging unit, punctuate acquiring unit；Compose entropy analytical unit:
The framing unit is configured to obtain multiple framing sections according to audio；
The energy threshold acquiring unit is configured to obtain energy threshold E according to the energy value of each framing section_k；
The independent sentence acquiring unit, is configured to according to the energy threshold E_k, it is super that its energy value is obtained from each framing sectionCross setting energy threshold E_tFraming section, then swept by preamble frame or postorder frame of the sentence intermediate frame to the frame of the framing sectionIt retouches, if the energy threshold of preamble frame or postorder frame is less than setting energy threshold E_t, then the frame and the sentence intermediate frame are risen by frameBeginning sequence, which merges, becomes independent sentence；
The spectrum entropy analytical unit is configured to search for forward and backward from two frame of the front and back of each sentence, if searched downOne frame belongs to other sentences, then merges to two sentences；If the energy of next frame is less than setting energy threshold E_t, and notBelong to other sentences, then Fourier transform carried out to the frame, take the amplitude of 0-4000HZ, be divided into z bands of a spectrum according to fixed width,The intensity of every bands of a spectrum is V_i, i=1,2 ... z, overall strength V_sum, P_iFor the probability of every bands of a spectrum, P_iCalculation formula are as follows:
Then, the spectrum entropy of the frame are as follows:
The energy of each frame and the ratio of spectrum entropy are energy entropy ratio, are denoted as R, set an energy entropy than threshold value R_tIf the energy entropy of the frameThan being not less than R_t, then the frame is grouped into sentence, if the beginning or end of voice flow, scan abort are arrived in scanning；
The noise sentence judging unit is configured to judge whether the frame length of the independent sentence is the short sentence frame length range set, ifIt is then to compare the short independent sentence sample of historical storage and current independent sentence, it, will be independent if matching degree is lower than setting valueSentence is identified as noise sentence；
Punctuate acquiring unit, the independent sentence for not being identified as noise sentence for being configured to obtain each framing section of the audio is as soundThe punctuate of frequency.
7. the automatic split system according to claim 6 for carrying out audio punctuate, which is characterized in that the framing unit is alsoIt is configured that reception audio file；The audio file is split according to the sliced time of setting, obtains multiple framing sections.
8. the automatic split system according to claim 6 or 7 for carrying out audio punctuate, which is characterized in that the energy valveValue acquiring unit is additionally configured to, and obtains energy threshold E according to the average value of the energy value of each framing section_k。
9. the automatic split system according to claim 6 for carrying out audio punctuate, which is characterized in that the independent sentence obtainsUnit is additionally configured to, if the energy threshold of preamble frame or postorder frame is less than setting energy threshold E_t, then judge present frame with it is nextWhether the interval time of frame is less than setting interval time, if so, the sentence intermediate frame is merged by frame start sequence becomes onlyVertical sentence.
10. the automatic split system according to claim 6 or 9 for carrying out audio punctuate, which is characterized in that further include: it is longSentence judging unit；
The long sentence judging unit, if the frame length for being configured to the independent sentence calculates the independent sentence beyond independent frame length is setAbove-mentioned independent sentence is divided into two independent sentences using lowest spectrum entropy than corresponding frame as cut-point by the spectrum entropy ratio of every frame.