Summary of the invention
In view of above prior art problems, the objective of the invention is to, a kind of coding and phoneme synthesizing method of phoneme of speech sound are provided, it can be under the situation of off-line (off-line), in advance sound phoneme and noiseless phoneme divided in the phoneme of voice, handle respectively, can when phonetic synthesis, simplify the process of phonetic synthesis.
The speech sound phoneme is encoded, and calculating amplitude, primitive period and frequency spectrum parameter also encoded, and wherein, frequency spectrum parameter is in LPC parameter coding mode; And for noiseless (aspirant; Unvoiced) the phoneme of speech sound file keeps its former sound and does not compress; Quiet part then only writes down quiet length.During decompression, only need partly, utilize interpolation method that amplitude, primitive period and frequency spectrum parameter are done smoothing processing, utilize voice operation demonstrator again, the reduction speech sound speech sound; Noiseless part only needs to take out former voice according to the address and is reduced; And quiet part only needs to take out long getting final product of quiet time.
According to the disclosed technology of the present invention, the invention provides a kind of coding and phoneme synthesizing method of phoneme of speech sound, it comprises two stages, speech database establishment stage and phonetic synthesis stage:
The speech database establishment stage comprises the following steps: this phoneme of speech sound is divided into sound, a noiseless and quiet phoneme; Should sound phoneme compressed encoding, and should carry out a geocoding and should carry out a time length coding by quiet phoneme by noiseless phoneme; And, store this sound phoneme of compressed coding and store this noiseless and quiet phoneme to this speech database.
In case the user keys in lteral data, can analyze the phoneme of this lteral data and read the phoneme data of this speech database, then, promptly enter the next stage.
The phonetic synthesis stage is synthesized the voice of this lteral data according to this phoneme data of this speech database, comprises the following steps: to read sound phoneme codes, this noiseless phoneme codes and this quiet phoneme codes of this phoneme data; And, via the synthetic speech sound of a voice operation demonstrator, and produce a unvoiced speech according to the sound phoneme codes of this phoneme data, and produce quiet voice according to this quiet phoneme codes according to the noiseless phoneme codes of this speech data.
In the speech database establishment stage, sound phoneme is according to primitive period parameter, amplitude parameter and frequency spectrum parameter compressed encoding in addition; Noiseless phoneme is then encoded according to primitive period parameter and address parameter; Quiet phoneme is then encoded according to primitive period parameter and time parameter.
In the phonetic synthesis stage, as long as according to the rule of voice coding, take out speech sound sign indicating number, unvoiced speech sign indicating number and quiet phonetic code in the middle of the speech database, and decipher respectively with synthesizing and to obtain voice that synthesize.Wherein, speech sound is via a voice operation demonstrator, and this voice operation demonstrator is designed according to primitive period parameter, frequency spectrum parameter and amplitude parameter three.
Specifically, the coding of a kind of phoneme of speech sound of the present invention and phoneme synthesizing method are taken a sample to a language with offline mode, and will encode and phonetic synthesis through the phoneme of speech sound of this language of sampling, comprise the following steps:
Set up a speech database, comprise the following steps:
This phoneme of speech sound is divided into sound, a noiseless and quiet phoneme;
Should sound phoneme compressed encoding, and should carry out a geocoding and should carry out a time length coding by quiet phoneme by noiseless phoneme; And
Store this sound phoneme of compressed coding and store this noiseless and quiet phoneme to this speech database;
When the user keys in a lteral data, analyze the phoneme of this lteral data and read a phoneme data of this speech database; And
According to this phoneme data of this speech database, the voice of synthetic this lteral data comprise the following steps:
Read sound phoneme codes, this noiseless phoneme codes and this quiet phoneme codes of this phoneme information; And
Sound phoneme codes according to this phoneme data synthesizes a speech sound via a voice operation demonstrator, and produces a unvoiced speech according to the noiseless phoneme codes of this speech data, and produces quiet voice according to this quiet phoneme codes.
In the coding and phoneme synthesizing method of described phoneme of speech sound, the sampling rate of this language of taking a sample is per second 8,000 times.
The compressed encoding of described sound phoneme is encoded according to a primitive period parameter, an amplitude parameter and a frequency spectrum parameter; This geocoding of this noiseless phoneme is encoded with this a primitive period parameter and an address parameter; The time span coding of this quiet phoneme is encoded with this a primitive period parameter and a time parameter.
This primitive period parameter and this amplitude parameter of described sound phoneme are unit with a sound frame (frame), progressively the calculating parameter value.
(Linear PredictiveCoding, LPC) mode is encoded with linear predictor coding in the coding system of described frequency spectrum parameter.
Described address parameter is this storage address through this noiseless phoneme of these voice of sampling of record.
Described time parameter is this quiet time span through this quiet phoneme of these voice of sampling of record.
The primitive period parameter value of described noiseless phoneme is defined as 1, and the primitive period parameter value of this quiet phoneme is defined as 0.
With synthetic this speech sound, wherein this voice operation demonstrator comprises according to this primitive period parameter, amplitude parameter and frequency spectrum parameter for described speech sound synthetic:
One pulse train generator is in order to be output as this primitive period parameter one excitation signal (Excitation Signal);
One vocal tract filter, according to the filtering parameter of this frequency spectrum parameter as this vocal tract filter, in order to receive this excitation signal and will be output as a voice signal; And
One multiplier is in order to be multiplied by this voice signal this amplitude parameter to export reduction voice.
The generation of described unvoiced speech is a unvoiced speech phoneme that reads this speech database according to this address parameter, and produces this unvoiced speech according to this unvoiced speech phoneme.
The amplitude that the generation of described quiet voice meets the time span of this time parameter according to the output of this time parameter be 0 quiet.
The method of phoneme of speech sound coding of the present invention and decoding, can off-line (off-line) mode carry out, the memory size of script phoneme file can be compressed to 2M byte (bytes) following (2.4kbps), can save memory headroom in a large number, and raising tonequality is sampled as 16 for each, utilize smoothing processing during decompression, then can improving partly, phoneme links bad voice.Moreover because this coding method is sound and unvoiced speech individual treated, when sound part can not occur in general voice coding, the situation of sound, the noiseless erroneous judgement that is produced caused dysphonia problem such as have a husky voice; Noiseless part then keeps the former sound of aspirant, to keep best aspirant effect.
Relevant features of the present invention etc., the conjunction with figs. most preferred embodiment is described in detail as follows.
Embodiment
If based on pronunciation, most of language all is the multisyllable language.With English is example, if English is subdivided into the different single syllable of being made up of each different phonetic symbol, then English can be summarized several thousand basic pronunciation unit, these pronunciation unit are phoneme, and each different phoneme itself all contains its primitive period (pitch).So can utilize this is the language on pronunciation basis with the phoneme, with coding and the decoding of phoneme as these voice, the present invention is the application according to this conception of species conversely.
Secondly, because the speech processes in e-dictionary market is rule comparatively, and the amount of data compression of its requirement is bigger, so, the present invention uses the mode of the mode of linear predictor coding (Linear Predictive Coding is hereinafter to be referred as LPC) as coding of the present invention and decoding.In addition, linear predictive coding (Linear Prediction Coding; LPC), be based on the speech utterance model, and the vocal tract filter of estimated signal (Vocal Tract Filter) parameter and basic cycle (Pitch) reach the purpose of compression, can reach low-down bit rate (Low Bit Rate), so quite be suitable as coding method of the present invention.
Next, please refer to Fig. 1, the coding of phoneme of speech sound of the present invention and the process flow diagram of phoneme synthesizing method comprise the following steps: to distinguish sound, noiseless and quiet phoneme (step 10); Carry out phoneme encoding (step 20); Store encoded sound phoneme codes, noiseless phoneme and quiet phoneme (step 30); With phoneme decoding and smoothing processing (step 40); And, synthetic speech (step 50).Wherein, from above-mentioned coding and decoding flow process, in fact comprised two stages, also, coding stage (step 10--30) and decoding stage (step 40--50).Wherein, the foundation that is speech database that coding stage focuses on is so what also can claim is the speech database establishment stage; The decoding stage is when then being the literal of wanting pressing to pronounce the e-dictionary user, e-dictionary can be according to the foundation rule of speech database, literal disassembled be phoneme of speech sound, and take out coded phoneme of speech sound according to coding rule of the present invention and deciphered again, and then reduction and synthetic speech, so what this stage also can claim is synthesis phase.Below will illustrate one by one at individual other step:
At first, in the middle of step 10, because voice can partly be distinguished phoneme of speech sound (phoneme) from the pronunciation of literal, and phoneme of speech sound also can be classified, so the present invention uses sound (voiced) in the middle of the phoneme of speech sound, noiseless (unvoiced) to do the basic classification mode with quiet difference.Because sound phoneme is the phonological component of periodicity (periodic), so, can further compress; And noiseless phoneme is the phonological component of aperiodicity (non-periodic), so, do not compress; Quiet its length that then directly writes down gets final product.
With the English equivalents in the middle of the e-dictionary is example, because the pairing of its letter and phonetic symbol (phoneticalphabet) has certain rule, also be, with each syllable is unit, can distinguish the sound and noiseless part of different syllables, so, can be distinguished the sound and noiseless of voice by the phonetic symbol data in the middle of the English database in advance.For example, the part of " noiseless " has f, p, s, t etc., and for example: the back handled in the phonetic symbol of free [fri] is [f-ri].As for the speech processes of national language and other Languages, reason also together.
By the information of language itself, can be with voice sound and noiseless under the situation of off-line (off-line), by handle in advance, i.e., before voice coding, all phoneme of speech sound are divided into sound and noiseless two classes.Wherein, the processing of sound phoneme is the initial consonant aspirant of excision syllable phoneme, only stays simple or compound vowel of a Chinese syllable that sound is arranged.And the processing of noiseless phoneme is the initial consonant aspirant that keeps noiseless consonant and syllable phoneme, and the quiet part of voice (may contain noise slightly) all is made as zero, only writes down quiet length.
After the phoneme classification with voice, can enter step 20, carry out phoneme encoding.Because the present invention in the middle of step 10, is divided into phoneme of speech sound " sound ", " noiseless " and " quiet " three kinds, therefore, the present invention will be encoded at three kinds of good phoneme of speech sound of classification in advance.Coded system of the present invention is encoded three major parameters of voice coding, and three parameters are respectively: the root-mean-square value (RMS of amplitude parameter; Root of mean square), primitive period (Pitch also is a tone) parameter and frequency spectrum parameter (RC ' s; Reflection coefficient, reflection coefficients).
Wherein, the acquisition of amplitude parameter and primitive period parameter is to be unit with a sound frame (a sound frame frame=180 sampling spot, the sampling rate of 8kHz), progressively calculates its parameter value.The acquisition of frequency spectrum parameter (RC ' s) is then calculated and is got according to the mode of LPC, also, calculates and gets according to following equation: A0/ (1+a1Z-1+ a2Z-2+ a10Z-10)
Wherein, A0Be amplitude parameter, Z is a1--a10Be the LPC parameter.
By three kinds of above parameters, a speech sound sound frame (180samples) may be encoded as 54bits, and the compression bit rate is equivalent to 2.4kbps, and the position configuration of each parameter is as follows: Pitch (6bits), RMS (6bits), RC ' s (RC
0--RC
9)
As for noiseless voice sound frame, because the present invention directly writes down it, so defining its primitive period (Pitch) parameter value is 1, its coded system is as follows: Pitch (6bits) Index_of_unvoiced_speech
Wherein, Idx is the pointer of actual speech (aspirant), also, and the address that it is stored.
Quiet voice sound frame, establishing its primitive period parameter value is 0, its coded system is as follows: Pitch (6bits) Length_of_silence
Wherein, Ls is noiseless length.
Next, can be with above coded voice data recording to speech database, also, step 30.Above step 10--30 has illustrated coding rule of the present invention, just utilizes " sound ", " noiseless " and " quiet " three parts of phoneme of speech sound itself, is encoded with different modes.So, can save sizable memory headroom.
This one sets up good speech database, promptly can be used as the data basis of phonetic synthesis.Just, be initial reading a little with the primitive period parameter when reading speech data, also, if Pitch>1 is then read 54bits altogether, decoding being reduced into speech sound; If Pitch=1 then reads 8bits (Idx) again, load actual speech aspirant data according to Idx, be example with English, all noiseless aspirant data account for internal memory 120kbytes; If Pitch=0 then reads 8bits (Ls) again, decoding is reduced into quiet, length L s*8.
In other words, because the technical tactic taked of the present invention is for " sound " of voice, " noiseless " the part separate processes with " quiet ", so the encoded data kenel of three is different, the position configuration of aforesaid various sound.So, when synthetic speech, needing only rule according to the present invention's coding, operation getting final product conversely.Below, will introduce the method for operating of synthesis phase, just step 40--50.
At first, introduce the part of phoneme decoding and smoothing processing earlier, also, step 40.In the middle of step 40, also need to be handled respectively according to three kinds of phonemes.
With reference to " sound " phoneme context, please refer to Fig. 2, voice operation demonstrator 100 calcspars of the present invention earlier.When phonetic synthesis, earlier according to the lteral data that the user keyed in, to disassemble rule according to phoneme and take out suitable phoneme of speech sound, its way is that first utilization can be pulse train (the Impulse Train of the primitive period of sound phoneme in the generation cycle; Excitation Signal) generator 101; Then, with pass through a vocal tract filter (Vocal Tract Filter) 102, the frequency response of this vocal tract filter 102 is determined by RC ' s value; Then, adjust the output speech energy according to the RMS value via multiplier 103.
Wherein, pulse train generator 101 is vibrations of emulation human vocal band, please refer to Fig. 3, and its production method is with sequence p[25]={ 8 ,-16,26 ,-48,86 ,-162,294,-502,718 ,-728,184,672,-610 ,-672,184,728,718,502,294,162,86,48,26,16,8} forms one-period property sequence e (n), and the cycle is primitive period (pitch) parameter.If Pitch>25, then e (n)=p[1], p[2] ..., p[25], 0 ..., 0}; If Pitch<=25, then e (n)=p[1], p[2] ..., p[Pitch] }.E (n) is again by a Lowpass Filter (1+0.75z then-1+ 0.125z-2), obtain the input excitation signal (ExcitationSignal) of vocal tract filter.
As for vocal tract filter 102, it is the frequency spectrum parameter that is calculated according to the LPC mode for the frequency response of emulation oral cavity channel, filter parameter, and RC ' s can realize by vocal tract filter 102, and its input signal is e (n), is output as voice s (n).Because the LPC processing procedure is done pre-emphasis and is handled (Pre-emphasis) (1-0.9875z when coding-1), it is in order to strengthening the correct computing of high-frequency signal, so when deciphering, need add one and separate pre-emphasis wave filter (De-emphasisFilter) 1/ (1-0.9875z-1).
In the multiplier of Fig. 2, yield value (Gain) is added, also, with the RMS of decoding back voice signal decoding value of going on duty through vocal tract filter 102, also, above-mentioned amplitude parameter, be adjusted into and encode before identical getting final product, wherein:
In addition, when the phonetic synthesis of sound phoneme, need primitive period (Pitch) is synchronous in addition.When synchronous method ties up to phonetic synthesis, with a primitive period is unit, synthetic continuous several all after dates, synthetic speech length is necessary≤the total sample of sound frame count (that is: the remaining sample points of sound frame length (180)+last synthesized voice frame) at present, remaining sample point of counting less than total sample, and handle in next sound frame.As shown in Figure 3, be that per second 8,000 is an example with the sampling rate, the length of a sound frame is about 180 points, after having got five primitive periods because discontented 180 points, the length that residue is counted and is not enough to get a primitive period, because of next cycle that it is enrolled continue, by that analogy.
At last, promptly enter the subordinate phase of step 40, smoothing processing also, is handled primitive period, amplitude and RC ' s parameter smoothing.Parameter is in the interpolation mode, does smoothing processing,
Wherein, synthetic parameters=last sound frame parameter * (1-Prop)+present sound frame parameter * Prop.
Wherein, 0≤Prop (Proportion; Ratio)≤1,
The present sound frame of Prop=is synthetic, and sample points/the total sample of sound frame is counted at present.
Have in the cataloged procedure of sound phoneme comparatively complicated, so, above-mentionedly its building-up process described with explanation more clearly.Next, will synthesize at three kinds of different phonemes and do an introduction that system is whole, also, a fit becomes the flow process of voice, please refer to Fig. 4, and phoneme of speech sound decoding process flow diagram of the present invention by this process flow diagram, can more clearly demonstrate the concrete operations of step 40 and 50.
In the bit flow process that whole speech data reads, because coding of the present invention takes primitive period (pitch) parameter to compile mode foremost in data, and, the primitive period parameter of " sound " obtains according to calculating, the primitive period parameter of " noiseless " is 1, and the primitive period parameter of " quiet " is 0, so, data that can the primitive period parameter are judged that it is " sound ", " noiseless " or " quiet " data, and are handled respectively.Because the primitive period parameter accounts for 6 data, so, read in 6 (step 401) earlier, be " sound ", " noiseless " or " quiet " to differentiate data.If, primitive period>1 (step 402), then it must be sound phoneme, then, read remaining 48 bit data, also, amplitude parameter (RMS) and frequency spectrum parameter (RC ' s), after reading in 48 (step 408), handle (step 409) through voice operation demonstrator again and encoded " sound " voice can be reduced; If, primitive period=0 (step 403), then it must then read in 8 (step 404) for quiet, reading quiet length, and produces Ls*8 point quiet (step 407); If primitive period is not more than 1, be not equal to 0 again, then the primitive period parameter must be 1, then reads in 8 (step 405), also, searches the storage address of aspirant, reads in aspirant sample point (step 406) according to database.At last, promptly exportable voice (step 410) with " sound " of original voice, the part of " noiseless " and " quiet ", are reduced respectively.
Please continue with reference to figure 5, voice operation demonstrator signal processing flow figure of the present invention can be illustrated more clearly in the synthetic of " sound " phoneme by this figure.
In the data of " sound ", it accounts for 54 positions, below is synthetic flow process.At first, in step 411, read in first sound frame parameter earlier, then, in step 412,
Make N=0, L=180,
Primitive period O=primitive period
RMSO=0,
RCOi=RCi,i=0,1,…,9
To read the RC parameter, then, can carry out the action of parameter smoothing, to make tonequality better, this is a step 413, and is as follows:
prop=N/L;
Primitive periodj=primitive period 0* (1-prop)+primitive period * prop
RMSj=RMSO*(1-prop)+RMS*prop;
RCj(i)=RCO(i)*(1-prop)+RC(i)*prop
i=0,1,…,9
Wherein, prop is ratio (Proportion), and L is the size of sound frame then, in the time of at the beginning, and L=180.
Then, if the N+ primitive periodj>L (step 414), also, get length greater than a sound frame after, read next sound frame again, just, enter step 415:
Make L=L-N+180
N=0
Primitive period 0=primitive period
RMSO=RMS,
RCOj=RCi,i=0,1,…,9
Then, continue step 416, read in next sound frame parameter.
If, the N+ primitive periodjNon-greater than L, just, take out primitive period parameter, RMS and RC ' s parameter, carry out step 417, to handle through voice operation demonstrator, promptly exportable voice (step 418) then, continue the processing of next voice sound frame, just, step 419:
Make the N=N+ primitive periodj
j=j+1
By above-mentioned phonetic synthesis flow process, speech sound can be deciphered and synthesized to compressed sound phoneme.
Fig. 6 A is the former sound speech waveform of individual character " abbreviation ", Fig. 6 B for utilize the present invention that " abbreviation " encoded and decipher after speech waveform, Fig. 6 C is it via the coding of general prior art and the speech waveform after deciphering; Fig. 7 A--7C then is respectively its frequency spectrum, by Fig. 6 A and Fig. 6 B, and Fig. 7 A and Fig. 7 B can find out, utilize coding of the present invention and phoneme synthesizing method, not only can solve very approximate primitive period and frequency spectrum, and the existing method of its noise ratio is little a lot, moreover, through after the smoothing processing, make the pronunciation more smooth-going nature of pronunciation of the present invention than prior art Fig. 7 C.
Though the present invention with aforesaid preferred embodiment openly as above; right its is not in order to qualification the present invention, any those of ordinary skill in the art, without departing from the spirit and scope of the present invention; when can doing a little change and retouching, so protection scope of the present invention is as the criterion with claim.