CN1450528A

Movatterモバイル変換

Info

Publication number: CN1450528A
Application number: CN02105935A
Authority: CN
Inventors: 杨凰琳
Original assignee: Inventec Besta Co Ltd
Current assignee: Inventec Besta Co Ltd
Priority date: 2002-04-09
Filing date: 2002-04-09
Publication date: 2003-10-22
Anticipated expiration: 2022-04-09
Also published as: CN1210688C

Abstract

The invention relates to a method for coding and synthesizing voice phoneme, which samples voice in an off-line mode, classifies the sampled voice data into three phonemes according to the sound, silence and silence of the phonemes, codes the sound phoneme according to the parameters of the basic cycle, the amplitude and the frequency spectrum, directly records the silence phoneme, records the silence time of the silence phoneme, and records the coded phoneme data to a voice database; then, the speech can be restored by decoding and synthesizing the encoded phoneme data in the speech database; the extracted voiced phoneme is synthesized by a speech synthesizer designed according to the basic parameter amplitude parameter and the frequency spectrum parameter, the unvoiced phoneme is directly extracted, and the mute length of the mute phoneme is played, so that the speech close to the original sound can be synthesized.

Description

The coding of phoneme of speech sound and phoneme synthesizing method

Technical field

The present invention relates to a kind of voice coding and phoneme synthesizing method, particularly a kind of based on phoneme and use the LPC technology to come method to voice coding and decoding.

Background technology

In low and middle-grade e-dictionary market, brag about function with true man's pronunciation, become the characteristic of the main demand of e-dictionary.In order to promote the competitive power of low and middle-grade e-dictionaries in market, each tame manufacturer is absorbed in the improvement of phonetic function invariably and wants simultaneously and can reduce production costs.The true man that some manufacturer emphasized record special sound, because its data volume is big, and the kind of system's output is subjected to very big restriction, quite expend cost, so, most of manufacturers all come the pronunciation near true man in the synthetic mode of speech analysis, can allow e-dictionary can save speech data memory and improve sound quality.

The synthetic technology of this speech analysis is according to certain disposal route, and the metalanguage signal also proposes necessary characteristic parameter with it, and synthesize the technology of voice according to the model of voice generation with these parameters.Because the synthetic process of speech analysis is that voice signal is represented original signal with minimum numerical data, so what generally also claim is voice compression technique, it involves the sampling and the technology such as coding and decoding of voice.As (the Adaptive Delta Pulse Code Modulation of the adaptability residual quantity pulse-code modulation in the speech waveform coding; ADPCM) coded system, it focuses on making the signal of reconstruction and original signal waveform to heal picture better; Viewpoint from mathematics, it adopts the criterion (Minimum Mean Square Error Criterion) of least mean-square error, but the bit rate of ADPCM method has the sound quality variation after reduction less than 24kbps (Kilo Bit Per Second), and the big problem of operand.

Above-described speech analysis is synthetic, and its characteristic is to have significantly compressed voice data volume, and the advantage (utilization encryption technology) of secret communication also can additionally be arranged.But, its shortcoming be phonetic synthesis weight, partial, primitive period (tone) often with natural-sounding gap to some extent, cause nature, shortcoming even not easy to identify.

Even if through the speech analysis synthetic technology of overcompression, the possibility of saving memory headroom is arranged still.In addition, the many modes with (on-line) on the line of existing speech analysis synthetic technology operate, so, must add whether the judge voice action of " sound is arranged ", usually in the process of judging, " sound " the part misjudgment with " noiseless " can be produced husky situation when causing phonetic synthesis.

So the voice that how to allow the speech analysis synthetic technology be produced can reach on the one hand near natural-sounding, also, the improvement of tonequality; On the other hand, how to reach the degree of maximum compression, also, least consumption accounts for memory headroom; Again on the one hand, how to allow the synthetic process of speech analysis comparatively simple; More than some all becomes important research project.

Summary of the invention

In view of above prior art problems, the objective of the invention is to, a kind of coding and phoneme synthesizing method of phoneme of speech sound are provided, it can be under the situation of off-line (off-line), in advance sound phoneme and noiseless phoneme divided in the phoneme of voice, handle respectively, can when phonetic synthesis, simplify the process of phonetic synthesis.

The speech sound phoneme is encoded, and calculating amplitude, primitive period and frequency spectrum parameter also encoded, and wherein, frequency spectrum parameter is in LPC parameter coding mode; And for noiseless (aspirant; Unvoiced) the phoneme of speech sound file keeps its former sound and does not compress; Quiet part then only writes down quiet length.During decompression, only need partly, utilize interpolation method that amplitude, primitive period and frequency spectrum parameter are done smoothing processing, utilize voice operation demonstrator again, the reduction speech sound speech sound; Noiseless part only needs to take out former voice according to the address and is reduced; And quiet part only needs to take out long getting final product of quiet time.

According to the disclosed technology of the present invention, the invention provides a kind of coding and phoneme synthesizing method of phoneme of speech sound, it comprises two stages, speech database establishment stage and phonetic synthesis stage:

The speech database establishment stage comprises the following steps: this phoneme of speech sound is divided into sound, a noiseless and quiet phoneme; Should sound phoneme compressed encoding, and should carry out a geocoding and should carry out a time length coding by quiet phoneme by noiseless phoneme; And, store this sound phoneme of compressed coding and store this noiseless and quiet phoneme to this speech database.

In case the user keys in lteral data, can analyze the phoneme of this lteral data and read the phoneme data of this speech database, then, promptly enter the next stage.

The phonetic synthesis stage is synthesized the voice of this lteral data according to this phoneme data of this speech database, comprises the following steps: to read sound phoneme codes, this noiseless phoneme codes and this quiet phoneme codes of this phoneme data; And, via the synthetic speech sound of a voice operation demonstrator, and produce a unvoiced speech according to the sound phoneme codes of this phoneme data, and produce quiet voice according to this quiet phoneme codes according to the noiseless phoneme codes of this speech data.

In the speech database establishment stage, sound phoneme is according to primitive period parameter, amplitude parameter and frequency spectrum parameter compressed encoding in addition; Noiseless phoneme is then encoded according to primitive period parameter and address parameter; Quiet phoneme is then encoded according to primitive period parameter and time parameter.

In the phonetic synthesis stage, as long as according to the rule of voice coding, take out speech sound sign indicating number, unvoiced speech sign indicating number and quiet phonetic code in the middle of the speech database, and decipher respectively with synthesizing and to obtain voice that synthesize.Wherein, speech sound is via a voice operation demonstrator, and this voice operation demonstrator is designed according to primitive period parameter, frequency spectrum parameter and amplitude parameter three.

Specifically, the coding of a kind of phoneme of speech sound of the present invention and phoneme synthesizing method are taken a sample to a language with offline mode, and will encode and phonetic synthesis through the phoneme of speech sound of this language of sampling, comprise the following steps:

Set up a speech database, comprise the following steps:

This phoneme of speech sound is divided into sound, a noiseless and quiet phoneme;

Should sound phoneme compressed encoding, and should carry out a geocoding and should carry out a time length coding by quiet phoneme by noiseless phoneme; And

Store this sound phoneme of compressed coding and store this noiseless and quiet phoneme to this speech database;

When the user keys in a lteral data, analyze the phoneme of this lteral data and read a phoneme data of this speech database; And

According to this phoneme data of this speech database, the voice of synthetic this lteral data comprise the following steps:

Read sound phoneme codes, this noiseless phoneme codes and this quiet phoneme codes of this phoneme information; And

Sound phoneme codes according to this phoneme data synthesizes a speech sound via a voice operation demonstrator, and produces a unvoiced speech according to the noiseless phoneme codes of this speech data, and produces quiet voice according to this quiet phoneme codes.

In the coding and phoneme synthesizing method of described phoneme of speech sound, the sampling rate of this language of taking a sample is per second 8,000 times.

The compressed encoding of described sound phoneme is encoded according to a primitive period parameter, an amplitude parameter and a frequency spectrum parameter; This geocoding of this noiseless phoneme is encoded with this a primitive period parameter and an address parameter; The time span coding of this quiet phoneme is encoded with this a primitive period parameter and a time parameter.

This primitive period parameter and this amplitude parameter of described sound phoneme are unit with a sound frame (frame), progressively the calculating parameter value.

(Linear PredictiveCoding, LPC) mode is encoded with linear predictor coding in the coding system of described frequency spectrum parameter.

Described address parameter is this storage address through this noiseless phoneme of these voice of sampling of record.

Described time parameter is this quiet time span through this quiet phoneme of these voice of sampling of record.

The primitive period parameter value of described noiseless phoneme is defined as 1, and the primitive period parameter value of this quiet phoneme is defined as 0.

With synthetic this speech sound, wherein this voice operation demonstrator comprises according to this primitive period parameter, amplitude parameter and frequency spectrum parameter for described speech sound synthetic:

One pulse train generator is in order to be output as this primitive period parameter one excitation signal (Excitation Signal);

One vocal tract filter, according to the filtering parameter of this frequency spectrum parameter as this vocal tract filter, in order to receive this excitation signal and will be output as a voice signal; And

One multiplier is in order to be multiplied by this voice signal this amplitude parameter to export reduction voice.

The generation of described unvoiced speech is a unvoiced speech phoneme that reads this speech database according to this address parameter, and produces this unvoiced speech according to this unvoiced speech phoneme.

The amplitude that the generation of described quiet voice meets the time span of this time parameter according to the output of this time parameter be 0 quiet.

The method of phoneme of speech sound coding of the present invention and decoding, can off-line (off-line) mode carry out, the memory size of script phoneme file can be compressed to 2M byte (bytes) following (2.4kbps), can save memory headroom in a large number, and raising tonequality is sampled as 16 for each, utilize smoothing processing during decompression, then can improving partly, phoneme links bad voice.Moreover because this coding method is sound and unvoiced speech individual treated, when sound part can not occur in general voice coding, the situation of sound, the noiseless erroneous judgement that is produced caused dysphonia problem such as have a husky voice; Noiseless part then keeps the former sound of aspirant, to keep best aspirant effect.

Relevant features of the present invention etc., the conjunction with figs. most preferred embodiment is described in detail as follows.

Description of drawings

Fig. 1 is the coding of phoneme of speech sound of the present invention and the process flow diagram of phoneme synthesizing method;

Fig. 2 is a voice operation demonstrator calcspar of the present invention;

Fig. 3 is an emulation human vocal band vibrorecord of the present invention;

Fig. 4 is a phoneme of speech sound decoding process flow diagram of the present invention;

Fig. 5 is voice operation demonstrator signal processing flow figure of the present invention;

Fig. 6 A is the former sound speech waveform of individual character " abbreviation ";

Fig. 6 B is that individual character " abbreviation " utilizes encoded speech waveform with phoneme synthesizing method of the present invention;

Fig. 6 C is that individual character " abbreviation " is with the speech waveform of general fashion coding with phoneme synthesizing method;

Fig. 7 A is the spectrogram of Fig. 6 A;

Fig. 7 B is the spectrogram of Fig. 6 B; And

Fig. 7 C is the spectrogram of Fig. 6 C.

Embodiment

If based on pronunciation, most of language all is the multisyllable language.With English is example, if English is subdivided into the different single syllable of being made up of each different phonetic symbol, then English can be summarized several thousand basic pronunciation unit, these pronunciation unit are phoneme, and each different phoneme itself all contains its primitive period (pitch).So can utilize this is the language on pronunciation basis with the phoneme, with coding and the decoding of phoneme as these voice, the present invention is the application according to this conception of species conversely.

Secondly, because the speech processes in e-dictionary market is rule comparatively, and the amount of data compression of its requirement is bigger, so, the present invention uses the mode of the mode of linear predictor coding (Linear Predictive Coding is hereinafter to be referred as LPC) as coding of the present invention and decoding.In addition, linear predictive coding (Linear Prediction Coding; LPC), be based on the speech utterance model, and the vocal tract filter of estimated signal (Vocal Tract Filter) parameter and basic cycle (Pitch) reach the purpose of compression, can reach low-down bit rate (Low Bit Rate), so quite be suitable as coding method of the present invention.

Next, please refer to Fig. 1, the coding of phoneme of speech sound of the present invention and the process flow diagram of phoneme synthesizing method comprise the following steps: to distinguish sound, noiseless and quiet phoneme (step 10); Carry out phoneme encoding (step 20); Store encoded sound phoneme codes, noiseless phoneme and quiet phoneme (step 30); With phoneme decoding and smoothing processing (step 40); And, synthetic speech (step 50).Wherein, from above-mentioned coding and decoding flow process, in fact comprised two stages, also, coding stage (step 10--30) and decoding stage (step 40--50).Wherein, the foundation that is speech database that coding stage focuses on is so what also can claim is the speech database establishment stage; The decoding stage is when then being the literal of wanting pressing to pronounce the e-dictionary user, e-dictionary can be according to the foundation rule of speech database, literal disassembled be phoneme of speech sound, and take out coded phoneme of speech sound according to coding rule of the present invention and deciphered again, and then reduction and synthetic speech, so what this stage also can claim is synthesis phase.Below will illustrate one by one at individual other step:

At first, in the middle of step 10, because voice can partly be distinguished phoneme of speech sound (phoneme) from the pronunciation of literal, and phoneme of speech sound also can be classified, so the present invention uses sound (voiced) in the middle of the phoneme of speech sound, noiseless (unvoiced) to do the basic classification mode with quiet difference.Because sound phoneme is the phonological component of periodicity (periodic), so, can further compress; And noiseless phoneme is the phonological component of aperiodicity (non-periodic), so, do not compress; Quiet its length that then directly writes down gets final product.

With the English equivalents in the middle of the e-dictionary is example, because the pairing of its letter and phonetic symbol (phoneticalphabet) has certain rule, also be, with each syllable is unit, can distinguish the sound and noiseless part of different syllables, so, can be distinguished the sound and noiseless of voice by the phonetic symbol data in the middle of the English database in advance.For example, the part of " noiseless " has f, p, s, t etc., and for example: the back handled in the phonetic symbol of free [fri] is [f-ri].As for the speech processes of national language and other Languages, reason also together.

By the information of language itself, can be with voice sound and noiseless under the situation of off-line (off-line), by handle in advance, i.e., before voice coding, all phoneme of speech sound are divided into sound and noiseless two classes.Wherein, the processing of sound phoneme is the initial consonant aspirant of excision syllable phoneme, only stays simple or compound vowel of a Chinese syllable that sound is arranged.And the processing of noiseless phoneme is the initial consonant aspirant that keeps noiseless consonant and syllable phoneme, and the quiet part of voice (may contain noise slightly) all is made as zero, only writes down quiet length.

After the phoneme classification with voice, can enter step 20, carry out phoneme encoding.Because the present invention in the middle of step 10, is divided into phoneme of speech sound " sound ", " noiseless " and " quiet " three kinds, therefore, the present invention will be encoded at three kinds of good phoneme of speech sound of classification in advance.Coded system of the present invention is encoded three major parameters of voice coding, and three parameters are respectively: the root-mean-square value (RMS of amplitude parameter; Root of mean square), primitive period (Pitch also is a tone) parameter and frequency spectrum parameter (RC ' s; Reflection coefficient, reflection coefficients).

Wherein, the acquisition of amplitude parameter and primitive period parameter is to be unit with a sound frame (a sound frame frame=180 sampling spot, the sampling rate of 8kHz), progressively calculates its parameter value.The acquisition of frequency spectrum parameter (RC ' s) is then calculated and is got according to the mode of LPC, also, calculates and gets according to following equation: A₀/ (1+a₁Z^-1+ a₂Z^-2+ a₁₀Z^-10)

Wherein, A₀Be amplitude parameter, Z is a₁--a₁₀Be the LPC parameter.

By three kinds of above parameters, a speech sound sound frame (180samples) may be encoded as 54bits, and the compression bit rate is equivalent to 2.4kbps, and the position configuration of each parameter is as follows: Pitch (6bits), RMS (6bits), RC ' s (RC₀--RC₉)

As for noiseless voice sound frame, because the present invention directly writes down it, so defining its primitive period (Pitch) parameter value is 1, its coded system is as follows: Pitch (6bits) Index_of_unvoiced_speech

6

8(Idx)

Wherein, Idx is the pointer of actual speech (aspirant), also, and the address that it is stored.

Quiet voice sound frame, establishing its primitive period parameter value is 0, its coded system is as follows: Pitch (6bits) Length_of_silence

6

8(Ls)

Wherein, Ls is noiseless length.

Next, can be with above coded voice data recording to speech database, also, step 30.Above step 10--30 has illustrated coding rule of the present invention, just utilizes " sound ", " noiseless " and " quiet " three parts of phoneme of speech sound itself, is encoded with different modes.So, can save sizable memory headroom.

This one sets up good speech database, promptly can be used as the data basis of phonetic synthesis.Just, be initial reading a little with the primitive period parameter when reading speech data, also, if Pitch＞1 is then read 54bits altogether, decoding being reduced into speech sound; If Pitch=1 then reads 8bits (Idx) again, load actual speech aspirant data according to Idx, be example with English, all noiseless aspirant data account for internal memory 120kbytes; If Pitch=0 then reads 8bits (Ls) again, decoding is reduced into quiet, length L s*8.

In other words, because the technical tactic taked of the present invention is for " sound " of voice, " noiseless " the part separate processes with " quiet ", so the encoded data kenel of three is different, the position configuration of aforesaid various sound.So, when synthetic speech, needing only rule according to the present invention's coding, operation getting final product conversely.Below, will introduce the method for operating of synthesis phase, just step 40--50.

At first, introduce the part of phoneme decoding and smoothing processing earlier, also, step 40.In the middle of step 40, also need to be handled respectively according to three kinds of phonemes.

With reference to " sound " phoneme context, please refer to Fig. 2, voice operation demonstrator 100 calcspars of the present invention earlier.When phonetic synthesis, earlier according to the lteral data that the user keyed in, to disassemble rule according to phoneme and take out suitable phoneme of speech sound, its way is that first utilization can be pulse train (the Impulse Train of the primitive period of sound phoneme in the generation cycle; Excitation Signal) generator 101; Then, with pass through a vocal tract filter (Vocal Tract Filter) 102, the frequency response of this vocal tract filter 102 is determined by RC ' s value; Then, adjust the output speech energy according to the RMS value via multiplier 103.

Wherein, pulse train generator 101 is vibrations of emulation human vocal band, please refer to Fig. 3, and its production method is with sequence p[25]={ 8 ,-16,26 ,-48,86 ,-162,294,-502,718 ,-728,184,672,-610 ,-672,184,728,718,502,294,162,86,48,26,16,8} forms one-period property sequence e (n), and the cycle is primitive period (pitch) parameter.If Pitch＞25, then e (n)=p[1], p[2] ..., p[25], 0 ..., 0}; If Pitch＜=25, then e (n)=p[1], p[2] ..., p[Pitch] }.E (n) is again by a Lowpass Filter (1+0.75z then^-1+ 0.125z^-2), obtain the input excitation signal (ExcitationSignal) of vocal tract filter.

As for vocal tract filter 102, it is the frequency spectrum parameter that is calculated according to the LPC mode for the frequency response of emulation oral cavity channel, filter parameter, and RC ' s can realize by vocal tract filter 102, and its input signal is e (n), is output as voice s (n).Because the LPC processing procedure is done pre-emphasis and is handled (Pre-emphasis) (1-0.9875z when coding^-1), it is in order to strengthening the correct computing of high-frequency signal, so when deciphering, need add one and separate pre-emphasis wave filter (De-emphasisFilter) 1/ (1-0.9875z^-1).

In the multiplier of Fig. 2, yield value (Gain) is added, also, with the RMS of decoding back voice signal decoding value of going on duty through vocal tract filter 102, also, above-mentioned amplitude parameter, be adjusted into and encode before identical getting final product, wherein:

Gain = RMS / \sqrt{\frac{1}{N} Σ_{n = 0}^{N} s^{2} (n)}

In addition, when the phonetic synthesis of sound phoneme, need primitive period (Pitch) is synchronous in addition.When synchronous method ties up to phonetic synthesis, with a primitive period is unit, synthetic continuous several all after dates, synthetic speech length is necessary≤the total sample of sound frame count (that is: the remaining sample points of sound frame length (180)+last synthesized voice frame) at present, remaining sample point of counting less than total sample, and handle in next sound frame.As shown in Figure 3, be that per second 8,000 is an example with the sampling rate, the length of a sound frame is about 180 points, after having got five primitive periods because discontented 180 points, the length that residue is counted and is not enough to get a primitive period, because of next cycle that it is enrolled continue, by that analogy.

At last, promptly enter the subordinate phase of step 40, smoothing processing also, is handled primitive period, amplitude and RC ' s parameter smoothing.Parameter is in the interpolation mode, does smoothing processing,

Wherein, synthetic parameters=last sound frame parameter * (1-Prop)+present sound frame parameter * Prop.

Wherein, 0≤Prop (Proportion; Ratio)≤1,

The present sound frame of Prop=is synthetic, and sample points/the total sample of sound frame is counted at present.

Have in the cataloged procedure of sound phoneme comparatively complicated, so, above-mentionedly its building-up process described with explanation more clearly.Next, will synthesize at three kinds of different phonemes and do an introduction that system is whole, also, a fit becomes the flow process of voice, please refer to Fig. 4, and phoneme of speech sound decoding process flow diagram of the present invention by this process flow diagram, can more clearly demonstrate the concrete operations of step 40 and 50.

In the bit flow process that whole speech data reads, because coding of the present invention takes primitive period (pitch) parameter to compile mode foremost in data, and, the primitive period parameter of " sound " obtains according to calculating, the primitive period parameter of " noiseless " is 1, and the primitive period parameter of " quiet " is 0, so, data that can the primitive period parameter are judged that it is " sound ", " noiseless " or " quiet " data, and are handled respectively.Because the primitive period parameter accounts for 6 data, so, read in 6 (step 401) earlier, be " sound ", " noiseless " or " quiet " to differentiate data.If, primitive period＞1 (step 402), then it must be sound phoneme, then, read remaining 48 bit data, also, amplitude parameter (RMS) and frequency spectrum parameter (RC ' s), after reading in 48 (step 408), handle (step 409) through voice operation demonstrator again and encoded " sound " voice can be reduced; If, primitive period=0 (step 403), then it must then read in 8 (step 404) for quiet, reading quiet length, and produces Ls*8 point quiet (step 407); If primitive period is not more than 1, be not equal to 0 again, then the primitive period parameter must be 1, then reads in 8 (step 405), also, searches the storage address of aspirant, reads in aspirant sample point (step 406) according to database.At last, promptly exportable voice (step 410) with " sound " of original voice, the part of " noiseless " and " quiet ", are reduced respectively.

Please continue with reference to figure 5, voice operation demonstrator signal processing flow figure of the present invention can be illustrated more clearly in the synthetic of " sound " phoneme by this figure.

In the data of " sound ", it accounts for 54 positions, below is synthetic flow process.At first, in step 411, read in first sound frame parameter earlier, then, in step 412,

Make N=0, L=180,

Primitive period O=primitive period

RMSO＝0，

RCO_i＝RC_i，i＝0，1，…，9

To read the RC parameter, then, can carry out the action of parameter smoothing, to make tonequality better, this is a step 413, and is as follows:

prop＝N/L；

Primitive period_j=primitive period 0* (1-prop)+primitive period * prop

RMS_j＝RMSO*(1-prop)+RMS*prop；

RC_j(i)＝RCO(i)*(1-prop)+RC(i)*prop

i＝0，1，…，9

Wherein, prop is ratio (Proportion), and L is the size of sound frame then, in the time of at the beginning, and L=180.

Then, if the N+ primitive period_j＞L (step 414), also, get length greater than a sound frame after, read next sound frame again, just, enter step 415:

Make L=L-N+180

N＝0

Primitive period 0=primitive period

RMSO＝RMS，

RCO_j＝RC_i，i＝0，1，…，9

Then, continue step 416, read in next sound frame parameter.

If, the N+ primitive period_jNon-greater than L, just, take out primitive period parameter, RMS and RC ' s parameter, carry out step 417, to handle through voice operation demonstrator, promptly exportable voice (step 418) then, continue the processing of next voice sound frame, just, step 419:

Make the N=N+ primitive period_j

j＝j+1

By above-mentioned phonetic synthesis flow process, speech sound can be deciphered and synthesized to compressed sound phoneme.

Fig. 6 A is the former sound speech waveform of individual character " abbreviation ", Fig. 6 B for utilize the present invention that " abbreviation " encoded and decipher after speech waveform, Fig. 6 C is it via the coding of general prior art and the speech waveform after deciphering; Fig. 7 A--7C then is respectively its frequency spectrum, by Fig. 6 A and Fig. 6 B, and Fig. 7 A and Fig. 7 B can find out, utilize coding of the present invention and phoneme synthesizing method, not only can solve very approximate primitive period and frequency spectrum, and the existing method of its noise ratio is little a lot, moreover, through after the smoothing processing, make the pronunciation more smooth-going nature of pronunciation of the present invention than prior art Fig. 7 C.

Though the present invention with aforesaid preferred embodiment openly as above; right its is not in order to qualification the present invention, any those of ordinary skill in the art, without departing from the spirit and scope of the present invention; when can doing a little change and retouching, so protection scope of the present invention is as the criterion with claim.

Claims

Translated fromChinese

1.一种语音音素的编码及语音合成方法，其特征在于，以离线方式对一语言进行取样，并将经取样的该语言的语音音素进行编码与语音合成，包括下列步骤：1. a kind of coding of speech phoneme and speech synthesis method, it is characterized in that, a language is sampled in off-line mode, and the speech phoneme of this language through sampling is carried out coding and speech synthesis, comprises the following steps:

建立一语音数据库，包括下列步骤：Establishing a voice database includes the following steps:

将该语音音素区分为一有声、无声与静音音素；Distinguishing the speech phoneme into a voiced, unvoiced and muted phoneme;

将该有声音素压缩编码，并将该无声音素进行一地址编码与将该静音音素进行一时间长度编码；及compressing and encoding the voiced phoneme, performing an address encoding on the silent phoneme and performing a time length encoding on the silent phoneme; and

储存经压缩编码的该有声音素并储存该无声与静音音素至该语音数据库；storing the voiced phonemes compressed and encoded and storing the unvoiced and muted phonemes into the speech database;

当使用者键入一文字数据时，分析该文字数据的音素并读取该语音数据库的一音素数据；以及When the user inputs a text data, analyze the phoneme of the text data and read a phoneme data of the speech database; and

依据该语音数据库的该音素数据，合成该文字数据的语音，包括下列步骤：Synthesizing the speech of the text data according to the phoneme data of the speech database includes the following steps:

读取该音素资料的有声音素码、该无声音素码与该静音音素码；及reading the phoneme code of the phoneme data, the code of the silent phoneme and the code of the silent phoneme; and

依据该音素数据的有声音素码经由一语音合成器合成一有声语音，并依据该语音数据的无声音素码产生一无声语音，并依据该静音音素码产生一静音语音。A voiced speech is synthesized by a speech synthesizer according to the phoneme code of the phoneme data, a silent speech is generated according to the silent phoneme code of the speech data, and a silent speech is generated according to the silent phoneme code.

2.如权利要求1所述的语音音素的编码及语音合成方法，其特征在于，取样该语言的取样率为每秒8千次。2. The method for encoding phonemes and speech synthesis as claimed in claim 1, wherein the sampling rate for sampling the language is 8,000 times per second.

3.如权利要求1所述的语音音素的编码及语音合成方法，其特征在于，所述的有声音素的压缩编码依据一基周参数、一振幅参数与一频谱参数加以编码；该无声音素的该地址编码以该基周参数与一地址参数加以编码；该静音音素的时间长度编码以该基周参数与一时间参数加以编码。3. the coding of speech phoneme as claimed in claim 1 and speech synthesis method, it is characterized in that, the compression coding of described voiced phoneme is coded according to a base period parameter, an amplitude parameter and a spectral parameter; The address code of the phoneme is coded by the base cycle parameter and an address parameter; the time length code of the silent phoneme is coded by the base cycle parameter and a time parameter.

4.如权利要求3所述的语音音素的编码及语音合成方法，其特征在于，所述的有声音素的该基周参数与该振幅参数以一个音框(frame)为单位，逐步计算参数值。4. the coding of speech phoneme as claimed in claim 3 and speech synthesis method, it is characterized in that, this base circle parameter and this amplitude parameter of described voiced phoneme take a sound frame (frame) as unit, calculate parameter step by step value.

5.如权利要求3所述的语音音素的编码及语音合成方法，其特征在于，所述的频谱参数的编码系以一线性预估编码(Linear PredictiveCoding，LPC)方式加以编码。5. the coding of speech phoneme as claimed in claim 3 and speech synthesis method, it is characterized in that, the coding system of described spectrum parameter is coded with a linear predictive coding (Linear Predictive Coding, LPC) mode.

6.如权利要求1或3所述的语音音素的编码及语音合成方法，其特征在于，该地址参数系记录该经取样的该语音的该无声音素的储存地址。6. The method for encoding phonemes of speech and speech synthesis as claimed in claim 1 or 3, wherein the address parameter is a storage address for recording the silent phonemes of the sampled speech.

7.如权利要求1或3所述的语音音素的编码及语音合成方法，其特征在于，该时间参数系记录该经取样的该语音的该静音音素的静音时间长度。7. The method for encoding phonemes of speech and speech synthesis as claimed in claim 1 or 3, characterized in that, the time parameter is the silent time length of the silent phoneme recorded in the sampled speech.

8.如权利要求3所述的语音音素的编码及语音合成方法，其特征在于，该无声音素的基周参数值定义为1，该静音音素的基周参数值定义为0。8. The method for coding and speech synthesis of speech phonemes according to claim 3, characterized in that, the base cycle parameter value of the silent phoneme is defined as 1, and the base cycle parameter value of the silent phoneme is defined as 0.

9.如权利要求1或3所述的语音音素的编码及语音合成方法，其特征在于，该有声语音的合成依据该基周参数、振幅参数与频谱参数以合成该有声语音，其中该语音合成器包括：9. The method for encoding phonemes and speech synthesis as claimed in claim 1 or 3, wherein the voiced speech is synthesized according to the base parameters, amplitude parameters and spectrum parameters to synthesize the voiced speech, wherein the speech synthesis Devices include:

一脉冲序列产生器，用以将该基周参数输出为一激发信号(Excitation Signal)；A pulse sequence generator for outputting the base cycle parameter as an excitation signal (Excitation Signal);

一声道滤波器，依据该频谱参数作为该声道滤波器的滤波参数，用以接收该激发信号并将的输出为一语音信号；以及a channel filter, which is used as a filtering parameter of the channel filter according to the spectral parameter, to receive the excitation signal and output it as a speech signal; and

一乘法器，用以将该语音信号乘上该振幅参数以输出一还原语音。A multiplier is used for multiplying the voice signal by the amplitude parameter to output a restored voice.

10.如权利要求1或3所述的语音音素的编码及语音合成方法，其特征在于，该无声语音的产生系依据该地址参数读取该语音数据库的一无声语音音素，并依据该无声语音音素产生该无声语音。10. The method for coding and speech synthesis of speech phonemes as claimed in claim 1 or 3, wherein the generation of the silent speech reads a silent speech phoneme of the speech database according to the address parameter, and according to the silent speech Phonemes produce this unvoiced speech.

11.如权利要求1或3所述的语音音素的编码及语音合成方法，其特征在于，所述的静音语音的产生依据该时间参数输出符合该时间参数的时间长度的一振幅为0的静音。11. the coding of speech phoneme as claimed in claim 1 or 3 and speech synthesis method, it is characterized in that, the generation of described silence speech is the silence that is 0 according to the time length of this time parameter output one amplitude .