Summary of the invention
Therefore, the present invention is providing a kind of speech conversion synthesizer exactly, is to utilize the unspecified person speech recognition technology, and the unspecified person voice are discerned, synthesize according to corresponding speech data in recognition result and the specific people's speech database again, and obtain specific people's voice.
The present invention is that the unspecified person voice that obtained are discerned proposing a kind of speech conversion synthetic method, utilizes corresponding speech data to synthesize again, and obtains specific people's voice.
For reaching above-mentioned purpose with other, the present invention proposes a kind of speech conversion synthesizer, and this device comprises speech analysis module, speech recognition module and phonetic synthesis module.
Above-mentioned speech analysis module receives the unspecified person voice that the speech conversion synthesizer is obtained, be divided into voiceless sound section and voiced segments after dividing frame to handle the unspecified person voice, wherein, the voiceless sound section directly is output to output terminal, and voiced segments is then at analyzed back output spectrum feature and prosodic information.
Above-mentioned speech recognition module is coupled to the speech analysis module, receive the spectrum signature that the speech analysis module transmits, be responsible for identifying the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature, and determining time span (abbreviation duration) the back output of each voice unit.Wherein, the speech recognition module comprises unspecified person speech database and voice recognition unit.This nonspecific speech database stores all speech unit models parameters that are used for the unspecified person speech recognition, and voice recognition unit is coupled to the unspecified person speech database, when receiving spectrum signature, identify the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature to the unspecified person speech database.
Above-mentioned phonetic synthesis module is coupled to speech recognition module and speech analysis module, be responsible for receiving duration, voice unit sequence and prosodic information, and synthesize with the respective phonetic unit data of voice unit sequence, produce specific people's voice, export specific people's voice by output terminal at last.Wherein, the phonetic synthesis module comprises specific people's speech database and phonetic synthesis unit, and specific people's speech database stores respective specific people's voice unit data of speech unit models parameter, and the phonetic synthesis unit is coupled to specific people's speech database, when receiving the voice unit sequence, extremely identify respective specific people's voice unit data of speech unit models parameter in specific people's speech database.
Described according to preferred embodiment of the present invention, above-mentioned unspecified person speech database adopts hidden Markov model (Hidden Markov Model, be called for short HMM) set up, and the corresponding hidden Markov model of each voice unit can be obtained by a large amount of continuous speech training of unspecified person.
Described according to preferred embodiment of the present invention, above-mentioned specific people's speech database can be one or more, and these specific people's speech databases all have its corresponding specific people.
Described according to preferred embodiment of the present invention, above-mentioned prosodic information comprises pitch period and short-time energy.
Described according to preferred embodiment of the present invention, above-mentionedly divide frame to be treated to the unspecified person voice a series of unspecified person voice are cut with a Preset Time.
Described according to preferred embodiment of the present invention, above-mentioned speech recognition module only carries out the identification of voice layer, and does not carry out the identification of semantic primitive (as word).
For reaching above-mentioned purpose with other, the present invention proposes a kind of speech conversion synthetic method, is applicable to the synthetic specific people's voice of the unspecified person speech conversion that will be obtained.Its method obtains the unspecified person voice for the speech analysis module, then the unspecified person voice is divided frame to handle, and is divided into voiceless sound section and voiced segments, and secondly speech analysis module obtains spectrum signature and prosodic information after with the voiced segments analysis.The speech recognition module identifies the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature then according to spectrum signature, and the duration of definite voice unit sequence.At last, the phonetic synthesis module is exported by output terminal after with the respective phonetic unit data of voice unit sequence and the synthetic specific people's voice of voiceless sound section according to voice unit sequence, duration, prosodic information.
For above-mentioned and other purposes of the present invention, feature and advantage can be become apparent, a preferred embodiment cited below particularly, and conjunction with figs. are described in detail below:
Embodiment
Please refer to Fig. 1, it has illustrated the functional block diagram according to a kind of speech conversion synthesizer of preferred embodiment of the present invention.Thisspeech conversion synthesizer 100 can be covered up or aspect such as toy designs as text-converted system design, voice, and it comprises:speech analysis module 110,speech recognition module 120 andphonetic synthesis module 130.
Speech analysis module 110 receives the unspecified person voice thatspeech conversion synthesizer 100 is obtained, be divided into voiceless sound section and voiced segments after dividing frame to handle the unspecified person voice, wherein, the voiceless sound section directly exports output terminal to, voiced segments is then exported after analyzeding as being spectrum signature and prosodic information, and prosodic information comprises fundamental tone (pitch ofspeech) cycle and short-time energy.
In addition, dividing frame to be treated to the unspecified person voice is cut a series of unspecified person voice with a Preset Time, the unspecified person voice are promptly cut every 20 milliseconds be defined as a frame, and Preset Time can be whenspeech conversion synthesizer 100 dispatches from the factory and has preset.
Speech recognition module 120 is coupled tospeech analysis module 110, receive the spectrum signature thatspeech analysis module 110 transmits, be responsible for identifying the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature, and behind the duration of determining the voice unit sequence, export.
Wherein,speech recognition module 120 comprises unspecifiedperson speech database 124 and voice recognition unit 122.In unspecifiedperson speech database 124, store all voice unit sequences that are used for the unspecified person speech recognition, andvoice recognition unit 122 is coupled to unspecifiedperson speech database 124, when receiving spectrum signature, to unspecifiedperson speech database 124, identify the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature.
Phonetic synthesis module 130 is coupled tospeech recognition module 120 andspeech analysis module 110, the prosodic information that duration, voice unit sequence and thespeech analysis module 110 that receptionspeech recognition module 120 transmits transmits, and utilize the corresponding respective phonetic unit data of voice unit sequence to synthesize, produce specific people's voice, export specific people's voice by output terminal at last.
Wherein,phonetic synthesis module 130 comprises a plurality of specific people's speech database D1~DNStore the corresponding respective specific people's voice unit of speech unit models parameter data, andphonetic synthesis unit 132 is coupled to these specific people's speech database D1~DN, when receiving the voice unit sequence, to specific people's speech database D1~DNIn identify and the corresponding respective phonetic unit data of voice unit sequence.
In preferred embodiment of the present invention, specific people's speech database D1~DNCan be one or more, and these specific people's speech databases all there is its corresponding specific people.
In preferred embodiment of the present invention, the unspecified person speech database adopts hidden Markov model (HiddenMarkov Model, be called for short HMM) set up, and the corresponding hidden Markov model of each voice unit can be obtained by a large amount of continuous speech training of unspecified person.
In preferred embodiment of the present invention,speech recognition module 120 only carries out the identification of voice layer, and does not carry out the identification of semantic primitive (as word).
The manner of execution of thisspeech conversion synthesizer 100 is thatspeech analysis module 110 receives the unspecified person voice thatspeech conversion synthesizer 100 is obtained, be divided into voiceless sound section and voiced segments after dividing frame to handle the unspecified person voice, then directly export the voiceless sound section to output terminal, voiced segments then obtains exporting behind spectrum signature and the prosodic information after analyzed.Secondly,speech recognition module 120 receives the spectrum signature thatspeech analysis module 110 transmits, and exports after identifying voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature and the duration of determining the voice unit sequence.At last, the prosodic information that duration, voice unit sequence and thespeech analysis module 110 thatphonetic synthesis module 130 receptionspeech recognition modules 120 transmit transmits, and utilize the corresponding respective phonetic unit data of voice unit sequence to synthesize, after producing specific people's voice, export specific people's voice by output terminal.
Please then refer to Fig. 2, it has illustrated a kind of circuit block diagram of realizing with Digital System Processor of preferred embodiment of the present invention.Voice conversion device 100 comprises analog/digital converter 200, Digital System Processor 210, digital/analog converter 220, unspecified person speech database 230 and a plurality of specific people's speech database D in Fig. 21~DN
Analog/digital converter 200 is the phonetic entry port, exports after being responsible for received unspecified person speech simulation signal is converted to unspecified person speech digit signal.Digital System Processor 210 is responsible for carrying out the calculating in the speech conversion, and it comprises analysis and identification and specific people's phonetic synthesis of unspecified person voice.Digital/analog converter 220 is responsible for exporting after analog signal with specific people's voice converts specific people's speech digit signal to for the voice output port.Unspecified person speech database 230 is for storing speech conversion formula and hidden Markov model (HMM) parameter, and wherein unspecified person speech database 230 is a ROM (read-only memory).A plurality of specific people's speech database D1~DNFor storing a plurality of specific people's speech database, speech database D wherein1~DNBe storer.
In preferred embodiment of the present invention, Digital System Processor 210 comprises input buffer 212, digital signal processing enter 214 and output buffer 216.Wherein, input buffer 212 is for storing the frequency spectrum parameter and the prosodic parameter of input voice segments; Digital signal processing enter 214 is responsible for carrying out the calculating of speech conversion; Output buffer 216 is for storing the output voice.
Please continue with reference to figure 3, it has illustrated the process flow diagram of a kind of speech conversion synthetic method of preferred embodiment of the present invention.In the speech conversion synthetic method,, please merge with reference to figure 1 and Fig. 3 for ease of understanding.The method is thatspeech analysis module 110 is obtained unspecified person voice (s302), then the unspecified person voice are divided frame to handle, and be divided into voiceless sound section and voiced segments (s304), secondlyspeech analysis module 110 obtains spectrum signature and prosodic information (s306) after with the voiced segments analysis.120 of speech recognition modules identify the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature according to spectrum signature to unspecifiedperson speech database 124, and the duration of definite voice unit sequence.At last,phonetic synthesis module 130 receives voice unit sequence, duration, prosodic information, together up to specific people's speech database D1~DNIn identify and the corresponding respective phonetic unit data of voice unit sequence, export by output terminal after according to voice unit sequence, duration and prosodic information then the synthetic specific people's voice of voiceless sound section and respective phonetic unit data.
Comprehensive the above, speech conversion synthesizer of the present invention and method thereof have following advantage:
(1) speech conversion synthesizer of the present invention and method thereof can become resulting arbitrary speech conversion one specific people's voice, need not in use to adjust, and have very strong adaptive faculty.
(2) speech conversion synthesizer of the present invention and method thereof are not changing under speech conversion synthesizer structure and the parameter, only increase new specific people's speech database, can make the speech conversion synthesizer possess transfer capability to new specific people's voice.
Though the present invention discloses as above with a preferred embodiment; right its is not in order to limit the present invention; anyly be familiar with present technique field person; without departing from the spirit and scope of the present invention; when can doing a little change and retouching, so protection scope of the present invention defines and is as the criterion when looking the accompanying Claim book.