Movatterモバイル変換


[0]ホーム

URL:


CN1534595A - Speech sound change over synthesis device and its method - Google Patents

Speech sound change over synthesis device and its method
Download PDF

Info

Publication number
CN1534595A
CN1534595ACNA031160506ACN03116050ACN1534595ACN 1534595 ACN1534595 ACN 1534595ACN A031160506 ACNA031160506 ACN A031160506ACN 03116050 ACN03116050 ACN 03116050ACN 1534595 ACN1534595 ACN 1534595A
Authority
CN
China
Prior art keywords
voice
speech
unspecified person
unit sequence
specific people
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA031160506A
Other languages
Chinese (zh)
Inventor
张江安
张钦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZHONGYING ELECTRONICS (SHANGHAI) CO Ltd
Original Assignee
ZHONGYING ELECTRONICS (SHANGHAI) CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZHONGYING ELECTRONICS (SHANGHAI) CO LtdfiledCriticalZHONGYING ELECTRONICS (SHANGHAI) CO Ltd
Priority to CNA031160506ApriorityCriticalpatent/CN1534595A/en
Publication of CN1534595ApublicationCriticalpatent/CN1534595A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Landscapes

Abstract

A speech conversion-composition device is composed of a speech analyzing module, a speech recognizing module, and a speech composing module for outputting particular speech. Its method features that on the basis of analyzed and recognized results, a non-particular speech is converted to the speech of a particular person pointed by user.

Description

Speech conversion synthesizer and method thereof
Technical field
The present invention is particularly to a kind of speech conversion synthesizer and its method that the unspecified person speech conversion is become specific people's voice relevant for a kind of speech conversion synthesizer and method thereof.
Background technology
Voice Conversion Techniques in text-converted (Text To Speech, be called for short TTS) system design, voice are covered up and aspect such as toy designs has a wide range of applications.And Voice Conversion Techniques substantially is how to focus on research according to the speech data of source words person with target words person, sets up transformational relation between the two.
The conversion method of known voice conversion device includes vector quantization and code book mapping method, linear transformation method, neural net for catching fish or birds method, mixed Gauss model method etc., above-mentioned these methods can both be used to set up the characteristic parameter between the words person, as the transformational relation of frequency domain character parameter.But these methods all can only be used to set up man-to-man transformational relation, it is the transformational relation between specific people's voice and the specific objective words person voice, therefore the speech conversion system that adopts these methods to set up can only be faced specific user, and for new user, speech conversion system must rebulid.So known phonetics transfer method also is not suitable for that voice are covered up or toy etc. need become the unspecified person speech conversion occasion of specific people's voice.
Summary of the invention
Therefore, the present invention is providing a kind of speech conversion synthesizer exactly, is to utilize the unspecified person speech recognition technology, and the unspecified person voice are discerned, synthesize according to corresponding speech data in recognition result and the specific people's speech database again, and obtain specific people's voice.
The present invention is that the unspecified person voice that obtained are discerned proposing a kind of speech conversion synthetic method, utilizes corresponding speech data to synthesize again, and obtains specific people's voice.
For reaching above-mentioned purpose with other, the present invention proposes a kind of speech conversion synthesizer, and this device comprises speech analysis module, speech recognition module and phonetic synthesis module.
Above-mentioned speech analysis module receives the unspecified person voice that the speech conversion synthesizer is obtained, be divided into voiceless sound section and voiced segments after dividing frame to handle the unspecified person voice, wherein, the voiceless sound section directly is output to output terminal, and voiced segments is then at analyzed back output spectrum feature and prosodic information.
Above-mentioned speech recognition module is coupled to the speech analysis module, receive the spectrum signature that the speech analysis module transmits, be responsible for identifying the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature, and determining time span (abbreviation duration) the back output of each voice unit.Wherein, the speech recognition module comprises unspecified person speech database and voice recognition unit.This nonspecific speech database stores all speech unit models parameters that are used for the unspecified person speech recognition, and voice recognition unit is coupled to the unspecified person speech database, when receiving spectrum signature, identify the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature to the unspecified person speech database.
Above-mentioned phonetic synthesis module is coupled to speech recognition module and speech analysis module, be responsible for receiving duration, voice unit sequence and prosodic information, and synthesize with the respective phonetic unit data of voice unit sequence, produce specific people's voice, export specific people's voice by output terminal at last.Wherein, the phonetic synthesis module comprises specific people's speech database and phonetic synthesis unit, and specific people's speech database stores respective specific people's voice unit data of speech unit models parameter, and the phonetic synthesis unit is coupled to specific people's speech database, when receiving the voice unit sequence, extremely identify respective specific people's voice unit data of speech unit models parameter in specific people's speech database.
Described according to preferred embodiment of the present invention, above-mentioned unspecified person speech database adopts hidden Markov model (Hidden Markov Model, be called for short HMM) set up, and the corresponding hidden Markov model of each voice unit can be obtained by a large amount of continuous speech training of unspecified person.
Described according to preferred embodiment of the present invention, above-mentioned specific people's speech database can be one or more, and these specific people's speech databases all have its corresponding specific people.
Described according to preferred embodiment of the present invention, above-mentioned prosodic information comprises pitch period and short-time energy.
Described according to preferred embodiment of the present invention, above-mentionedly divide frame to be treated to the unspecified person voice a series of unspecified person voice are cut with a Preset Time.
Described according to preferred embodiment of the present invention, above-mentioned speech recognition module only carries out the identification of voice layer, and does not carry out the identification of semantic primitive (as word).
For reaching above-mentioned purpose with other, the present invention proposes a kind of speech conversion synthetic method, is applicable to the synthetic specific people's voice of the unspecified person speech conversion that will be obtained.Its method obtains the unspecified person voice for the speech analysis module, then the unspecified person voice is divided frame to handle, and is divided into voiceless sound section and voiced segments, and secondly speech analysis module obtains spectrum signature and prosodic information after with the voiced segments analysis.The speech recognition module identifies the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature then according to spectrum signature, and the duration of definite voice unit sequence.At last, the phonetic synthesis module is exported by output terminal after with the respective phonetic unit data of voice unit sequence and the synthetic specific people's voice of voiceless sound section according to voice unit sequence, duration, prosodic information.
For above-mentioned and other purposes of the present invention, feature and advantage can be become apparent, a preferred embodiment cited below particularly, and conjunction with figs. are described in detail below:
Description of drawings
Fig. 1 is the functional block diagram of a kind of speech conversion synthesizer of preferred embodiment of the present invention;
Fig. 2 is a kind of circuit block diagram of realizing with Digital System Processor of preferred embodiment of the present invention; And
Fig. 3 is the method flow diagram of a kind of speech conversion synthetic method of preferred embodiment of the present invention.
Embodiment
Please refer to Fig. 1, it has illustrated the functional block diagram according to a kind of speech conversion synthesizer of preferred embodiment of the present invention.Thisspeech conversion synthesizer 100 can be covered up or aspect such as toy designs as text-converted system design, voice, and it comprises:speech analysis module 110,speech recognition module 120 andphonetic synthesis module 130.
Speech analysis module 110 receives the unspecified person voice thatspeech conversion synthesizer 100 is obtained, be divided into voiceless sound section and voiced segments after dividing frame to handle the unspecified person voice, wherein, the voiceless sound section directly exports output terminal to, voiced segments is then exported after analyzeding as being spectrum signature and prosodic information, and prosodic information comprises fundamental tone (pitch ofspeech) cycle and short-time energy.
In addition, dividing frame to be treated to the unspecified person voice is cut a series of unspecified person voice with a Preset Time, the unspecified person voice are promptly cut every 20 milliseconds be defined as a frame, and Preset Time can be whenspeech conversion synthesizer 100 dispatches from the factory and has preset.
Speech recognition module 120 is coupled tospeech analysis module 110, receive the spectrum signature thatspeech analysis module 110 transmits, be responsible for identifying the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature, and behind the duration of determining the voice unit sequence, export.
Wherein,speech recognition module 120 comprises unspecifiedperson speech database 124 and voice recognition unit 122.In unspecifiedperson speech database 124, store all voice unit sequences that are used for the unspecified person speech recognition, andvoice recognition unit 122 is coupled to unspecifiedperson speech database 124, when receiving spectrum signature, to unspecifiedperson speech database 124, identify the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature.
Phonetic synthesis module 130 is coupled tospeech recognition module 120 andspeech analysis module 110, the prosodic information that duration, voice unit sequence and thespeech analysis module 110 that receptionspeech recognition module 120 transmits transmits, and utilize the corresponding respective phonetic unit data of voice unit sequence to synthesize, produce specific people's voice, export specific people's voice by output terminal at last.
Wherein,phonetic synthesis module 130 comprises a plurality of specific people's speech database D1~DNStore the corresponding respective specific people's voice unit of speech unit models parameter data, andphonetic synthesis unit 132 is coupled to these specific people's speech database D1~DN, when receiving the voice unit sequence, to specific people's speech database D1~DNIn identify and the corresponding respective phonetic unit data of voice unit sequence.
In preferred embodiment of the present invention, specific people's speech database D1~DNCan be one or more, and these specific people's speech databases all there is its corresponding specific people.
In preferred embodiment of the present invention, the unspecified person speech database adopts hidden Markov model (HiddenMarkov Model, be called for short HMM) set up, and the corresponding hidden Markov model of each voice unit can be obtained by a large amount of continuous speech training of unspecified person.
In preferred embodiment of the present invention,speech recognition module 120 only carries out the identification of voice layer, and does not carry out the identification of semantic primitive (as word).
The manner of execution of thisspeech conversion synthesizer 100 is thatspeech analysis module 110 receives the unspecified person voice thatspeech conversion synthesizer 100 is obtained, be divided into voiceless sound section and voiced segments after dividing frame to handle the unspecified person voice, then directly export the voiceless sound section to output terminal, voiced segments then obtains exporting behind spectrum signature and the prosodic information after analyzed.Secondly,speech recognition module 120 receives the spectrum signature thatspeech analysis module 110 transmits, and exports after identifying voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature and the duration of determining the voice unit sequence.At last, the prosodic information that duration, voice unit sequence and thespeech analysis module 110 thatphonetic synthesis module 130 receptionspeech recognition modules 120 transmit transmits, and utilize the corresponding respective phonetic unit data of voice unit sequence to synthesize, after producing specific people's voice, export specific people's voice by output terminal.
Please then refer to Fig. 2, it has illustrated a kind of circuit block diagram of realizing with Digital System Processor of preferred embodiment of the present invention.Voice conversion device 100 comprises analog/digital converter 200, Digital System Processor 210, digital/analog converter 220, unspecified person speech database 230 and a plurality of specific people's speech database D in Fig. 21~DN
Analog/digital converter 200 is the phonetic entry port, exports after being responsible for received unspecified person speech simulation signal is converted to unspecified person speech digit signal.Digital System Processor 210 is responsible for carrying out the calculating in the speech conversion, and it comprises analysis and identification and specific people's phonetic synthesis of unspecified person voice.Digital/analog converter 220 is responsible for exporting after analog signal with specific people's voice converts specific people's speech digit signal to for the voice output port.Unspecified person speech database 230 is for storing speech conversion formula and hidden Markov model (HMM) parameter, and wherein unspecified person speech database 230 is a ROM (read-only memory).A plurality of specific people's speech database D1~DNFor storing a plurality of specific people's speech database, speech database D wherein1~DNBe storer.
In preferred embodiment of the present invention, Digital System Processor 210 comprises input buffer 212, digital signal processing enter 214 and output buffer 216.Wherein, input buffer 212 is for storing the frequency spectrum parameter and the prosodic parameter of input voice segments; Digital signal processing enter 214 is responsible for carrying out the calculating of speech conversion; Output buffer 216 is for storing the output voice.
Please continue with reference to figure 3, it has illustrated the process flow diagram of a kind of speech conversion synthetic method of preferred embodiment of the present invention.In the speech conversion synthetic method,, please merge with reference to figure 1 and Fig. 3 for ease of understanding.The method is thatspeech analysis module 110 is obtained unspecified person voice (s302), then the unspecified person voice are divided frame to handle, and be divided into voiceless sound section and voiced segments (s304), secondlyspeech analysis module 110 obtains spectrum signature and prosodic information (s306) after with the voiced segments analysis.120 of speech recognition modules identify the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature according to spectrum signature to unspecifiedperson speech database 124, and the duration of definite voice unit sequence.At last,phonetic synthesis module 130 receives voice unit sequence, duration, prosodic information, together up to specific people's speech database D1~DNIn identify and the corresponding respective phonetic unit data of voice unit sequence, export by output terminal after according to voice unit sequence, duration and prosodic information then the synthetic specific people's voice of voiceless sound section and respective phonetic unit data.
Comprehensive the above, speech conversion synthesizer of the present invention and method thereof have following advantage:
(1) speech conversion synthesizer of the present invention and method thereof can become resulting arbitrary speech conversion one specific people's voice, need not in use to adjust, and have very strong adaptive faculty.
(2) speech conversion synthesizer of the present invention and method thereof are not changing under speech conversion synthesizer structure and the parameter, only increase new specific people's speech database, can make the speech conversion synthesizer possess transfer capability to new specific people's voice.
Though the present invention discloses as above with a preferred embodiment; right its is not in order to limit the present invention; anyly be familiar with present technique field person; without departing from the spirit and scope of the present invention; when can doing a little change and retouching, so protection scope of the present invention defines and is as the criterion when looking the accompanying Claim book.

Claims (11)

CNA031160506A2003-03-282003-03-28Speech sound change over synthesis device and its methodPendingCN1534595A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CNA031160506ACN1534595A (en)2003-03-282003-03-28Speech sound change over synthesis device and its method

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CNA031160506ACN1534595A (en)2003-03-282003-03-28Speech sound change over synthesis device and its method

Publications (1)

Publication NumberPublication Date
CN1534595Atrue CN1534595A (en)2004-10-06

Family

ID=34284550

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CNA031160506APendingCN1534595A (en)2003-03-282003-03-28Speech sound change over synthesis device and its method

Country Status (1)

CountryLink
CN (1)CN1534595A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN100349206C (en)*2005-09-122007-11-14周运南Text-to-speech interchanging device
CN102737628A (en)*2012-07-042012-10-17哈尔滨工业大学深圳研究生院Method for converting voice based on linear predictive coding and radial basis function neural network
CN103794206A (en)*2014-02-242014-05-14联想(北京)有限公司Method for converting text data into voice data and terminal equipment
CN104123932A (en)*2014-07-292014-10-29科大讯飞股份有限公司Voice conversion system and method
CN105206257A (en)*2015-10-142015-12-30科大讯飞股份有限公司Voice conversion method and device
CN105227966A (en)*2015-09-292016-01-06深圳Tcl新技术有限公司To televise control method, server and control system of televising
CN105654941A (en)*2016-01-202016-06-08华南理工大学Voice change method and device based on specific target person voice change ratio parameter
WO2017067206A1 (en)*2015-10-202017-04-27百度在线网络技术(北京)有限公司Training method for multiple personalized acoustic models, and voice synthesis method and device
CN109935225A (en)*2017-12-152019-06-25富泰华工业(深圳)有限公司 Word information processing device and method, computer storage medium and mobile terminal
WO2021120145A1 (en)*2019-12-202021-06-24深圳市优必选科技股份有限公司Voice conversion method and apparatus, computer device and computer-readable storage medium

Cited By (13)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN100349206C (en)*2005-09-122007-11-14周运南Text-to-speech interchanging device
CN102737628A (en)*2012-07-042012-10-17哈尔滨工业大学深圳研究生院Method for converting voice based on linear predictive coding and radial basis function neural network
CN103794206B (en)*2014-02-242017-04-19联想(北京)有限公司Method for converting text data into voice data and terminal equipment
CN103794206A (en)*2014-02-242014-05-14联想(北京)有限公司Method for converting text data into voice data and terminal equipment
CN104123932A (en)*2014-07-292014-10-29科大讯飞股份有限公司Voice conversion system and method
CN105227966A (en)*2015-09-292016-01-06深圳Tcl新技术有限公司To televise control method, server and control system of televising
CN105206257B (en)*2015-10-142019-01-18科大讯飞股份有限公司A kind of sound converting method and device
CN105206257A (en)*2015-10-142015-12-30科大讯飞股份有限公司Voice conversion method and device
WO2017067206A1 (en)*2015-10-202017-04-27百度在线网络技术(北京)有限公司Training method for multiple personalized acoustic models, and voice synthesis method and device
US10410621B2 (en)2015-10-202019-09-10Baidu Online Network Technology (Beijing) Co., Ltd.Training method for multiple personalized acoustic models, and voice synthesis method and device
CN105654941A (en)*2016-01-202016-06-08华南理工大学Voice change method and device based on specific target person voice change ratio parameter
CN109935225A (en)*2017-12-152019-06-25富泰华工业(深圳)有限公司 Word information processing device and method, computer storage medium and mobile terminal
WO2021120145A1 (en)*2019-12-202021-06-24深圳市优必选科技股份有限公司Voice conversion method and apparatus, computer device and computer-readable storage medium

Similar Documents

PublicationPublication DateTitle
CN1169115C (en) Speech Synthesis System and Method
Purwins et al.Deep learning for audio signal processing
CN1156819C (en) A Method of Generating Personalized Speech from Text
Schuller et al.The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals
CN102231278B (en)Method and system for realizing automatic addition of punctuation marks in speech recognition
CN102496363B (en)Correction method for Chinese speech synthesis tone
NZ526298A (en)Device and method for judging dog's feelings from cry vocal character analysis
CN1167307A (en) Runtime audio unit selection method and system for speech synthesis
CN101064103A (en)Chinese voice synthetic method and system based on syllable rhythm restricting relationship
CN114627851B (en) Speech synthesis method and system
CN114220414B (en) Speech synthesis method and related device and equipment
CN1300049A (en)Method and apparatus for identifying speech sound of chinese language common speech
KR20200088263A (en)Method and system of text to multiple speech
CN1534595A (en)Speech sound change over synthesis device and its method
CN118298845B (en)Training method, training device, training medium and training equipment for pitch recognition model of complex tone audio
EP3363015A1 (en)Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
KR20190135853A (en)Method and system of text to multiple speech
Gaol et al.Match to win: Analysing sequences lengths for efficient self-supervised learning in speech and audio
Schmid et al.Low-complexity audio embedding extractors
CN119360887A (en) A voice authentication method and related equipment
CN1113330C (en) Speech Regularization Method in Speech Synthesis
Bonada et al.Bird song synthesis based on hidden markov models
CN112242134A (en) Speech synthesis method and device
CN116580698A (en)Speech synthesis method, device, computer equipment and medium based on artificial intelligence
Razak et al.Towards automatic recognition of emotion in speech

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C02Deemed withdrawal of patent application after publication (patent law 2001)
WD01Invention patent application deemed withdrawn after publication

[8]ページ先頭

©2009-2025 Movatter.jp