Movatterモバイル変換


[0]ホーム

URL:


CN104391673A - Voice interaction method and voice interaction device - Google Patents

Voice interaction method and voice interaction device
Download PDF

Info

Publication number
CN104391673A
CN104391673ACN201410670573.5ACN201410670573ACN104391673ACN 104391673 ACN104391673 ACN 104391673ACN 201410670573 ACN201410670573 ACN 201410670573ACN 104391673 ACN104391673 ACN 104391673A
Authority
CN
China
Prior art keywords
voice
speech
recognition
answer
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410670573.5A
Other languages
Chinese (zh)
Inventor
李秀林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co LtdfiledCriticalBeijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201410670573.5ApriorityCriticalpatent/CN104391673A/en
Publication of CN104391673ApublicationCriticalpatent/CN104391673A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The invention provides a voice interaction method and a voice interaction device. The voice interaction method comprises receiving input voice and performing characteristic extraction on the input voice to obtain the characteristic information of the input voice, performing voice characteristic identification and voice identification based on the characteristic information to obtain voice characteristics and a voice recognition result, obtaining an answer corresponding to the input voice according to the voice recognition result and the voice characteristics, and generating output voice according to the voice characteristics and the answer, wherein the voice characteristics include dialect, accent and mandarin; the output voice corresponds to the answer and has the voice characteristics. The method is capable of improving the voice interaction effect and improving the experience of a user.

Description

Voice interaction method and device
Technical Field
The present invention relates to the field of information technologies, and in particular, to a voice interaction method and apparatus.
Background
In the history of human development, language is crucial for the development of civilization. Voice has been changing and evolving as an important vehicle for human information communication for thousands of years. Languages and voices also vary significantly from region to region, depending on the environment and history. Thus, the language includes not only mandarin, but also dialects and accents.
With the continuous development of computer technology, human-computer interaction becomes more and more important, and voice interaction is a form of human-computer interaction. Dialect and accent problems have been a difficult point in the field of speech recognition and synthesis, and many researchers have been to collect more data to build a new speech model or to optimize an original model to improve the recognition and synthesis effects. When a user uses a man-machine conversation system, the expected result can be realized only by default dialect/accent setting or manually modifying the dialect/accent setting, so that the conversation effect is not ideal enough, and the user experience is poor.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, an object of the present invention is to provide a voice interaction method, which can improve a voice interaction effect and improve user experience.
Another object of the present invention is to provide a voice interaction apparatus.
In order to achieve the above object, an embodiment of the first aspect of the present invention provides a voice interaction method, including: receiving input voice, and performing feature extraction on the input voice to obtain feature information of the input voice; and performing voice characteristic recognition and voice recognition according to the characteristic information to obtain voice characteristics and a voice recognition result, wherein the voice characteristics comprise: dialect, accent, or mandarin; obtaining an answer corresponding to the input voice according to the voice recognition result and the voice characteristics; and generating output voice according to the voice characteristics and the answer, wherein the output voice is the voice which corresponds to the answer and has the voice characteristics.
According to the voice interaction method provided by the embodiment of the first aspect of the invention, the voice characteristics of the input voice are recognized, and the voice recognition model matched with the voice characteristics can be selected to perform voice recognition on the input voice, so that the voice interaction effect can be improved, and the user experience can be improved.
In order to achieve the above object, a voice interaction apparatus according to an embodiment of a second aspect of the present invention includes: the input module is used for receiving input voice and extracting the characteristics of the input voice to obtain the characteristic information of the input voice; the recognition module is used for carrying out voice characteristic recognition and voice recognition according to the characteristic information to obtain voice characteristics and a voice recognition result, wherein the voice characteristics comprise: dialect, accent, or mandarin; the acquisition module is used for acquiring answers corresponding to the input voice according to the voice recognition result and the voice characteristics; and the output module is used for generating output voice according to the voice characteristics and the answer, wherein the output voice is the voice which corresponds to the answer and has the voice characteristics.
According to the voice interaction device provided by the embodiment of the second aspect of the invention, the voice characteristics of the input voice are recognized, and the voice recognition model matched with the voice characteristics can be selected to perform voice recognition on the input voice, so that the voice interaction effect can be improved, and the user experience can be improved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flow chart of a voice interaction method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of obtaining speech features in an embodiment of the present invention;
FIG. 3 is a schematic flow chart of obtaining speech characteristics and speech recognition results according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of the generation of output speech in an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a voice interaction method according to another embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a voice interaction apparatus according to another embodiment of the present invention;
fig. 7 is a schematic structural diagram of a voice interaction apparatus according to another embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
Fig. 1 is a schematic flow chart of a voice interaction method according to an embodiment of the present invention, where the method includes:
s11: receiving input voice, and performing feature extraction on the input voice to obtain feature information of the input voice.
The input speech is speech input by the user into the speech interaction system, and the input speech may specifically be a question, for example, the input speech is "what is the weather today".
The voice interaction system may receive an input voice through a microphone or other devices, may perform preprocessing such as noise reduction on the input voice after receiving the input voice, and may perform feature extraction on the preprocessed input voice, for example, extracting a spectral feature, a fundamental frequency feature, an energy feature, or a zero-crossing rate.
S12: and performing voice characteristic recognition and voice recognition according to the characteristic information to obtain voice characteristics and a voice recognition result, wherein the voice characteristics comprise: dialect, accent, or mandarin.
The voice characteristics can be determined according to the characteristic information, then a pre-established voice model corresponding to the voice characteristics is determined, and the voice model is adopted for voice recognition to obtain a voice recognition result; or,
and performing voice recognition according to the characteristic information and a plurality of voice models which are established in advance to obtain a confidence coefficient value when each voice model performs voice recognition, determining an optimal voice model from the plurality of voice models according to the confidence coefficient value, and determining the voice characteristics and the voice recognition result which correspond to the optimal voice model as the voice characteristics and the voice recognition result to be obtained.
Referring to fig. 2, the process of obtaining the speech characteristics of the input speech may include:
s21: the input speech is pre-processed.
The preprocessing is, for example, noise reduction processing.
S22: and performing feature extraction on the preprocessed input voice to obtain feature information.
The feature extraction is, for example, spectral feature extraction, fundamental frequency feature extraction, energy feature extraction or zero-crossing rate extraction, etc.
S23: and recognizing the voice characteristics according to the characteristic information obtained after the characteristic extraction and a pre-established discrimination model.
The discriminant Model may be established by using a Support Vector Machine (SVM) or Hidden Markov Model (HMM) modeling technique, and may include a mandarin, dialect, or accent Model.
According to the comparison of the feature information and the discrimination model, the speech characteristics of Mandarin, dialect or accent can be identified.
Since the present embodiment mainly aims at the dialect and accent discrimination, the speech feature discrimination is expressed by dialect/accent discrimination in fig. 2.
Optionally, after obtaining the speech characteristics according to the discriminant model, the speech characteristics may be corrected according to the related information. Referring to fig. 2, the method may further include:
s24: and acquiring recent data, and performing accumulation judgment on the dialect/accent judgment result according to the recent data to obtain a judgment result.
The recent data refers to data in a time period which is less than a preset value from the current time.
In addition, the data may also be combined with the location information of the user, for example, according to the related information in the prior model, for example, the statistical probability of each dialect or accent in the region to which the location belongs, and the dialect/accent discrimination result, to obtain the final recognition result, thereby obtaining a more accurate estimation.
After the speech characteristics are obtained, a corresponding speech recognition model can be determined from a plurality of pre-established models, and then speech recognition is performed by using the corresponding speech recognition model, for example, if the obtained speech characteristics are the Sichuan speech, speech recognition can be performed on the input speech by using the speech recognition model corresponding to the Sichuan speech.
The foregoing describes determining speech characteristics prior to determining a speech recognition model, and optionally, the speech characteristics and the speech recognition model can be determined simultaneously.
Referring to fig. 3, the process of obtaining the speech characteristics and the speech recognition result according to the input speech may include:
s31: the input speech is pre-processed.
The preprocessing is, for example, noise reduction processing.
S32: and performing feature extraction on the preprocessed input voice to obtain feature information.
The feature extraction is, for example, spectral feature extraction, fundamental frequency feature extraction, energy feature extraction or zero-crossing rate extraction, etc.
S33: and performing voice recognition according to the characteristic information and a plurality of voice recognition models which are established in advance to obtain a confidence coefficient value corresponding to each model.
The plurality of speech recognition models may be all models established in advance, or a plurality of models selected from all models established in advance.
In fig. 3, the plurality of speech recognition models are represented by recognition model _1, recognition model _2, …, and recognition model _ N, respectively.
For example, the plurality of speech recognition models are a speech recognition model corresponding to the Sichuan language, a speech recognition model corresponding to the northeast language, and a speech recognition model corresponding to the Guangdong language, respectively.
When each speech recognition model performs speech recognition on the input speech, a confidence value corresponding to each model can be obtained.
S34: and obtaining an optimal voice recognition model according to the confidence coefficient value, and obtaining the voice characteristics and the voice recognition result corresponding to the optimal voice recognition model.
For example, the confidence value obtained by the speech recognition model corresponding to the Sichuan is greater than the confidence value obtained by the speech recognition model corresponding to the northeast and greater than the speech recognition model corresponding to the Guangdong.
For example, if the optimal speech recognition model is a speech recognition model corresponding to the tetragon, the speech is characterized by the tetragon, and the speech recognition result is obtained by performing speech recognition on the input speech by using the speech recognition model corresponding to the tetragon.
In addition, it can be understood that, no matter whether the speech characteristics are determined and then the speech recognition model is determined, or the speech characteristics and the speech recognition model are determined synchronously, if the speech characteristics and the speech recognition model consistent with the feature information cannot be found, the most similar speech recognition model can be found according to the similarity, and the most similar speech recognition model is adopted for speech recognition.
S13: and acquiring an answer corresponding to the input voice according to the voice recognition result and the voice characteristics.
After the voice recognition result is obtained, the requirement of the user is judged by semantic understanding technology, and a relevant result is searched in a database, a search engine or other knowledge bases and information data to serve as an answer.
Preferably, the text answer with the voice feature is preferentially obtained in a database.
For example, if the user's speech is dialect or accent, the answer is preferably looked up in the data for its dialect or accent characteristics.
In addition, if no corresponding information exists, certain text conversion can be carried out on the voice recognition result, so that the voice recognition result is more consistent with written language habits and is searched.
S14: and generating output voice according to the voice characteristics and the answer, wherein the output voice is the voice which corresponds to the answer and has the voice characteristics.
Optionally, the generating an output voice according to the voice feature and the answer includes:
if the answer comprises a text answer with the voice characteristics, setting voice synthesis parameters, and converting the text answer with the voice characteristics into the output voice; or,
if the answer comprises a text answer without the voice characteristics, setting voice synthesis parameters according to the voice characteristics, and generating the output voice according to the voice synthesis parameters and the text answer without the voice characteristics; or,
if the answer comprises a text answer without the voice characteristics, converting the text answer into a text answer with the voice characteristics, setting voice synthesis parameters according to the voice characteristics, and generating the output voice according to the voice synthesis parameters and the converted text answer.
For example, when the input voice is a Sichuan voice, after an answer having the characteristics of the Sichuan voice is found in the database, the text answer having the characteristics of the Sichuan voice may be converted into a voice. Or after finding the text answer of the mandarin in the database, the answer of the mandarin can be converted into the voice with the characteristics of the mandarin according to the voice characteristics of the mandarin. Or after finding the text answer of the mandarin in the database, firstly converting the text answer into the text answer with the characteristics of the Sichuan language, and then converting the text answer into the voice with the characteristics of the Sichuan language.
After the output speech is obtained, the output speech may be output and/or saved.
Optionally, the setting a speech synthesis parameter according to the speech characteristic includes:
setting voice synthesis parameters matched with the voice characteristics; or,
and setting the speech synthesis parameters with the highest similarity to the speech characteristics.
Referring to fig. 4, the process of generating the output voice according to the answer may include:
s41: and judging whether dialects corresponding to the recognized voice characteristics exist, if so, executing S45, and otherwise, executing S42.
S42: and judging whether an accent corresponding to the recognized voice characteristics exists, if so, executing S45, otherwise, executing S43.
S43: and judging whether approximate accent can be realized through conversion, if so, executing S45, and otherwise, executing S44.
S44: the parameters are reset.
S45: and setting synthesis parameters.
S46: and (5) voice synthesis.
For example, if the found information is dialect or accent corresponding to the user, the speech synthesis module is combined to see whether the same synthesis setting exists, and if not, the closest synthesis setting is set. If the searched information is a conventional written language habit text, and the synthesis module can support the corresponding dialect, or support approximate accent, or realize the approximate accent through simple conversion rules such as tone, the answer text is firstly converted, and the converted answer text is used as the input information of the synthesis module after meeting the corresponding language habit in the speech synthesis.
According to the embodiment, the voice characteristics of the input voice are recognized, and the voice recognition model matched with the voice characteristics can be selected to perform voice recognition on the input voice, so that the voice interaction effect can be improved, and the user experience is improved.
Fig. 5 is a schematic flow chart of a voice interaction method according to another embodiment of the present invention, where the method includes:
s51: and performing feature extraction on the input voice.
For example, the input speech is preprocessed, and then the feature extraction is performed on the preprocessed input speech.
The preprocessing is, for example, noise reduction processing.
The feature extraction is, for example, spectral feature extraction, fundamental frequency feature extraction, energy feature extraction or zero-crossing rate extraction, etc.
S52: and judging dialect/accent according to the feature information obtained by feature extraction.
Dialect/accent discrimination can be performed based on a discrimination model established in advance and the feature information.
The specific determination method can be seen in fig. 2, and is not described herein again.
S53: and (5) voice recognition.
After obtaining the speech characteristics, speech recognition may be performed using a speech recognition model that matches the speech characteristics, for example, when the input speech has characteristics of the Sichuan language, speech recognition may be performed using a speech recognition model of the Sichuan language characteristics.
It will be appreciated that when there is no speech recognition model that is consistent with the recognized speech characteristics, the speech recognition model that is most similar to the speech characteristics may be subjected to speech recognition.
S54: semantic understanding.
For example, after the text content is obtained by speech recognition, the semantic understanding is performed on the text content to obtain the intention of the user to input speech.
S55: and generating an answer.
After semantic understanding, the corresponding answer may be obtained by searching in a database of the corresponding dialect or accent, and/or a mandarin database.
S56: a synthesized dialect/accent setting.
For example, when the input voice has the characteristics of the Sichuan language, the parameters having the characteristics of the Sichuan language may be set so that the voice corresponding to the answer has the characteristics of the Sichuan language.
S57: speech generation resulting in an output speech which can then be output.
After the synthesis parameters are set, the answer may be converted into speech according to the parameters.
Possible application scenarios of the present embodiment are as follows:
the user inputs speech in mandarin chinese, corresponding to "how do the weather today? After dialect/accent judgment, a recognition system is set to adopt a mandarin recognition model to obtain correct recognition. Then, the weather forecast information of the current day is obtained through the data of a search engine or a weather service provider. And finally, setting voice synthesis to mandarin, and playing the weather forecast information to the user so as to complete one conversation.
The user inputs 'today's weather is peculiar and knows nothing? After dialect/accent judgment, a recognition model with northeast accents is adopted by a set recognition system to obtain a correct recognition result. Then, through a semantic understanding module, the weather forecast information of the current day is obtained by using the data of a search engine or a weather service provider. Finally, the obtained information is properly converted, after the characteristics of the language used by the user are added to the text, the voice is set to be synthesized into the mandarin with the northeast accent, and the weather forecast information is played to the user by the northeast accent, so that one conversation is completed.
The embodiment improves the core link in the traditional human-computer interaction interface, and the system can be more intelligent and more intimate by introducing dialect/accent judgment, so that the user experience is improved, and the user satisfaction is improved. According to the embodiment, through dialect/accent judgment, a recognition model which is more matched with the input voice of the user can be adopted, so that the recognition effect is improved, and the user requirements are better understood; through semantic understanding, response content suitable for being accepted by a user can be generated on the basis of understanding the spoken content of the user with dialect/accent; by the speech synthesis, speech that is most suitable for the user can be output. The embodiment makes full use of dialect/accent information in human-computer interaction, improves the ability of the machine to understand the voices and speak the special voices by distinguishing the dialect/accent, and converts the unfavorable factor of dialect/accent into the favorable factor. Meanwhile, the limit of using human-computer voice interaction for users can be further reduced, and the voice technology is greatly promoted to be more widely applied.
Fig. 6 is a schematic structural diagram of a voice interaction apparatus according to another embodiment of the present invention, where the apparatus 60 includes an input module 61, a recognition module 62, an acquisition module 63, and an output module 64.
The input module 61 is configured to receive an input voice, and perform feature extraction on the input voice to obtain feature information of the input voice;
the input speech is speech input by the user into the speech interaction system, and the input speech may specifically be a question, for example, the input speech is "what is the weather today".
The voice interaction system may receive an input voice through a microphone or other devices, may perform preprocessing such as noise reduction on the input voice after receiving the input voice, and may perform feature extraction on the preprocessed input voice, for example, extracting a spectral feature, a fundamental frequency feature, an energy feature, or a zero-crossing rate.
The recognition module 62 is configured to perform speech feature recognition and speech recognition according to the feature information to obtain speech features and a speech recognition result, where the speech features include: dialect, accent, or mandarin;
optionally, the identification module 62 is specifically configured to:
performing voice characteristic recognition according to the characteristic information to obtain voice characteristics;
and determining a voice recognition model corresponding to the voice characteristics, and recognizing the input voice by adopting the voice recognition model corresponding to the voice characteristics to obtain a voice recognition result.
Optionally, the identification module 62 is further specifically configured to:
performing voice characteristic recognition according to the characteristic information and a pre-established discrimination model to obtain voice characteristics; or,
and performing voice characteristic recognition according to the characteristic information and a pre-established discrimination model to obtain a preliminary voice characteristic, and obtaining a final voice characteristic according to the preliminary voice characteristic and pre-acquired data, wherein the pre-acquired data is data collected in a time period which is less than a preset value from the current time.
The discriminant Model may be established by using a Support Vector Machine (SVM) or Hidden Markov Model (HMM) modeling technique, and may include a mandarin, dialect, or accent Model.
According to the comparison of the feature information and the discrimination model, the speech characteristics of Mandarin, dialect or accent can be identified.
Optionally, after obtaining the speech characteristics according to the discriminant model, the speech characteristics may be corrected according to the related information.
After the speech characteristics are obtained, a corresponding speech recognition model can be determined from a plurality of pre-established models, and then speech recognition is performed by using the corresponding speech recognition model, for example, if the obtained speech characteristics are the Sichuan speech, speech recognition can be performed on the input speech by using the speech recognition model corresponding to the Sichuan speech.
Optionally, the identification module 62 is specifically configured to:
recognizing the input voice by adopting at least two preset voice recognition models to obtain a voice recognition result and a confidence coefficient value corresponding to each voice recognition model, wherein different voice recognition models have different voice characteristics;
and determining the voice characteristics and the voice recognition result corresponding to the voice recognition model with the maximum confidence coefficient value as the voice characteristics and the semantic recognition result to be obtained.
The plurality of speech recognition models may be all models established in advance, or a plurality of models selected from all models established in advance.
For example, the plurality of speech recognition models are a speech recognition model corresponding to the Sichuan language, a speech recognition model corresponding to the northeast language, and a speech recognition model corresponding to the Guangdong language, respectively.
When each speech recognition model performs speech recognition on the input speech, a confidence value corresponding to each model can be obtained.
For example, the confidence value obtained by the speech recognition model corresponding to the Sichuan is greater than the confidence value obtained by the speech recognition model corresponding to the northeast and greater than the speech recognition model corresponding to the Guangdong.
For example, if the optimal speech recognition model is a speech recognition model corresponding to the tetragon, the speech is characterized by the tetragon, and the speech recognition result is obtained by performing speech recognition on the input speech by using the speech recognition model corresponding to the tetragon.
In addition, it can be understood that, no matter whether the speech characteristics are determined and then the speech recognition model is determined, or the speech characteristics and the speech recognition model are determined synchronously, if the speech characteristics and the speech recognition model consistent with the feature information cannot be found, the most similar speech recognition model can be found according to the similarity, and the most similar speech recognition model is adopted for speech recognition.
The obtaining module 63 is configured to obtain an answer corresponding to the input voice according to the voice recognition result and the voice characteristics;
after the voice recognition result is obtained, the requirement of the user is judged by semantic understanding technology, and a relevant result is searched in a database, a search engine or other knowledge bases and information data to serve as an answer.
Optionally, the obtaining module 63 is specifically configured to:
and preferentially acquiring the text answers with the voice characteristics in a database with the voice characteristics.
For example, if the user's speech is dialect or accent, it is preferred to look in the data corresponding to their dialect or accent.
In addition, if no corresponding information exists, certain text conversion can be carried out on the voice recognition result, so that the voice recognition result is more consistent with written language habits and is searched.
The output module 64 is configured to generate an output voice according to the voice characteristics and the answer, where the output voice is a voice corresponding to the answer and having the voice characteristics.
Optionally, the output module 64 is specifically configured to:
if the answer comprises a text answer with the voice characteristics, setting voice synthesis parameters, and converting the text answer with the voice characteristics into the output voice; or,
if the answer comprises a text answer without the voice characteristics, setting voice synthesis parameters according to the voice characteristics, and generating the output voice according to the voice synthesis parameters and the text answer without the voice characteristics; or,
if the answer comprises a text answer without the voice characteristics, converting the text answer into a text answer with the voice characteristics, setting voice synthesis parameters according to the voice characteristics, and generating the output voice according to the voice synthesis parameters and the converted text answer.
For example, when the input voice is a Sichuan voice, after an answer having the characteristics of the Sichuan voice is found in the database, the text answer having the characteristics of the Sichuan voice may be converted into a voice. Or after finding the text answer of the mandarin in the database, the answer of the mandarin can be converted into the voice with the characteristics of the mandarin according to the voice characteristics of the mandarin. Or after finding the text answer of the mandarin in the database, firstly converting the text answer into the text answer with the characteristics of the Sichuan language, and then converting the text answer into the voice with the characteristics of the Sichuan language.
Optionally, the output module 64 is further specifically configured to:
setting voice synthesis parameters matched with the voice characteristics; or,
and setting the speech synthesis parameters with the highest similarity to the speech characteristics.
For example, if the found information is dialect or accent corresponding to the user, the speech synthesis module is combined to see whether the same synthesis setting exists, and if not, the closest synthesis setting is set. If the searched information is a conventional written language habit text, and the synthesis module can support the corresponding dialect, or support approximate accent, or realize the approximate accent through simple conversion rules such as tone, the answer text is firstly converted, and the converted answer text is used as the input information of the synthesis module after meeting the corresponding language habit in the speech synthesis.
In another embodiment, referring to fig. 7, the apparatus 60 further comprises:
a processing module 65, configured to store the output voice; alternatively, the output voice is output.
According to the embodiment, the voice characteristics of the input voice are recognized, and the voice recognition model matched with the voice characteristics can be selected to perform voice recognition on the input voice, so that the voice interaction effect can be improved, and the user experience is improved.
It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (15)

CN201410670573.5A2014-11-202014-11-20Voice interaction method and voice interaction devicePendingCN104391673A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201410670573.5ACN104391673A (en)2014-11-202014-11-20Voice interaction method and voice interaction device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201410670573.5ACN104391673A (en)2014-11-202014-11-20Voice interaction method and voice interaction device

Publications (1)

Publication NumberPublication Date
CN104391673Atrue CN104391673A (en)2015-03-04

Family

ID=52609583

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201410670573.5APendingCN104391673A (en)2014-11-202014-11-20Voice interaction method and voice interaction device

Country Status (1)

CountryLink
CN (1)CN104391673A (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104992706A (en)*2015-05-152015-10-21百度在线网络技术(北京)有限公司Voice-based information pushing method and device
CN105096940A (en)*2015-06-302015-11-25百度在线网络技术(北京)有限公司Method and device for voice recognition
CN105185375A (en)*2015-08-102015-12-23联想(北京)有限公司Information processing method and electronic equipment
CN105529028A (en)*2015-12-092016-04-27百度在线网络技术(北京)有限公司Voice analytical method and apparatus
CN105654954A (en)*2016-04-062016-06-08普强信息技术(北京)有限公司Cloud voice recognition system and method
CN106128462A (en)*2016-06-212016-11-16东莞酷派软件技术有限公司 Speech Recognition Method and System
CN106782547A (en)*2015-11-232017-05-31芋头科技(杭州)有限公司A kind of robot semantics recognition system based on speech recognition
CN106952648A (en)*2017-02-172017-07-14北京光年无限科技有限公司A kind of output intent and robot for robot
CN107845381A (en)*2017-10-272018-03-27安徽硕威智能科技有限公司A kind of method and system of robot semantic processes
CN108010529A (en)*2017-11-282018-05-08广西职业技术学院A kind of filling method and device of express delivery list
CN108053823A (en)*2017-11-282018-05-18广西职业技术学院A kind of speech recognition system and method
CN108364638A (en)*2018-01-122018-08-03咪咕音乐有限公司A kind of voice data processing method, device, electronic equipment and storage medium
CN108711423A (en)*2018-03-302018-10-26百度在线网络技术(北京)有限公司Intelligent sound interacts implementation method, device, computer equipment and storage medium
CN108984078A (en)*2017-05-312018-12-11联想(新加坡)私人有限公司The method and information processing unit of output setting are adjusted based on the user identified
CN108986802A (en)*2017-05-312018-12-11联想(新加坡)私人有限公司For providing method, equipment and the program product of output associated with dialect
CN109410935A (en)*2018-11-012019-03-01平安科技(深圳)有限公司A kind of destination searching method and device based on speech recognition
CN109686362A (en)*2019-01-022019-04-26百度在线网络技术(北京)有限公司Voice broadcast method, device and computer readable storage medium
CN109714608A (en)*2018-12-182019-05-03深圳壹账通智能科技有限公司Video data handling procedure, device, computer equipment and storage medium
CN109767338A (en)*2018-11-302019-05-17平安科技(深圳)有限公司 Method, device, device and readable storage medium for processing gastroenterology reimbursement process
CN110062369A (en)*2019-04-192019-07-26上海救要救信息科技有限公司It is a kind of for provide rescue voice prompting method and apparatus
CN111161718A (en)*2018-11-072020-05-15珠海格力电器股份有限公司Voice recognition method, device, equipment, storage medium and air conditioner
CN111354349A (en)*2019-04-162020-06-30深圳市鸿合创新信息技术有限责任公司 A kind of speech recognition method and device, electronic equipment
CN111916057A (en)*2020-06-202020-11-10中国建设银行股份有限公司Language identification method and device, electronic equipment and computer readable storage medium
CN112349275A (en)*2020-11-102021-02-09平安普惠企业管理有限公司Voice recognition method, device, equipment and medium suitable for multiple users
CN113470278A (en)*2021-06-302021-10-01中国建设银行股份有限公司Self-service payment method and device
US20220351715A1 (en)*2021-04-302022-11-03International Business Machines CorporationUsing speech to text data in training text to speech models
CN116013273A (en)*2023-01-032023-04-25国网安徽省电力有限公司和县供电公司 Speech recognition method and system for power grid dispatching business
CN119763547A (en)*2024-12-272025-04-04安徽讯飞寰语科技有限公司Speech synthesis method, training method of speech synthesis model, electronic device and computer program product
CN119763547B (en)*2024-12-272025-10-10安徽讯飞寰语科技有限公司 Speech synthesis method, speech synthesis model training method, electronic device and computer program product

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1474379A (en)*2002-07-022004-02-11�ձ������ȷ湫˾Voice identfying/responding system, voice/identifying responding program and its recording medium
CN103310788A (en)*2013-05-232013-09-18北京云知声信息技术有限公司Voice information identification method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1474379A (en)*2002-07-022004-02-11�ձ������ȷ湫˾Voice identfying/responding system, voice/identifying responding program and its recording medium
CN103310788A (en)*2013-05-232013-09-18北京云知声信息技术有限公司Voice information identification method and system

Cited By (35)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104992706A (en)*2015-05-152015-10-21百度在线网络技术(北京)有限公司Voice-based information pushing method and device
EP3093775A1 (en)*2015-05-152016-11-16Baidu Online Network Technology Beijing Co., Ltd.Method and apparatus for speech-based information push
CN105096940A (en)*2015-06-302015-11-25百度在线网络技术(北京)有限公司Method and device for voice recognition
CN105096940B (en)*2015-06-302019-03-08百度在线网络技术(北京)有限公司Method and apparatus for carrying out speech recognition
CN105185375A (en)*2015-08-102015-12-23联想(北京)有限公司Information processing method and electronic equipment
CN105185375B (en)*2015-08-102019-03-08联想(北京)有限公司A kind of information processing method and electronic equipment
CN106782547A (en)*2015-11-232017-05-31芋头科技(杭州)有限公司A kind of robot semantics recognition system based on speech recognition
CN105529028A (en)*2015-12-092016-04-27百度在线网络技术(北京)有限公司Voice analytical method and apparatus
CN105654954A (en)*2016-04-062016-06-08普强信息技术(北京)有限公司Cloud voice recognition system and method
CN106128462A (en)*2016-06-212016-11-16东莞酷派软件技术有限公司 Speech Recognition Method and System
CN106952648A (en)*2017-02-172017-07-14北京光年无限科技有限公司A kind of output intent and robot for robot
CN108984078A (en)*2017-05-312018-12-11联想(新加坡)私人有限公司The method and information processing unit of output setting are adjusted based on the user identified
CN108986802A (en)*2017-05-312018-12-11联想(新加坡)私人有限公司For providing method, equipment and the program product of output associated with dialect
CN107845381A (en)*2017-10-272018-03-27安徽硕威智能科技有限公司A kind of method and system of robot semantic processes
CN108053823A (en)*2017-11-282018-05-18广西职业技术学院A kind of speech recognition system and method
CN108010529A (en)*2017-11-282018-05-08广西职业技术学院A kind of filling method and device of express delivery list
CN108364638A (en)*2018-01-122018-08-03咪咕音乐有限公司A kind of voice data processing method, device, electronic equipment and storage medium
CN108711423A (en)*2018-03-302018-10-26百度在线网络技术(北京)有限公司Intelligent sound interacts implementation method, device, computer equipment and storage medium
CN109410935A (en)*2018-11-012019-03-01平安科技(深圳)有限公司A kind of destination searching method and device based on speech recognition
CN111161718A (en)*2018-11-072020-05-15珠海格力电器股份有限公司Voice recognition method, device, equipment, storage medium and air conditioner
CN109767338A (en)*2018-11-302019-05-17平安科技(深圳)有限公司 Method, device, device and readable storage medium for processing gastroenterology reimbursement process
CN109714608A (en)*2018-12-182019-05-03深圳壹账通智能科技有限公司Video data handling procedure, device, computer equipment and storage medium
CN109714608B (en)*2018-12-182023-03-10深圳壹账通智能科技有限公司Video data processing method, video data processing device, computer equipment and storage medium
CN109686362A (en)*2019-01-022019-04-26百度在线网络技术(北京)有限公司Voice broadcast method, device and computer readable storage medium
CN111354349A (en)*2019-04-162020-06-30深圳市鸿合创新信息技术有限责任公司 A kind of speech recognition method and device, electronic equipment
CN110062369A (en)*2019-04-192019-07-26上海救要救信息科技有限公司It is a kind of for provide rescue voice prompting method and apparatus
CN111916057A (en)*2020-06-202020-11-10中国建设银行股份有限公司Language identification method and device, electronic equipment and computer readable storage medium
CN112349275A (en)*2020-11-102021-02-09平安普惠企业管理有限公司Voice recognition method, device, equipment and medium suitable for multiple users
US11699430B2 (en)*2021-04-302023-07-11International Business Machines CorporationUsing speech to text data in training text to speech models
US20220351715A1 (en)*2021-04-302022-11-03International Business Machines CorporationUsing speech to text data in training text to speech models
WO2022229743A1 (en)*2021-04-302022-11-03International Business Machines CorporationUsing speech to text data in training text to speech models
CN113470278A (en)*2021-06-302021-10-01中国建设银行股份有限公司Self-service payment method and device
CN116013273A (en)*2023-01-032023-04-25国网安徽省电力有限公司和县供电公司 Speech recognition method and system for power grid dispatching business
CN119763547A (en)*2024-12-272025-04-04安徽讯飞寰语科技有限公司Speech synthesis method, training method of speech synthesis model, electronic device and computer program product
CN119763547B (en)*2024-12-272025-10-10安徽讯飞寰语科技有限公司 Speech synthesis method, speech synthesis model training method, electronic device and computer program product

Similar Documents

PublicationPublication DateTitle
CN104391673A (en)Voice interaction method and voice interaction device
US20230043916A1 (en)Text-to-speech processing using input voice characteristic data
US12230268B2 (en)Contextual voice user interface
US11282495B2 (en)Speech processing using embedding data
US10917758B1 (en)Voice-based messaging
CN106683677B (en)Voice recognition method and device
CN106486121B (en)Voice optimization method and device applied to intelligent robot
JP6284462B2 (en) Speech recognition method and speech recognition apparatus
CN105304080A (en)Speech synthesis device and speech synthesis method
CN102982811A (en)Voice endpoint detection method based on real-time decoding
US11676572B2 (en)Instantaneous learning in text-to-speech during dialog
CN109493846B (en)English accent recognition system
KR20230026242A (en)Voice synthesis method and device, equipment and computer storage medium
CN102945673A (en)Continuous speech recognition method with speech command range changed dynamically
CN112802444A (en)Speech synthesis method, apparatus, device and storage medium
KR20250048367A (en) Data processing system and method for speech recognition model, speech recognition method
CN112466287A (en)Voice segmentation method and device and computer readable storage medium
CN111161726B (en)Intelligent voice interaction method, device, medium and system
CN111986675A (en) Voice dialogue method, device and computer readable storage medium
CN113112996A (en)System and method for speech-based audio and text alignment
US11564194B1 (en)Device communication
Chen et al.Improving recognition-synthesis based any-to-one voice conversion with cyclic training
CN114242108A (en) An information processing method and related equipment
US11211056B1 (en)Natural language understanding model generation
US20070129946A1 (en)High quality speech reconstruction for a dialog method and system

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication
RJ01Rejection of invention patent application after publication

Application publication date:20150304


[8]ページ先頭

©2009-2025 Movatter.jp