Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
Fig. 1 is a schematic flow chart of a voice interaction method according to an embodiment of the present invention, where the method includes:
s11: receiving input voice, and performing feature extraction on the input voice to obtain feature information of the input voice.
The input speech is speech input by the user into the speech interaction system, and the input speech may specifically be a question, for example, the input speech is "what is the weather today".
The voice interaction system may receive an input voice through a microphone or other devices, may perform preprocessing such as noise reduction on the input voice after receiving the input voice, and may perform feature extraction on the preprocessed input voice, for example, extracting a spectral feature, a fundamental frequency feature, an energy feature, or a zero-crossing rate.
S12: and performing voice characteristic recognition and voice recognition according to the characteristic information to obtain voice characteristics and a voice recognition result, wherein the voice characteristics comprise: dialect, accent, or mandarin.
The voice characteristics can be determined according to the characteristic information, then a pre-established voice model corresponding to the voice characteristics is determined, and the voice model is adopted for voice recognition to obtain a voice recognition result; or,
and performing voice recognition according to the characteristic information and a plurality of voice models which are established in advance to obtain a confidence coefficient value when each voice model performs voice recognition, determining an optimal voice model from the plurality of voice models according to the confidence coefficient value, and determining the voice characteristics and the voice recognition result which correspond to the optimal voice model as the voice characteristics and the voice recognition result to be obtained.
Referring to fig. 2, the process of obtaining the speech characteristics of the input speech may include:
s21: the input speech is pre-processed.
The preprocessing is, for example, noise reduction processing.
S22: and performing feature extraction on the preprocessed input voice to obtain feature information.
The feature extraction is, for example, spectral feature extraction, fundamental frequency feature extraction, energy feature extraction or zero-crossing rate extraction, etc.
S23: and recognizing the voice characteristics according to the characteristic information obtained after the characteristic extraction and a pre-established discrimination model.
The discriminant Model may be established by using a Support Vector Machine (SVM) or Hidden Markov Model (HMM) modeling technique, and may include a mandarin, dialect, or accent Model.
According to the comparison of the feature information and the discrimination model, the speech characteristics of Mandarin, dialect or accent can be identified.
Since the present embodiment mainly aims at the dialect and accent discrimination, the speech feature discrimination is expressed by dialect/accent discrimination in fig. 2.
Optionally, after obtaining the speech characteristics according to the discriminant model, the speech characteristics may be corrected according to the related information. Referring to fig. 2, the method may further include:
s24: and acquiring recent data, and performing accumulation judgment on the dialect/accent judgment result according to the recent data to obtain a judgment result.
The recent data refers to data in a time period which is less than a preset value from the current time.
In addition, the data may also be combined with the location information of the user, for example, according to the related information in the prior model, for example, the statistical probability of each dialect or accent in the region to which the location belongs, and the dialect/accent discrimination result, to obtain the final recognition result, thereby obtaining a more accurate estimation.
After the speech characteristics are obtained, a corresponding speech recognition model can be determined from a plurality of pre-established models, and then speech recognition is performed by using the corresponding speech recognition model, for example, if the obtained speech characteristics are the Sichuan speech, speech recognition can be performed on the input speech by using the speech recognition model corresponding to the Sichuan speech.
The foregoing describes determining speech characteristics prior to determining a speech recognition model, and optionally, the speech characteristics and the speech recognition model can be determined simultaneously.
Referring to fig. 3, the process of obtaining the speech characteristics and the speech recognition result according to the input speech may include:
s31: the input speech is pre-processed.
The preprocessing is, for example, noise reduction processing.
S32: and performing feature extraction on the preprocessed input voice to obtain feature information.
The feature extraction is, for example, spectral feature extraction, fundamental frequency feature extraction, energy feature extraction or zero-crossing rate extraction, etc.
S33: and performing voice recognition according to the characteristic information and a plurality of voice recognition models which are established in advance to obtain a confidence coefficient value corresponding to each model.
The plurality of speech recognition models may be all models established in advance, or a plurality of models selected from all models established in advance.
In fig. 3, the plurality of speech recognition models are represented by recognition model _1, recognition model _2, …, and recognition model _ N, respectively.
For example, the plurality of speech recognition models are a speech recognition model corresponding to the Sichuan language, a speech recognition model corresponding to the northeast language, and a speech recognition model corresponding to the Guangdong language, respectively.
When each speech recognition model performs speech recognition on the input speech, a confidence value corresponding to each model can be obtained.
S34: and obtaining an optimal voice recognition model according to the confidence coefficient value, and obtaining the voice characteristics and the voice recognition result corresponding to the optimal voice recognition model.
For example, the confidence value obtained by the speech recognition model corresponding to the Sichuan is greater than the confidence value obtained by the speech recognition model corresponding to the northeast and greater than the speech recognition model corresponding to the Guangdong.
For example, if the optimal speech recognition model is a speech recognition model corresponding to the tetragon, the speech is characterized by the tetragon, and the speech recognition result is obtained by performing speech recognition on the input speech by using the speech recognition model corresponding to the tetragon.
In addition, it can be understood that, no matter whether the speech characteristics are determined and then the speech recognition model is determined, or the speech characteristics and the speech recognition model are determined synchronously, if the speech characteristics and the speech recognition model consistent with the feature information cannot be found, the most similar speech recognition model can be found according to the similarity, and the most similar speech recognition model is adopted for speech recognition.
S13: and acquiring an answer corresponding to the input voice according to the voice recognition result and the voice characteristics.
After the voice recognition result is obtained, the requirement of the user is judged by semantic understanding technology, and a relevant result is searched in a database, a search engine or other knowledge bases and information data to serve as an answer.
Preferably, the text answer with the voice feature is preferentially obtained in a database.
For example, if the user's speech is dialect or accent, the answer is preferably looked up in the data for its dialect or accent characteristics.
In addition, if no corresponding information exists, certain text conversion can be carried out on the voice recognition result, so that the voice recognition result is more consistent with written language habits and is searched.
S14: and generating output voice according to the voice characteristics and the answer, wherein the output voice is the voice which corresponds to the answer and has the voice characteristics.
Optionally, the generating an output voice according to the voice feature and the answer includes:
if the answer comprises a text answer with the voice characteristics, setting voice synthesis parameters, and converting the text answer with the voice characteristics into the output voice; or,
if the answer comprises a text answer without the voice characteristics, setting voice synthesis parameters according to the voice characteristics, and generating the output voice according to the voice synthesis parameters and the text answer without the voice characteristics; or,
if the answer comprises a text answer without the voice characteristics, converting the text answer into a text answer with the voice characteristics, setting voice synthesis parameters according to the voice characteristics, and generating the output voice according to the voice synthesis parameters and the converted text answer.
For example, when the input voice is a Sichuan voice, after an answer having the characteristics of the Sichuan voice is found in the database, the text answer having the characteristics of the Sichuan voice may be converted into a voice. Or after finding the text answer of the mandarin in the database, the answer of the mandarin can be converted into the voice with the characteristics of the mandarin according to the voice characteristics of the mandarin. Or after finding the text answer of the mandarin in the database, firstly converting the text answer into the text answer with the characteristics of the Sichuan language, and then converting the text answer into the voice with the characteristics of the Sichuan language.
After the output speech is obtained, the output speech may be output and/or saved.
Optionally, the setting a speech synthesis parameter according to the speech characteristic includes:
setting voice synthesis parameters matched with the voice characteristics; or,
and setting the speech synthesis parameters with the highest similarity to the speech characteristics.
Referring to fig. 4, the process of generating the output voice according to the answer may include:
s41: and judging whether dialects corresponding to the recognized voice characteristics exist, if so, executing S45, and otherwise, executing S42.
S42: and judging whether an accent corresponding to the recognized voice characteristics exists, if so, executing S45, otherwise, executing S43.
S43: and judging whether approximate accent can be realized through conversion, if so, executing S45, and otherwise, executing S44.
S44: the parameters are reset.
S45: and setting synthesis parameters.
S46: and (5) voice synthesis.
For example, if the found information is dialect or accent corresponding to the user, the speech synthesis module is combined to see whether the same synthesis setting exists, and if not, the closest synthesis setting is set. If the searched information is a conventional written language habit text, and the synthesis module can support the corresponding dialect, or support approximate accent, or realize the approximate accent through simple conversion rules such as tone, the answer text is firstly converted, and the converted answer text is used as the input information of the synthesis module after meeting the corresponding language habit in the speech synthesis.
According to the embodiment, the voice characteristics of the input voice are recognized, and the voice recognition model matched with the voice characteristics can be selected to perform voice recognition on the input voice, so that the voice interaction effect can be improved, and the user experience is improved.
Fig. 5 is a schematic flow chart of a voice interaction method according to another embodiment of the present invention, where the method includes:
s51: and performing feature extraction on the input voice.
For example, the input speech is preprocessed, and then the feature extraction is performed on the preprocessed input speech.
The preprocessing is, for example, noise reduction processing.
The feature extraction is, for example, spectral feature extraction, fundamental frequency feature extraction, energy feature extraction or zero-crossing rate extraction, etc.
S52: and judging dialect/accent according to the feature information obtained by feature extraction.
Dialect/accent discrimination can be performed based on a discrimination model established in advance and the feature information.
The specific determination method can be seen in fig. 2, and is not described herein again.
S53: and (5) voice recognition.
After obtaining the speech characteristics, speech recognition may be performed using a speech recognition model that matches the speech characteristics, for example, when the input speech has characteristics of the Sichuan language, speech recognition may be performed using a speech recognition model of the Sichuan language characteristics.
It will be appreciated that when there is no speech recognition model that is consistent with the recognized speech characteristics, the speech recognition model that is most similar to the speech characteristics may be subjected to speech recognition.
S54: semantic understanding.
For example, after the text content is obtained by speech recognition, the semantic understanding is performed on the text content to obtain the intention of the user to input speech.
S55: and generating an answer.
After semantic understanding, the corresponding answer may be obtained by searching in a database of the corresponding dialect or accent, and/or a mandarin database.
S56: a synthesized dialect/accent setting.
For example, when the input voice has the characteristics of the Sichuan language, the parameters having the characteristics of the Sichuan language may be set so that the voice corresponding to the answer has the characteristics of the Sichuan language.
S57: speech generation resulting in an output speech which can then be output.
After the synthesis parameters are set, the answer may be converted into speech according to the parameters.
Possible application scenarios of the present embodiment are as follows:
the user inputs speech in mandarin chinese, corresponding to "how do the weather today? After dialect/accent judgment, a recognition system is set to adopt a mandarin recognition model to obtain correct recognition. Then, the weather forecast information of the current day is obtained through the data of a search engine or a weather service provider. And finally, setting voice synthesis to mandarin, and playing the weather forecast information to the user so as to complete one conversation.
The user inputs 'today's weather is peculiar and knows nothing? After dialect/accent judgment, a recognition model with northeast accents is adopted by a set recognition system to obtain a correct recognition result. Then, through a semantic understanding module, the weather forecast information of the current day is obtained by using the data of a search engine or a weather service provider. Finally, the obtained information is properly converted, after the characteristics of the language used by the user are added to the text, the voice is set to be synthesized into the mandarin with the northeast accent, and the weather forecast information is played to the user by the northeast accent, so that one conversation is completed.
The embodiment improves the core link in the traditional human-computer interaction interface, and the system can be more intelligent and more intimate by introducing dialect/accent judgment, so that the user experience is improved, and the user satisfaction is improved. According to the embodiment, through dialect/accent judgment, a recognition model which is more matched with the input voice of the user can be adopted, so that the recognition effect is improved, and the user requirements are better understood; through semantic understanding, response content suitable for being accepted by a user can be generated on the basis of understanding the spoken content of the user with dialect/accent; by the speech synthesis, speech that is most suitable for the user can be output. The embodiment makes full use of dialect/accent information in human-computer interaction, improves the ability of the machine to understand the voices and speak the special voices by distinguishing the dialect/accent, and converts the unfavorable factor of dialect/accent into the favorable factor. Meanwhile, the limit of using human-computer voice interaction for users can be further reduced, and the voice technology is greatly promoted to be more widely applied.
Fig. 6 is a schematic structural diagram of a voice interaction apparatus according to another embodiment of the present invention, where the apparatus 60 includes an input module 61, a recognition module 62, an acquisition module 63, and an output module 64.
The input module 61 is configured to receive an input voice, and perform feature extraction on the input voice to obtain feature information of the input voice;
the input speech is speech input by the user into the speech interaction system, and the input speech may specifically be a question, for example, the input speech is "what is the weather today".
The voice interaction system may receive an input voice through a microphone or other devices, may perform preprocessing such as noise reduction on the input voice after receiving the input voice, and may perform feature extraction on the preprocessed input voice, for example, extracting a spectral feature, a fundamental frequency feature, an energy feature, or a zero-crossing rate.
The recognition module 62 is configured to perform speech feature recognition and speech recognition according to the feature information to obtain speech features and a speech recognition result, where the speech features include: dialect, accent, or mandarin;
optionally, the identification module 62 is specifically configured to:
performing voice characteristic recognition according to the characteristic information to obtain voice characteristics;
and determining a voice recognition model corresponding to the voice characteristics, and recognizing the input voice by adopting the voice recognition model corresponding to the voice characteristics to obtain a voice recognition result.
Optionally, the identification module 62 is further specifically configured to:
performing voice characteristic recognition according to the characteristic information and a pre-established discrimination model to obtain voice characteristics; or,
and performing voice characteristic recognition according to the characteristic information and a pre-established discrimination model to obtain a preliminary voice characteristic, and obtaining a final voice characteristic according to the preliminary voice characteristic and pre-acquired data, wherein the pre-acquired data is data collected in a time period which is less than a preset value from the current time.
The discriminant Model may be established by using a Support Vector Machine (SVM) or Hidden Markov Model (HMM) modeling technique, and may include a mandarin, dialect, or accent Model.
According to the comparison of the feature information and the discrimination model, the speech characteristics of Mandarin, dialect or accent can be identified.
Optionally, after obtaining the speech characteristics according to the discriminant model, the speech characteristics may be corrected according to the related information.
After the speech characteristics are obtained, a corresponding speech recognition model can be determined from a plurality of pre-established models, and then speech recognition is performed by using the corresponding speech recognition model, for example, if the obtained speech characteristics are the Sichuan speech, speech recognition can be performed on the input speech by using the speech recognition model corresponding to the Sichuan speech.
Optionally, the identification module 62 is specifically configured to:
recognizing the input voice by adopting at least two preset voice recognition models to obtain a voice recognition result and a confidence coefficient value corresponding to each voice recognition model, wherein different voice recognition models have different voice characteristics;
and determining the voice characteristics and the voice recognition result corresponding to the voice recognition model with the maximum confidence coefficient value as the voice characteristics and the semantic recognition result to be obtained.
The plurality of speech recognition models may be all models established in advance, or a plurality of models selected from all models established in advance.
For example, the plurality of speech recognition models are a speech recognition model corresponding to the Sichuan language, a speech recognition model corresponding to the northeast language, and a speech recognition model corresponding to the Guangdong language, respectively.
When each speech recognition model performs speech recognition on the input speech, a confidence value corresponding to each model can be obtained.
For example, the confidence value obtained by the speech recognition model corresponding to the Sichuan is greater than the confidence value obtained by the speech recognition model corresponding to the northeast and greater than the speech recognition model corresponding to the Guangdong.
For example, if the optimal speech recognition model is a speech recognition model corresponding to the tetragon, the speech is characterized by the tetragon, and the speech recognition result is obtained by performing speech recognition on the input speech by using the speech recognition model corresponding to the tetragon.
In addition, it can be understood that, no matter whether the speech characteristics are determined and then the speech recognition model is determined, or the speech characteristics and the speech recognition model are determined synchronously, if the speech characteristics and the speech recognition model consistent with the feature information cannot be found, the most similar speech recognition model can be found according to the similarity, and the most similar speech recognition model is adopted for speech recognition.
The obtaining module 63 is configured to obtain an answer corresponding to the input voice according to the voice recognition result and the voice characteristics;
after the voice recognition result is obtained, the requirement of the user is judged by semantic understanding technology, and a relevant result is searched in a database, a search engine or other knowledge bases and information data to serve as an answer.
Optionally, the obtaining module 63 is specifically configured to:
and preferentially acquiring the text answers with the voice characteristics in a database with the voice characteristics.
For example, if the user's speech is dialect or accent, it is preferred to look in the data corresponding to their dialect or accent.
In addition, if no corresponding information exists, certain text conversion can be carried out on the voice recognition result, so that the voice recognition result is more consistent with written language habits and is searched.
The output module 64 is configured to generate an output voice according to the voice characteristics and the answer, where the output voice is a voice corresponding to the answer and having the voice characteristics.
Optionally, the output module 64 is specifically configured to:
if the answer comprises a text answer with the voice characteristics, setting voice synthesis parameters, and converting the text answer with the voice characteristics into the output voice; or,
if the answer comprises a text answer without the voice characteristics, setting voice synthesis parameters according to the voice characteristics, and generating the output voice according to the voice synthesis parameters and the text answer without the voice characteristics; or,
if the answer comprises a text answer without the voice characteristics, converting the text answer into a text answer with the voice characteristics, setting voice synthesis parameters according to the voice characteristics, and generating the output voice according to the voice synthesis parameters and the converted text answer.
For example, when the input voice is a Sichuan voice, after an answer having the characteristics of the Sichuan voice is found in the database, the text answer having the characteristics of the Sichuan voice may be converted into a voice. Or after finding the text answer of the mandarin in the database, the answer of the mandarin can be converted into the voice with the characteristics of the mandarin according to the voice characteristics of the mandarin. Or after finding the text answer of the mandarin in the database, firstly converting the text answer into the text answer with the characteristics of the Sichuan language, and then converting the text answer into the voice with the characteristics of the Sichuan language.
Optionally, the output module 64 is further specifically configured to:
setting voice synthesis parameters matched with the voice characteristics; or,
and setting the speech synthesis parameters with the highest similarity to the speech characteristics.
For example, if the found information is dialect or accent corresponding to the user, the speech synthesis module is combined to see whether the same synthesis setting exists, and if not, the closest synthesis setting is set. If the searched information is a conventional written language habit text, and the synthesis module can support the corresponding dialect, or support approximate accent, or realize the approximate accent through simple conversion rules such as tone, the answer text is firstly converted, and the converted answer text is used as the input information of the synthesis module after meeting the corresponding language habit in the speech synthesis.
In another embodiment, referring to fig. 7, the apparatus 60 further comprises:
a processing module 65, configured to store the output voice; alternatively, the output voice is output.
According to the embodiment, the voice characteristics of the input voice are recognized, and the voice recognition model matched with the voice characteristics can be selected to perform voice recognition on the input voice, so that the voice interaction effect can be improved, and the user experience is improved.
It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.