Movatterモバイル変換


[0]ホーム

URL:


CN118173093B - Speech dialogue method and system based on artificial intelligence - Google Patents

Speech dialogue method and system based on artificial intelligence
Download PDF

Info

Publication number
CN118173093B
CN118173093BCN202410567703.6ACN202410567703ACN118173093BCN 118173093 BCN118173093 BCN 118173093BCN 202410567703 ACN202410567703 ACN 202410567703ACN 118173093 BCN118173093 BCN 118173093B
Authority
CN
China
Prior art keywords
voice
recognition result
target
acoustic
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410567703.6A
Other languages
Chinese (zh)
Other versions
CN118173093A (en
Inventor
程绍波
石玉山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Yuyun Technology Co ltd
Original Assignee
Liaoning Yuyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Yuyun Technology Co ltdfiledCriticalLiaoning Yuyun Technology Co ltd
Priority to CN202410567703.6ApriorityCriticalpatent/CN118173093B/en
Publication of CN118173093ApublicationCriticalpatent/CN118173093A/en
Application grantedgrantedCritical
Publication of CN118173093BpublicationCriticalpatent/CN118173093B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The embodiment of the application relates to the technical field of artificial intelligence and discloses a voice dialogue method and a voice dialogue system based on artificial intelligence, wherein the method comprises the steps of firstly carrying out acoustic model recognition and machine learning recognition on a plurality of voices input by a user to judge and obtain a voice object type, and then selecting a corresponding target voice acoustic category according to the voice object type to generate a response voice to carry out voice dialogue with the user; therefore, the user types are accurately identified through multiple identification mechanisms, so that the corresponding voice acoustic categories are provided according to the user types to perform voice conversation, the voice conversation requirements of different users or different scenes are met, and the voice conversation experience is improved.

Description

Speech dialogue method and system based on artificial intelligence
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a voice dialogue method and system based on artificial intelligence.
Background
With the progress of science and technology, the intelligent degree and functions of the commercial superrobots are more and more diversified, so that the commercial superrobots can complete the traditional shopping, cleaning and other tasks, and can realize the functions of commodity recommendation, intelligent shopping guide and the like through an artificial intelligence technology; however, the robot function of such application scenes is still limited, and the requirements of voice dialogue of more scenes cannot be met, for example, game interaction or topic communication under corresponding intelligence cannot be performed with children, so that the consumer is not attractive enough due to overestimation.
Disclosure of Invention
The invention mainly aims to provide a voice dialogue method and a voice dialogue system based on artificial intelligence, and aims to solve the technical problem that in the prior art, an intelligent robot cannot meet voice dialogue requirements of more scenes, so that a manufacturer has insufficient attractiveness to consumers.
To achieve the above object, in a first aspect, an embodiment of the present application provides a speech dialogue method based on artificial intelligence, where the method includes:
acquiring a first target voice input by a target user, and triggering a first feedback voice according to the first target voice;
Obtaining a second target voice which is input by a target user in a answering way to the first feedback voice, inputting the first target voice and the second target voice into a pre-trained acoustic model, and performing voice object recognition to obtain a first voice recognition result, wherein the first voice recognition result comprises a voice object type and a corresponding recognition probability;
performing voice deep learning recognition on the first target voice and the second target voice based on a machine deep learning technology to obtain a second voice recognition result, wherein the second voice recognition result comprises a voice object type and a corresponding recognition probability;
Under the condition that the voice object types in the first voice recognition result and the second voice recognition result are consistent, determining the voice object type in the first voice recognition result or the second voice recognition result as a target voice object type;
Determining a target voice acoustic category required to be used when answering the target voice object type according to the target voice object type, wherein the target voice acoustic category is one of adult accent and child accent;
and generating a response voice according to the target voice acoustic category to carry out voice dialogue with the target user.
Further, after the first voice recognition result and the second voice recognition result are obtained, the method further includes:
Under the condition that the voice object types in the first voice recognition result and the second voice recognition result are inconsistent, determining a target voice acoustic type according to the first voice recognition result and the second voice recognition result, wherein the target voice acoustic type is one with larger prediction probability in the first voice recognition result and the second voice recognition result;
Generating a response voice according to the target voice acoustic category to carry out voice dialogue with the target user;
performing voice deep learning recognition based on a machine deep learning technology in the voice dialogue process with the target user to obtain a third voice recognition result;
And maintaining or switching the target voice acoustic category according to the third voice recognition result.
Further, the maintaining or switching the target voice acoustic category according to the third voice recognition result includes:
if the target voice acoustic category is the same as the third voice recognition result, keeping the target voice acoustic category unchanged;
And if the target voice acoustic category is different from the third voice recognition result, switching the target voice acoustic category into the voice acoustic category corresponding to the third voice recognition result.
Further, the inputting the first target voice and the second target voice into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result includes:
Performing voice splicing on the first target voice and the second target voice to form a positive sequence voice group and a reverse sequence voice group;
and respectively inputting the positive sequence voice group and the negative sequence voice group into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result.
Further, the step of inputting the positive sequence voice group and the negative sequence voice group into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result includes:
inputting the positive sequence voice group into a pre-trained acoustic model to perform voice object recognition to obtain a positive sequence recognition result;
inputting the reverse order voice group into a pre-trained acoustic model to perform voice object recognition to obtain a reverse order recognition result;
And carrying out Gaussian mixture processing on the positive sequence recognition result and the negative sequence recognition result to obtain the first voice recognition result, wherein the Gaussian mixture processing satisfies the following vector expression:
(1)
(2)
[“C”,100%](3)
Wherein, the vector expression (1) is a first voice recognition result obtained by processing when the positive sequence recognition result is consistent with the reverse sequence recognition result, the vector expression (2) is a first voice recognition result obtained by processing when the positive sequence recognition result is not completely consistent with the reverse sequence recognition result, the vector expression (3) is a first voice recognition result obtained by processing when the positive sequence recognition result is not consistent with the reverse sequence recognition result, the 'A', 'B', 'C' are voice object types obtained by Gaussian mixture processing, the P (X) is recognition probability obtained by Gaussian mixture processing, the P1,For the recognition probability and the corresponding calculation weight under the positive sequence recognition result, P2,The recognition probability and the corresponding calculation weight under the reverse order recognition result are obtained.
Further, the performing the deep learning recognition on the first target voice and the second target voice to obtain a second voice recognition result based on the deep learning technology includes:
Respectively identifying and extracting voice contents in the first target voice and the second target voice to obtain first voice contents and second voice contents;
performing context association learning on the first voice content and the second voice content based on a machine deep learning technology to obtain a context association degree;
and inquiring an association mapping table according to the context association degree to obtain the second voice recognition result.
Further, after determining the target voice acoustic category to be used when answering the target voice object type according to the target voice object type, the method further comprises:
Determining a voice dialogue theme according to the voice content of the first target voice and/or the second target voice;
And generating answering voice according to the voice dialogue theme to carry out voice dialogue with the target user.
Further, the generating a voice dialogue between the answering voice and the target user according to the target voice acoustic category includes:
acquiring a preset speech rate of the voice response according to the target voice acoustic category;
and generating answering voice at the preset voice speed and carrying out voice dialogue with the target user.
Further, after the generating the answering voice at the preset speech rate and the target user perform a voice dialogue, the method further includes:
In the process of carrying out voice dialogue with the target user, adjusting the preset speech speed according to the voice speech speed of the target user to obtain the target speech speed;
and generating answering voice at the target voice speed to carry out voice dialogue with the target user.
In a second aspect, there is also provided in an embodiment of the present application a speech dialog system, including a memory for storing program code and a processor for invoking the program code to perform the method according to the first aspect.
Compared with the prior art, the voice dialogue method based on artificial intelligence provided by the embodiment of the application comprises the steps of firstly acquiring first target voice input by a target user and triggering first feedback voice according to the first target voice; then, a second target voice which is input by a target user in a answering way to the first feedback voice is obtained, and the first target voice and the second target voice are input into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result; performing voice deep learning recognition on the first target voice and the second target voice based on a machine deep learning technology to obtain a second voice recognition result; under the condition that the voice object types in the first voice recognition result and the second voice recognition result are consistent, determining the voice object type in the first voice recognition result or the second voice recognition result as a target voice object type; then determining a target voice acoustic category required to be used when the target voice object type is answered according to the target voice object type; and finally, generating a response voice according to the target voice acoustic category to carry out voice dialogue with the target user. Firstly, carrying out acoustic model recognition and machine learning on a plurality of voices input by a user to comprehensively recognize to obtain voice object types, and then selecting corresponding target voice acoustic categories according to the voice object types to generate answering voices to carry out voice dialogue with the user; therefore, the user types are accurately identified through multiple identification mechanisms, so that the corresponding voice acoustic categories are provided according to the user types to perform voice conversation, the voice conversation requirements of different users or different scenes are met, and the voice conversation experience is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a voice conversation method according to some embodiments of the application;
FIG. 2 is a flow chart of a voice conversation method according to other embodiments of the present application;
Fig. 3 is a schematic hardware structure of a voice dialogue system according to some embodiments of the application.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that all directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present invention are merely used to explain the relative positional relationship, movement, etc. between the components in a particular posture (as shown in the drawings), and if the particular posture is changed, the directional indicator is changed accordingly.
Furthermore, the description of "first," "second," etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, "and/or" throughout this document includes three schemes, taking a and/or B as an example, including a technical scheme, a technical scheme B, and a technical scheme that both a and B satisfy; in addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
With the progress of science and technology, the intelligent degree and functions of the commercial superrobots are more and more diversified, so that the commercial superrobots can complete the traditional shopping, cleaning and other tasks, and can realize the functions of commodity recommendation, intelligent shopping guide and the like through an artificial intelligence technology; however, the robot function of such application scenes is still limited, and the requirements of voice dialogue of more scenes cannot be met, for example, game interaction or topic communication under corresponding intelligence cannot be performed with children, so that the consumer is not attractive enough due to overestimation.
Aiming at the problems, the application provides a voice dialogue method based on artificial intelligence, and the general idea of the voice dialogue method is as follows: when a user and a machine perform voice conversation, acoustic model recognition and machine learning recognition are performed according to a plurality of voices input by the user so as to judge the type (child or adult) of the user for obtaining the voice conversation, the corresponding voice acoustic type is selected according to the type of the user to perform the voice conversation, namely, when the user is judged to be the child in the voice conversation, the voice conversation is performed with the user by using the accent of the child and selecting topics matched with the intelligence of the child (such as topics like a baby song and a game), when the user is judged to be the adult in the voice conversation, the voice conversation is performed with the user by using the accent of the adult and selecting topics matched with the intelligence of the adult (such as topics like shopping recommendation and shopping guide), so that the voice conversation requirements of different users or different scenes are met, the voice conversation experience is improved, and the attraction of the super-robot for different crowds is increased.
1-2, The specific steps of the artificial intelligence based voice dialog method will be described primarily below, with the understanding that although a logical sequence is illustrated in the flow chart, in some cases the steps illustrated or described may be performed in a different order than that illustrated herein. Referring to fig. 1, the method comprises the steps of:
S100, acquiring a first target voice input by a target user, and triggering a first feedback voice according to the first target voice;
before a user performs a conversation with the intelligent robot, the user needs to wake up the machine, and at the moment, the voice of waking up the machine can be used as a first target voice, for example, when the user wakes up the intelligent robot through 'little leopard, afternoon', the voice of waking up the machine is used as the first target voice; the intelligent robotic system will then trigger a feedback voice based on the wake-up voice, e.g., the system triggers "hello, ask what help is needed" to reply to the user.
S200, obtaining second target voice which is input by a target user in a answering way to the first feedback voice, and inputting the first target voice and the second target voice into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result, wherein the first voice recognition result comprises a voice object type and a corresponding recognition probability;
After the user acquires the first feedback voice, the user needs to answer the first feedback voice, for example, reply to 'I want to find XXX shop', and the system takes 'I want to find XXX shop' as the second target voice, so that the acquisition of the first target voice and the second target voice is completed. The user type can be judged and identified based on the acoustic model, and the user type can be identified only according to the first target voice, can be identified only according to the second voice, and can be identified simultaneously according to the first target voice and the second target voice.
In order to improve the accuracy of object type recognition (user category) judgment, the application inputs the first target voice and the second target voice into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result.
It can be understood that, since the initial pronunciation of each segment of voice has different characteristics, the sequence of inputting the multiple segments of voices into the acoustic model has a certain influence on the recognition result. In one embodiment, the steps are as follows: inputting the first target voice and the second target voice into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result, wherein the method comprises the following steps:
Performing voice splicing on the first target voice and the second target voice to form a positive sequence voice group and a reverse sequence voice group;
and respectively inputting the positive sequence voice group and the negative sequence voice group into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result.
Specifically, the positive sequence voice group and the negative sequence voice group refer to voice combinations formed by voice splicing of first target voice and second target voice according to sequence, and if the first target voice is A and the second target voice is B, the A-B can be regarded as the positive sequence voice group and the B-A is regarded as the negative sequence voice group; after the positive sequence voice group and the negative sequence voice group are obtained, the positive sequence voice group or the negative sequence voice group can be randomly selected to be input into a pre-trained acoustic model for voice object recognition to obtain a first voice recognition result, and the positive sequence voice group and the negative sequence voice group can be simultaneously input into the pre-trained acoustic model for voice object recognition to obtain the first voice recognition result.
In one embodiment, the steps are as follows: respectively inputting the positive sequence voice group and the negative sequence voice group into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result, wherein the method comprises the following steps:
inputting the positive sequence voice group into a pre-trained acoustic model to perform voice object recognition to obtain a positive sequence recognition result;
inputting the reverse order voice group into a pre-trained acoustic model to perform voice object recognition to obtain a reverse order recognition result;
And carrying out Gaussian mixture processing on the positive sequence recognition result and the negative sequence recognition result to obtain the first voice recognition result, wherein the Gaussian mixture processing satisfies the following vector expression:
(1)
(2)
[“C”,100%](3)
wherein, the vector expression (1) is a first voice recognition result obtained by processing when the positive sequence recognition result is consistent with the reverse sequence recognition result, the vector expression (2) is a first voice recognition result obtained by processing when the positive sequence recognition result is not completely consistent with the reverse sequence recognition result, the vector expression (3) is a first voice recognition result obtained by processing when the positive sequence recognition result is not consistent with the reverse sequence recognition result, the 'A', 'B', 'C' are voice object types obtained by Gaussian mixture processing, the P (X) is recognition probability obtained by Gaussian mixture processing, the P1,For the recognition probability and the corresponding calculation weight under the positive sequence recognition result, P2,For the recognition probability and the corresponding calculation weight under the reverse order recognition result, the weight is calculatedCalculating weightsThe settings may be made in advance on the system, for example, each setting is 0.5.
Specifically, the cases of the results generated by the recognition of the voice objects of the positive-order voice group and the negative-order voice group are roughly classified into three types: (1) The positive sequence recognition result is consistent with the negative sequence recognition result, for example, the positive sequence recognition result: the 80% probability is identified as adult users, and the reverse order identification results: the 85% probability is identified as adult users, i.e. the identified voice objects are the same type and the corresponding probabilities are close; (2) The positive sequence recognition result is not completely identical to the negative sequence recognition result, for example, the positive sequence recognition result: the 80% probability is identified as adult users, and the reverse order identification results: 20% of the probability is identified as adult users, namely the identified voice objects are the same in type, but the corresponding probabilities are larger in difference; (3) The positive sequence recognition result is inconsistent with the negative sequence recognition result, such as the positive sequence recognition result: the 80% probability is identified as adult users, and the reverse order identification results: 88% of the probabilities are identified as child users, i.e., the identified speech objects are of opposite type and the corresponding probabilities are close. When the positive sequence recognition result and the negative sequence recognition result are of a first type, the first voice recognition result obtained by Gaussian mixture processing is: the voice object type A is a voice object type corresponding to the positive sequence recognition result or the negative sequence recognition result, and the probability is the probability weighted summation of the positive sequence recognition result and the negative sequence recognition result; when the positive sequence recognition result and the negative sequence recognition result are of the second type, the voice object type B is a voice object type corresponding to the larger probability of the positive sequence recognition result and the negative sequence recognition result, and the probability is the probability weighted summation of the positive sequence recognition result and the negative sequence recognition result; when the positive sequence recognition result and the negative sequence recognition result are of the third type, the voice object type C is the object uncertainty, and the uncertainty probability is 100%, and at the moment, the object recognition needs to be carried out again until an explicit first voice recognition result is obtained.
S300, performing voice deep learning recognition on the first target voice and the second target voice based on a machine deep learning technology to obtain a second voice recognition result, wherein the second voice recognition result comprises a voice object type and a corresponding recognition probability;
in order to further improve the accuracy of voice object recognition, the application carries out voice deep learning recognition on the first target voice and the second target voice based on a machine deep learning technology to obtain a second voice recognition result so as to comprehensively judge the voice object type according to the first voice recognition result and the second voice recognition result.
In one embodiment, the step S300: performing deep learning recognition on the first target voice and the second target voice based on a machine deep learning technology to obtain a second voice recognition result, wherein the method comprises the following steps:
S310, respectively identifying and extracting voice contents in the first target voice and the second target voice to obtain first voice contents and second voice contents;
S320, performing context association learning on the first voice content and the second voice content based on a machine deep learning technology to obtain a context association degree;
s330, inquiring a correlation mapping table according to the context correlation degree to obtain the second voice recognition result.
Specifically, firstly extracting voice contents in a first target voice and a second target voice to obtain the first voice content and the second voice content, then carrying out context association learning on the first voice content and the second voice content based on a machine deep learning technology to obtain a context association degree, and finally inquiring a pre-trained association degree-recognition probability mapping table according to the context association degree to obtain a second voice recognition result; it can be understood that if the context association degree of the first voice content and the second voice content is higher, the context association logic exists in the voice dialogue of the user, which is described or described with high probability, so that the intelligence or thought scope of the adult is reached, and the user is identified as the adult at this time; if the context association of the first voice content and the second voice content is low, the description or the high probability indicates that the context of the voice conversation of the user lacks association logic and is in the mental range of the child, and the user is identified as the child.
S400, determining the voice object type in the first voice recognition result or the second voice recognition result as a target voice object type under the condition that the voice object types in the first voice recognition result and the second voice recognition result are consistent;
specifically, when the first voice recognition result is consistent with the voice object type in the second voice recognition result, the voice recognition is stated to reach the consistency target, and the voice object type in the first voice recognition result or the second voice recognition result is determined to be the target voice object type.
For example, when the first speech recognition result is 80% probable and the second speech recognition result is 85% probable, the target speech object is the adult user.
S500, determining a target voice acoustic category required to be used when the target voice object type is answered according to the target voice object type, wherein the target voice acoustic category is one of adult accent and child accent;
After determining a target voice object type (judging a user type for talking with a machine), determining a target voice acoustic type required to be used when the target voice object type is answered according to the target voice object type, for example, when recognizing that a voice talking user is an adult user, adopting adult accents to talk with the user; when the voice dialogue user is identified as a child user, the child accent is adopted to carry out voice dialogue with the user, so that the voice requirements of different users are met, and the interest and experience of the user and the robot dialogue are improved.
S600, generating a response voice according to the target voice acoustic category to carry out voice dialogue with the target user.
It can be understood that, in order to promote the enthusiasm and the conversation interest of the user and the robot, so as to improve the viscosity of the robot to the user, after determining the voice conversation object, the user can select the topics interested by the user and use the adapted voice to talk with the user in addition to the corresponding voice accent to talk with the corresponding user.
In one embodiment, after determining the voice conversation type, a voice conversation topic may be determined from the voice content of the first target voice and/or the second target voice; and then generating answering voice according to the voice dialogue theme to conduct voice dialogue with the target user.
Specifically, if the dialogue user is an adult, and the user is recognized and found to be interested in the topic of purchasing clothing by the business in the voice content of the first target voice and/or the second target voice, the dialogue user performs voice dialogue with the user under the clothing topic as much as possible; if the dialogue user is a child and the user is found to be interested in the erigeron topics through recognition in the voice content of the first target voice and/or the second target voice, voice dialogue is carried out with the user under the erigeron topics as much as possible, and therefore the viscosity of the user to the robot is increased.
In an embodiment, the generating the reply voice according to the target voice acoustic category performs a voice dialogue with the target user, including:
acquiring a preset speech rate of the voice response according to the target voice acoustic category;
and generating answering voice at the preset voice speed and carrying out voice dialogue with the target user.
Specifically, the preset speech rate of the speech response is obtained according to the user category, so that a speech dialogue is performed with the user at the preset speech rate, and if the speech rate of the user changes, the preset speech rate can be adjusted according to the speech rate of the target user in the process of performing the speech dialogue with the target user to obtain the target speech rate, so that the speech dialogue is performed with the target user at a proper target speech rate, and further the speech dialogue experience of the user is improved.
In other embodiments, after the first speech recognition result and the second speech recognition result are obtained, the method further includes:
S700, determining a target voice acoustic type according to the first voice recognition result and the second voice recognition result under the condition that the voice object types in the first voice recognition result and the second voice recognition result are inconsistent, wherein the target voice acoustic type is one of the first voice recognition result and the second voice recognition result with larger prediction probability;
S800, generating a response voice according to the target voice acoustic category to carry out voice dialogue with the target user;
S900, performing deep learning recognition on the basis of a machine deep learning technology in the process of performing voice dialogue with the target user to obtain a third voice recognition result;
and S1000, maintaining or switching the target voice acoustic category according to the third voice recognition result.
Specifically, if the voice object types in the first voice recognition result and the second voice recognition result are inconsistent, for example, the first voice recognition result: the probability of recognizing adult users is 80%, and the second speech recognition result: the probability of recognizing the child user is 90%, and at this time, one of the first voice recognition result and the second voice recognition result with a larger prediction probability is selected as the recognition result, such as the first voice recognition result: the probability of adult users is 80%, the second speech recognition result: the probability of the child user is 90%, and the child recognition probability is larger, so that the child is taken as a target object, the accent of the child is taken as a target voice acoustic category, and the voice object type can be redetermined according to machine learning recognition in the voice conversation process; if the redetermined voice object type (third voice recognition result) is the same as the target voice acoustic category, keeping the target voice acoustic category unchanged; and if the redetermined voice object type (third voice recognition result) is different from the target voice acoustic category, switching the target voice acoustic category into the voice acoustic category corresponding to the third voice recognition result. Thus, the accent of the voice conversation is adjusted in the voice conversation process so as to adapt to the real requirements of the user and improve the voice conversation effect.
Based on the above, the voice dialogue method based on artificial intelligence can accurately identify the user type through a plurality of identification mechanisms so as to provide corresponding voice acoustic categories according to the user type to perform voice dialogue, thereby meeting the voice dialogue requirements of different users or different scenes, improving the voice dialogue experience and improving the attraction of the voice dialogue robot to different users.
The embodiment of the application also provides a voice dialogue system, please refer to fig. 3, fig. 3 is a schematic diagram of hardware structure of the voice dialogue system provided by some embodiments of the application; the speech dialog system comprises a memory 110 and a processor 120, the memory 110 being arranged to store program code, the processor 120 being arranged to invoke the program code to perform the method as described above.
Wherein the processor 120 is configured to provide computing and control capabilities to control the voice dialog system to perform corresponding tasks, for example, to control the voice dialog system to perform the artificial intelligence-based voice dialog method of any of the method embodiments described above, the method comprising: acquiring a first target voice input by a target user, and triggering a first feedback voice according to the first target voice; obtaining a second target voice which is input by a target user in a answering way to the first feedback voice, inputting the first target voice and the second target voice into a pre-trained acoustic model, and performing voice object recognition to obtain a first voice recognition result, wherein the first voice recognition result comprises a voice object type and a corresponding recognition probability; performing voice deep learning recognition on the first target voice and the second target voice based on a machine deep learning technology to obtain a second voice recognition result, wherein the second voice recognition result comprises a voice object type and a corresponding recognition probability; under the condition that the voice object types in the first voice recognition result and the second voice recognition result are consistent, determining the voice object type in the first voice recognition result or the second voice recognition result as a target voice object type; determining a target voice acoustic category required to be used when answering the target voice object type according to the target voice object type, wherein the target voice acoustic category is one of adult accent and child accent; and generating a response voice according to the target voice acoustic category to carry out voice dialogue with the target user.
Processor 120 may be a general-purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), a hardware chip, or any combination thereof; it may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (FPGA) GATE ARRAY, generic array logic (GENERIC ARRAY logic, GAL), or any combination thereof.
The memory 110 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the artificial intelligence-based voice conversation method in the embodiments of the present application. The processor 120 may implement the artificial intelligence based voice dialog method of any of the method embodiments described above by running non-transitory software programs, instructions, and modules stored in the memory 110.
In particular, memory 110 may include Volatile Memory (VM), such as random access memory (random access memory, RAM); memory 110 may also include non-volatile memory (NVM), such as read-only memory (ROM), flash memory (flash memory), hard disk (HARD DISK DRIVE, HDD) or solid-state disk (solid-state drive-STATE DRIVE, SSD) or other non-transitory solid state storage device; memory 110 may also include a combination of the types of memory described above.
In summary, the voice dialogue system of the present application adopts any one of the above technical solutions of the embodiment of the voice dialogue method based on artificial intelligence, so at least the beneficial effects brought by the technical solutions of the above embodiments are not described in detail herein.
Embodiments of the present application also provide a computer readable storage medium, such as a memory, including program code executable by a processor to perform the artificial intelligence based voice dialog method of the above embodiments. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (CDROM), magnetic tape, floppy disk, optical data storage device, and the like.
Embodiments of the present application also provide a computer program product comprising one or more program codes stored in a computer-readable storage medium. The program code is read from the computer readable storage medium by a processor of the electronic device, which executes the program code to perform the artificial intelligence based voice dialog method steps provided in the above embodiments.
It will be appreciated by those of ordinary skill in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by program code related hardware, where the program may be stored in a computer readable storage medium, where the storage medium may be a read only memory, a magnetic disk or optical disk, etc.
It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and where the program may include processes implementing the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the description of the present invention and the accompanying drawings or direct/indirect application in other related technical fields are included in the scope of the invention.

Claims (7)

Wherein, the vector expression (1) is a first voice recognition result obtained by processing when the positive sequence recognition result is consistent with the reverse sequence recognition result, the vector expression (2) is a first voice recognition result obtained by processing when the positive sequence recognition result is not completely consistent with the reverse sequence recognition result, the vector expression (3) is a first voice recognition result obtained by processing when the positive sequence recognition result is not consistent with the reverse sequence recognition result, the 'A', 'B', 'C' are voice object types obtained by Gaussian mixture processing, the P (X) is recognition probability obtained by Gaussian mixture processing, the P1,For the recognition probability and the corresponding calculation weight under the positive sequence recognition result, P2,The recognition probability and the corresponding calculation weight under the reverse order recognition result are obtained.
CN202410567703.6A2024-05-092024-05-09Speech dialogue method and system based on artificial intelligenceActiveCN118173093B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202410567703.6ACN118173093B (en)2024-05-092024-05-09Speech dialogue method and system based on artificial intelligence

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202410567703.6ACN118173093B (en)2024-05-092024-05-09Speech dialogue method and system based on artificial intelligence

Publications (2)

Publication NumberPublication Date
CN118173093A CN118173093A (en)2024-06-11
CN118173093Btrue CN118173093B (en)2024-07-02

Family

ID=91348981

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202410567703.6AActiveCN118173093B (en)2024-05-092024-05-09Speech dialogue method and system based on artificial intelligence

Country Status (1)

CountryLink
CN (1)CN118173093B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108564943A (en)*2018-04-272018-09-21京东方科技集团股份有限公司voice interactive method and system
CN109147800A (en)*2018-08-302019-01-04百度在线网络技术(北京)有限公司Answer method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
JP6077957B2 (en)*2013-07-082017-02-08本田技研工業株式会社 Audio processing apparatus, audio processing method, and audio processing program
CN110718234A (en)*2019-09-022020-01-21江苏师范大学 Acoustic scene classification method based on semantic segmentation encoder-decoder network
CN112420022B (en)*2020-10-212024-05-10浙江同花顺智能科技有限公司Noise extraction method, device, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN108564943A (en)*2018-04-272018-09-21京东方科技集团股份有限公司voice interactive method and system
CN109147800A (en)*2018-08-302019-01-04百度在线网络技术(北京)有限公司Answer method and device

Also Published As

Publication numberPublication date
CN118173093A (en)2024-06-11

Similar Documents

PublicationPublication DateTitle
US10747959B2 (en)Dialog generation method, apparatus, and electronic device
KR102437944B1 (en) Voice wake-up method and device
CN110444199B (en)Voice keyword recognition method and device, terminal and server
CN106599196A (en)Artificial intelligence conversation method and system
CN108920510A (en)Automatic chatting method, device and electronic equipment
CN111081280A (en)Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
CN108763495A (en)Interactive method, system, electronic equipment and storage medium
US12002451B1 (en)Automatic speech recognition
KR101945983B1 (en)Method for determining a best dialogue pattern for achieving a goal, method for determining an estimated probability of achieving a goal at a point of a dialogue session associated with a conversational ai service system, and computer readable recording medium
CN112632242A (en)Intelligent conversation method and device and electronic equipment
TW201919042A (en)Voice interactive device and voice interaction method using the same
CN113569032A (en) Conversational recommendation method, device and device
CN115640398A (en)Comment generation model training method, comment generation device and storage medium
CN117556026A (en)Data generation method, electronic device and storage medium
CN109961152B (en)Personalized interaction method and system of virtual idol, terminal equipment and storage medium
WO2022141142A1 (en)Method and system for determining target audio and video
CN118173093B (en)Speech dialogue method and system based on artificial intelligence
CN114049891A (en)Information generation method and device, electronic equipment and storage medium
Irfan et al.Coffee with a hint of data: Towards using data-driven approaches in personalised long-term interactions
CN115245682A (en)Game interaction method, device, system and computer readable storage medium
CN109165982A (en)The determination method and apparatus of user's purchase information
CN107846493A (en)Call contact person control method, device and storage medium and mobile terminal
CN115083412B (en)Voice interaction method and related device, electronic equipment and storage medium
CN117078342A (en) Product recommendation system and method based on deep learning
JP2013117842A (en)Knowledge amount estimation information generating device, knowledge amount estimating device, method, and program

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp