CN118173093B

Movatterモバイル変換

Info

Publication number: CN118173093B
Application number: CN202410567703.6A
Authority: CN
Inventors: 程绍波; 石玉山
Original assignee: Liaoning Yuyun Technology Co ltd
Current assignee: Liaoning Yuyun Technology Co ltd
Priority date: 2024-05-09
Filing date: 2024-05-09
Publication date: 2024-07-02
Anticipated expiration: 2044-05-09
Also published as: CN118173093A

Abstract

The embodiment of the application relates to the technical field of artificial intelligence and discloses a voice dialogue method and a voice dialogue system based on artificial intelligence, wherein the method comprises the steps of firstly carrying out acoustic model recognition and machine learning recognition on a plurality of voices input by a user to judge and obtain a voice object type, and then selecting a corresponding target voice acoustic category according to the voice object type to generate a response voice to carry out voice dialogue with the user; therefore, the user types are accurately identified through multiple identification mechanisms, so that the corresponding voice acoustic categories are provided according to the user types to perform voice conversation, the voice conversation requirements of different users or different scenes are met, and the voice conversation experience is improved.

Description

Speech dialogue method and system based on artificial intelligence

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a voice dialogue method and system based on artificial intelligence.

Background

With the progress of science and technology, the intelligent degree and functions of the commercial superrobots are more and more diversified, so that the commercial superrobots can complete the traditional shopping, cleaning and other tasks, and can realize the functions of commodity recommendation, intelligent shopping guide and the like through an artificial intelligence technology; however, the robot function of such application scenes is still limited, and the requirements of voice dialogue of more scenes cannot be met, for example, game interaction or topic communication under corresponding intelligence cannot be performed with children, so that the consumer is not attractive enough due to overestimation.

Disclosure of Invention

The invention mainly aims to provide a voice dialogue method and a voice dialogue system based on artificial intelligence, and aims to solve the technical problem that in the prior art, an intelligent robot cannot meet voice dialogue requirements of more scenes, so that a manufacturer has insufficient attractiveness to consumers.

To achieve the above object, in a first aspect, an embodiment of the present application provides a speech dialogue method based on artificial intelligence, where the method includes:

acquiring a first target voice input by a target user, and triggering a first feedback voice according to the first target voice;

Obtaining a second target voice which is input by a target user in a answering way to the first feedback voice, inputting the first target voice and the second target voice into a pre-trained acoustic model, and performing voice object recognition to obtain a first voice recognition result, wherein the first voice recognition result comprises a voice object type and a corresponding recognition probability;

performing voice deep learning recognition on the first target voice and the second target voice based on a machine deep learning technology to obtain a second voice recognition result, wherein the second voice recognition result comprises a voice object type and a corresponding recognition probability;

Under the condition that the voice object types in the first voice recognition result and the second voice recognition result are consistent, determining the voice object type in the first voice recognition result or the second voice recognition result as a target voice object type;

Determining a target voice acoustic category required to be used when answering the target voice object type according to the target voice object type, wherein the target voice acoustic category is one of adult accent and child accent;

and generating a response voice according to the target voice acoustic category to carry out voice dialogue with the target user.

Further, after the first voice recognition result and the second voice recognition result are obtained, the method further includes:

Under the condition that the voice object types in the first voice recognition result and the second voice recognition result are inconsistent, determining a target voice acoustic type according to the first voice recognition result and the second voice recognition result, wherein the target voice acoustic type is one with larger prediction probability in the first voice recognition result and the second voice recognition result;

Generating a response voice according to the target voice acoustic category to carry out voice dialogue with the target user;

performing voice deep learning recognition based on a machine deep learning technology in the voice dialogue process with the target user to obtain a third voice recognition result;

And maintaining or switching the target voice acoustic category according to the third voice recognition result.

Further, the maintaining or switching the target voice acoustic category according to the third voice recognition result includes:

if the target voice acoustic category is the same as the third voice recognition result, keeping the target voice acoustic category unchanged;

And if the target voice acoustic category is different from the third voice recognition result, switching the target voice acoustic category into the voice acoustic category corresponding to the third voice recognition result.

Further, the inputting the first target voice and the second target voice into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result includes:

Performing voice splicing on the first target voice and the second target voice to form a positive sequence voice group and a reverse sequence voice group;

and respectively inputting the positive sequence voice group and the negative sequence voice group into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result.

Further, the step of inputting the positive sequence voice group and the negative sequence voice group into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result includes:

inputting the positive sequence voice group into a pre-trained acoustic model to perform voice object recognition to obtain a positive sequence recognition result;

inputting the reverse order voice group into a pre-trained acoustic model to perform voice object recognition to obtain a reverse order recognition result;

And carrying out Gaussian mixture processing on the positive sequence recognition result and the negative sequence recognition result to obtain the first voice recognition result, wherein the Gaussian mixture processing satisfies the following vector expression:

（1）

（2）

[“C”,100%]（3）

Wherein, the vector expression (1) is a first voice recognition result obtained by processing when the positive sequence recognition result is consistent with the reverse sequence recognition result, the vector expression (2) is a first voice recognition result obtained by processing when the positive sequence recognition result is not completely consistent with the reverse sequence recognition result, the vector expression (3) is a first voice recognition result obtained by processing when the positive sequence recognition result is not consistent with the reverse sequence recognition result, the 'A', 'B', 'C' are voice object types obtained by Gaussian mixture processing, the P (X) is recognition probability obtained by Gaussian mixture processing, the P₁,For the recognition probability and the corresponding calculation weight under the positive sequence recognition result, P₂,The recognition probability and the corresponding calculation weight under the reverse order recognition result are obtained.

Further, the performing the deep learning recognition on the first target voice and the second target voice to obtain a second voice recognition result based on the deep learning technology includes:

Respectively identifying and extracting voice contents in the first target voice and the second target voice to obtain first voice contents and second voice contents;

performing context association learning on the first voice content and the second voice content based on a machine deep learning technology to obtain a context association degree;

and inquiring an association mapping table according to the context association degree to obtain the second voice recognition result.

Further, after determining the target voice acoustic category to be used when answering the target voice object type according to the target voice object type, the method further comprises:

Determining a voice dialogue theme according to the voice content of the first target voice and/or the second target voice;

And generating answering voice according to the voice dialogue theme to carry out voice dialogue with the target user.

Further, the generating a voice dialogue between the answering voice and the target user according to the target voice acoustic category includes:

acquiring a preset speech rate of the voice response according to the target voice acoustic category;

and generating answering voice at the preset voice speed and carrying out voice dialogue with the target user.

Further, after the generating the answering voice at the preset speech rate and the target user perform a voice dialogue, the method further includes:

In the process of carrying out voice dialogue with the target user, adjusting the preset speech speed according to the voice speech speed of the target user to obtain the target speech speed;

and generating answering voice at the target voice speed to carry out voice dialogue with the target user.

In a second aspect, there is also provided in an embodiment of the present application a speech dialog system, including a memory for storing program code and a processor for invoking the program code to perform the method according to the first aspect.

Compared with the prior art, the voice dialogue method based on artificial intelligence provided by the embodiment of the application comprises the steps of firstly acquiring first target voice input by a target user and triggering first feedback voice according to the first target voice; then, a second target voice which is input by a target user in a answering way to the first feedback voice is obtained, and the first target voice and the second target voice are input into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result; performing voice deep learning recognition on the first target voice and the second target voice based on a machine deep learning technology to obtain a second voice recognition result; under the condition that the voice object types in the first voice recognition result and the second voice recognition result are consistent, determining the voice object type in the first voice recognition result or the second voice recognition result as a target voice object type; then determining a target voice acoustic category required to be used when the target voice object type is answered according to the target voice object type; and finally, generating a response voice according to the target voice acoustic category to carry out voice dialogue with the target user. Firstly, carrying out acoustic model recognition and machine learning on a plurality of voices input by a user to comprehensively recognize to obtain voice object types, and then selecting corresponding target voice acoustic categories according to the voice object types to generate answering voices to carry out voice dialogue with the user; therefore, the user types are accurately identified through multiple identification mechanisms, so that the corresponding voice acoustic categories are provided according to the user types to perform voice conversation, the voice conversation requirements of different users or different scenes are met, and the voice conversation experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a voice conversation method according to some embodiments of the application;

FIG. 2 is a flow chart of a voice conversation method according to other embodiments of the present application;

Fig. 3 is a schematic hardware structure of a voice dialogue system according to some embodiments of the application.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that all directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present invention are merely used to explain the relative positional relationship, movement, etc. between the components in a particular posture (as shown in the drawings), and if the particular posture is changed, the directional indicator is changed accordingly.

Furthermore, the description of "first," "second," etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, "and/or" throughout this document includes three schemes, taking a and/or B as an example, including a technical scheme, a technical scheme B, and a technical scheme that both a and B satisfy; in addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

1-2, The specific steps of the artificial intelligence based voice dialog method will be described primarily below, with the understanding that although a logical sequence is illustrated in the flow chart, in some cases the steps illustrated or described may be performed in a different order than that illustrated herein. Referring to fig. 1, the method comprises the steps of:

S100, acquiring a first target voice input by a target user, and triggering a first feedback voice according to the first target voice;

before a user performs a conversation with the intelligent robot, the user needs to wake up the machine, and at the moment, the voice of waking up the machine can be used as a first target voice, for example, when the user wakes up the intelligent robot through 'little leopard, afternoon', the voice of waking up the machine is used as the first target voice; the intelligent robotic system will then trigger a feedback voice based on the wake-up voice, e.g., the system triggers "hello, ask what help is needed" to reply to the user.

S200, obtaining second target voice which is input by a target user in a answering way to the first feedback voice, and inputting the first target voice and the second target voice into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result, wherein the first voice recognition result comprises a voice object type and a corresponding recognition probability;

After the user acquires the first feedback voice, the user needs to answer the first feedback voice, for example, reply to 'I want to find XXX shop', and the system takes 'I want to find XXX shop' as the second target voice, so that the acquisition of the first target voice and the second target voice is completed. The user type can be judged and identified based on the acoustic model, and the user type can be identified only according to the first target voice, can be identified only according to the second voice, and can be identified simultaneously according to the first target voice and the second target voice.

In order to improve the accuracy of object type recognition (user category) judgment, the application inputs the first target voice and the second target voice into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result.

It can be understood that, since the initial pronunciation of each segment of voice has different characteristics, the sequence of inputting the multiple segments of voices into the acoustic model has a certain influence on the recognition result. In one embodiment, the steps are as follows: inputting the first target voice and the second target voice into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result, wherein the method comprises the following steps:

In one embodiment, the steps are as follows: respectively inputting the positive sequence voice group and the negative sequence voice group into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result, wherein the method comprises the following steps:

（1）

（2）

[“C”,100%]（3）

wherein, the vector expression (1) is a first voice recognition result obtained by processing when the positive sequence recognition result is consistent with the reverse sequence recognition result, the vector expression (2) is a first voice recognition result obtained by processing when the positive sequence recognition result is not completely consistent with the reverse sequence recognition result, the vector expression (3) is a first voice recognition result obtained by processing when the positive sequence recognition result is not consistent with the reverse sequence recognition result, the 'A', 'B', 'C' are voice object types obtained by Gaussian mixture processing, the P (X) is recognition probability obtained by Gaussian mixture processing, the P₁,For the recognition probability and the corresponding calculation weight under the positive sequence recognition result, P₂,For the recognition probability and the corresponding calculation weight under the reverse order recognition result, the weight is calculatedCalculating weightsThe settings may be made in advance on the system, for example, each setting is 0.5.

S300, performing voice deep learning recognition on the first target voice and the second target voice based on a machine deep learning technology to obtain a second voice recognition result, wherein the second voice recognition result comprises a voice object type and a corresponding recognition probability;

in order to further improve the accuracy of voice object recognition, the application carries out voice deep learning recognition on the first target voice and the second target voice based on a machine deep learning technology to obtain a second voice recognition result so as to comprehensively judge the voice object type according to the first voice recognition result and the second voice recognition result.

In one embodiment, the step S300: performing deep learning recognition on the first target voice and the second target voice based on a machine deep learning technology to obtain a second voice recognition result, wherein the method comprises the following steps:

S310, respectively identifying and extracting voice contents in the first target voice and the second target voice to obtain first voice contents and second voice contents;

S320, performing context association learning on the first voice content and the second voice content based on a machine deep learning technology to obtain a context association degree;

s330, inquiring a correlation mapping table according to the context correlation degree to obtain the second voice recognition result.

Specifically, firstly extracting voice contents in a first target voice and a second target voice to obtain the first voice content and the second voice content, then carrying out context association learning on the first voice content and the second voice content based on a machine deep learning technology to obtain a context association degree, and finally inquiring a pre-trained association degree-recognition probability mapping table according to the context association degree to obtain a second voice recognition result; it can be understood that if the context association degree of the first voice content and the second voice content is higher, the context association logic exists in the voice dialogue of the user, which is described or described with high probability, so that the intelligence or thought scope of the adult is reached, and the user is identified as the adult at this time; if the context association of the first voice content and the second voice content is low, the description or the high probability indicates that the context of the voice conversation of the user lacks association logic and is in the mental range of the child, and the user is identified as the child.

S400, determining the voice object type in the first voice recognition result or the second voice recognition result as a target voice object type under the condition that the voice object types in the first voice recognition result and the second voice recognition result are consistent;

specifically, when the first voice recognition result is consistent with the voice object type in the second voice recognition result, the voice recognition is stated to reach the consistency target, and the voice object type in the first voice recognition result or the second voice recognition result is determined to be the target voice object type.

For example, when the first speech recognition result is 80% probable and the second speech recognition result is 85% probable, the target speech object is the adult user.

S500, determining a target voice acoustic category required to be used when the target voice object type is answered according to the target voice object type, wherein the target voice acoustic category is one of adult accent and child accent;

After determining a target voice object type (judging a user type for talking with a machine), determining a target voice acoustic type required to be used when the target voice object type is answered according to the target voice object type, for example, when recognizing that a voice talking user is an adult user, adopting adult accents to talk with the user; when the voice dialogue user is identified as a child user, the child accent is adopted to carry out voice dialogue with the user, so that the voice requirements of different users are met, and the interest and experience of the user and the robot dialogue are improved.

S600, generating a response voice according to the target voice acoustic category to carry out voice dialogue with the target user.

It can be understood that, in order to promote the enthusiasm and the conversation interest of the user and the robot, so as to improve the viscosity of the robot to the user, after determining the voice conversation object, the user can select the topics interested by the user and use the adapted voice to talk with the user in addition to the corresponding voice accent to talk with the corresponding user.

In one embodiment, after determining the voice conversation type, a voice conversation topic may be determined from the voice content of the first target voice and/or the second target voice; and then generating answering voice according to the voice dialogue theme to conduct voice dialogue with the target user.

Specifically, if the dialogue user is an adult, and the user is recognized and found to be interested in the topic of purchasing clothing by the business in the voice content of the first target voice and/or the second target voice, the dialogue user performs voice dialogue with the user under the clothing topic as much as possible; if the dialogue user is a child and the user is found to be interested in the erigeron topics through recognition in the voice content of the first target voice and/or the second target voice, voice dialogue is carried out with the user under the erigeron topics as much as possible, and therefore the viscosity of the user to the robot is increased.

In an embodiment, the generating the reply voice according to the target voice acoustic category performs a voice dialogue with the target user, including:

Specifically, the preset speech rate of the speech response is obtained according to the user category, so that a speech dialogue is performed with the user at the preset speech rate, and if the speech rate of the user changes, the preset speech rate can be adjusted according to the speech rate of the target user in the process of performing the speech dialogue with the target user to obtain the target speech rate, so that the speech dialogue is performed with the target user at a proper target speech rate, and further the speech dialogue experience of the user is improved.

In other embodiments, after the first speech recognition result and the second speech recognition result are obtained, the method further includes:

S700, determining a target voice acoustic type according to the first voice recognition result and the second voice recognition result under the condition that the voice object types in the first voice recognition result and the second voice recognition result are inconsistent, wherein the target voice acoustic type is one of the first voice recognition result and the second voice recognition result with larger prediction probability;

S800, generating a response voice according to the target voice acoustic category to carry out voice dialogue with the target user;

S900, performing deep learning recognition on the basis of a machine deep learning technology in the process of performing voice dialogue with the target user to obtain a third voice recognition result;

and S1000, maintaining or switching the target voice acoustic category according to the third voice recognition result.

Specifically, if the voice object types in the first voice recognition result and the second voice recognition result are inconsistent, for example, the first voice recognition result: the probability of recognizing adult users is 80%, and the second speech recognition result: the probability of recognizing the child user is 90%, and at this time, one of the first voice recognition result and the second voice recognition result with a larger prediction probability is selected as the recognition result, such as the first voice recognition result: the probability of adult users is 80%, the second speech recognition result: the probability of the child user is 90%, and the child recognition probability is larger, so that the child is taken as a target object, the accent of the child is taken as a target voice acoustic category, and the voice object type can be redetermined according to machine learning recognition in the voice conversation process; if the redetermined voice object type (third voice recognition result) is the same as the target voice acoustic category, keeping the target voice acoustic category unchanged; and if the redetermined voice object type (third voice recognition result) is different from the target voice acoustic category, switching the target voice acoustic category into the voice acoustic category corresponding to the third voice recognition result. Thus, the accent of the voice conversation is adjusted in the voice conversation process so as to adapt to the real requirements of the user and improve the voice conversation effect.

Based on the above, the voice dialogue method based on artificial intelligence can accurately identify the user type through a plurality of identification mechanisms so as to provide corresponding voice acoustic categories according to the user type to perform voice dialogue, thereby meeting the voice dialogue requirements of different users or different scenes, improving the voice dialogue experience and improving the attraction of the voice dialogue robot to different users.

The embodiment of the application also provides a voice dialogue system, please refer to fig. 3, fig. 3 is a schematic diagram of hardware structure of the voice dialogue system provided by some embodiments of the application; the speech dialog system comprises a memory 110 and a processor 120, the memory 110 being arranged to store program code, the processor 120 being arranged to invoke the program code to perform the method as described above.

Processor 120 may be a general-purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), a hardware chip, or any combination thereof; it may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (FPGA) GATE ARRAY, generic array logic (GENERIC ARRAY logic, GAL), or any combination thereof.

The memory 110 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the artificial intelligence-based voice conversation method in the embodiments of the present application. The processor 120 may implement the artificial intelligence based voice dialog method of any of the method embodiments described above by running non-transitory software programs, instructions, and modules stored in the memory 110.

In particular, memory 110 may include Volatile Memory (VM), such as random access memory (random access memory, RAM); memory 110 may also include non-volatile memory (NVM), such as read-only memory (ROM), flash memory (flash memory), hard disk (HARD DISK DRIVE, HDD) or solid-state disk (solid-state drive-STATE DRIVE, SSD) or other non-transitory solid state storage device; memory 110 may also include a combination of the types of memory described above.

In summary, the voice dialogue system of the present application adopts any one of the above technical solutions of the embodiment of the voice dialogue method based on artificial intelligence, so at least the beneficial effects brought by the technical solutions of the above embodiments are not described in detail herein.

Embodiments of the present application also provide a computer readable storage medium, such as a memory, including program code executable by a processor to perform the artificial intelligence based voice dialog method of the above embodiments. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (CDROM), magnetic tape, floppy disk, optical data storage device, and the like.

Embodiments of the present application also provide a computer program product comprising one or more program codes stored in a computer-readable storage medium. The program code is read from the computer readable storage medium by a processor of the electronic device, which executes the program code to perform the artificial intelligence based voice dialog method steps provided in the above embodiments.

It will be appreciated by those of ordinary skill in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by program code related hardware, where the program may be stored in a computer readable storage medium, where the storage medium may be a read only memory, a magnetic disk or optical disk, etc.

It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and where the program may include processes implementing the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the description of the present invention and the accompanying drawings or direct/indirect application in other related technical fields are included in the scope of the invention.

Claims

1. A method of speech dialog based on artificial intelligence, the method comprising:

After the first voice recognition result and the second voice recognition result are obtained, the method further comprises:

maintaining or switching the target voice acoustic category according to the third voice recognition result;

inputting the first target voice and the second target voice into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result, including:

Respectively inputting the positive sequence voice group and the negative sequence voice group into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result;

The step of inputting the positive sequence voice group and the negative sequence voice group into a pre-trained acoustic model to perform voice object recognition to obtain a first voice recognition result, comprises the following steps:

[“A”，P（X）=]（1）

[“B”,P（X）=]（2）

[“C”,100%]（3）

2. The artificial intelligence based voice dialog method of claim 1, wherein the maintaining or switching the target voice acoustic class according to the third voice recognition result includes:

3. The artificial intelligence-based voice conversation method of claim 1 wherein the machine deep learning-based technique for performing deep learning recognition on the first target voice and the second target voice to obtain a second voice recognition result comprises:

4. The artificial intelligence based voice dialog method of claim 1, wherein after determining a target voice acoustic class to be used in answering the target voice object type based on the target voice object type, further comprising:

5. The artificial intelligence based voice dialog method of claim 1, wherein the generating a response voice from the target voice acoustic class to engage in a voice dialog with the target user includes:

6. The artificial intelligence based voice conversation method of claim 5 wherein after generating a response voice at the preset speech rate to conduct a voice conversation with the target user, further comprising:

7. A speech dialog system comprising a memory for storing program code and a processor for invoking the program code to perform the method of any of claims 1 to 6.