Based on any of the above embodiments, the system further comprises at least one of a terminal computing resource, an edge computing resource, and a cloud computing resource.

Specifically, the terminal computing resources are the computing resources of the user terminal itself. The voice interaction system can call at least one of terminal computing resources, edge computing resources and cloud computing resources to realize voice interaction, so that full duplex voice interaction capability is improved, each component in the voice interaction system can be deployed at a corresponding computing resource according to functional requirements facing the voice interaction system, namely each component in the voice interaction system can be deployed at the same computing resource or at different computing resources respectively, and the embodiment of the invention is not particularly limited. For example, components that implement voice recognition, dialog management, text synthesis, etc. functions may be processed using cloud computing resources.

Based on any of the above embodiments, the system further comprises an edge computing resource, a cloud computing resource, and a transmission component;

Specifically, for each functional component used for voice interaction, the functional component can be allocated with more suitable computing resources according to different functional requirements of different functional components, so as to realize the voice interaction function. Particularly, under the condition that Edge (Edge) computing resources and Cloud (Cloud) computing resources coexist, two functional components of an acoustic acquisition component and a voice synthesis component can be deployed on the Edge computing resources, and two functional components of a voice recognition component and a conversation processing component can be deployed on the Cloud computing resources according to the voice interaction functional component deployment illustration shown in fig. 7, so that more powerful computing capability support is provided for voice recognition and conversation processing.

Under the condition that the voice interaction system itself bears the edge computing function, after the voice interaction system acquires the user voice stream, the user voice stream can be sent to a cloud providing cloud computing resources through a transmission component, the cloud executes voice recognition operation and dialogue processing operation aiming at the user voice stream based on the voice interaction resources, a voice recognition algorithm and a dialogue processing algorithm, so that an interaction text of the user voice stream is obtained, and the interaction text is fed back to the voice interaction system through the transmission component.

After that, the local providing the edge computing resource can receive the interactive text returned by the cloud through the transmission component, and after the interactive text is obtained, the interactive synthesized voice corresponding to the interactive text is generated by utilizing the local computing resource, namely the edge computing resource.

Based on any of the above embodiments, the voice interaction knowledge includes declarative knowledge and procedural knowledge;

the resource data includes scene data, user data, and history data.

In particular, a voice interaction system, and in particular a dialog processing component in a voice interaction system, needs to be performed in dependence on voice interaction knowledge. The voice interaction knowledge, i.e. knowledge information for realizing voice interaction, may be pre-stored, or may be continuously updated during the voice interaction.

Further, the voice interaction knowledge can be divided into two types of declarative knowledge and procedural knowledge, wherein the declarative knowledge is information about what, the declarative information is easily verbally expressed and translated into statements, and thus can be regarded as explicit knowledge. The procedural knowledge is information about how to do so. Procedural knowledge is often difficult to speak and describe and therefore can be considered as implicit knowledge. For example, "the authors of the littoral waterside transmission are Shi Naian", which pertains to declarative knowledge, and "how children speak" pertains to procedural knowledge.

It should be noted that in the field of artificial intelligence AI, procedural knowledge can be considered as a kind of intelligent program that describes knowledge about the implementation of AI. The smart program includes many different programs that the AI system can execute and the voice interaction system can be authorized to invoke the AI system capable of performing the programmatic knowledge to implement a particular AI function. In this process, the voice interaction system can learn about the AI function without paying attention to the specific execution flow of the AI function, and can realize application of programmatic knowledge in an effective manner only by relying on the AI system.

In addition, the resource data is the resource related to the voice interaction task, wherein the scene data can reflect the related data of the subordinate scene of the voice interaction task, for example, the scene data in the driving scene of an automobile can cover characteristic voice for navigation, and specific mode setting parameters of different navigation modes, such as a clear road mode, a general navigation mode, a high-speed navigation mode and the like; the user data can reflect relevant information of the user participating in the voice interaction task, such as user identity information, user preference, user history setting information and the like; the history data may reflect data that has been generated in the voice interaction task at this time, or may also reflect data that has been generated in a voice interaction task preceding the voice interaction task at this time.

In addition, the resource data may further include service data, where the service data may reflect data of a service related to a voice interaction task, for example, when the voice interaction task is a navigation task, the service data may cover map resources, congestion data of a specific line, and the like, and when the voice interaction task is a home service, for example, the service data may cover executable functions and specific control parameters of each intelligent device capable of being controlled in a coordinated manner.

Based on any of the above embodiments, the voice interaction system further comprises:

The display component can be a display screen of a smart phone, an intelligent household appliance and the like, or can be a display screen which is additionally connected with a hardware processor of the voice interaction system, such as a display, a television and the like.

In the voice interaction process, the influence of interference factors such as user accents, environmental noise and the like is considered, the recognition text can not necessarily reflect the content of the user voice stream absolutely accurately, the display component can display the recognition text output by the voice recognition component, so that a user can adjust the recognition text, the user can monitor and control the voice interaction to a certain extent through displaying the recognition text, and the influence of the interaction content with the outlier in the interaction process on the user experience is avoided.

In addition, in order to avoid that the user does not hear or understand the interactive synthesized voice played by the voice playing and reporting component, the display component can also display the interactive text corresponding to the interactive synthesized voice, so that the user can obtain more clear and clear interactive information.

Moreover, the display component can also display the voice interaction state, and can remind the user that the voice interaction is running in a prompting mode of icons displayed on a screen, so that the user can conveniently confirm whether the voice interaction is started or not in real time, the privacy of the user is prevented from being eavesdropped and revealed, and the privacy safety of the user is ensured.

Specifically, the input operation component may include a touch screen with both display and input functions, and may also include an input device in the form of physical hardware, such as physical keys, physical switches, physical keyboards, and the like. The above-mentioned various interactive operations may be implemented by the same device in the input operation component, or may be implemented by different devices therein, which is not particularly limited in the embodiment of the present invention.

For example, the user can switch the voice interaction state by operating the input component, for example, switch from on to off and switch from off to on, so that flexible control of the voice interaction state is realized, and the risk of privacy disclosure caused by recording external environment sound when the user does not need voice interaction is avoided.

For another example, after the user sees the recognition text or hears the synthesized voice of the recognition text, the user can determine whether the current recognition text is different from the content which the user wants to express, namely whether the recognition text needs to be adjusted, if the user needs to adjust, the user can modify the recognition text through an input operation component, for example, click on a part which needs to be adjusted in the recognition text on a screen, so that the voice interaction system obtains modification operation of the user on the recognition text, and in response to the modification operation, the voice interaction system modifies the recognition text obtained by the recognition, and applies the modified recognition text to subsequent dialogue processing operation and voice synthesis operation.

For another example, the user can set the voice interaction language through operating the input component, so that the voice interaction can be executed according to the language meeting the requirement of the user, or set the preset trigger word through operating the input component, so that the voice interaction can be executed according to the trigger word set by the user, or set the target voice attribute of the target speaker through operating the input component, so that the interactive synthesized voice broadcasted by the voice interaction system can meet the target voice attribute, and the interaction requirement of the user is met.

and the switch component is used for switching the voice interaction state.

Specifically, the voice interaction system may be provided with a dedicated switch component to realize the switching of the voice interaction state, where the switch component may be a virtual switch, i.e. a switch icon displayed on a screen of the voice interaction system or a screen connected to the voice interaction system, or may be a physical switch, i.e. a switch in the form of a button or a dial plate, etc. disposed on the voice interaction system.

Based on any of the above embodiments, in the half duplex interactive system, when a user inputs a voice signal or other input information, the system is in a silence state, and a series of processes such as voice recognition, semantic understanding, dialogue generation, voice synthesis, etc. are performed after the user inputs the voice signal or other input information; when the system processes the voice signal or other input information input by the user, the system receiving signal device is in a silence state, and the system does not receive any information input by the user at the moment, and has to wait for the system to process the last voice interaction before the system continuously receives the voice signal or other input information input by the user. It follows that a half-duplex voice interactive system is characterized by a person and a system that can only communicate with each other in one direction at a time, similar to a round-robin conversation.

In contrast to half-duplex interactive systems, full-duplex interactive systems allow people to communicate with the system at the same time. FIG. 8 is a functional view of a voice interactive system according to the present invention, as shown in FIG. 8, wherein the functional architecture of the voice interactive system is composed of a plurality of layers and components, the layers including an interaction layer, a knowledge and data resource layer, a base layer, and an AI and machine learning layer, and the layers refer to a unit aggregate for performing a large class of functional capabilities. The interaction layer comprises anacoustic acquisition component 120, avoice recognition component 130, adialogue processing component 140 and avoice synthesis component 150, and the knowledge and data resource layer comprises two parts of voice interaction knowledge and resource data, wherein the voice interaction knowledge forms a knowledge base, and the resource data corresponds to scene data, historical data and user data. The layers described above may be described in terms of their inputs, outputs, and intent or function, where each layer and its components may be used and tested individually. All layers can be integrated together to enable a user to talk to the voice interaction system and to assist the user in meeting the requirements. The full duplex voice interaction system can achieve the degree of conversation communication between similar people and people, can keep the continuity of the conversation, so that after two or more rounds of conversation with the system are carried out by a user, the system still can keep good viscosity with the user, and the content replied by the system still can meet the demands of the user. The main function of the interaction layer is to recognize the input signal as a plain text through theacoustic acquisition component 120 and thevoice recognition component 130, understand the real intention of the input signal through thedialogue processing component 140, generate an interaction reply word, and finally use the synthesized voice audio of the interaction reply word as an output signal through thevoice synthesis component 150; the knowledge and data resource layer mainly provides necessary data resources and knowledge base for the interaction layer; the AI and machine learning layer mainly uses AI method based on machine learning to process data, train model and continuously optimize, provide model reasoning, online data mining, data analysis, etc. ability for the interaction layer; the basic layer comprises cloud service, a terminal and edge calculation, provides hardware calculation resources, is an operation carrier of an AI and a machine learning algorithm, and mainly guarantees network call and system stability of each module in the full duplex voice interaction process. Further, the base layer provides FDX voice interaction capabilities using cloud services and/or terminal and/or edge computing, where components such as voice recognition, dialog management, text synthesis, etc., can be handled using cloud services. In some embodiments, thedialog processing component 140 may include a natural language understanding sub-component, a semantic rejection sub-component, a semantic post-processing sub-component, a dialog management sub-component, a natural language generation sub-component. The natural language understanding sub-component can acquire entity information, intention information and the like in the recognition text obtained by the voice recognition component based on the voice interaction resource and the natural language understanding algorithm; the semantic rejection sub-component can reject the semantics based on the entity information and the intention information obtained by the natural language understanding sub-component, eliminate the semantic intention (such as boring information and the like) irrelevant to the current dialogue, and obtain effective entity information, intention information and the like, namely effective text semantics; the semantic post-processing sub-component can convert the valid text semantics into service requests based on the valid text semantics obtained by the semantic rejecting sub-component; the dialogue management sub-component can generate interaction information and interaction intention corresponding to the interaction information based on the service request, the voice interaction resource and a dialogue management algorithm; the natural language generation sub-component may derive interaction text based on the interaction information and the interaction intent.

The voice interaction system is executed under the voice interaction task, the voice interaction task can cover one or more tasks, and each task can be logically designed by using a traditional software engineering method, wherein the tasks comprise scene definition, environment characteristics, input and output, functional units, databases, data streams and the like. Also, each task here should be well defined and have a well-defined domain, and each task should reflect the needs of the user.

The acoustic acquisition component, the voice recognition component, the dialogue processing component and the voice synthesis component which are covered by the interaction interface all play a core function in voice interaction, and the components can meet the additional requirements of users by adding or modifying additional functions. While the above components have a temporal relationship during voice interactions, some components may interact with other components at the same time. For example, semantic-based speech endpoint detection and irrelevant content rejection in a speech recognition component are also invoking natural language understanding and semantic processing in a dialog processing component. Whereas the voice triggering function based on the voice recognition component is mainly applied to the acoustic acquisition component.

The knowledge and data contained in the voice interaction resource layer are extremely important for training, testing and operating all components below the functional component layer and the voice interaction whole. Also, the voice interaction system may "learn continuously", i.e. at run-time, its input data and resulting actions are also used to update the database held by the voice interaction system.

In the AI and machine learning layers, artificial intelligence and machine learning systems can be used for knowledge acquisition and model training, validation and verification to support the functional implementation of the functional components below the functional component layer. And under the cloud service, the terminal and the edge computing layer, network resources with high bandwidth and low delay are required to perform data transmission, so that the realization of real-time voice interaction is supported.

In addition, the utilization of each block of infrastructure in the computing infrastructure layer needs to remain stable during voice interactions. And to ensure good voice interaction performance, LSTM-RNN and DNN can be used for acoustic modeling, encoder-decoder structure for language modeling, CNN for beamforming, bi-directional encoder representation from BERT for natural language understanding.

The voice interaction system can be operated by a user through natural voice language, and can also allow the user to operate through other modes such as gestures and actions. Moreover, the voice interaction function is only required to be triggered when interaction starts, and repeated triggering is not required in the specific interactive call process. In addition, functions, languages, information, etc. involved in voice interaction can be set by the user.

Based on any of the above embodiments, the above full duplex voice interaction system may be implemented based on an engineering flow, where the engineering flow is for stakeholders such as designers, enforcers, and verifiers, where the designers refer to entities that receive data and problem descriptions and create AI models, the enforcers refer to entities that receive AI models and specify calculations to be performed, and the verifiers refer to entities that verify whether the calculations are being performed and whether the AI models are being performed by design.

The voice interaction may vary by task, scene, add-on device, and embedding method, which may affect the process phase. For example, most models and functions in a full duplex voice interactive system can be trained using the ML algorithm, but require multiple iterative improvements to achieve acceptable levels of accuracy and reliability. Thus, testing and verification of voice interaction systems embedded in ML methods (especially deep learning) can be challenging compared to traditional rule-based methods (programmed to an understandable way according to requirements and specifications).

Engineering implementation flow begins when stakeholders decide to turn an idea into a tangible system. In the initial stage, stakeholders should determine why a full duplex voice interactive system needs to be developed, which can be used to solve what kind of questions, adapt what kind of scenario, and which clients' needs or business opportunities, the questions can be answered by market research and analysis, and the stakeholders with different expertise can help to determine requirements and costs.

Then enter the design development phase: the work at this stage is to create a full duplex voice interactive system and end up with Software or hardware ready for deployment, including APP, SDK (Software Development Kit ), saaS (Software-as-a-Service). At this stage, and particularly before the end, the stakeholder should ensure that the full-duplex voice interactive system achieves the initial objectives, requirements, and other objectives determined at the initial stage.

The verification and authentication phase is then entered: the work at this stage is to check whether the full duplex voice interactive system starting from the design and development stage works as required and meets the objective.

Then enter the deployment, operation and monitoring phases: the work at this stage includes installing and configuring the full duplex voice interactive system in the target environment. The system installed and configured herein should be usable. This stage requires monitoring of the operational status and faults of the system and reporting to stakeholders to take action.

Then enter the re-evaluation phase: this stage should evaluate the results of the operational monitoring based on the targets and requirements determined for the full duplex voice interactive system. Once a problem is found, the target and requirements should be refined.

Finally, the retirement stage may be entered: in some cases, full duplex voice interactive systems may be outdated or unusable due to significant changes in the operating environment or technology, and stakeholders should consider retirement and discard or redevelopment.

Based on any of the above embodiments, fig. 9 is a schematic flow chart of a voice interaction method provided by the present invention, and as shown in fig. 9, the method can be applied to a voice interaction system, such as a smart phone, a smart home appliance, a smart assistant APP, a customer service robot, and the like. The method comprises the following steps:

step 910, collecting user voice stream after voice interaction awakening;

specifically, the user voice stream, that is, the voice data stream obtained during the voice interaction process, is obtained by recording in real time, and may specifically be obtained by recording voice, or may be obtained by recording video, and may be a single microphone or may be a microphone array including multiple microphones for recording the user voice stream, which is not particularly limited in the embodiment of the present invention.

Here, the voice interaction wake-up may be implemented by the user triggering a voice control operation of the UI interface, or may be implemented by the user inputting a voice signal of a preset wake-up word such as "fly little", which is not limited in particular in the embodiment of the present invention. It can be understood that the voice interaction system can complete the whole conversation process after one voice interaction wakeup, and realize multiple interactions, that is, the voice interaction system only needs to perform one voice interaction wakeup at the beginning of a conversation, then a user can continuously speak, and the voice interaction system can collect a user voice stream in a period of time after the voice interaction wakeup, thereby promoting each subsequent step to operate in a period of time to perform multiple continuous conversations, and further obtaining the effect of one wakeup multiple interactions. The specific length of the period of time referred to herein may depend on the actual interaction situation between the user and the voice interaction system, and may be understood as the time length of the entire conversation process, for example, the voice interaction system may collect the user voice stream containing valid voice during or after broadcasting the interactive synthesized voice, i.e. the user interrupts the broadcasting of the interactive synthesized voice, or the user continues to converse with the voice interaction system after hearing the interactive synthesized voice broadcasting, so that the time for collecting the user voice stream after waking up the voice interaction may be shifted backward until the entire conversation process ends.

And whether the whole conversation process is finished or not can be judged by judging whether the user voice stream of the effective voice is collected in the preset time, for example, after the voice interaction system broadcasts the interactive synthesized voice, if the user voice stream containing the effective voice is not collected in the preset time, the conversation process can be considered to be finished, and the collection can be stopped. For example, whether the entire conversation process is finished can be judged through two preset times, specifically, after the voice interaction system broadcasts the interactive synthetic voice, if the user voice stream containing the effective voice is not collected in the first preset time, the voice interaction system can generate and broadcast the interactive synthetic voice for active interaction so as to guide the user to continue the conversation, and after the interactive synthetic voice broadcast for active interaction is finished, if the user voice stream containing the effective voice is not collected in the second preset time, the conversation process can be considered to be finished, and the collection can be stopped.

Moreover, voice interaction wake-up is only required to be triggered at the beginning of a dialogue, and no trigger is required in the dialogue process.

Step 920, performing voice endpoint detection and voice recognition on the user voice stream based on voice interaction resources, to obtain a recognition text of the user voice stream, wherein the voice endpoint detection is performed based on semantics of the user voice stream, and the voice interaction resources comprise voice interaction knowledge and resource data related to voice interaction tasks;

Step 930, performing natural language understanding, dialogue management and natural language generation on the recognition text based on the voice interaction resource to obtain an interaction text of the recognition text;

step 940, performing a speech synthesis operation on the interactive text to obtain an interactive synthesized speech of the interactive text;

specifically, for real-time interaction of user voice streams, the method can be divided into three stages of voice recognition operation, dialogue processing operation and voice synthesis operation, and execution of the three operation stages is realized on the basis of voice interaction resources.

The voice interaction resources herein, namely, data resources for supporting real-time voice interaction execution, can be specifically divided into two types of voice interaction knowledge and resource data:

The real-time interaction of the user voice stream can be realized by combining the voice interaction resource, specifically, the recognition text of the user voice stream can be obtained by combining the voice interaction resource through voice recognition operation, then the interactive text corresponding to the recognition text can be obtained by combining the voice interaction resource through dialogue processing operation, and the interactive text is used for feeding back the text form information of the user; and then combining the voice interaction resources, and synthesizing the voice corresponding to the interaction text through voice synthesis operation, so as to obtain the interaction synthesized voice.

In this process, the speech recognition operation, the dialogue processing operation, and the speech synthesis operation may be realized by calling a single functional component, for example, the speech recognition component, the dialogue processing component, and the speech synthesis component may be preset for realizing the speech recognition operation, the dialogue processing operation, and the speech synthesis operation. Here, each functional component can be used and tested independently, and the functional components can also be integrated together so as to realize real-time interaction.

Further, instep 920, voice endpoint detection and voice recognition may be implemented by a streaming voice recognition algorithm for recognizing a continuous voice stream and a semantic-based voice endpoint detection algorithm; the purpose of the semantic-based voice endpoint detection algorithm is to identify and eliminate silence periods from the user voice stream, and compared with the traditional acoustic VAD, the semantic-based VAD combines semantic features when distinguishing voice from non-voice, so that the reliability of voice endpoint detection can be further improved; in addition, instep 920, an operation of rejecting irrelevant contents may be added, where irrelevant content co-rejecting may be implemented by an irrelevant content rejecting algorithm, where the purpose of the irrelevant content rejecting algorithm is to distinguish and reject contents that cannot be processed or should not be processed in the voice stream of the user. Such content is generally independent of the interaction tasks and dialog topics or context, and may also include invalid speech. The disambiguation effect can also be achieved by an irrelevant content rejection algorithm.

The natural language understanding, dialog management and natural language generation instep 930 may be implemented by a natural language understanding algorithm, a dialog management algorithm and a natural language generation algorithm, where the natural language understanding is used to understand the semantics of the recognition text, the dialog management is used to enter the current dialog state and context input, update the state of the dialog, and generate the dialog action to be implemented according to the dialog processing logic, and the natural language generation is used to generate the natural language text corresponding to the dialog action, that is, the interactive text. In addition, instep 930, the operations of semantic rejection and semantic post-processing may also be added, whereby natural language understanding and semantic rejection are used to understand the semantics of the recognized text and screen out extraneous semantics, semantic post-processing, dialog management and natural language generation are used to organize the process of dialog jumps according to the semantics of the recognized text, thereby generating suitable natural language text, i.e., interactive text.

The speech synthesis instep 940 is performed in dependence on a speech synthesis algorithm by which a conversion from interactive text to interactive synthesized speech is performed, thereby enabling a speech interaction with the user.

Step 950, broadcasting the interactive synthesized voice;

Specifically, after the interactive synthesized voice is obtained, the interactive synthesized voice can be broadcasted. It should be noted that, in the broadcasting process of the interactive synthetic voice, recording and acquiring of the user voice stream are not interrupted, and continuously acquiring the user voice data stream can ensure that user input is still possible to be acquired in the broadcasting process of the interactive synthetic voice, and real-time interactive processing and feedback can be performed on the user voice stream for interrupting the interactive synthetic voice broadcasting by the user, so that the user can continuously speak as continuous input, and the interaction between the user and the machine can be closer to the actual dialogue habit, the user can talk at will, and the user can interrupt at any time in the voice interaction process.

In order to ensure the realizability of the simultaneous execution of the broadcasting process of the interactive synthetic voice and the collecting process of the user voice stream, the broadcasting of the interactive synthetic voice and the collecting of the user voice stream can be realized through two physical channels, wherein an uplink channel is used for realizing the collecting of the user voice stream, namely, transmitting the collected user voice stream from a user to a voice interaction system, and a downlink channel is used for realizing the broadcasting of the interactive synthetic voice, namely, transmitting the interactive synthetic voice from the voice interaction system to the user. The two channels should be able to operate simultaneously without interfering with each other, thereby providing the voice interactive system with the ability to "talk" while "listen".

The method provided by the embodiment of the invention respectively executes the collection of the user voice stream and the broadcasting of the interactive synthetic voice through different physical channels, so that the user input can still be collected in the broadcasting process of the interactive synthetic voice, and the user voice stream of the interactive synthetic voice broadcasting can be still subjected to real-time interactive processing and feedback for the user to interrupt the interactive synthetic voice broadcasting, thereby ensuring that the user can continuously speak as continuous input, the interaction between the user and the machine can be more close to the actual dialogue habit, the user can talk at will, and the user can interrupt at any time in the voice interaction process, and the naturality of the voice interaction is ensured.

Based on any of the above embodiments, step 950 further includes:

Based on any of the above embodiments, the method further comprises:

Specifically, the voice interaction system should be able to predict the intention of the user to a certain extent according to the state and scene of the user, control the rhythm of the dialogue, and actively give feedback and information to guide the next action of the user. For the situation that the user voice stream is not collected, particularly the situation that the user does not speak in a period of time, the system can combine knowledge base, scene data, history data and user data in voice interaction resources to perform dialogue management and natural language generation, so that the previous response context and common knowledge can be combined to perform real-time and reasonable active dialogue.

It will be appreciated that interactive synthesized speech of active interactive text may also be broadcast after synthesis.

Based on any of the above embodiments,

steps

920 and 930 include:

and receiving the interactive text returned by the cloud.

In particular, to be able to provide more powerful computational power support for speech recognition and dialog processing operations. Under the condition that the voice interaction system itself bears the edge computing function, after the voice interaction system acquires the user voice stream, the user voice stream can be sent to a cloud providing cloud computing resources through a transmission component, the cloud executes voice recognition operation and dialogue processing operation aiming at the user voice stream based on the voice interaction resources, a voice recognition algorithm and a dialogue processing algorithm, so that an interaction text of the user voice stream is obtained, and the interaction text is fed back to the voice interaction system through the transmission component.

After that, the local providing the edge computing resource can receive the interactive text returned by the cloud through the transmission component, and after the interactive text is obtained, the interactive synthesized voice corresponding to the interactive text is generated by utilizing the local computing resource, namely the edge computing resource. Based on any of the above embodiments,step 910 includes:

and collecting a user voice stream based on the microphone array, and carrying out voice preprocessing on the user voice stream, wherein the voice preprocessing comprises at least one of voice enhancement, dereverberation, sound source extraction and separation, echo cancellation, sound source positioning and denoising.

Based on any of the above embodiments, step 910 specifically includes:

performing speaker localization based on the microphone array, determining a speaker localization result;

Based on any of the above embodiments, the array structure of the microphone array is at least one of a linear form, a planar form, and a spatial stereo form.

Based on any of the above embodiments,step 910 includes:

Based on any of the above embodiments,step 920 includes:

performing voice endpoint detection on the user voice stream based on the voice interaction resource and a voice endpoint detection algorithm based on semantics to obtain an active voice stream;

performing voice recognition on the active voice stream based on the voice interaction resource and a streaming voice recognition algorithm to obtain a primary recognition text of the active voice stream;

and performing content rejection on the preliminary recognition text based on the voice interaction resource to obtain the recognition text of the user voice stream.

Based on any of the above embodiments, step 920 further includes:

Based on any of the above embodiments,step 930 includes:

acquiring entity information and intention information in the identification text based on the voice interaction resource and a natural language understanding algorithm, and acquiring a text semantic path based on the entity information and the intention information;

Screening irrelevant semantics in the text semantics based on the voice interaction resource and the semantic rejection algorithm to obtain target semantics;

determining interaction information of the target semantics based on the voice interaction resource and the semantics post-processing algorithm;

based on the voice interaction resource and a dialogue management algorithm, determining dialogue actions corresponding to the interaction information;

and generating interactive text corresponding to the dialogue action based on a natural language generation algorithm.

Based on any one of the foregoing embodiments, instep 930, the determining, based on the voice interaction resource and the dialog management algorithm, a dialog action corresponding to the interaction information includes:

Based on any of the above embodiments,step 940 includes:

Based on any of the above embodiments, the target voice attribute includes at least one of a pace, a pitch, a intonation, and a timbre of the target speaker.

the resource data includes scene data, user data, and history data.

Based on any of the above embodiments, further comprising:

and displaying at least one of the recognition text, the interactive text and the voice interaction state.

Based on any of the above embodiments, further comprising:

and receiving interactive operation input by a user, wherein the interactive operation is used for switching at least one of voice interaction state, correcting the recognition text, setting voice interaction languages, setting preset trigger words and setting target voice attributes of a target speaker.

Based on any one of the above embodiments, fig. 10 is a second flowchart of the voice interaction method provided by the present invention, fig. 10 shows a communication and dialogue process of voice interaction, in the voice interaction between a user and two parties of the voice interaction system, the user may first speak information carrying a preset trigger word to trigger the start of the voice interaction, and thereafter the user may continue speaking without interruption, corresponding to a continuous user voice stream including the preset trigger word received by the voice interaction system, for which the voice interaction system end may perform real-time interaction operation to generate and broadcast interactive synthesized voice. After the user listens to the interactive synthetic voice, the user can continue to dictate new content, the voice interaction system can collect continuous user voice flow, and accordingly, conversation guidance and rhythm control can be conducted, and meanwhile, voice collection and voice recognition are conducted continuously, so that whether the user breaks the speech of the interactive synthetic voice broadcasting in the process is judged, if the user detects the break voice continuing to the user, the interactive synthetic voice can be generated based on the newly collected voice, and accordingly voice interaction which is more natural and can be more attached to various emergency situations existing in actual interaction can be achieved.

Fig. 11 is a third flow chart of the voice interaction method provided by the present invention, and fig. 11 shows an interaction process based on time sequence. The interaction process can be divided into a plurality of time intervals (Tn) and speech, depending on the time logic of the dialog. The upstream channel (input natural speech) and the downstream channel (output synthesized speech) should be able to receive and transmit speech signals within the same time interval (Ti). Thus, the user may interrupt the voice interaction system while speaking, and the voice interaction system may manage cadence or give prompts while the user speaks or remains silent.

Fig. 12 is a schematic flow chart of the voice interaction method provided by the invention, as shown in fig. 12, in the full duplex voice interaction process, a person and a machine can communicate with each other at the same time, parties at two ends of a conversation can talk at the same time and can be heard by the other party, in the process, the person can interrupt the interactive synthesized voice being broadcasted, the machine can break the silent active speaking response, or can manage the rhythm or give a prompt when the user speaks or keeps silent.

Fig. 13 is a schematic diagram of a voice interaction scenario provided by the present invention, in fig. 13, "small flying hello" is a preset trigger word, a semantic interaction system realizes voice wake-up after detecting "small flying hello", and a microphone array in an acoustic acquisition component is applied to perform sound source positioning, so as to prepare for subsequent directional pickup of a user voice stream of a target speaker. In the voice interaction process, semantic understanding is carried out on the random song obtained through recognition, and corresponding operation can be executed, namely, songs are played. During the song playing process, the microphone array can perform echo cancellation on the collected audio, so that the collected songs are prevented from affecting the execution of voice interaction. And during this process the microphone array continues recording and the voice recognition component continues performing voice recognition, thus getting and responding to what the user says "i want to hear the D song of the C singer". After detecting "change to E singer", the voice interaction system may contact the short term memory above, determine that the user is a D song that he wants to change to E singer, and respond. Aiming at the 'good hearing' uttered by the user, the voice interaction system can judge that the sentence is meaningless through semantic rejection, does not need to respond, and continues to play songs. For pauses that occur in the user's voice stream, a semantic VAD may be performed by the voice recognition component to obtain an accurate representation of the user for accurate response.

Fig. 14 is a second schematic view of a voice interaction scenario provided by the present invention, in fig. 14, after detecting that a voice is broken, i say that i want to go to B city ", error correction can be performed on a city name obtained by the last recognition according to semantics, and a task dialogue after error correction is output. After detecting "help me order the earliest morning trip", the earliest flight, i.e., the 9 th morning departure flight, can be screened from the flights already retrieved by logical reasoning and responded accordingly. When "good, what is the weather? After that, the city name "B city" can be shared under the ticket business and the weather business, so as to respond to the weather condition of B city. When "one has detected a user's own home, the last hotel bar" is read from the user data and responded.

Fig. 15 is a third schematic view of a voice interaction scenario provided by the present invention, wherein the solid line with an arrow in fig. 15 is used to reflect the voice stream transmitted through the up channel and the down channel. In fig. 15, the person speaks "i want to listen (pause 1 s) to the song of singer C", and the machine finds that the sentence lacks object information by understanding the semantics of the sound clip "i want to listen", and is not yet expressed completely, and based on this determination, there is still a voice activity frame, so that the machine can ignore the pause 1s in the middle of the user voice stream, continuously collect the user voice stream, and perform semantic understanding according to the dialogue context. In addition, in fig. 15, the machine can execute voice recognition, semantic understanding and dialogue management on the weather today of the city a by speaking the weather today of the city a, continuously collect user voice stream while generating the interactive text of the city a, … … today, and collect and interrupt the input city B, at this time, the machine can execute processing such as voice recognition, semantic understanding, dialogue management and voice synthesis on the city B input by this round while not affecting the previous interactive text to form interactive synthesized voice and broadcast, thereby realizing parallel processing of uplink and downlink channels.

Fig. 16 illustrates a physical structure diagram of an electronic device, as shown in fig. 16, which may include: aprocessor 1610, a communication interface (Communications Interface) 1620, amemory 1630, and acommunication bus 1640, wherein theprocessor 1610, thecommunication interface 1620, and thememory 1630 perform communication with each other via thecommunication bus 1640.Processor 1610 can invoke logic instructions inmemory 1630 to perform a voice interaction method comprising:

collecting user voice stream after voice interaction awakening;

Further, the logic instructions inmemory 1630 described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method of voice interaction provided by the methods described above, the method comprising:

collecting user voice stream after voice interaction awakening;

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above-provided voice interaction methods, the method comprising:

collecting user voice stream after voice interaction awakening;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A voice interactive system, comprising:

2. The voice interaction system of claim 1, wherein the acoustic acquisition component comprises:

3. The voice interaction system of claim 2, wherein the microphone array is specifically configured to:

executing speaker positioning and determining speaker positioning results;

4. The voice interaction system according to claim 2, wherein the microphone array is specifically used for near-field acquisition and/or far-field acquisition of user voice streams;

5. The voice interaction system of claim 2, wherein the array structure of the microphone array is at least one of a linear form, a planar form, and a spatial stereo form.

6. The voice interaction system of claim 1, wherein the acoustic acquisition component is specifically configured to:

7. The voice interaction system of claim 1, wherein the voice recognition component comprises:

8. The voice interaction system of claim 7, wherein the voice recognition component further comprises:

9. The voice interaction system of claim 1, wherein the voice recognition component is further configured to:

10. The voice interaction system of claim 1, wherein the dialog processing component comprises:

11. The voice interaction system of claim 10, wherein the dialog processing component further comprises:

12. The voice interaction system of claim 11, wherein the dialog management sub-component is specifically configured to:

13. The voice interaction system of claim 1, wherein the voice synthesis component is specifically configured to:

14. The voice interaction system of claim 13, wherein the target voice attribute comprises at least one of a pace, a pitch, a intonation, a timbre of the target speaker.

15. The voice interaction system of any of claims 1 to 14, further comprising an edge computing resource, a cloud computing resource, and a transmission component;

16. The voice interaction system of any of claims 1 to 14, wherein the voice interaction knowledge comprises declarative knowledge and procedural knowledge;

the resource data includes scene data, user data, and history data.

17. The voice interaction system of any of claims 1 to 14, further comprising:

18. The voice interaction system of any of claims 1 to 14, further comprising:

19. The voice interaction system of any of claims 1 to 14, further comprising:

and the switch component is used for switching the voice interaction state.

20. A method of voice interaction, comprising:

collecting user voice stream after voice interaction awakening;

21. The voice interaction method of claim 20, further comprising:

22. The voice interaction method of claim 20, wherein the performing voice endpoint detection and voice recognition on the user voice stream based on the voice interaction resource to obtain a recognized text of the user voice stream, and the performing natural language understanding, dialogue management, and natural language generation on the recognized text based on the voice interaction resource to obtain an interactive text of the recognized text, comprises:

and receiving the interactive text returned by the cloud.

23. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the voice interaction method of any of claims 20-22 when the program is executed.

24. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the voice interaction method according to any of claims 20-22.