Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The voice interaction, i.e. the communication between the person and the machine, the voice interaction system, or the UI (user interface) interface of the voice interaction system can be used as a communication channel between the person and the machine. The user can use the UI interface of the voice interaction system to carry out dialogue with the machine, and meanwhile, the machine can synthesize voice through the voice interaction system and reply to the user through the UI interface. The two-way communication approach described above may be considered as duplex voice interaction. Duplex voice interaction can be further divided into two modes, namely, an HDX (half duplex) mode and an FDX (full duplex) mode according to different data transmission sequences.
Duplex voice interaction in the related art, i.e., HDX mode, in the case of HDX voice interaction, a person and a machine can perform data transmission communication in only one direction at a time. HDX voice interaction is characterized by a round of dialogue, and the machine will return to default state after completing a round of dialogue. In addition, the machine cannot collect the voice signal during its voice broadcast.
As an example, a typical HDX-based communication system is a two-way radio, such as an intercom. The interphone uses a push-to-talk button to control a signal transmission channel, thereby switching a transmitter or a receiver to achieve transmission or reception of a signal. For example, the user can turn on the transmitter and turn off the receiver through the buttons, so that his own voice is transmitted to the opposite terminal, and in the process, the interphone does not receive the information transmitted by the opposite terminal, and the user cannot hear the voice of the remote user.
Compared to HDX voice interactions, FDX voice interactions allow a person and a machine to communicate at the same time, i.e., the person and the machine can speak at the same time. In the FDX voice interaction system, the voice information stream is transmitted through two physical channels, an uplink channel transmits the voice information stream from the user to the functional unit, and a downlink channel transmits the voice information stream from the functional unit to the user. The two physical channels should work simultaneously and not interfere with each other, so that the functional unit has hearing ability during conversation. The FDX voice interaction is characterized by being dialog-oriented, and the voice interaction system will maintain the continuity of the dialog, thereby ensuring that the user and the machine remain in the same state after two or more rounds of dialog. Furthermore, both the user and the machine may speak during the same time interval.
By way of example, a typical FDX-based communication system is a telephone, where local and remote users can speak and hear the voice of the opposite end simultaneously.
From a functional point of view, the voice interaction system needs to continuously receive user input data throughout the human-machine conversation process and provide feedback to the user through the UI of the FDX voice interaction system.
In summary, the voice interaction mode based on the FDX can overcome the problems that the voice interaction mode based on the HXD cannot meet the requirement of man-machine interaction and does not meet the actual dialogue habit. The voice interaction system based on the FDX can be applied to various intelligent devices, such as smart phones, intelligent household appliances, intelligent assistant APP, customer service robots and the like. The interactive tasks of the FDX-based voice interactive system vary depending on the scene type and user requirements, and the application scene of the interactive tasks may include, but is not limited to, telephony, navigation, home services, chat, etc. The embodiment of the invention provides a voice interaction system based on FXD, and in order to facilitate understanding of the embodiment of the invention, the following first describes basic concepts related to the embodiment of the invention:
duplex (duplex): a communication method capable of transmitting data bi-directionally.
Full Duplex (FDX): a communication method capable of simultaneously transmitting data in both directions.
Functional unit (functional unit): an entity of hardware or software or both capable of achieving a particular purpose; the functional units may be integrated into one system.
Half Duplex (HDX): a communication method capable of transmitting data bi-directionally, but transmitting data in only one direction at any time.
Microphone array (microphone array): a system consisting of a plurality of microphones having a determined spatial topology is used to sample and filter spatial features of a signal.
Voice interaction (speech interaction): the voice communication between the person and the machine is used for information transmission and communication.
Speech recognition (speech recognition): a process of converting human voice signals into text or instructions. Namely, a method of converting a voice signal into voice content by a functional unit; the content to be identified herein may be represented as a sequence of suitable words or phonemes.
Semantic understanding (semantic understanding): the functional unit is made to understand the intention of the person to speak.
Speech synthesis (speech synthesis): generating speech from the data mechanically or electronically; here, the voice may be generated from text, image, video, and audio. The text-to-speech conversion process is the primary way of speech interaction; the result of speech synthesis is also called artificial speech to distinguish from natural speech emitted through the vocal organs of a person.
Voice activity detection (voice activity detection, VAD): and analyzing and identifying effective voice starting points in the continuous voice stream.
The voice trigger (voice trigger) and voice wake-up process is that after detecting some features or events in the audio stream monitoring state, the voice wake-up process is switched to processing states such as command word recognition, continuous voice recognition and the like.
Fig. 1 is a schematic structural diagram of a voice interaction system provided by the present invention, where, as shown in fig. 1, the voice interaction system includes:
avoice interaction resource 110 comprising voice interaction knowledge and resource data related to voice interaction tasks;
anacoustic collection component 120, configured to collect a user voice stream, where the acoustic collection component operates after a voice interaction wakes up;
avoice recognition component 130, configured to perform voice endpoint detection and voice recognition on the user voice stream based on the voice interaction resource, to obtain a recognition text of the user voice stream, where the voice endpoint detection is performed based on semantics of the user voice stream;
adialogue processing component 140, configured to perform natural language understanding, dialogue management, and natural language generation on the recognized text based on the voice interaction resource, so as to obtain an interaction text of the recognized text;
Aspeech synthesis component 150, configured to perform a speech synthesis operation on the interactive text, so as to obtain an interactive synthesized speech of the interactive text;
thevoice broadcasting component 160 is configured to broadcast the interactive synthesized voice;
the collection of the user voice stream is carried out through an uplink channel, the broadcasting of the interactive synthetic voice is carried out through a downlink channel, the uplink channel and the downlink channel are different physical channels, and the uplink channel and the downlink channel are processed in parallel.
Specifically, theacoustic collection assembly 120 may be a single microphone or may be a microphone array including a plurality of microphones, which is not particularly limited in the embodiments of the present invention. Theacoustic collection component 120 is configured to collect a user voice stream, that is, a voice data stream obtained during a voice interaction process, where the user voice stream is recorded in real time, and may specifically be obtained by recording voice or may be obtained by recording video.
It should be noted that, the user voice stream herein may be a voice data stream recorded by a user for voice interaction, for example, a wake-up voice data stream for waking up voice interaction, for example, a voice data stream for querying specific information after waking up, or a voice data stream recorded when the user interrupts a voice played by the voice interaction system during voice interaction, which is not particularly limited in the embodiment of the present invention.
Here, the voice interaction wake-up may be implemented by the user triggering a voice control operation of the UI interface, or may be implemented by the user inputting a voice signal of a preset wake-up word such as "fly little", which is not limited in particular in the embodiment of the present invention. It can be understood that the voice interaction system can complete the entire conversation process after one voice interaction wake-up, and realize multiple interactions, that is, the voice interaction system only needs to wake up one voice interaction at the beginning of a conversation, then the user can continuously speak, and theacoustic acquisition component 120 can also operate within a period of time after the voice interaction wake-up to acquire the voice stream of the user, thereby promoting the subsequent components to operate within a period of time to perform multiple conversations continuously, and thus obtaining the effect of one wake-time wake-up multiple interactions. The specific length of the period of time referred to herein may depend on the actual interaction situation between the user and the voice interaction system, and may be understood as the time length of the entire conversation process, for example, the voice interaction system may collect the user voice stream containing valid voice during or after broadcasting the interactive synthesized voice, i.e. the user interrupts the broadcasting of the interactive synthesized voice, or the user continues to converse with the voice interaction system after hearing the interactive synthesized voice broadcasting, so that the time for collecting the user voice stream after waking up the voice interaction may be shifted backward until the entire conversation process ends.
Whether the entire conversation process is finished or not can be judged by whether theacoustic collection component 120 collects the user voice stream of the effective voice in the preset time, for example, after the voice interaction system broadcasts the interactive synthesized voice, if the user voice stream containing the effective voice is not collected in the preset time, the conversation process can be considered to be finished, and theacoustic collection component 120 can stop collecting. For another example, whether the entire conversation process is finished may be determined through two preset times, specifically, after the voice interaction system broadcasts the interactive synthesized voice, if the user voice stream including the effective voice is not collected in the first preset time, the voice interaction system may generate and broadcast the interactive synthesized voice for active interaction to guide the user to continue the conversation, and after the interactive synthesized voice broadcast for active interaction is finished, if the user voice stream including the effective voice is not collected in the second preset time, the conversation process may be considered to be finished, and theacoustic collection component 120 may stop the collection. Moreover, voice interaction wake-up is only required to be triggered at the beginning of a dialogue, and no trigger is required in the dialogue process.
For real-time interaction of user voice streams, a complete dialogue process can sequentially execute functional units such as voice recognition, dialogue processing, voice synthesis and the like, and the whole processing link also depends on scenes, contexts, knowledge and data and an implementation calculation method of each component. The FDX voice interaction system processes the input user voice stream or other input information to finally output synthesized voice or other information and instruction actions. The FDX voice interaction system can continuously receive various input signals, including but not limited to voice signals, information, requests and the like, transcribe useful signals into texts, extract semantic information from the transcribed texts, predict and decide interaction tasks according to the semantic information, provide output signals for users according to the prediction and the decision, and output signals include but not limited to synthesized voice, answers, information, behaviors and the like. The above-mentioned conversation process can be divided into three phases of speech recognition operation, conversation processing operation and speech synthesis operation, and the execution of these three operation phases is implemented on the basis of thespeech interaction resource 110.
Thevoice interaction resources 110 herein, i.e. the data resources for supporting real-time voice interaction execution, thevoice interaction resources 110 refer to relevant knowledge and data needed for understanding of scenes and contexts, which refer to different scenes or language context information. Thevoice interaction resource 110 is used for supporting various functional components including thevoice recognition component 130, thedialogue processing component 140, thevoice synthesis component 150 and the like, and can make interaction decisions covering the usage scenario in combination with the information and dialogue data of the various functional components. Thevoice interaction resources 110 can be specifically classified into two types, namely voice interaction knowledge and resource data:
The voice interaction knowledge, that is, knowledge information for realizing voice interaction, may be used to provide solutions to questions posed by a user in voice interaction, and may also provide descriptions for program tasks in voice interaction, which is not specifically described in the embodiments of the present invention.
The resource data is the resource related to the voice interaction task, and different voice interaction tasks can be associated with different resource data. Each time the voice interaction is executed, the voice interaction task is provided with an explicit voice interaction task, the voice interaction task can reflect specific targets and requirements of the voice interaction needed to be solved, and the voice interaction task can change along with the scene type and the user requirement, for example, the voice interaction task can be telephone, navigation, home service, chat and the like. The resource data related to the voice interaction task may reflect resource information required for solving a specific problem pointed by the voice interaction task, for example, for the voice interaction task in an automobile driving scene, the resource information related to the voice interaction task may include a location area of interest to the user, map resources, statement information of route query, and the like.
By combining thevoice interaction resource 110, real-time interaction of the user voice stream can be realized, specifically, thevoice recognition component 130 can combine the voice interaction resource, the recognition text of the user voice stream can be obtained through voice recognition operation, then thedialogue processing component 140 can combine the voice interaction resource, the interactive text corresponding to the recognition text can be obtained through dialogue processing operation, and the interactive text is used for feeding back the text form information of the user; then, thevoice synthesis component 150 combines the voice interaction resources, and synthesizes the voice corresponding to the interaction text through the voice synthesis operation, so as to obtain the interaction synthesized voice.
In this process, thespeech recognition component 130, thedialogue processing component 140 and thespeech synthesis component 150 can be implemented by calling a single functional component, each functional component can be used and tested independently, and the functional components can be integrated together to implement real-time interaction.
Fig. 2 is a schematic diagram of a functional component program of voice interaction provided by the present invention, and as shown in fig. 2, the voice interaction between a user and a voice interaction system can be understood as an execution flow of "listen", "cognition", "understand", "express". For the voice interaction system, the above execution flow may be implemented by four functional components of "acoustic processing", "voice recognition", "dialogue processing", and "voice synthesis", corresponding to theacoustic acquisition component 120, thevoice recognition component 130, thedialogue processing component 140, and thevoice synthesis component 150. The input of the voice interaction system may be a user voice stream responded or input by a user, on the voice interaction system side, firstly, acoustic processing is required to be performed on the user voice stream by means of a microphone or a microphone array in theacoustic acquisition component 120, corresponding to a "listening" stage, specifically, voice pickup and voice preprocessing are involved, after the acoustic processing is completed, voice recognition can be performed on the user voice stream, corresponding to a "cognition" stage, so as to obtain a recognition text of the user voice stream, then, dialogue processing is performed on the recognition text, corresponding to an "understanding" stage, so as to obtain an interactive text for interactive dialogue, finally, through voice synthesis, corresponding to an "expression" stage, interactive synthesized voice corresponding to the interactive text is generated, and the interactive synthesized voice is used as output for broadcasting.
Further, the implementation of the speech recognition component 130, depending on the speech interaction resources and the speech recognition algorithm, the speech recognition algorithm herein may include a streaming speech recognition algorithm and a semantic-based speech endpoint detection algorithm, and may also include both a streaming speech recognition algorithm and a semantic-based speech endpoint detection algorithm, and also include an irrelevant content rejection algorithm, where Continuous speech recognition refers to a process of converting a user Continuous sound signal into text or instructions, and the streaming speech recognition algorithm is also referred to as a Continuous speech recognition algorithm (Continuous ASR) for recognizing a Continuous speech stream; the Semantic-based voice endpoint detection algorithm (Semantic VAD) is to understand the semantics of the voice to obtain the discrimination result of voice activity frames, the Semantic-based VAD aims to identify and eliminate the silence period from the voice stream of the user, compared with the traditional acoustic VAD, the Semantic-based VAD combines Semantic features when distinguishing voice and non-voice, the reliability of voice endpoint detection can be further improved, and in the interaction process, the FDX voice interaction system can realize intelligent waiting aiming at the pause interference of the user input by applying the Semantic-based VAD, so that continuous dialogue is realized; irrelevant content rejection refers to rejecting invalid speech inputs, such as scene noise, echoes, etc., by analyzing and deciding on speech signals, and the purpose of the irrelevant content rejection algorithm (Irrelevant content rejection) is to distinguish and reject content in the user's speech stream that cannot be processed or should not be processed. Such content is generally independent of the interaction task and dialog topic or context, and may also include inactive speech (e.g., noise, background sounds, boring sounds, etc.). The disambiguation effect can also be achieved by an irrelevant content rejection algorithm.
The implementation of thedialogue processing component 140 depends on the speech interaction resources and the dialogue processing algorithm, where the dialogue processing algorithm may include a natural language understanding algorithm, a dialogue management algorithm, and a natural language generation algorithm, or may include a semantic rejection algorithm, and a semantic post-processing algorithm. Natural language understanding (natural language understanding, NLU) and semantic rejection are used to understand the semantics of the recognized text and screen out extraneous semantics, semantic post-processing, dialog management, and natural language generation processes for organizing dialog jumps from the semantics of the recognized text. Wherein the natural language understanding converts text or speech into an internal description, the internal description being an expression of the input structured semantics; the semantic rejection means that the system can distinguish input information which is not required to be processed in the current state of the system through a natural language understanding technology, and the input information which is not required to be processed comprises content irrelevant to interaction tasks and dialogue topics or contexts; semantic post-processing refers to that after the system carries out natural language understanding on an input signal, subsequent re-processing is carried out on an understood result, for example: after natural language understanding is carried out on the weather of the open sky of the input voice, a specific date value corresponding to the open sky is also required to be calculated; the dialogue management means that the system follows the current dialogue state and the context input, updates the dialogue state and generates dialogue actions to be implemented according to dialogue processing logic; natural language generation refers to that a system generates proper natural language text according to dialogue actions obtained by dialogue management.
The implementation of thespeech synthesis component 150 relies on the speech synthesis algorithm by which the conversion from interactive text to interactive synthesized speech is accomplished, thereby enabling speech interaction with the user.
After the speech synthesis is completed, the interactive synthesized speech may be broadcasted by thespeech broadcasting component 160. It should be noted that, in the broadcasting process of the interactive synthetic voice, recording and acquiring of the user voice stream are not interrupted, and synchronous acquisition of the user voice data stream can ensure that user input is still possible to be acquired in the broadcasting process of the interactive synthetic voice, and real-time interactive processing and feedback can be performed on the user voice stream for interrupting the interactive synthetic voice broadcasting by the user, so that the user can speak continuously as continuous input, and interaction between the user and the machine can be closer to real dialogue habit, the user can talk at will, and the user can interrupt at any time in the voice interaction process.
In order to ensure the realizability of the simultaneous execution of the broadcasting process of the interactive synthetic voice and the collecting process of the user voice stream, the broadcasting of the interactive synthetic voice and the collecting of the user voice stream can be realized through two physical channels, wherein an uplink channel is used for realizing the collecting of the user voice stream, namely, transmitting the collected user voice stream from a user to a voice interaction system, and a downlink channel is used for realizing the broadcasting of the interactive synthetic voice, namely, transmitting the interactive synthetic voice from the voice interaction system to the user. The two channels should work simultaneously and not interfere with each other, so that the capacity of speaking and hearing is provided for the voice interaction system, namely, people and machines can communicate with each other simultaneously, the FDX voice interaction system can receive and send voice signals in the same time interval through the uplink channel and the downlink channel, thus, a user can freely interrupt the speaking of the machine at any time, the machine can manage the rhythm or give prompts when the user speaks or keeps silent, and the FDX voice interaction system can process input and output signals simultaneously at any moment, so that duplex communication interaction is realized.
The system provided by the embodiment of the invention respectively executes the collection of the user voice stream and the broadcasting of the interactive synthetic voice through different physical channels, so that the user input can still be collected in the broadcasting process of the interactive synthetic voice, and the real-time interactive processing and feedback can still be carried out on the user voice stream of which the user interrupts the interactive synthetic voice broadcasting, thereby ensuring that the user can continuously speak as continuous input, the interaction between the user and the machine can be more close to the actual dialogue habit, the user can talk at will, and the user can break at any time in the voice interaction process, and the naturality of the voice interaction is ensured.
And by combining voice interaction resources with voice recognition operation, dialogue processing operation and voice synthesis operation, real-time voice interaction processing is performed, so that the real-time voice interaction under a voice interaction task can meet the increasingly complex and diversified man-machine interaction requirements, particularly the application of voice data streams, the user input has continuity, the voice interaction can be performed by referring to the context of the dialogue, and the continuity of the voice interaction is ensured.
Based on the above embodiment, fig. 3 is a functional view of voice interaction provided by the present invention, and fig. 3 depicts three parts of input, processing and output of voice interaction.
The input part is a user voice stream, and can be specifically divided into two types of user voice streams, wherein one type is a voice data stream input by a user for voice interaction, namely input represented by a solid line box in fig. 3, and the other type is a voice data stream input when the user breaks voice played by a voice interaction system in the voice interaction process, namely input represented by a dotted line box in fig. 3. Both types of inputs are speech signals, and specifically characterized are information or requests provided by the user through speech. The voice interaction system can receive various input signals within a period of time, the input signals comprise but are not limited to voice signals, information, requests and the like, useful signals are transcribed into texts, semantic information is extracted from the transcribed texts, interaction tasks are predicted and decided according to the semantic information, output signals are provided for users according to the prediction and the decision, the output signals comprise but are not limited to synthesized voice, answers, information, behaviors and the like, so that the voice interaction system can process the input and output signals at any moment simultaneously, duplex communication interaction is realized, and the naturalness of voice interaction is improved. For example, the voice interaction system receives the voice signal 'the weather with the fertilizer', performs voice recognition and semantic understanding on the 'the weather with the fertilizer', generates the interaction reply word 'the weather with the fertilizer is 10 ℃ today, has rain', and simultaneously monitors other input signals, such as 'Shanghai woolen', at the moment, the FDX voice interaction system can perform voice recognition and semantic understanding on the input signal 'Shanghai woolen' of the round while generating synthesized audio without influencing the transmission of the interaction reply word of the previous round to the voice synthesis module, and realizes the simultaneous processing of uplink and downlink channels.
The processing part, i.e. the speech interaction process in fig. 3, may receive an input speech signal, transcribe the useful signal therein into a recognition text by means of a speech recognition operation ((Automatic Speech Recognition, ASR), extract semantic information from the recognition text, predict and decide a speech interaction task according to the semantic information, generate an interaction synthesized speech as output according to the decision and/or feedback provided to the user, wherein both the extraction of the semantic information and the prediction and decision of the speech interaction task may be performed by means of a dialogue processing operation, and the generation of the interaction synthesized speech according to the decision and/or feedback provided to the user may be performed by means of a speech synthesis operation (TTS).
The output part is interactive synthesized voice, and the interactive synthesized voice can be an answer or information fed back to the user, or can be feedback for executing actions according to the user request.
In addition, in fig. 3, a receiver and a transmitter are further provided, where the receiver can receive various input signals within a period of time based on an uplink channel, and the transmitter can send automatically synthesized voice or other information and command actions to the user terminal in real time based on a downlink channel, so that the function of processing input and output signals at any time can be achieved by the voice interaction system, the problem of dialogue blocking in the past is solved, the dead office that the integrity of interaction content is not high and semantic understanding is failed is broken, and the naturalness of voice interaction is pushed to a new level.
In addition, the voice interaction task in fig. 3 is set, so that the scene, the context, the knowledge, the data, the computing channel and other voice interaction resources can be applied to the voice interaction process. Where context and scene may be used to define semantic scope of a dialog, in particular may refer to different context or language context information, the dialog in a voice interaction may be cross-context and scene. The dialog processing operations need to be implemented in terms of knowledge and data, here knowledge and data, i.e. relevant knowledge and data needed for scene and context understanding. The computing channel refers to an algorithm and a computing mode adopted for realizing the module capability, and can cover computing methods such as cloud computing, AI (artificial intelligence ) computing and the like, so as to support the execution of a voice interaction process.
The voice interaction thus achieved may have the following characteristics:
continuous: the voice interaction system is executed, so that a user can continuously speak as continuous input, and the voice interaction system can continuously receive and process input data; during this process, the functional components can continue to listen to and understand what the user said; further, through FDX voice interaction, pause can be input for a user, intelligent waiting is realized, so that continuous conversation is realized, for example, the user says 'i want to listen (pause 1 s) to a song of Zhang Xueyou', the voice interaction can ignore the pause in the middle, and continuous reception and understanding can be realized; VAD is carried out on the input continuous audio stream, continuous voice recognition can be realized, user intention prediction and voice interaction are carried out according to semantic understanding of dialogue context, and cloud service and/or terminal and/or edge calculation can be included;
And (3) naturally: the execution of the voice interaction system enables voice interaction to support natural conversations between people and machines. The user only needs to wake up the voice interaction system once when the dialogue begins, and can randomly converse after that, and the voice interaction system can be interrupted at any time in the interaction process; in addition, the interactive synthesized voice output can simulate the mood and rhythm of human voice;
self-adaption: the voice interaction system can adapt to continuous transformation and different deployment environments. It can be used in different vertical industries and adapts to cross-domain applications and tasks through the feeding of dynamic data and updating states based on new data; the method can combine knowledge base, scene data, history data and user data to perform dialogue management, can implement active dialogue when the user does not speak, and can select silence mode when the user speaks;
initiative is that: the voice interaction system can dynamically predict the dialogue intention based on the voice interaction resource, control the dialogue rhythm and actively give feedback to guide the next action of the user.
Based on context: the voice interaction system builds core functions such as semantic understanding, historical information inheritance, data analysis, dialogue generation and the like based on the context.
Based on knowledge: the voice interaction system may use knowledge from a variety of sources including context information, historical information retrieval information, user information, etc., all of which may be contained within the voice interaction resource, and in particular may be stored in a general knowledge and database.
Based on the model: in addition to grammar analysis, the FDX voice interaction system preferably utilizes acoustic models, language models, natural language understanding models, natural language generation models, voice synthesis models based on machine learning techniques, which may be implemented based on convolutional neural networks (convolutional neural network, CNN), recurrent neural networks (recurrent neural network, RNN), long short-term memory (LSTM) networks, and the like.
The user is allowed to use the full duplex voice interaction system offline, i.e. the terminal can also support the use independently in a network-free environment.
And the full duplex voice interaction response time is not more than 1.5s, and the response time is from the end of the voice input of the user to the synthesis of the voice response of the system.
Based on the above embodiment, the voice broadcast component is further configured to:
and if the user voice stream is acquired in the broadcasting process of the interactive synthetic voice, stopping broadcasting the interactive synthetic voice until updated interactive synthetic voice is obtained, or continuously broadcasting the interactive synthetic voice until updated interactive synthetic voice is obtained.
Specifically, during the broadcasting process of the interactive synthetic voice, the collection of the user voice stream is synchronously performed, and during the process, the user may listen to the voice played by the voice interaction system, at this time, the user is silent, although the voice interaction system is performing the collection action of the user voice stream, the user voice stream may not be able to be collected to obtain the effective user voice stream, in addition, the user may interrupt the broadcasting of the interactive synthetic voice to express a new viewpoint or opinion, and at this time, the voice interaction system is able to collect to obtain the effective user voice stream.
In order to be more suitable for the conversation situation of the user continuously inputting or interrupting the counterpart, aiming at the situation that the user voice stream is collected in the voice broadcasting process, namely, the situation that the user interrupts the broadcasting voice speaking in the voice broadcasting process, the voice interaction system stops broadcasting the interactive synthetic voice, continues to collect the user voice stream, and carries out real-time interactive operation on the newly collected user voice stream, so that the interactive synthetic voice of the newly collected user voice stream, namely, the updated interactive synthetic voice is obtained, and is used for broadcasting the updated interactive synthetic voice, so that the voice interaction system can be interrupted and intervened by the user at any moment in the broadcasting or speaking process, namely, the interaction process of the voice interaction system can be interrupted at any moment as required, thereby avoiding the situation that the broadcasting voice interferes with the user thought and influences the user experience when the user speaks, and enabling the voice interaction to be closer to the conversation in reality.
In addition, for the case that the user voice stream is collected during the voice broadcasting, considering that the user voice stream collected here is not necessarily a voice stream that is effective for voice interaction, there may be a case that an irrelevant content refusal function configured by the voice recognition component filters out invalid voice during the voice interaction, and in the case of invalid voice, the voice interaction system will not generate new interactive synthesized voice. In order to avoid the situation that the interactive synthesis voice broadcasting is interrupted without any cause caused by invalid voice interference, the interactive synthesis voice can be continuously broadcasted while the user voice stream is collected until updated interactive synthesis voice is obtained, and the updated interactive synthesis voice is broadcasted.
Based on any of the above embodiments, in the full duplex voice interaction system, in the interaction scene of the finite field or other specific scenes, the system should support the semantic-based broadcasting interrupt, and the full duplex voice interaction system should satisfy the following conditions:
a) In interactive scenes of a finite field or other specific scenes, the system should support semantic-based broadcasting interruption; here, the interactive synthetic voice broadcasting interruption is realized based on the semantics of the user voice stream, and the user dictates that the effective voice triggers the broadcasting interruption, namely the effective semantics interruption, and the external noise or other factors trigger the broadcasting interruption, namely the ineffective semantics error interruption:
i) When the user input part is detected to be effective information, other information is still needed, feedback words are returned.
ii) when the interaction of the target scene is carried out, the accuracy rate of the system to the statement understanding of the user of the target scene is greater than or equal to a certain threshold value, such as 80%,85%,90%,95%,98%, etc., and the recall rate is greater than or equal to a certain threshold value, such as 80%,85%,90%,95%,98%, etc.; the accuracy rate of the system for extracting the key information in the target scene user statement is greater than or equal to a certain threshold value, such as 80%,85%,90%,95%,97%, etc., and the recall rate is greater than or equal to a certain threshold value, such as 80%,85%,90%,95%,97%, etc.;
further, the accuracy rate of the system for understanding the statement intention of the target scene user is greater than or equal to 90%, and the recall rate is greater than or equal to 90%; the accuracy rate of the system for extracting the key information in the target scene user statement is greater than or equal to 90%, the recall rate is greater than or equal to 90%, and the key information refers to all necessary slot value information meeting the requirement that the system responds to the user request correctly in the input statement.
b) In order to ensure user end-to-end interaction experience, the system should satisfy the following in consideration of the influence of noise environment on understanding results:
i) Under a low noise environment (the signal to noise ratio is more than 10 db), the response rate of the non-human interaction is less than or equal to 6%, and the response rate of the non-human interaction refers to the proportion of the number of the conversation given by the system under the non-human interaction scene to the number of all the non-human interaction conversations successful in the machine reception.
ii) under the high noise environment (signal to noise ratio 10db or below), the response rate of the non-human-computer interaction is less than or equal to 10%.
In addition, for broadcasting interruption, the system can also satisfy the following conditions:
in a low noise environment (e.g., the sound intensity is below a certain threshold, the threshold may be 45db,50db,55db, etc.), the effective semantic disruption success rate is greater than or equal to a certain threshold, for example, 80%,85%,90%,95%,98%,99%, etc., and the ineffective semantic disruption rate is less than a certain threshold, for example, 1%,5%,7%,10%,15%,20%, etc. The effective semantic breaking success rate refers to the recognition accuracy rate with clear intention in the user breaking speech operation; the invalid semantic misbreak rate refers to a misidentification rate that the user does not explicitly intend or belongs to boring in the interrupting conversation, but is identified as being intended to break.
Effective semantic breaking success rate and ineffective semantic rejection rate in noise environment: in a high noise environment (e.g., sound intensity at or above a certain threshold, which may be 50dB,55dB,60 dB-65 dB, etc., or sound intensity at 60 dB-65 dB), the effective semantic disruption success rate is equal to or greater than a certain threshold, e.g., 75%,80%,85%,90%,95%, etc., and the ineffective semantic rejection rate is less than a certain threshold, e.g., 1%,5%,7%,10%,15%,20%, etc.
Based on any of the above embodiments, the acoustic acquisition assembly comprises:
and the microphone array is used for collecting the user voice stream and carrying out voice pretreatment on the user voice stream, wherein the voice pretreatment comprises at least one of voice enhancement, dereverberation, sound source extraction and separation, echo cancellation, sound source positioning and denoising.
In particular, the collection of the user's voice stream, i.e. the acoustic collection, may be achieved by a microphone array comprising a plurality of microphones. The microphone array can be applied to not only the collection of voice signals, but also the voice preprocessing stage of the collected voice signals.
The front-end correlation algorithm, i.e. the algorithm for performing voice preprocessing, such as voice enhancement, dereverberation, sound source extraction and separation, echo cancellation, sound source localization, denoising, etc., may be integrated as part of a microphone array, and after the microphone array completes voice signal acquisition, the acquired voice signal may be subjected to voice preprocessing, so as to obtain a user voice stream after completing the preprocessing.
In the above-listed voice preprocessing operation, voice enhancement refers to a process of extracting a pure voice signal from a noisy background, and particularly in a complex acoustic environment, when the voice signal is interfered by various noises (including background noise and other irrelevant voices) and even is submerged, a voice enhancement algorithm based on a beam forming method (Beamforming based approach) or the like can be used for suppressing the noise and enhancing the voice.
Reverberation generally refers to an acoustic phenomenon in which sound is reflected by obstacles such as walls, ceilings, floors, etc., and superimposed with the original sound as the sound propagates in an enclosed space (e.g., an indoor space). Because of the reverberation, the asynchronous speech signals overlap each other, thereby generating a phoneme overlap effect. To address this problem, the microphone array may implement dereverberation by a blind speech enhancement based approach (Blind signal enhancement approach), a beam forming based approach, an inverse filtering based approach (An inverse filtering approach), and so on. The method based on blind voice enhancement uses reverberation as a common additive noise signal to carry out voice enhancement algorithm; forming a pickup beam in the direction of the target signal by weighting and adding the collected signals by a plurality of microphones based on the beam forming method, while attenuating reflected sounds from other directions; based on the inverse filtering method, the room impulse response (Room Impulse Response, RIR) of the room is estimated by the microphone array, and a reconstruction filter is designed to compensate for the cancellation of reverberation.
In the sound source extraction and separation, the target signal is extracted from a plurality of sound signals, and the sound source signal separation is to extract all of a plurality of mixed sounds. Sound source extraction refers to extracting a target speech signal from a plurality of sound signals, and the purpose of sound source separation is to extract a plurality of mixed speech signals. Both beam forming based methods and blind source separation methods (Blind Source Separation) may be used for sound source extraction and separation, where blind source separation methods may include principal component analysis (Principal Component Analysis, PCA) and independent component analysis (Independent Component Analysis, ICA) based methods.
Echo cancellation (Acoustic echo cancellation, AEC) is used to ensure that the voice interaction system can collect voice signals while broadcasting audio (e.g., music, interactive synthesized speech, etc.). When the user at far end a speaks, his voice is picked up by the microphone array, transmitted to the communication device at near end B, and broadcast by the speaker. The speech signal will be picked up by the microphone of the near end B, thereby forming an echo. The echo signal is transmitted back to the far end a and broadcast through the speaker of the far end a, and then the user of the far end a hears his own voice. The echo signal has a great influence on the voice acquisition effect, so that the echo signal needs to be removed in the voice acquisition process. Echo cancellation may be achieved, for example, by an adaptive filter of finite impulse response (finite impulse response, FIR) structure.
Based on any of the above embodiments, the microphone array is specifically configured to:
executing speaker positioning and determining speaker positioning results;
and carrying out directional voice pickup based on the speaker positioning result to obtain a user voice stream, and carrying out voice preprocessing on the user voice stream.
In particular, the microphone array may also be provided with sound source localization and directional pickup functions, where sound source localization is speaker localization. In reality, the position of the sound source may be changed continuously, and the sound source localization technology may calculate the angle and distance of the target speaker by using the microphone array, so as to track the target speaker and pick up the subsequent voice orientation. With the support of sound source localization and directional voice pick-up technology, the voice interaction application no longer limits the movement of the speaker, and the microphone array can automatically adjust the receiving direction to capture the voice of the speaker, so as to realize the collection of the user voice stream with high signal-to-noise ratio (SNR).
Here, the speaker positioning may be to position the target speaker by calculating information such as a plane angle, an azimuth angle, a pitch angle, and a distance between the microphone and the target speaker, and improve a signal-to-noise ratio of the user voice stream. Here, the identification determination of the target speaker may be implemented by a voiceprint recognition technique.
Based on any of the above embodiments, the microphone array is specifically configured to collect a user voice stream in a near field and/or a far field;
the distance of the near field acquisition is within 1 meter, and the distance of the far field acquisition is more than 3 meters and less than 5 meters.
In particular, a microphone array for capturing a user's voice stream should be provided with near-field audio capturing and/or far-field audio capturing capabilities. The near field referred to herein means a case where a distance between the microphone and the sound source is within 1 meter, and the far field means a case where a distance between the microphone and the sound source is greater than 3 meters and less than 5 meters.
Based on any of the above embodiments, fig. 4 is a schematic diagram of a microphone array structure provided by the present invention, and as shown in fig. 4, the array structure of the microphone array is at least one of a linear form, a planar form and a spatial stereo form.
In particular, microphone arrays are generally composed of two or more microphones in linear, planar and spatial stereo form. An example of a microphone array structure is shown in fig. 4. In fig. 4, (a) shows a microphone array structure in a Linear form, (b) shows a microphone array structure in a Planar form, and (c) shows a microphone array structure in a spatial three-dimensional form (Spatial steric form).
In the system provided by the embodiment of the invention, the front-end acoustic collection of voice interaction is realized by using the microphone array, and the microphone array has the functions of continuous audio collection, voice enhancement, sound source positioning, dereverberation, denoising, echo cancellation, sound source extraction and the like. In addition, the compression level of the user voice stream collected by the microphone array should be configurable, and the user voice stream collected by the microphone array should be adaptable to compression and decompression under various coding formats and algorithms without changing the voice content. The coding formats here may include EVRC (Enhanced Variable Rate Codec ) and g.711, g.723.1 series developed by ITU-T, and the coding formats of audio may include AAC (advanced audio coding ), AC3 (audio coding 3), MP3 (MPEG audio layer 3), WMA (Windows media audio), WAV (waveform audio file format), and so on.
Based on any of the above embodiments, the acoustic acquisition assembly is specifically configured to:
and under the condition that the voice interaction state is on, collecting the voice stream of the user.
In particular, in order to protect privacy and safety of users, a switch may be set for voice interaction, that is, a voice interaction switch, where the voice interaction switch may be a virtual switch, that is, a switch icon displayed on a screen of the voice interaction system or a screen connected to the voice interaction system, and the voice interaction switch may also be an entity switch, that is, a switch in the form of a key or a dial plate set on the voice interaction system.
The on-off state of the voice interaction switch directly determines the voice interaction state, and the voice interaction state can be on or off. Only under the condition that the voice interaction state is on, the voice interaction method in the above embodiments is executed, if the voice interaction state is off, namely, the user turns off the voice interaction switch, the voice interaction system will not collect the user voice stream, and the subsequent voice interaction process will not be executed, thereby avoiding the user privacy from being eavesdropped and revealed, and ensuring the privacy security of the user.
In addition, when the voice interaction state is on, the user can be reminded that the voice interaction is running through icons displayed on a screen or other visual or audible prompting modes.
In addition, in order to ensure the privacy security of the user, a notification of the collection of the privacy information needs to be provided, the collection of the privacy information should be easy for the user to find, and the content of the notification should be popular and easy to understand. In addition, the voice interaction system can also set an authorized account number for the user to manage personal safety and privacy information of the user, wherein the personal safety and privacy information of the user can be name, gender, voiceprint, other functions defined in a user configuration file and the like, the personal safety and privacy information of the user needs to be stored in a trusted safety space in an encrypted mode, and the information is only used for temporary interaction and cannot be applied to other services. In addition, the system should also support closing the audio cloud collection and closing the continuous speech recognition after receiving the active exit intention or the invalid interaction for a specified time (e.g., 60 seconds without interaction).
Based on any of the above embodiments, thespeech recognition component 130 includes:
the end point detection sub-component is used for detecting the voice end point of the user voice stream based on the voice interaction resource and a voice end point detection algorithm based on the semantics to obtain an active voice stream;
and the voice recognition sub-component is used for carrying out voice recognition on the active voice stream based on the voice interaction resource and the streaming voice recognition algorithm to obtain a recognition text of the active voice stream.
Specifically, the endpoint detection subassembly is used for realizing the voice endpoint detection of the user voice stream, obtaining the active voice stream and judging the integrity of the voice content. The voice endpoint detection here needs to be implemented in dependence on a voice endpoint detection algorithm.
Conventional speech endpoint detection algorithms are implemented based on acoustic features, such as energy-based, periodic features, zero-crossing rate based, or multi-feature fusion based speech endpoint detection. In practice, however, continuous speech streams often contain various background noise and are subject to speech speed and speaking patterns, for which acoustic speech endpoint detection based on energy or zero crossing rate, etc., is highly likely to fail. Therefore, in the embodiment of the invention, a semantic-based voice endpoint detection algorithm which is more suitable for high noise (i.e. low signal-to-noise ratio) and far-field voice pickup is selected, for example, an ML (machine learning) algorithm can be applied to calculate semantic truncation probability, and the starting points and the ending points of a plurality of voice fragments in continuous user voice stream are detected, so that an active voice stream with the silence period eliminated is obtained. Here, LSTM, DNN (Deep Neural Networks, deep neural network), or the like may be used to calculate the semantic truncation probability.
It should be noted that, in the semantic-based voice endpoint detection algorithm, the mute waiting time between two voice segments can be flexibly set, and the sensitivity of the voice endpoint detection algorithm can be adjusted by adjusting the mute waiting time.
The speech recognition sub-component is used to implement speech recognition of the active speech stream, where speech recognition is implemented in dependence on a streaming speech recognition algorithm. The streaming voice recognition algorithm is used for voice recognition of continuous voice streams, and can be used for voice recognition of at least one language and can adapt to multiple languages, so that streaming voice recognition of various languages can be supported.
There are various streaming voice recognition algorithms that can be applied to implement voice recognition, for example, the streaming voice recognition framework shown in fig. 5 may be in the form of a streaming voice recognition framework, where the streaming voice recognition framework is composed of an encoder, an acoustic model, a language model, a dictionary, and a decoder, an input voice stream is an active voice stream, and an output text is a recognition text.
Wherein the encoder is configured to extract features from an input speech stream. In this process, the encoder converts each frame of speech data in the speech stream into a multi-dimensional vector representing acoustic information. The characteristics encoded in the streaming speech recognition may be at least one of linear prediction coefficients (linear prediction coefficients, LPC), perceptual linear prediction (perceptual linear predictive, PLP), tandem characteristics, bottleneck characteristics, filter bank, linear prediction cepstrum coefficients (linear predictive cepstral coefficient, LPCC) and Mel-scale frequency cepstrum coefficients (Mel-scale frequency cepstral coefficients, MFCC). It should be noted that the serial feature and the bottleneck feature referred to herein may be extracted by a neural network, specifically, the posterior probability vector of the corresponding class node in the output layer of the neural network may be reduced in dimension, and spliced with the MFCC or PLP feature, so as to obtain the serial feature.
The speech recognition realized by the method has the capability of supporting speech recognition and error correction under various different use scenes, and supports sentence breaking and audio interactive segmentation. In addition, it should also be able to detect the start and end points of multiple speech segments from a continuous speech stream; and should be able to reject the identification of inappropriate content based on the semantics of the statement and scene. The test indexes of the voice recognition comprise sentence recognition accuracy and word recognition accuracy, wherein the test set construction of the word recognition accuracy, the test method and the index calculation method refer to GB/T21023-2007; test set construction of sentence recognition accuracy, test method and index calculation method refer to GB/T36464.1-2020:
a) In low noise environments (the signal-to-noise ratio is at or above a certain threshold, e.g., the threshold may be 5db,10db,15db,20db,25db,30db, etc.), the speech recognition sentence recognition accuracy should be greater than or equal to a certain threshold, e.g., the threshold may be 80%,83%,84%,85%,86%,90%,95%, etc.; the recognition accuracy of the speech recognition word should be greater than or equal to a certain threshold, for example, the threshold may be 80%,85%,90%,95%,97%,98%,99%, etc. The sentence recognition accuracy displays the sentence recognition capability of the tested system, and the parameter value calculation method is that the number of sentences correctly recognized by the tested system is divided by the total number of labeled sentences; the word recognition accuracy displays the word recognition capability of the tested system, and the parameter value calculation method is that the total word number of the sentence correctly recognized by the tested system is divided by the total word number of the labeled sentence.
Further, in the case of defining a low noise environment to be an environment with a signal-to-noise ratio of 10dB or more, the recognition accuracy of the speech recognition sentence should be 84% or more, and the recognition accuracy of the speech recognition word should be 95% or more;
b) In a high noise environment (the signal-to-noise ratio is at or below a certain threshold, for example, the threshold may be 5db,10db,15db,20db,25db,30db, etc.), the speech recognition sentence recognition accuracy should be greater than or equal to a certain threshold, for example, the threshold may be 70%,73%,74%,75%,76%,80%,85%,90%, etc.; the recognition accuracy of the voice recognition word is greater than or equal to a certain threshold, for example, the threshold can be 70%,75%,80%,85%,87%,88%,89%,90%,95%, etc.;
further, in the case of defining a high noise environment as an environment having a signal-to-noise ratio of 10dB or less, the speech recognition sentence recognition accuracy should be 75% or more, and the speech recognition word recognition accuracy should be 88% or more.
c) The real-time voice recognition performance index should meet the real-time requirement.
In addition, acoustic models are typically trained on feature vectors and output phoneme information. The dictionary is used to represent the correspondence between words and phonemes, such as phonetic symbols/symbols and characters/words. The language model obtains word-related probabilities by training a large amount of text information. The decoder may output text from the acoustic features output by the encoder using acoustic models, lexicons, and language models. Further, training samples of the acoustic model and the language model applied herein, as well as the dictionary itself, may be obtained from the voice interaction resource.
Furthermore, the streaming speech recognition algorithm can not only take on the speech recognition task of the speech stream, but also provide post-processing functions for the text obtained by recognition, such as character normalization, punctuation prediction, text replacement, etc.
Based on any of the above embodiments, and referring to fig. 8, thevoice recognition component 130 further includes:
and the irrelevant content refusing sub-component is used for refusing the content of the identification text based on the voice interaction resource.
Specifically, the irrelevant content rejection subassembly is used for implementing content rejection of the primary identification text, and the content rejection is implemented by relying on an irrelevant content rejection algorithm.
The irrelevant content refusing algorithm can distinguish and reject the content which cannot be processed or should not be processed in the voice stream of the user, such as noise, background speaking voice, noise and the like, and can particularly distinguish and screen out the content which is contained in the recognition text and is irrelevant to voice interaction according to the semantics of the recognition text obtained by voice recognition and the semantics of the interaction scene, so that the recognition text obtained by irrelevant content refusing is the content relevant to the voice interaction.
Here, the semantics of the interaction scenario may be determined according to the context in the voice interaction process, or may be determined according to the voice interaction task, for example, the semantics of the voice interaction task may be queried from the voice interaction resource or the semantics of the interaction scenario subordinate to the voice interaction task. By judging whether the semantics of each clause in the identification text are related to the semantics of the interaction scene or not, locating and deleting the clause which is irrelevant to the semantics of the interaction scene in the identification text, the voice interaction system is prevented from mistakenly taking interference noise as user input and generating false feedback.
Based on any of the above embodiments, the voice recognition algorithm may further include a speaker classification algorithm, and specifically, the speaker classification algorithm may be applied to segment and cluster voices of multiple speakers included in the user voice stream, so as to determine a target speaker who performs voice interaction currently, and perform operations such as voice recognition, dialogue processing, and voice synthesis only for the voice of the target speaker.
The division of speaker voices refers to finding time boundaries of speaker variation among a plurality of speakers from a user voice stream and cutting the voice stream into a plurality of voice segments according to the boundaries. Clustering for speaker speech refers to aggregating one or more speech segments belonging to the same speaker.
Based on any of the above embodiments, the speech recognition component is further configured to:
and if the recognition text contains a preset trigger word, continuously executing voice recognition on the user voice stream based on the voice interaction resource.
Specifically, the triggering of the voice interaction can be realized through preset trigger words, such as hello, small fly and the like, the preset trigger words can be default or user-defined of the voice interaction system, and one or more preset trigger words can be adopted.
After the voice recognition operation is performed on the user voice stream, the recognition text of the user voice stream can be obtained, and if the recognition text is detected to contain the preset trigger word, the user can be considered to trigger the voice interaction function. After the voice interaction function is triggered, continuous voice recognition operation can be performed on the continuously picked-up user voice stream, and dialogue processing operation and voice synthesis operation are continuously performed on the recognized text obtained through recognition, so that natural voice interaction can be achieved after one trigger.
Further, in order to ensure the nature of voice interaction, when the voice is triggered, the user can combine the preset trigger word with continuous voice to trigger the voice, for example, "hello, help me put a song".
Based on any of the above embodiments, the dialog processing component includes:
the natural language understanding sub-component is used for acquiring entity information and intention information in the identification text based on the voice interaction resource and a natural language understanding algorithm, and acquiring text semantics based on the entity information and the intention information;
a dialog management sub-component for determining dialog actions for feeding back the text semantics based on the voice interaction resources and dialog management algorithms;
And the natural language generation sub-component is used for generating interactive text corresponding to the dialogue action based on a natural language generation algorithm.
In particular, the natural language understanding subcomponent is used for semantic understanding of the recognition text, where semantic understanding is implemented in dependence on a natural language understanding algorithm, NLU, for extracting information from the recognition text and generating one or more semantic paths for the content contained in the recognition text as text semantics.
The two basic algorithms in NLU to support voice interaction are entity recognition (Named Entity Recognition, NER) and intent understanding, respectively. The entity identification NER is used for identifying and labeling named entities such as people, places, organizations and the like, so that entity information in the identification text is obtained, and the entity information can cover the entities contained in the identification text and specific types of the entities. On the basis of obtaining entity information, the intention understanding can realize field classification, intention recognition and semantic annotation aiming at the recognized text, thereby obtaining the intention information of the recognized text and determining text semantics according to the intention information.
For example, FIG. 6 is a schematic diagram of an intent understanding provided by the present invention, FIG. 6 depicts the relationships between domain classification, intent recognition, and semantic annotation in an intent understanding, and the roles of domain classification, intent recognition, and semantic annotation in an intent understanding. Wherein the top layer is a domain classification for classifying meaning of the recognition text into a domain category. The middle layer is intended recognition, i.e. more details of the language network recognition statement, and maps the details into a defined expression library in the form of an extended barker back-naur form (ABNF). The bottom layer is semantic annotation, also called attribute extraction, which refers to the process of generating and marking labels representing a particular meaning to keywords or sentences having semantic slots. Semantic annotation can also be viewed as a sequence annotation task that selects useful semantics of speaker intent, which can be implemented using rule-based or machine-learning-based methods.
After the text semantics are obtained, a dialog management sub-component can be applied, the current dialog state and context input are followed by a dialog management algorithm, the dialog state is updated, and dialog actions to be implemented are generated according to dialog processing logic and the text semantics.
The natural language generation sub-component is used for generating natural language text of feedback text semantics, namely, interactive text of a feedback user. Here, the natural language generation algorithm has the capability of natural language generation (natural language generation, NLG) so that interactive information in the form of data can be converted into text in the form of natural language, i.e. interactive text. Here, the content of the interactive text may be a simple reply text, a reply text based on a predefined template, a reply text reasonably guided or suggested by understanding and responding to the user's intention, etc., to which the embodiment of the present invention is not limited in detail.
Based on any of the above embodiments, the dialog processing component further includes:
the semantic rejecting sub-component is used for screening irrelevant semantics in the text semantics based on the voice interaction resource and the semantic rejecting algorithm to obtain target semantics;
The semantic post-processing sub-component is used for determining interaction information of the target semantic based on the voice interaction resource and the semantic post-processing algorithm;
correspondingly, the dialog management sub-component is specifically configured to determine a dialog action corresponding to the interaction information based on the voice interaction resource and a dialog management algorithm.
In general, text semantics in the form of multiple semantic paths obtained through natural language understanding are unordered, so that a semantic rejecting subassembly is required to sort the multiple semantic paths based on the matching degree between the multiple semantic paths and a current interaction task or a dialog main body or a context of the current semantic interaction through a semantic rejecting algorithm, so that semantic paths irrelevant to the interaction task and the dialog main body or the context are screened out, and an optimal semantic path capable of reflecting the semantics of the identified text is selected to be used as a target semantic of the identified text.
After the target semantics of the identified text are obtained, a semantic post-processing sub-component can be applied, the target semantics are converted into service requests through a semantic post-processing algorithm, and then data meeting the user requirements, namely interaction information required by interaction, are obtained by inquiring semantic interaction resources containing massive service data.
The semantic post-processing algorithm here may include operations such as semantic inheritance, semantic post-processing, information source searching, semantic correction, and business data ordering. And the semantic post-processing algorithm can be deployed in a cloud service mode, so that the service requirements of users can be rapidly and accurately met under the support of strong computing capacity.
After the interactive information is obtained, a dialog management sub-component can be applied, the current dialog state and context input are followed through a dialog management algorithm, the dialog state is updated, and dialog actions to be implemented are generated according to dialog processing logic and the interactive information. The natural language generation sub-component may then be applied to generate appropriate natural language text corresponding to the dialog action, i.e., interactive text for the feedback user. In addition, when the dialog management sub-component is applied to perform dialog management, the dialog management can be performed by combining the voice interaction knowledge contained in the voice interaction resource and resource data such as scene data, history data, user data and the like, so that active dialog can be implemented when the user does not speak, and a silence mode can be selected when the user speaks.
According to the system provided by the embodiment of the invention, the dialogue processing is realized through natural language understanding, semantic rejection, semantic post-processing, dialogue management and natural language generation, so that the dialogue processing operation can understand the intention of a user and predict future dialogue contents to a certain extent according to voice interaction resources. In this process, the dialog processing component may execute according to a dialog context, which may also be supported by the voice interaction resources. In addition, the dialog processing component can provide reasoning functions to assist the voice interaction system in understanding, predicting, and deciding on recognition text, where reasoning can include spatial reasoning, temporal reasoning, common sense reasoning, computational policy application, or any form of reasoning that can be encoded; furthermore, the dialog processing component can track dialog states, manage dialog policies, change or conduct dialog topics based on user intent.
Based on any of the above embodiments, the dialog management sub-component is specifically configured to:
and determining the dialogue action corresponding to the interaction information based on the dialogue guiding algorithm and/or the beat control algorithm in the voice interaction resource and the dialogue management algorithm.
Specifically, the session management algorithm may include a session guidance algorithm or a beat control algorithm, and may also include both the session guidance algorithm and the beat control algorithm.
Under the dialog management sub-component and the natural language generation sub-component, tasks performed by the dialog management and natural language generation algorithms may be further divided into six tasks, including text content determination, text structure determination, sentence aggregation, lexical, reference expression generation, and language implementation. The text content determination refers to a task of determining which information can be contained in the text to be generated, and the text structure determination refers to a task of determining the presentation sequence of the information in the text to be generated. Statement aggregation refers to the task of deciding what information to present in a single sentence, lexical refers to the task of finding the appropriate words or phrases to express the information, reference expression generation refers to the task of selecting words and phrases to identify domain objects, and language implementation refers to the task of combining all words and phrases into a well-structured sentence.
The dialogue guiding algorithm is an algorithm for updating the scene and the state of the current user by using the scene state semantics and the searched interaction information, and generating prompts according to the data so as to guide dialogue or open new topics.
The beat control algorithm is an algorithm for coordinating and controlling the conversation rhythm according to scene data, speaker states (such as speaker types, moods, etc.), and conversation states (such as speech rate, intonation, etc.), so that the man-machine conversation is more natural. The beat control algorithm mainly comprises the following functions: active break-down silencing, emotion recognition and expression, dynamic disruption, emotional response, topic changes, talk delays/dysfluencies, and asymmetric conversations. Wherein talk delay/disfluency is used to simulate the stuck conditions common in real spoken conversations, thereby improving the realism of the speech interaction experience.
The system provided by the embodiment of the invention integrates the dialogue guiding and rhythm control in the dialogue management sub-component, thereby achieving the active dialogue response when the user stops speaking, and selecting silence as the benign dialogue effect of the listener when the user continues speaking.
Based on any of the above embodiments, the speech synthesis component is specifically configured to:
determining target voice attributes of a target speaker from the voice interaction resources;
Based on a speech synthesis algorithm, interactive synthesized speech corresponding to the interactive text and conforming to the target speech attribute is generated.
Specifically, speech synthesis refers to a technique of converting data into speech representing the content of the data. In the embodiment of the invention, the speech synthesis operation TTS for realizing the conversion from text to speech can be understood as the reverse process of the speech recognition operation ASR. The interactive text-based speech synthesis operation can be divided into three tasks: text analysis, prosody analysis and acoustic analysis, wherein the text analysis is used for extracting text features based on a phoneme dictionary and converting graphemes into phonemes, the prosody analysis is used for predicting prosody features such as fundamental frequency, duration, tone, intonation, speed and the like, the acoustic analysis is used for realizing mapping from text parameters to voice parameters, and on the basis, voice synthesis can be realized through a vocoder, so that interactive synthesized voice is obtained. In the specific implementation process, the interactive synthesized voice can be synthesized by a waveform splicing or parameter synthesis mode. The waveform concatenation is to extract proper concatenation voice from a corpus, and splice the extracted concatenation voice into sentences. Parametric synthesis requires parametric modeling of a phoneme dictionary and prediction of prosody and acoustic parameters using machine methods.
In order to ensure the diversity of the interactive synthesized voice and enhance the flexibility and the interestingness of voice interaction, parameters such as simulated speakers, tone, speech speed and the like of the interactive synthesized voice can be adjusted. In the voice interaction resource, a large number of set voice characteristics of the speaker can be stored, so that virtual dialogue objects with different identities and different roles are provided for users. When the interactive voice synthesis is carried out, the target voice attribute of the target speaker can be selected from the voice interactive resource, wherein the target speaker can be set by a user or default by equipment, and further the voice synthesis can be carried out according to the target voice attribute of the target speaker.
Further, the target voice attribute includes at least one of a pace, a pitch, a intonation, and a timbre of the target speaker.
The system provided by the embodiment of the invention can support the voice synthesis of one or more languages and support the synthesis of continuous voice streams. The synthesized interactive synthesized voice can simulate the voice characteristics of the target speaker and has the hearing perception characteristics of the target speaker. And, various voice attributes in the voice synthesis, such as acoustic prosody, speech speed, tone, intonation, etc., and various timbres of men, women, old, feeble, etc., can be adjusted.
In addition, the speech synthesis component provided in the embodiment of the present invention should support the ability of speech synthesis in a plurality of different use scenarios, including:
a) At least one tone color:
b) Mixed reading of Chinese and English or other multi-languages;
c) When it is desired to increase the pauses in generation, a mood word may be added to the interactive synthesized speech. Based on any of the above embodiments, the interactive synthesized speech is natural speech with acceptable naturalness and intelligibility. The quality of the interactive synthesized voice can be measured by using a mean opinion score (mean opinion score, MOS), the MOS of the interactive synthesized voice in the embodiment of the invention should be evaluated to be more than 4 points, and further, the mean opinion score of the voice synthesis of the full duplex voice interactive system should be more than or equal to 4.2. The MOS scoring rules are shown in table 1:
TABLE 1 MOS quantization value of inter-synthesized speech and corresponding audiometric effect
Based on any of the above embodiments, the system further comprises at least one of a terminal computing resource, an edge computing resource, and a cloud computing resource.
Specifically, the terminal computing resources are the computing resources of the user terminal itself. The voice interaction system can call at least one of terminal computing resources, edge computing resources and cloud computing resources to realize voice interaction, so that full duplex voice interaction capability is improved, each component in the voice interaction system can be deployed at a corresponding computing resource according to functional requirements facing the voice interaction system, namely each component in the voice interaction system can be deployed at the same computing resource or at different computing resources respectively, and the embodiment of the invention is not particularly limited. For example, components that implement voice recognition, dialog management, text synthesis, etc. functions may be processed using cloud computing resources.
Based on any of the above embodiments, the system further comprises an edge computing resource, a cloud computing resource, and a transmission component;
the edge computing resource is used for providing computing resources for the acoustic acquisition component and the voice synthesis component, and the cloud computing resource is used for providing computing resources for the voice recognition component and the dialogue processing component;
the transmission component is used for transmitting the user voice stream at the edge computing resource to the cloud computing resource for the voice recognition component to operate, and is also used for transmitting the interactive text at the cloud computing resource to the edge computing resource for the voice synthesis component to operate.
Specifically, for each functional component used for voice interaction, the functional component can be allocated with more suitable computing resources according to different functional requirements of different functional components, so as to realize the voice interaction function. Particularly, under the condition that Edge (Edge) computing resources and Cloud (Cloud) computing resources coexist, two functional components of an acoustic acquisition component and a voice synthesis component can be deployed on the Edge computing resources, and two functional components of a voice recognition component and a conversation processing component can be deployed on the Cloud computing resources according to the voice interaction functional component deployment illustration shown in fig. 7, so that more powerful computing capability support is provided for voice recognition and conversation processing.
Under the condition that the voice interaction system itself bears the edge computing function, after the voice interaction system acquires the user voice stream, the user voice stream can be sent to a cloud providing cloud computing resources through a transmission component, the cloud executes voice recognition operation and dialogue processing operation aiming at the user voice stream based on the voice interaction resources, a voice recognition algorithm and a dialogue processing algorithm, so that an interaction text of the user voice stream is obtained, and the interaction text is fed back to the voice interaction system through the transmission component.
After that, the local providing the edge computing resource can receive the interactive text returned by the cloud through the transmission component, and after the interactive text is obtained, the interactive synthesized voice corresponding to the interactive text is generated by utilizing the local computing resource, namely the edge computing resource.
Based on any of the above embodiments, the voice interaction knowledge includes declarative knowledge and procedural knowledge;
the resource data includes scene data, user data, and history data.
In particular, a voice interaction system, and in particular a dialog processing component in a voice interaction system, needs to be performed in dependence on voice interaction knowledge. The voice interaction knowledge, i.e. knowledge information for realizing voice interaction, may be pre-stored, or may be continuously updated during the voice interaction.
Further, the voice interaction knowledge can be divided into two types of declarative knowledge and procedural knowledge, wherein the declarative knowledge is information about what, the declarative information is easily verbally expressed and translated into statements, and thus can be regarded as explicit knowledge. The procedural knowledge is information about how to do so. Procedural knowledge is often difficult to speak and describe and therefore can be considered as implicit knowledge. For example, "the authors of the littoral waterside transmission are Shi Naian", which pertains to declarative knowledge, and "how children speak" pertains to procedural knowledge.
It should be noted that in the field of artificial intelligence AI, procedural knowledge can be considered as a kind of intelligent program that describes knowledge about the implementation of AI. The smart program includes many different programs that the AI system can execute and the voice interaction system can be authorized to invoke the AI system capable of performing the programmatic knowledge to implement a particular AI function. In this process, the voice interaction system can learn about the AI function without paying attention to the specific execution flow of the AI function, and can realize application of programmatic knowledge in an effective manner only by relying on the AI system.
In addition, the resource data is the resource related to the voice interaction task, wherein the scene data can reflect the related data of the subordinate scene of the voice interaction task, for example, the scene data in the driving scene of an automobile can cover characteristic voice for navigation, and specific mode setting parameters of different navigation modes, such as a clear road mode, a general navigation mode, a high-speed navigation mode and the like; the user data can reflect relevant information of the user participating in the voice interaction task, such as user identity information, user preference, user history setting information and the like; the history data may reflect data that has been generated in the voice interaction task at this time, or may also reflect data that has been generated in a voice interaction task preceding the voice interaction task at this time.
In addition, the resource data may further include service data, where the service data may reflect data of a service related to a voice interaction task, for example, when the voice interaction task is a navigation task, the service data may cover map resources, congestion data of a specific line, and the like, and when the voice interaction task is a home service, for example, the service data may cover executable functions and specific control parameters of each intelligent device capable of being controlled in a coordinated manner.
Based on any of the above embodiments, the voice interaction system further comprises:
and the display component is used for displaying at least one of the identification text, the interactive text and the voice interaction state.
The display component can be a display screen of a smart phone, an intelligent household appliance and the like, or can be a display screen which is additionally connected with a hardware processor of the voice interaction system, such as a display, a television and the like.
In the voice interaction process, the influence of interference factors such as user accents, environmental noise and the like is considered, the recognition text can not necessarily reflect the content of the user voice stream absolutely accurately, the display component can display the recognition text output by the voice recognition component, so that a user can adjust the recognition text, the user can monitor and control the voice interaction to a certain extent through displaying the recognition text, and the influence of the interaction content with the outlier in the interaction process on the user experience is avoided.
In addition, in order to avoid that the user does not hear or understand the interactive synthesized voice played by the voice playing and reporting component, the display component can also display the interactive text corresponding to the interactive synthesized voice, so that the user can obtain more clear and clear interactive information.
Moreover, the display component can also display the voice interaction state, and can remind the user that the voice interaction is running in a prompting mode of icons displayed on a screen, so that the user can conveniently confirm whether the voice interaction is started or not in real time, the privacy of the user is prevented from being eavesdropped and revealed, and the privacy safety of the user is ensured.
Based on any of the above embodiments, the voice interaction system further comprises:
the operation input component is used for receiving interactive operation input by a user, and the interactive operation is used for switching at least one of voice interaction state, correcting the recognition text, setting voice interaction languages, setting preset trigger words and setting target voice attributes of a target speaker.
Specifically, the input operation component may include a touch screen with both display and input functions, and may also include an input device in the form of physical hardware, such as physical keys, physical switches, physical keyboards, and the like. The above-mentioned various interactive operations may be implemented by the same device in the input operation component, or may be implemented by different devices therein, which is not particularly limited in the embodiment of the present invention.
For example, the user can switch the voice interaction state by operating the input component, for example, switch from on to off and switch from off to on, so that flexible control of the voice interaction state is realized, and the risk of privacy disclosure caused by recording external environment sound when the user does not need voice interaction is avoided.
For another example, after the user sees the recognition text or hears the synthesized voice of the recognition text, the user can determine whether the current recognition text is different from the content which the user wants to express, namely whether the recognition text needs to be adjusted, if the user needs to adjust, the user can modify the recognition text through an input operation component, for example, click on a part which needs to be adjusted in the recognition text on a screen, so that the voice interaction system obtains modification operation of the user on the recognition text, and in response to the modification operation, the voice interaction system modifies the recognition text obtained by the recognition, and applies the modified recognition text to subsequent dialogue processing operation and voice synthesis operation.
For another example, the user can set the voice interaction language through operating the input component, so that the voice interaction can be executed according to the language meeting the requirement of the user, or set the preset trigger word through operating the input component, so that the voice interaction can be executed according to the trigger word set by the user, or set the target voice attribute of the target speaker through operating the input component, so that the interactive synthesized voice broadcasted by the voice interaction system can meet the target voice attribute, and the interaction requirement of the user is met.
Based on any of the above embodiments, the voice interaction system further comprises:
and the switch component is used for switching the voice interaction state.
Specifically, the voice interaction system may be provided with a dedicated switch component to realize the switching of the voice interaction state, where the switch component may be a virtual switch, i.e. a switch icon displayed on a screen of the voice interaction system or a screen connected to the voice interaction system, or may be a physical switch, i.e. a switch in the form of a button or a dial plate, etc. disposed on the voice interaction system.
The on-off state of the voice interaction switch directly determines the voice interaction state, and the voice interaction state can be on or off. Only under the condition that the voice interaction state is on, the voice interaction method in the above embodiments is executed, if the voice interaction state is off, namely, the user turns off the voice interaction switch, the voice interaction system will not collect the user voice stream, and the subsequent voice interaction process will not be executed, thereby avoiding the user privacy from being eavesdropped and revealed, and ensuring the privacy security of the user.
Based on any of the above embodiments, in the half duplex interactive system, when a user inputs a voice signal or other input information, the system is in a silence state, and a series of processes such as voice recognition, semantic understanding, dialogue generation, voice synthesis, etc. are performed after the user inputs the voice signal or other input information; when the system processes the voice signal or other input information input by the user, the system receiving signal device is in a silence state, and the system does not receive any information input by the user at the moment, and has to wait for the system to process the last voice interaction before the system continuously receives the voice signal or other input information input by the user. It follows that a half-duplex voice interactive system is characterized by a person and a system that can only communicate with each other in one direction at a time, similar to a round-robin conversation.
In contrast to half-duplex interactive systems, full-duplex interactive systems allow people to communicate with the system at the same time. FIG. 8 is a functional view of a voice interactive system according to the present invention, as shown in FIG. 8, wherein the functional architecture of the voice interactive system is composed of a plurality of layers and components, the layers including an interaction layer, a knowledge and data resource layer, a base layer, and an AI and machine learning layer, and the layers refer to a unit aggregate for performing a large class of functional capabilities. The interaction layer comprises anacoustic acquisition component 120, avoice recognition component 130, adialogue processing component 140 and avoice synthesis component 150, and the knowledge and data resource layer comprises two parts of voice interaction knowledge and resource data, wherein the voice interaction knowledge forms a knowledge base, and the resource data corresponds to scene data, historical data and user data. The layers described above may be described in terms of their inputs, outputs, and intent or function, where each layer and its components may be used and tested individually. All layers can be integrated together to enable a user to talk to the voice interaction system and to assist the user in meeting the requirements. The full duplex voice interaction system can achieve the degree of conversation communication between similar people and people, can keep the continuity of the conversation, so that after two or more rounds of conversation with the system are carried out by a user, the system still can keep good viscosity with the user, and the content replied by the system still can meet the demands of the user. The main function of the interaction layer is to recognize the input signal as a plain text through theacoustic acquisition component 120 and thevoice recognition component 130, understand the real intention of the input signal through thedialogue processing component 140, generate an interaction reply word, and finally use the synthesized voice audio of the interaction reply word as an output signal through thevoice synthesis component 150; the knowledge and data resource layer mainly provides necessary data resources and knowledge base for the interaction layer; the AI and machine learning layer mainly uses AI method based on machine learning to process data, train model and continuously optimize, provide model reasoning, online data mining, data analysis, etc. ability for the interaction layer; the basic layer comprises cloud service, a terminal and edge calculation, provides hardware calculation resources, is an operation carrier of an AI and a machine learning algorithm, and mainly guarantees network call and system stability of each module in the full duplex voice interaction process. Further, the base layer provides FDX voice interaction capabilities using cloud services and/or terminal and/or edge computing, where components such as voice recognition, dialog management, text synthesis, etc., can be handled using cloud services. In some embodiments, thedialog processing component 140 may include a natural language understanding sub-component, a semantic rejection sub-component, a semantic post-processing sub-component, a dialog management sub-component, a natural language generation sub-component. The natural language understanding sub-component can acquire entity information, intention information and the like in the recognition text obtained by the voice recognition component based on the voice interaction resource and the natural language understanding algorithm; the semantic rejection sub-component can reject the semantics based on the entity information and the intention information obtained by the natural language understanding sub-component, eliminate the semantic intention (such as boring information and the like) irrelevant to the current dialogue, and obtain effective entity information, intention information and the like, namely effective text semantics; the semantic post-processing sub-component can convert the valid text semantics into service requests based on the valid text semantics obtained by the semantic rejecting sub-component; the dialogue management sub-component can generate interaction information and interaction intention corresponding to the interaction information based on the service request, the voice interaction resource and a dialogue management algorithm; the natural language generation sub-component may derive interaction text based on the interaction information and the interaction intent.
The voice interaction system is executed under the voice interaction task, the voice interaction task can cover one or more tasks, and each task can be logically designed by using a traditional software engineering method, wherein the tasks comprise scene definition, environment characteristics, input and output, functional units, databases, data streams and the like. Also, each task here should be well defined and have a well-defined domain, and each task should reflect the needs of the user.
The acoustic acquisition component, the voice recognition component, the dialogue processing component and the voice synthesis component which are covered by the interaction interface all play a core function in voice interaction, and the components can meet the additional requirements of users by adding or modifying additional functions. While the above components have a temporal relationship during voice interactions, some components may interact with other components at the same time. For example, semantic-based speech endpoint detection and irrelevant content rejection in a speech recognition component are also invoking natural language understanding and semantic processing in a dialog processing component. Whereas the voice triggering function based on the voice recognition component is mainly applied to the acoustic acquisition component.
The knowledge and data contained in the voice interaction resource layer are extremely important for training, testing and operating all components below the functional component layer and the voice interaction whole. Also, the voice interaction system may "learn continuously", i.e. at run-time, its input data and resulting actions are also used to update the database held by the voice interaction system.
In the AI and machine learning layers, artificial intelligence and machine learning systems can be used for knowledge acquisition and model training, validation and verification to support the functional implementation of the functional components below the functional component layer. And under the cloud service, the terminal and the edge computing layer, network resources with high bandwidth and low delay are required to perform data transmission, so that the realization of real-time voice interaction is supported.
In addition, the utilization of each block of infrastructure in the computing infrastructure layer needs to remain stable during voice interactions. And to ensure good voice interaction performance, LSTM-RNN and DNN can be used for acoustic modeling, encoder-decoder structure for language modeling, CNN for beamforming, bi-directional encoder representation from BERT for natural language understanding.
The voice interaction system can be operated by a user through natural voice language, and can also allow the user to operate through other modes such as gestures and actions. Moreover, the voice interaction function is only required to be triggered when interaction starts, and repeated triggering is not required in the specific interactive call process. In addition, functions, languages, information, etc. involved in voice interaction can be set by the user.
Based on any of the above embodiments, the above full duplex voice interaction system may be implemented based on an engineering flow, where the engineering flow is for stakeholders such as designers, enforcers, and verifiers, where the designers refer to entities that receive data and problem descriptions and create AI models, the enforcers refer to entities that receive AI models and specify calculations to be performed, and the verifiers refer to entities that verify whether the calculations are being performed and whether the AI models are being performed by design.
The voice interaction may vary by task, scene, add-on device, and embedding method, which may affect the process phase. For example, most models and functions in a full duplex voice interactive system can be trained using the ML algorithm, but require multiple iterative improvements to achieve acceptable levels of accuracy and reliability. Thus, testing and verification of voice interaction systems embedded in ML methods (especially deep learning) can be challenging compared to traditional rule-based methods (programmed to an understandable way according to requirements and specifications).
Engineering implementation flow begins when stakeholders decide to turn an idea into a tangible system. In the initial stage, stakeholders should determine why a full duplex voice interactive system needs to be developed, which can be used to solve what kind of questions, adapt what kind of scenario, and which clients' needs or business opportunities, the questions can be answered by market research and analysis, and the stakeholders with different expertise can help to determine requirements and costs.
Then enter the design development phase: the work at this stage is to create a full duplex voice interactive system and end up with Software or hardware ready for deployment, including APP, SDK (Software Development Kit ), saaS (Software-as-a-Service). At this stage, and particularly before the end, the stakeholder should ensure that the full-duplex voice interactive system achieves the initial objectives, requirements, and other objectives determined at the initial stage.
The verification and authentication phase is then entered: the work at this stage is to check whether the full duplex voice interactive system starting from the design and development stage works as required and meets the objective.
Then enter the deployment, operation and monitoring phases: the work at this stage includes installing and configuring the full duplex voice interactive system in the target environment. The system installed and configured herein should be usable. This stage requires monitoring of the operational status and faults of the system and reporting to stakeholders to take action.
Then enter the re-evaluation phase: this stage should evaluate the results of the operational monitoring based on the targets and requirements determined for the full duplex voice interactive system. Once a problem is found, the target and requirements should be refined.
Finally, the retirement stage may be entered: in some cases, full duplex voice interactive systems may be outdated or unusable due to significant changes in the operating environment or technology, and stakeholders should consider retirement and discard or redevelopment.
Based on any of the above embodiments, fig. 9 is a schematic flow chart of a voice interaction method provided by the present invention, and as shown in fig. 9, the method can be applied to a voice interaction system, such as a smart phone, a smart home appliance, a smart assistant APP, a customer service robot, and the like. The method comprises the following steps:
step 910, collecting user voice stream after voice interaction awakening;
specifically, the user voice stream, that is, the voice data stream obtained during the voice interaction process, is obtained by recording in real time, and may specifically be obtained by recording voice, or may be obtained by recording video, and may be a single microphone or may be a microphone array including multiple microphones for recording the user voice stream, which is not particularly limited in the embodiment of the present invention.
It should be noted that, the user voice stream herein may be a voice data stream recorded by a user for voice interaction, for example, a wake-up voice data stream for waking up voice interaction, for example, a voice data stream for querying specific information after waking up, or a voice data stream recorded when the user interrupts a voice played by the voice interaction system during voice interaction, which is not particularly limited in the embodiment of the present invention.
Here, the voice interaction wake-up may be implemented by the user triggering a voice control operation of the UI interface, or may be implemented by the user inputting a voice signal of a preset wake-up word such as "fly little", which is not limited in particular in the embodiment of the present invention. It can be understood that the voice interaction system can complete the whole conversation process after one voice interaction wakeup, and realize multiple interactions, that is, the voice interaction system only needs to perform one voice interaction wakeup at the beginning of a conversation, then a user can continuously speak, and the voice interaction system can collect a user voice stream in a period of time after the voice interaction wakeup, thereby promoting each subsequent step to operate in a period of time to perform multiple continuous conversations, and further obtaining the effect of one wakeup multiple interactions. The specific length of the period of time referred to herein may depend on the actual interaction situation between the user and the voice interaction system, and may be understood as the time length of the entire conversation process, for example, the voice interaction system may collect the user voice stream containing valid voice during or after broadcasting the interactive synthesized voice, i.e. the user interrupts the broadcasting of the interactive synthesized voice, or the user continues to converse with the voice interaction system after hearing the interactive synthesized voice broadcasting, so that the time for collecting the user voice stream after waking up the voice interaction may be shifted backward until the entire conversation process ends.
And whether the whole conversation process is finished or not can be judged by judging whether the user voice stream of the effective voice is collected in the preset time, for example, after the voice interaction system broadcasts the interactive synthesized voice, if the user voice stream containing the effective voice is not collected in the preset time, the conversation process can be considered to be finished, and the collection can be stopped. For example, whether the entire conversation process is finished can be judged through two preset times, specifically, after the voice interaction system broadcasts the interactive synthetic voice, if the user voice stream containing the effective voice is not collected in the first preset time, the voice interaction system can generate and broadcast the interactive synthetic voice for active interaction so as to guide the user to continue the conversation, and after the interactive synthetic voice broadcast for active interaction is finished, if the user voice stream containing the effective voice is not collected in the second preset time, the conversation process can be considered to be finished, and the collection can be stopped.
Moreover, voice interaction wake-up is only required to be triggered at the beginning of a dialogue, and no trigger is required in the dialogue process.
Step 920, performing voice endpoint detection and voice recognition on the user voice stream based on voice interaction resources, to obtain a recognition text of the user voice stream, wherein the voice endpoint detection is performed based on semantics of the user voice stream, and the voice interaction resources comprise voice interaction knowledge and resource data related to voice interaction tasks;
Step 930, performing natural language understanding, dialogue management and natural language generation on the recognition text based on the voice interaction resource to obtain an interaction text of the recognition text;
step 940, performing a speech synthesis operation on the interactive text to obtain an interactive synthesized speech of the interactive text;
specifically, for real-time interaction of user voice streams, the method can be divided into three stages of voice recognition operation, dialogue processing operation and voice synthesis operation, and execution of the three operation stages is realized on the basis of voice interaction resources.
The voice interaction resources herein, namely, data resources for supporting real-time voice interaction execution, can be specifically divided into two types of voice interaction knowledge and resource data:
the voice interaction knowledge, that is, knowledge information for realizing voice interaction, may be used to provide solutions to questions posed by a user in voice interaction, and may also provide descriptions for program tasks in voice interaction, which is not specifically described in the embodiments of the present invention.
The resource data is the resource related to the voice interaction task, and different voice interaction tasks can be associated with different resource data. Each time the voice interaction is executed, the voice interaction task is provided with an explicit voice interaction task, the voice interaction task can reflect specific targets and requirements of the voice interaction needed to be solved, and the voice interaction task can change along with the scene type and the user requirement, for example, the voice interaction task can be telephone, navigation, home service, chat and the like. The resource data related to the voice interaction task may reflect resource information required for solving a specific problem pointed by the voice interaction task, for example, for the voice interaction task in an automobile driving scene, the resource information related to the voice interaction task may include a location area of interest to the user, map resources, statement information of route query, and the like.
The real-time interaction of the user voice stream can be realized by combining the voice interaction resource, specifically, the recognition text of the user voice stream can be obtained by combining the voice interaction resource through voice recognition operation, then the interactive text corresponding to the recognition text can be obtained by combining the voice interaction resource through dialogue processing operation, and the interactive text is used for feeding back the text form information of the user; and then combining the voice interaction resources, and synthesizing the voice corresponding to the interaction text through voice synthesis operation, so as to obtain the interaction synthesized voice.
In this process, the speech recognition operation, the dialogue processing operation, and the speech synthesis operation may be realized by calling a single functional component, for example, the speech recognition component, the dialogue processing component, and the speech synthesis component may be preset for realizing the speech recognition operation, the dialogue processing operation, and the speech synthesis operation. Here, each functional component can be used and tested independently, and the functional components can also be integrated together so as to realize real-time interaction.
Further, instep 920, voice endpoint detection and voice recognition may be implemented by a streaming voice recognition algorithm for recognizing a continuous voice stream and a semantic-based voice endpoint detection algorithm; the purpose of the semantic-based voice endpoint detection algorithm is to identify and eliminate silence periods from the user voice stream, and compared with the traditional acoustic VAD, the semantic-based VAD combines semantic features when distinguishing voice from non-voice, so that the reliability of voice endpoint detection can be further improved; in addition, instep 920, an operation of rejecting irrelevant contents may be added, where irrelevant content co-rejecting may be implemented by an irrelevant content rejecting algorithm, where the purpose of the irrelevant content rejecting algorithm is to distinguish and reject contents that cannot be processed or should not be processed in the voice stream of the user. Such content is generally independent of the interaction tasks and dialog topics or context, and may also include invalid speech. The disambiguation effect can also be achieved by an irrelevant content rejection algorithm.
The natural language understanding, dialog management and natural language generation instep 930 may be implemented by a natural language understanding algorithm, a dialog management algorithm and a natural language generation algorithm, where the natural language understanding is used to understand the semantics of the recognition text, the dialog management is used to enter the current dialog state and context input, update the state of the dialog, and generate the dialog action to be implemented according to the dialog processing logic, and the natural language generation is used to generate the natural language text corresponding to the dialog action, that is, the interactive text. In addition, instep 930, the operations of semantic rejection and semantic post-processing may also be added, whereby natural language understanding and semantic rejection are used to understand the semantics of the recognized text and screen out extraneous semantics, semantic post-processing, dialog management and natural language generation are used to organize the process of dialog jumps according to the semantics of the recognized text, thereby generating suitable natural language text, i.e., interactive text.
The speech synthesis instep 940 is performed in dependence on a speech synthesis algorithm by which a conversion from interactive text to interactive synthesized speech is performed, thereby enabling a speech interaction with the user.
Step 950, broadcasting the interactive synthesized voice;
the collection of the user voice stream is carried out through an uplink channel, the broadcasting of the interactive synthetic voice is carried out through a downlink channel, the uplink channel and the downlink channel are different physical channels, and the uplink channel and the downlink channel are processed in parallel.
Specifically, after the interactive synthesized voice is obtained, the interactive synthesized voice can be broadcasted. It should be noted that, in the broadcasting process of the interactive synthetic voice, recording and acquiring of the user voice stream are not interrupted, and continuously acquiring the user voice data stream can ensure that user input is still possible to be acquired in the broadcasting process of the interactive synthetic voice, and real-time interactive processing and feedback can be performed on the user voice stream for interrupting the interactive synthetic voice broadcasting by the user, so that the user can continuously speak as continuous input, and the interaction between the user and the machine can be closer to the actual dialogue habit, the user can talk at will, and the user can interrupt at any time in the voice interaction process.
In order to ensure the realizability of the simultaneous execution of the broadcasting process of the interactive synthetic voice and the collecting process of the user voice stream, the broadcasting of the interactive synthetic voice and the collecting of the user voice stream can be realized through two physical channels, wherein an uplink channel is used for realizing the collecting of the user voice stream, namely, transmitting the collected user voice stream from a user to a voice interaction system, and a downlink channel is used for realizing the broadcasting of the interactive synthetic voice, namely, transmitting the interactive synthetic voice from the voice interaction system to the user. The two channels should be able to operate simultaneously without interfering with each other, thereby providing the voice interactive system with the ability to "talk" while "listen".
The method provided by the embodiment of the invention respectively executes the collection of the user voice stream and the broadcasting of the interactive synthetic voice through different physical channels, so that the user input can still be collected in the broadcasting process of the interactive synthetic voice, and the user voice stream of the interactive synthetic voice broadcasting can be still subjected to real-time interactive processing and feedback for the user to interrupt the interactive synthetic voice broadcasting, thereby ensuring that the user can continuously speak as continuous input, the interaction between the user and the machine can be more close to the actual dialogue habit, the user can talk at will, and the user can interrupt at any time in the voice interaction process, and the naturality of the voice interaction is ensured.
And by combining voice interaction resources with voice recognition operation, dialogue processing operation and voice synthesis operation, real-time voice interaction processing is performed, so that the real-time voice interaction under a voice interaction task can meet the increasingly complex and diversified man-machine interaction requirements, particularly the application of voice data streams, the user input has continuity, the voice interaction can be performed by referring to the context of the dialogue, and the continuity of the voice interaction is ensured.
Based on any of the above embodiments, step 950 further includes:
And if the user voice stream is acquired in the broadcasting process of the interactive synthetic voice, stopping broadcasting the interactive synthetic voice until updated interactive synthetic voice is obtained, or continuously broadcasting the interactive synthetic voice until updated interactive synthetic voice is obtained.
Based on any of the above embodiments, the method further comprises:
under the condition that the user voice stream is not collected, performing dialogue management and natural language generation based on the voice interaction resource to obtain an active interaction text;
and executing voice synthesis operation on the active interaction text to obtain the interaction synthesized voice of the active interaction text.
Specifically, the voice interaction system should be able to predict the intention of the user to a certain extent according to the state and scene of the user, control the rhythm of the dialogue, and actively give feedback and information to guide the next action of the user. For the situation that the user voice stream is not collected, particularly the situation that the user does not speak in a period of time, the system can combine knowledge base, scene data, history data and user data in voice interaction resources to perform dialogue management and natural language generation, so that the previous response context and common knowledge can be combined to perform real-time and reasonable active dialogue.
It will be appreciated that interactive synthesized speech of active interactive text may also be broadcast after synthesis.
Based on any of the above embodiments,steps 920 and 930 include:
sending the user voice stream to a cloud end providing cloud computing resources to request the cloud end to execute voice endpoint detection and voice recognition on the user voice stream based on the voice interaction resources to obtain a recognition text of the user voice stream, and executing natural language understanding, dialogue management and natural language generation on the recognition text based on the voice interaction resources to obtain an interaction text of the recognition text;
and receiving the interactive text returned by the cloud.
In particular, to be able to provide more powerful computational power support for speech recognition and dialog processing operations. Under the condition that the voice interaction system itself bears the edge computing function, after the voice interaction system acquires the user voice stream, the user voice stream can be sent to a cloud providing cloud computing resources through a transmission component, the cloud executes voice recognition operation and dialogue processing operation aiming at the user voice stream based on the voice interaction resources, a voice recognition algorithm and a dialogue processing algorithm, so that an interaction text of the user voice stream is obtained, and the interaction text is fed back to the voice interaction system through the transmission component.
After that, the local providing the edge computing resource can receive the interactive text returned by the cloud through the transmission component, and after the interactive text is obtained, the interactive synthesized voice corresponding to the interactive text is generated by utilizing the local computing resource, namely the edge computing resource. Based on any of the above embodiments,step 910 includes:
and collecting a user voice stream based on the microphone array, and carrying out voice preprocessing on the user voice stream, wherein the voice preprocessing comprises at least one of voice enhancement, dereverberation, sound source extraction and separation, echo cancellation, sound source positioning and denoising.
Based on any of the above embodiments, step 910 specifically includes:
performing speaker localization based on the microphone array, determining a speaker localization result;
and carrying out directional voice pickup based on the speaker positioning result to obtain a user voice stream, and carrying out voice preprocessing on the user voice stream.
Based on any of the above embodiments, the microphone array is specifically configured to collect a user voice stream in a near field and/or a far field;
the distance of the near field acquisition is within 1 meter, and the distance of the far field acquisition is more than 3 meters and less than 5 meters.
Based on any of the above embodiments, the array structure of the microphone array is at least one of a linear form, a planar form, and a spatial stereo form.
Based on any of the above embodiments,step 910 includes:
and under the condition that the voice interaction state is on, collecting the voice stream of the user.
Based on any of the above embodiments,step 920 includes:
performing voice endpoint detection on the user voice stream based on the voice interaction resource and a voice endpoint detection algorithm based on semantics to obtain an active voice stream;
performing voice recognition on the active voice stream based on the voice interaction resource and a streaming voice recognition algorithm to obtain a primary recognition text of the active voice stream;
and performing content rejection on the preliminary recognition text based on the voice interaction resource to obtain the recognition text of the user voice stream.
Based on any of the above embodiments, step 920 further includes:
and if the recognition text contains a preset trigger word, continuously executing voice recognition on the user voice stream based on the voice interaction resource.
Based on any of the above embodiments,step 930 includes:
acquiring entity information and intention information in the identification text based on the voice interaction resource and a natural language understanding algorithm, and acquiring a text semantic path based on the entity information and the intention information;
Screening irrelevant semantics in the text semantics based on the voice interaction resource and the semantic rejection algorithm to obtain target semantics;
determining interaction information of the target semantics based on the voice interaction resource and the semantics post-processing algorithm;
based on the voice interaction resource and a dialogue management algorithm, determining dialogue actions corresponding to the interaction information;
and generating interactive text corresponding to the dialogue action based on a natural language generation algorithm.
Based on any one of the foregoing embodiments, instep 930, the determining, based on the voice interaction resource and the dialog management algorithm, a dialog action corresponding to the interaction information includes:
and determining the dialogue action corresponding to the interaction information based on the dialogue guiding algorithm and/or the beat control algorithm in the voice interaction resource and the dialogue management algorithm.
Based on any of the above embodiments,step 940 includes:
determining target voice attributes of a target speaker from the voice interaction resources;
based on a speech synthesis algorithm, interactive synthesized speech corresponding to the interactive text and conforming to the target speech attribute is generated.
Based on any of the above embodiments, the target voice attribute includes at least one of a pace, a pitch, a intonation, and a timbre of the target speaker.
Based on any of the above embodiments, the voice interaction knowledge includes declarative knowledge and procedural knowledge;
the resource data includes scene data, user data, and history data.
Based on any of the above embodiments, further comprising:
and displaying at least one of the recognition text, the interactive text and the voice interaction state.
Based on any of the above embodiments, further comprising:
and receiving interactive operation input by a user, wherein the interactive operation is used for switching at least one of voice interaction state, correcting the recognition text, setting voice interaction languages, setting preset trigger words and setting target voice attributes of a target speaker.
Based on any one of the above embodiments, fig. 10 is a second flowchart of the voice interaction method provided by the present invention, fig. 10 shows a communication and dialogue process of voice interaction, in the voice interaction between a user and two parties of the voice interaction system, the user may first speak information carrying a preset trigger word to trigger the start of the voice interaction, and thereafter the user may continue speaking without interruption, corresponding to a continuous user voice stream including the preset trigger word received by the voice interaction system, for which the voice interaction system end may perform real-time interaction operation to generate and broadcast interactive synthesized voice. After the user listens to the interactive synthetic voice, the user can continue to dictate new content, the voice interaction system can collect continuous user voice flow, and accordingly, conversation guidance and rhythm control can be conducted, and meanwhile, voice collection and voice recognition are conducted continuously, so that whether the user breaks the speech of the interactive synthetic voice broadcasting in the process is judged, if the user detects the break voice continuing to the user, the interactive synthetic voice can be generated based on the newly collected voice, and accordingly voice interaction which is more natural and can be more attached to various emergency situations existing in actual interaction can be achieved.
Fig. 11 is a third flow chart of the voice interaction method provided by the present invention, and fig. 11 shows an interaction process based on time sequence. The interaction process can be divided into a plurality of time intervals (Tn) and speech, depending on the time logic of the dialog. The upstream channel (input natural speech) and the downstream channel (output synthesized speech) should be able to receive and transmit speech signals within the same time interval (Ti). Thus, the user may interrupt the voice interaction system while speaking, and the voice interaction system may manage cadence or give prompts while the user speaks or remains silent.
Fig. 12 is a schematic flow chart of the voice interaction method provided by the invention, as shown in fig. 12, in the full duplex voice interaction process, a person and a machine can communicate with each other at the same time, parties at two ends of a conversation can talk at the same time and can be heard by the other party, in the process, the person can interrupt the interactive synthesized voice being broadcasted, the machine can break the silent active speaking response, or can manage the rhythm or give a prompt when the user speaks or keeps silent.
Fig. 13 is a schematic diagram of a voice interaction scenario provided by the present invention, in fig. 13, "small flying hello" is a preset trigger word, a semantic interaction system realizes voice wake-up after detecting "small flying hello", and a microphone array in an acoustic acquisition component is applied to perform sound source positioning, so as to prepare for subsequent directional pickup of a user voice stream of a target speaker. In the voice interaction process, semantic understanding is carried out on the random song obtained through recognition, and corresponding operation can be executed, namely, songs are played. During the song playing process, the microphone array can perform echo cancellation on the collected audio, so that the collected songs are prevented from affecting the execution of voice interaction. And during this process the microphone array continues recording and the voice recognition component continues performing voice recognition, thus getting and responding to what the user says "i want to hear the D song of the C singer". After detecting "change to E singer", the voice interaction system may contact the short term memory above, determine that the user is a D song that he wants to change to E singer, and respond. Aiming at the 'good hearing' uttered by the user, the voice interaction system can judge that the sentence is meaningless through semantic rejection, does not need to respond, and continues to play songs. For pauses that occur in the user's voice stream, a semantic VAD may be performed by the voice recognition component to obtain an accurate representation of the user for accurate response.
Fig. 14 is a second schematic view of a voice interaction scenario provided by the present invention, in fig. 14, after detecting that a voice is broken, i say that i want to go to B city ", error correction can be performed on a city name obtained by the last recognition according to semantics, and a task dialogue after error correction is output. After detecting "help me order the earliest morning trip", the earliest flight, i.e., the 9 th morning departure flight, can be screened from the flights already retrieved by logical reasoning and responded accordingly. When "good, what is the weather? After that, the city name "B city" can be shared under the ticket business and the weather business, so as to respond to the weather condition of B city. When "one has detected a user's own home, the last hotel bar" is read from the user data and responded.
Fig. 15 is a third schematic view of a voice interaction scenario provided by the present invention, wherein the solid line with an arrow in fig. 15 is used to reflect the voice stream transmitted through the up channel and the down channel. In fig. 15, the person speaks "i want to listen (pause 1 s) to the song of singer C", and the machine finds that the sentence lacks object information by understanding the semantics of the sound clip "i want to listen", and is not yet expressed completely, and based on this determination, there is still a voice activity frame, so that the machine can ignore the pause 1s in the middle of the user voice stream, continuously collect the user voice stream, and perform semantic understanding according to the dialogue context. In addition, in fig. 15, the machine can execute voice recognition, semantic understanding and dialogue management on the weather today of the city a by speaking the weather today of the city a, continuously collect user voice stream while generating the interactive text of the city a, … … today, and collect and interrupt the input city B, at this time, the machine can execute processing such as voice recognition, semantic understanding, dialogue management and voice synthesis on the city B input by this round while not affecting the previous interactive text to form interactive synthesized voice and broadcast, thereby realizing parallel processing of uplink and downlink channels.
Fig. 16 illustrates a physical structure diagram of an electronic device, as shown in fig. 16, which may include: aprocessor 1610, a communication interface (Communications Interface) 1620, amemory 1630, and acommunication bus 1640, wherein theprocessor 1610, thecommunication interface 1620, and thememory 1630 perform communication with each other via thecommunication bus 1640.Processor 1610 can invoke logic instructions inmemory 1630 to perform a voice interaction method comprising:
collecting user voice stream after voice interaction awakening;
performing voice endpoint detection and voice recognition on the user voice stream based on voice interaction resources to obtain a recognition text of the user voice stream, wherein the voice endpoint detection is performed based on semantics of the user voice stream, and the voice interaction resources comprise voice interaction knowledge and resource data related to voice interaction tasks;
based on the voice interaction resource, natural language understanding, dialogue management and natural language generation are carried out on the identification text, and an interaction text of the identification text is obtained;
performing voice synthesis operation on the interactive text to obtain interactive synthesized voice of the interactive text;
Broadcasting the interactive synthetic voice, and stopping broadcasting the interactive synthetic voice until updated interactive synthetic voice is obtained under the condition that the user voice stream is acquired in the broadcasting process of the interactive synthetic voice, or continuously broadcasting the interactive synthetic voice until updated interactive synthetic voice is obtained;
the collection of the user voice stream is carried out through an uplink channel, the broadcasting of the interactive synthetic voice is carried out through a downlink channel, the uplink channel and the downlink channel are different physical channels, and the uplink channel and the downlink channel are processed in parallel.
Further, the logic instructions inmemory 1630 described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method of voice interaction provided by the methods described above, the method comprising:
collecting user voice stream after voice interaction awakening;
performing voice endpoint detection and voice recognition on the user voice stream based on voice interaction resources to obtain a recognition text of the user voice stream, wherein the voice endpoint detection is performed based on semantics of the user voice stream, and the voice interaction resources comprise voice interaction knowledge and resource data related to voice interaction tasks;
based on the voice interaction resource, natural language understanding, dialogue management and natural language generation are carried out on the identification text, and an interaction text of the identification text is obtained;
performing voice synthesis operation on the interactive text to obtain interactive synthesized voice of the interactive text;
broadcasting the interactive synthetic voice, and stopping broadcasting the interactive synthetic voice until updated interactive synthetic voice is obtained under the condition that the user voice stream is acquired in the broadcasting process of the interactive synthetic voice, or continuously broadcasting the interactive synthetic voice until updated interactive synthetic voice is obtained;
The collection of the user voice stream is carried out through an uplink channel, the broadcasting of the interactive synthetic voice is carried out through a downlink channel, the uplink channel and the downlink channel are different physical channels, and the uplink channel and the downlink channel are processed in parallel.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above-provided voice interaction methods, the method comprising:
collecting user voice stream after voice interaction awakening;
performing voice endpoint detection and voice recognition on the user voice stream based on voice interaction resources to obtain a recognition text of the user voice stream, wherein the voice endpoint detection is performed based on semantics of the user voice stream, and the voice interaction resources comprise voice interaction knowledge and resource data related to voice interaction tasks;
based on the voice interaction resource, natural language understanding, dialogue management and natural language generation are carried out on the identification text, and an interaction text of the identification text is obtained;
performing voice synthesis operation on the interactive text to obtain interactive synthesized voice of the interactive text;
Broadcasting the interactive synthetic voice, and stopping broadcasting the interactive synthetic voice until updated interactive synthetic voice is obtained under the condition that the user voice stream is acquired in the broadcasting process of the interactive synthetic voice, or continuously broadcasting the interactive synthetic voice until updated interactive synthetic voice is obtained;
the collection of the user voice stream is carried out through an uplink channel, the broadcasting of the interactive synthetic voice is carried out through a downlink channel, the uplink channel and the downlink channel are different physical channels, and the uplink channel and the downlink channel are processed in parallel.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.