CN112466304B

Movatterモバイル変換

Info

Publication number: CN112466304B
Application number: CN202011411215.4A
Authority: CN
Inventors: 孙洪菠
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2023-09-08
Anticipated expiration: 2040-12-03
Also published as: CN112466304A

Abstract

The application discloses an offline voice interaction method, device, system, equipment and storage medium, relates to the technical field of computers, and particularly relates to the technical field of artificial intelligence such as voice, deep learning and the like. The offline voice interaction method comprises the following steps: after the local terminal wakes up, continuously transmitting a voice signal to be recognized sent by a user to a decoder in the local terminal, so that the decoder continuously decodes the voice signal to be recognized to obtain a voice recognition result; continuously receiving the voice recognition result sent by the decoder, and continuously responding to the voice recognition result until receiving an ending instruction sent by the user; and after receiving the ending instruction, ending the continuous interaction. The application can support continuous recognition after one-time awakening in the offline voice interaction scene.

Description

Translated fromChinese

离线语音交互方法、装置、系统、设备和存储介质Offline voice interaction method, device, system, equipment and storage medium

技术领域Technical field

本申请涉及计算机技术领域，具体涉及语音、深度学习等人工智能技术领域，尤其涉及一种离线语音交互方法、装置、系统、设备和存储介质。This application relates to the field of computer technology, specifically to the field of artificial intelligence technology such as voice and deep learning, and in particular to an offline voice interaction method, device, system, equipment and storage medium.

背景技术Background technique

随着计算机技术的普及，当今人们的生活已经逐渐走入智能时代。不仅仅是电脑，手机，PAD，人们的衣食住行的方方面面都开始应用智能技术，比如，智能电视，智能导航，智能家居等，智能技术在人们生活的各个方面提供方便快捷的服务。语音交互属于人机交互的范畴，是基于语音输入的新一代交互模式，就是利用人类的自然语言给机器下指令，达成人类自身目的这一过程。With the popularization of computer technology, people's lives today have gradually entered the intelligent era. Not only computers, mobile phones, and PADs, smart technology has begun to be applied in all aspects of people's daily necessities, such as smart TVs, smart navigation, smart homes, etc. Smart technology provides convenient and fast services in all aspects of people's lives. Voice interaction belongs to the category of human-computer interaction. It is a new generation of interaction mode based on voice input. It is the process of using human natural language to give instructions to machines to achieve human beings' own goals.

语音交互过程一般包括唤醒、语音识别、语音合成等流程。现有技术中，仅支持唤醒后的一次识别，即，当前唤醒智能设备后，智能设备仅执行当前唤醒后的单次指令，若之后还需要对智能设备进行控制，则需要再次唤醒，再次发出新的指令。The voice interaction process generally includes processes such as wake-up, speech recognition, and speech synthesis. In the existing technology, only one identification after waking up is supported. That is, after waking up the smart device, the smart device only executes the single instruction after the current waking up. If the smart device needs to be controlled later, it needs to be woken up again and issued again. New instructions.

发明内容Contents of the invention

本申请提供了一种离线语音交互方法、装置、系统、设备和存储介质。This application provides an offline voice interaction method, device, system, equipment and storage medium.

根据本申请的一方面，提供了一种离线语音交互方法，包括：在本地终端唤醒后，持续传输用户发出的待识别语音信号至所述本地终端内的解码器，以使所述解码器持续解码所述待识别语音信号得到语音识别结果；持续接收所述解码器发送的所述语音识别结果，并持续响应所述语音识别结果，直至接收到所述用户发出的结束指令；接收到所述结束指令后，结束本次持续交互。According to one aspect of the present application, an offline voice interaction method is provided, including: after the local terminal wakes up, continuously transmitting a voice signal to be recognized sent by the user to a decoder in the local terminal, so that the decoder continues to Decoding the voice signal to be recognized to obtain a voice recognition result; continuing to receive the voice recognition result sent by the decoder, and continuing to respond to the voice recognition result until an end instruction from the user is received; receiving the After ending the command, end this ongoing interaction.

根据本申请的另一方面，提供了一种离线语音交互装置，包括：传输单元，用于在本地终端唤醒后，持续传输用户发出的待识别语音信号至所述本地终端内的解码器，以使所述解码器持续解码所述待识别语音信号得到语音识别结果；响应单元，用于持续接收所述解码器发送的所述语音识别结果，并持续响应所述语音识别结果，直至接收到所述用户发出的结束指令；结束单元，用于接收到所述结束指令后，结束本次持续交互。According to another aspect of the present application, an offline voice interaction device is provided, including: a transmission unit configured to continuously transmit a voice signal to be recognized sent by the user to a decoder in the local terminal after the local terminal wakes up, so as to causing the decoder to continuously decode the speech signal to be recognized to obtain a speech recognition result; a response unit configured to continue to receive the speech recognition result sent by the decoder, and to continue to respond to the speech recognition result until the speech recognition result is received The end instruction issued by the user; the end unit is used to end this continuous interaction after receiving the end instruction.

根据本申请的另一方面，提供了一种离线语音交互系统，包括如上述任一方面的任一项所述的装置。According to another aspect of the present application, an offline voice interaction system is provided, including the device according to any one of the above aspects.

根据本申请的另一方面，提供了一种电子设备，包括：至少一个处理器；以及与所述至少一个处理器通信连接的存储器；其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行如上述任一方面的任一项所述的方法。According to another aspect of the present application, an electronic device is provided, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be used by the at least one processor. Execution instructions, the instructions are executed by the at least one processor, so that the at least one processor can perform the method according to any one of the above aspects.

根据本申请的另一方面，提供了一种存储有计算机指令的非瞬时计算机可读存储介质，其中，所述计算机指令用于使所述计算机执行如上述任一方面的任一项所述的方法。According to another aspect of the present application, a non-transitory computer-readable storage medium storing computer instructions is provided, wherein the computer instructions are used to cause the computer to execute the method described in any one of the above aspects. method.

根据本申请的技术方案，通过在本地终端唤醒后，持续传输和处理语音信号，在用户主动发起结束时才结束本次语音交互，可以支持离线语音交互场景下的一次唤醒后的持续识别，以提升用户体验、避免资源浪费和提高语音交互效率。According to the technical solution of this application, by continuously transmitting and processing voice signals after the local terminal wakes up, and not ending the voice interaction until the end of the user's initiative, it can support continuous recognition after a wake-up in an offline voice interaction scenario, so as to Improve user experience, avoid resource waste and improve voice interaction efficiency.

应当理解，本部分所描述的内容并非旨在标识本申请的实施例的关键或重要特征，也不用于限制本申请的范围。本申请的其它特征将通过以下的说明书而变得容易理解。It should be understood that the content described in this section is not intended to identify key or important features of the embodiments of the application, nor is it intended to limit the scope of the application. Other features of the present application will become readily understood from the following description.

附图说明Description of the drawings

附图用于更好地理解本方案，不构成对本申请的限定。其中：The accompanying drawings are used to better understand the present solution and do not constitute a limitation of the present application. in:

图1是根据本申请第一实施例的示意图；Figure 1 is a schematic diagram according to a first embodiment of the present application;

图2是根据本申请实施例的离线语音交互系统的示意图；Figure 2 is a schematic diagram of an offline voice interaction system according to an embodiment of the present application;

图3是根据本申请第二实施例的示意图；Figure 3 is a schematic diagram according to a second embodiment of the present application;

图4是根据本申请实施例的回溯语音信号的示意图；Figure 4 is a schematic diagram of a traceback voice signal according to an embodiment of the present application;

图5是根据本申请第三实施例的示意图；Figure 5 is a schematic diagram according to a third embodiment of the present application;

图6是根据本申请第四实施例的示意图；Figure 6 is a schematic diagram according to a fourth embodiment of the present application;

图7是根据本申请第五实施例的示意图；Figure 7 is a schematic diagram according to a fifth embodiment of the present application;

图8是用来实现本申请实施例的离线语音交互方法中任一方法的电子设备的示意图。FIG. 8 is a schematic diagram of an electronic device used to implement any of the offline voice interaction methods according to the embodiment of the present application.

具体实施方式Detailed ways

以下结合附图对本申请的示范性实施例做出说明，其中包括本申请实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本申请的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and they should be considered to be exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

相关技术中，语音交互仅支持唤醒后的单次识别，比如，唤醒词是“小度小度”，在音乐场景下，用户需要唤醒智能设备(如智能音箱)播放音乐，用户需要说“小度小度”，智能音箱回复应答词(如“在呢”)，用户之后可以说语音指令“播放音乐”，智能音箱识别后执行播放音乐的操作。若播放音乐后用户发现音乐不是自己想听的，需要更换音乐，那么相关技术中，用户需要再次唤醒智能音箱，即用户需要再次说“小度小度”，然后智能音箱再次进行应答“在呢”，用户再次说新的语音指令“换一首”，之后智能音箱响应该新的语音指令，换一首音乐播放。若用户需要调大音量，则需要再次说“小度小度”，智能音箱再次应答“在呢”，之后用户才能再次说新的语音指令“调大音量”，之后智能音箱响应该新的语音指令，调大播放音量。从上述流程可以看出，在用户需要发出多次语音指令的情况下，相关技术中的语音交互过程，需要用户多次唤醒智能设备。对于用户来讲，操作繁琐，每次发出新的语音指令之前都需要先唤醒智能设备，影响用户体验；对于智能设备来讲，需要多次识别唤醒词多次应答，造成资源浪费、降低语音交互效率。In related technologies, voice interaction only supports single recognition after wake-up. For example, the wake-up word is "小DU Xiaodu". In a music scene, the user needs to wake up a smart device (such as a smart speaker) to play music. The user needs to say "小DU Xiaodu". "Du Xiaodu", the smart speaker replies with a response word (such as "here"), the user can then say the voice command "play music", and the smart speaker performs the operation of playing music after recognition. If the user finds that the music is not what he wants to listen to after playing the music and needs to change the music, in related technologies, the user needs to wake up the smart speaker again, that is, the user needs to say "Xiaodu Xiaodu" again, and then the smart speaker responds "Here you are" again. ”, the user once again says the new voice command “Change one”, and then the smart speaker responds to the new voice command and changes the music to play. If the user needs to turn up the volume, he needs to say "Xiaodu Xiaodu" again, and the smart speaker responds "Here" again. Then the user can say the new voice command "Turn up the volume" again, and then the smart speaker responds to the new voice command. command to increase the playback volume. It can be seen from the above process that when the user needs to issue multiple voice commands, the voice interaction process in related technologies requires the user to wake up the smart device multiple times. For users, the operation is cumbersome. Every time a new voice command is issued, the smart device needs to be woken up, which affects the user experience. For smart devices, it is necessary to recognize the wake-up word multiple times and respond multiple times, resulting in a waste of resources and reducing voice interaction. efficiency.

为了提升用户体验、避免资源浪费、提高语音交互效率，本申请提供如下一些实施例。In order to improve user experience, avoid resource waste, and improve voice interaction efficiency, this application provides the following embodiments.

图1是根据本申请第一实施例的示意图。该实施例提供一种离线语音交互方法，包括：Figure 1 is a schematic diagram according to a first embodiment of the present application. This embodiment provides an offline voice interaction method, including:

101、在本地终端唤醒后，持续传输用户发出的待识别语音信号至所述本地终端内的解码器，以使所述解码器持续解码所述待识别语音信号得到语音识别结果。101. After the local terminal wakes up, continue to transmit the voice signal to be recognized sent by the user to the decoder in the local terminal, so that the decoder continues to decode the voice signal to be recognized to obtain a voice recognition result.

102、持续接收所述解码器发送的所述语音识别结果，并持续响应所述语音识别结果，直至接收到所述用户发出的结束指令。102. Continue to receive the voice recognition result sent by the decoder, and continue to respond to the voice recognition result until an end instruction from the user is received.

103、接收到所述结束指令后，结束本次持续交互。103. After receiving the end instruction, end this continuous interaction.

本实施例提供的离线语音交互方法应用在离线语音交互场景下，因此，本实施例的执行主体为用户所使用的本地终端。本地终端的具体形式不作限定，涵盖配置有与用户进行离线语音交互功能的所有智能设备，比如，可以是车载终端、智能家居终端以及各种移动设备，移动设备比如包括：移动电话、平板计算机、手持式计算设备、PDA(个人数字助手)、便携式媒体播放器、使用头戴式受话器和耳机的设备(例如，蓝牙兼容设备)、手机平板(phablet)设备(即，组合智能电话/平板设备)、可穿戴式计算机等。The offline voice interaction method provided in this embodiment is applied in an offline voice interaction scenario. Therefore, the execution subject of this embodiment is the local terminal used by the user. The specific form of the local terminal is not limited, and covers all smart devices equipped with offline voice interaction functions with users. For example, it can be vehicle-mounted terminals, smart home terminals, and various mobile devices. Mobile devices include, for example: mobile phones, tablet computers, Handheld computing devices, PDAs (personal digital assistants), portable media players, devices using headsets and headphones (e.g., Bluetooth-compatible devices), phablet devices (i.e., combination smartphone/tablet devices) , wearable computers, etc.

本地终端可以基于离线语音交互系统与用户进行语音交互，进一步地，离线语音交互系统可以包括语音交互界面，以便用户通过语音交互界面输入语音指令。语音交互界面可以由APP(应用)、网页或者程序等提供，本申请对此不作限定。APP可以显式地安装在本地终端的界面上，或者，APP也可以是用户通过特定的硬件和/或软件按钮调出，本申请对此也不作限定。The local terminal can perform voice interaction with the user based on the offline voice interaction system. Furthermore, the offline voice interaction system can include a voice interaction interface so that the user can input voice instructions through the voice interaction interface. The voice interaction interface can be provided by an APP (application), a web page, a program, etc., which is not limited in this application. The APP can be explicitly installed on the interface of the local terminal, or the APP can also be called up by the user through specific hardware and/or software buttons, which is not limited in this application.

本实施例中，“持续”相对于“单次”而言，是指在未完成离线语音交互之前一直处于进行中状态而不是结束。比如，本地终端唤醒后，在接收到用户发出的结束指令之前有N条语音信号，若是“单次”处理，则只是对唤醒之后的第一条语音信号进行处理，其余N-1条均视作无效语音信号，不做处理。而本实施例中，是对这N条语音信号均做处理。从而可以实现单次唤醒后的多次交互，而不是单次唤醒单次识别。In this embodiment, "continuous", relative to "single", means that the offline voice interaction is in an ongoing state rather than finished before the offline voice interaction is completed. For example, after the local terminal wakes up, there are N voice signals before receiving the end instruction from the user. If it is processed "single-time", only the first voice signal after waking up will be processed, and the remaining N-1 will be regarded as As an invalid voice signal, no processing is performed. In this embodiment, all N voice signals are processed. This enables multiple interactions after a single wake-up instead of a single recognition after a single wake-up.

如图2所示，离线语音交互系统200可以包括：数据采集模块201、唤醒模块202、识别处理模块203、语音端点检测模块204和解码器205。As shown in Figure 2, the offline voice interaction system 200 may include: a data collection module 201, a wake-up module 202, a recognition processing module 203, a voice endpoint detection module 204 and a decoder 205.

结合图2所示的离线语音交互系统，图1所示方法的执行主体可以具体为图2所示系统中的识别处理模块203。Combined with the offline voice interaction system shown in Figure 2, the execution subject of the method shown in Figure 1 can be specifically the recognition processing module 203 in the system shown in Figure 2.

离线语音交互过程中，智能设备采集到语音信号后，先判断语音信号中是否包含唤醒词，在确定包含唤醒词后，将之后接收的语音信号作为待识别语音信号，对待识别语音信号进行后续识别、响应等处理，比如，若待识别语音信号是“播放音乐”，则识别并响应“播放音乐”的操作。During the offline voice interaction process, after the smart device collects the voice signal, it first determines whether the voice signal contains the wake-up word. After determining that the wake-up word is included, the voice signal received subsequently is used as the voice signal to be recognized, and the voice signal to be recognized is subsequently recognized. , response and other processing. For example, if the voice signal to be recognized is "play music", then recognize and respond to the operation of "play music".

本申请实施例中，为了区分，将本地终端成功唤醒之前接收的语音信号称为“唤醒语音信号”，该“唤醒语音信号”中可能包含或不包含唤醒词，比如唤醒词是“小度小度”，则包含“小度小度”的语音信号为唤醒语音信号；将本地终端成功唤醒之后接收的语音信号称为“待识别语音信号”，比如，采用唤醒词“小度小度”成功唤醒本地终端后，将之后的“播放音乐”等语音信号作为“待识别语音信号”。In the embodiment of this application, for the purpose of differentiation, the voice signal received before the local terminal successfully wakes up is called the "wake-up voice signal". The "wake-up voice signal" may or may not contain the wake-up word. For example, the wake-up word is "小DU小小" degree", then the voice signal containing "小DU小DU" is the wake-up voice signal; the voice signal received after the local terminal successfully wakes up is called the "voice signal to be recognized". For example, the wake-up word "小DU小DU" is used successfully After waking up the local terminal, the subsequent voice signals such as "playing music" are used as "voice signals to be recognized".

数据采集模块201用于采集语音信号。比如，用户发出语音信号后，麦克风阵列采集到用户发出的语音信号，麦克风阵列可以对用户发出的语音信号不做处理或者经过增强等处理，之后将不做处理的语音信号(可称为原始语音信号)或处理后的语音信号发送给数据采集模块201。The data collection module 201 is used to collect voice signals. For example, after the user sends a voice signal, the microphone array collects the voice signal sent by the user. The microphone array can process the voice signal sent by the user without processing or through enhancement, and then the unprocessed voice signal (can be called the original voice signal) or the processed voice signal is sent to the data collection module 201.

数据采集模块201采集到语音信号后，若语音信号为唤醒语音信号，则将唤醒语音信号发送给唤醒模块202。比如，数据采集模块201采集到语音信号后，若未接收过唤醒模块202反馈的唤醒标识，则将当前采集的语音信号作为唤醒语音信号发送给唤醒模块202；若接收过唤醒模块202反馈的唤醒标识，则将当前采集的语音信号作为待识别语音信号，不再发送给唤醒模块。另外，数据采集模块201采集到语音信号后，不论该语音信号是唤醒语音信号，还是待识别语音信号，都将其发送给识别处理模块203。After the data collection module 201 collects the voice signal, if the voice signal is a wake-up voice signal, the wake-up voice signal is sent to the wake-up module 202 . For example, after the data collection module 201 collects the voice signal, if it has not received the wake-up flag fed back by the wake-up module 202, it will send the currently collected voice signal to the wake-up module 202 as the wake-up voice signal; if it has received the wake-up flag fed back by the wake-up module 202 identification, the currently collected voice signal will be used as the voice signal to be recognized and will no longer be sent to the wake-up module. In addition, after the data collection module 201 collects the voice signal, whether the voice signal is a wake-up voice signal or a voice signal to be recognized, it is sent to the recognition processing module 203 .

唤醒模块202用于检测语音信号中是否包含唤醒词，在包含唤醒词时，确定成功唤醒本地终端，否则，在不包含唤醒词时继续检测语音信号。唤醒模块202在检测唤醒词时，可以采用各种相关技术实现，比如，先将语音信号划分为多帧，提取每帧语音信号的语音特征，再根据语音特征与唤醒声学模型判断该帧语音信号中是否包含唤醒词。The wake-up module 202 is configured to detect whether the voice signal contains a wake-up word. If the wake-up word is included, it is determined that the local terminal has been successfully woken up. Otherwise, the voice signal is continued to be detected when the wake-up word is not included. When the wake-up module 202 detects the wake-up word, it can be implemented by using various related technologies. For example, first divide the speech signal into multiple frames, extract the speech features of each frame of the speech signal, and then determine the speech signal of the frame based on the speech features and the wake-up acoustic model. contains the wake word.

唤醒模块202在检测到语音信号中包含唤醒词后，向数据采集模块201反馈唤醒标识，数据采集模块201接收到唤醒标识后，确定本地终端成功唤醒，之后进行唤醒后的后续处理，比如，可以触发本地终端反馈应答信息，比如用户采用唤醒词“小度小度”唤醒本地终端后，本地终端向用户反馈应答词“在呢”。After detecting that the voice signal contains the wake-up word, the wake-up module 202 feeds back the wake-up identifier to the data collection module 201. After receiving the wake-up identifier, the data collection module 201 determines that the local terminal has successfully woken up, and then performs subsequent processing after the wake-up. For example, Trigger the local terminal to feedback response information. For example, after the user uses the wake-up word "Xiaodu Xiaodu" to wake up the local terminal, the local terminal feeds back the response word "Here" to the user.

数据采集模块201接收到唤醒模块202反馈的唤醒标识后，可以将该唤醒标识发送给识别处理模块203，以便识别处理模块203根据唤醒成功信息确定唤醒时间点，并基于该唤醒时间点进行后续处理。以及，数据采集模块201接收到唤醒标识后，将之后接收的语音信号作为待识别语音信号持续传输给识别处理模块203和语音端点检测模块204。After receiving the wake-up identifier fed back by the wake-up module 202, the data collection module 201 can send the wake-up identifier to the identification processing module 203, so that the identification processing module 203 determines the wake-up time point based on the wake-up success information, and performs subsequent processing based on the wake-up time point. . And, after receiving the wake-up flag, the data collection module 201 continuously transmits the voice signal received subsequently as the voice signal to be recognized to the recognition processing module 203 and the voice endpoint detection module 204.

语音端点检测模块204用于检测待识别语音信号的语音起点和语音尾点，并将检测得到的语音起点和语音尾点发送给识别处理模块203。语音端点检测模块204比如是语音活动检测(Voice Activity Detection，VAD)模块。语音端点检测模块204可以采用各种相关技术进行语音端点(语音起点和语音尾点)检测，比如，提取语音信号的语音特征，再根据语音特征和语音端点检测模型检测出语音端点。The speech endpoint detection module 204 is used to detect the speech starting point and the speech ending point of the speech signal to be recognized, and send the detected speech starting point and speech ending point to the recognition processing module 203 . The voice endpoint detection module 204 is, for example, a Voice Activity Detection (VAD) module. The voice endpoint detection module 204 can use various related technologies to detect voice endpoints (voice starting point and voice ending point). For example, extract the voice features of the voice signal, and then detect the voice endpoints based on the voice features and the voice endpoint detection model.

识别处理模块203用于根据接收的唤醒标识确定唤醒时间点，并将唤醒时间点作为基点确定回溯起点，以及将回溯起点与首次待识别语音信号的尾点之间的语音信号作为回溯语音信号发送给解码器205、以及根据语音端点检测模块204检测得到的语音起点和语音端点选择非首次待识别语音信号持续传输给解码器205。The recognition processing module 203 is configured to determine the wake-up time point according to the received wake-up identification, use the wake-up time point as the base point to determine the backtracking starting point, and send the voice signal between the backtracking starting point and the end point of the first voice signal to be recognized as a backtracking voice signal. The non-first speech signal to be recognized is selected and continuously transmitted to the decoder 205 and based on the speech starting point and speech endpoint detected by the speech endpoint detection module 204.

解码器205用于对接收的待识别语音信号进行解码处理，得到语音识别结果，并将语音识别结果持续发送给识别处理模块203。解码器可以采用各种相关技术进行解码处理，比如提取语音信号的语音特征，基于语音特征和离线语音识别模型识别出语音识别结果。解码时，比如是将语音形式的“播放音乐”识别为文本形式的“播放音乐”。The decoder 205 is used to decode the received speech signal to be recognized, obtain a speech recognition result, and continuously send the speech recognition result to the recognition processing module 203. The decoder can use various related technologies for decoding processing, such as extracting the speech features of the speech signal, and identifying the speech recognition results based on the speech features and offline speech recognition models. When decoding, for example, "play music" in the form of speech is recognized as "play music" in text form.

识别处理模块203还用于持续接收解码器发送的语音识别结果后，并持续响应语音识别结果。比如，语音识别结果为“播放音乐”，则调用音乐播放接口，以播放音乐。The recognition processing module 203 is also configured to continuously receive the speech recognition results sent by the decoder and continue to respond to the speech recognition results. For example, if the speech recognition result is "play music", the music playback interface is called to play music.

相关技术中，本地终端唤醒后，仅支持单次识别；而本实施例中，本次终端唤醒后，数据采集模块、语音端点检测模块、识别处理模块和解码器支持持续的语音传输、持续的语音端点检测、持续的语音解码、持续的语音识别结果响应等处理，直至接收到用户发出的结束指令。比如，采用“小度小度”唤醒本地终端后，用户又依次说了“播放音乐”、“换一首”、“调大音量”等语音信号，则相关技术中仅识别和响应“播放音乐”操作，并不会响应“换一首”、“调大音量”操作；而本实施例中，则会依次响应“播放音乐”、“换一首”等操作。In related technologies, after the local terminal wakes up, it only supports a single recognition; in this embodiment, after the terminal wakes up this time, the data collection module, voice endpoint detection module, recognition processing module and decoder support continuous voice transmission, continuous Voice endpoint detection, continuous voice decoding, continuous voice recognition result response and other processing until receiving the end instruction from the user. For example, after using "Xiaodu Xiaodu" to wake up the local terminal, and the user successively said "play music", "change a song", "turn up the volume" and other voice signals, the relevant technology only recognizes and responds to "play music". " operation will not respond to the "change one song" or "turn up the volume" operations; in this embodiment, it will respond to the "play music", "change one song" and other operations in sequence.

本申请实施例中，首次待识别语音信号是指本地终端唤醒后，用户说出的第一条待识别语音信号，比如上述的“播放音乐”，非首次待识别语音信号是指第一条待识别语音信号之后的待识别语音信号，比如上述的“换一首”、“调大音量”等。In the embodiment of this application, the first voice signal to be recognized refers to the first voice signal to be recognized spoken by the user after the local terminal wakes up, such as the above-mentioned "play music", and the non-first voice signal to be recognized refers to the first voice signal to be recognized. The speech signal to be recognized after the speech signal is recognized, such as the above-mentioned "change a song", "turn up the volume", etc.

用户发出的结束指令是指用户主动发出的，该发出的结束指令可以是用户说的语音信号，或者，也可以是用户通过操作本地终端上的软件或硬件，触发产生的操作指令。用户说的语音信号比如用户说“停止播放”；或者，比如，在语音交互界面上设置“退出”图标，用户点击“退出”图标后产生结束指令；或者还可以是硬件按钮，比如，用户点击本地终端上的预设的结束按钮后产生结束指令。本申请对用户发出的结束指令的具体形式不作限定。The end instruction issued by the user refers to the end instruction issued by the user actively. The end instruction issued may be a voice signal spoken by the user, or it may also be an operation instruction triggered by the user operating software or hardware on the local terminal. The voice signal spoken by the user is, for example, the user saying "stop playing"; or, for example, an "exit" icon is set on the voice interaction interface, and the user clicks the "exit" icon to generate an end instruction; or it can also be a hardware button, for example, when the user clicks The end command is generated after the preset end button on the local terminal. This application does not limit the specific form of the end instruction issued by the user.

识别处理模块203获取到用户发出的结束指令后，则结束本次持续交互，结束本次持续交互比如不再发送语音信号给解码器，也不再响应语音识别结果等，还可以进行状态置位，向应用层反馈成功退出本次持续交互的信息等。After the recognition processing module 203 obtains the end instruction issued by the user, it ends this continuous interaction. To end this continuous interaction, for example, it will no longer send voice signals to the decoder, no longer respond to the voice recognition results, etc., and it can also set the status. , feedback information about successfully exiting this continuous interaction to the application layer, etc.

由于本地终端的唤醒过程可以为多次，比如，在结束一次持续交互过程后，还可以进行下一次的持续交互过程，比如，以上述的“停止播放”结束本次持续交互过程后，若用户之后还需要再次播放音乐，或者需要进行其他操作，比如打电话等，则用户可以再次采用唤醒词“小度小度”唤醒本地终端，再次开启新一次的持续交互过程。所以，接收到结束指令后，是结束本次持续交互，或者说，结束当前持续交互，而不是结束所有的持续交互过程，用户在之后的过程中，依然可以重新唤醒并开启新一次的持续交互过程。Since the local terminal can wake up multiple times, for example, after ending one continuous interaction process, the next continuous interaction process can also be carried out. For example, after ending this continuous interaction process with the above-mentioned "stop playing", if the user Afterwards, if the music needs to be played again or other operations need to be performed, such as making a phone call, the user can use the wake-up word "Xiaodu Xiaodu" again to wake up the local terminal and start a new continuous interaction process again. Therefore, after receiving the end instruction, it ends this continuous interaction, or in other words, ends the current continuous interaction, rather than ending all continuous interaction processes. The user can still reawaken and start a new continuous interaction in the subsequent process. process.

本实施例中，通过在本地终端唤醒后，持续传输和处理语音信号，在用户主动发起结束时才结束本次语音交互，可以支持离线语音交互场景下的一次唤醒后的持续识别，以提升用户体验、避免资源浪费和提高语音交互效率。In this embodiment, by continuously transmitting and processing voice signals after the local terminal wakes up, and not ending the voice interaction until the end of the user's initiative, it can support continuous recognition after a wake-up in an offline voice interaction scenario to improve the user experience. experience, avoid resource waste and improve voice interaction efficiency.

离线语音交互场景下，包括解码器在内的各个相关模块均集成在本地终端内，比如上述图2所示的离线语音交互系统的各个模块均集成在本地终端的芯片上，受限于芯片的空间以及处理能力，可能会存在首次语音识别成功率不高的问题。为此，本申请还提供一些实施例，以提高首次识别成功率。In the offline voice interaction scenario, all related modules including the decoder are integrated in the local terminal. For example, each module of the offline voice interaction system shown in Figure 2 above is integrated on the chip of the local terminal, which is limited by the chip. Due to space and processing power, there may be a problem of low first-time speech recognition success rate. To this end, this application also provides some embodiments to improve the first-time identification success rate.

一些实施例中，所述唤醒是根据用户发出的唤醒语音信号确定，所述待识别语音信号包括首次待识别语音信号和非首次待识别语音信号，所述持续传输用户发出的待识别语音信号至所述本地终端内的解码器，包括：在所述唤醒语音信号中确定回溯起点，根据所述回溯起点和所述首次待识别语音信号确定回溯语音信号，将所述回溯语音信号传输至所述本地终端内的解码器；以及，持续获取所述非首次待识别语音信号的起点和尾点，并将所述起点和尾点之间的非首次待识别语音信号持续传输至所述本地终端内的解码器。In some embodiments, the wake-up is determined based on the wake-up voice signal sent by the user, the voice signal to be recognized includes the first voice signal to be recognized and the non-first voice signal to be recognized, and the continuous transmission of the voice signal to be recognized sent by the user is to The decoder in the local terminal includes: determining a traceback starting point in the wake-up voice signal, determining a traceback voice signal based on the traceback starting point and the first voice signal to be recognized, and transmitting the traceback voice signal to the a decoder in the local terminal; and, continuously obtain the starting point and the end point of the non-first-time to-be-recognized speech signal, and continuously transmit the non-first-time to-be-recognized speech signal between the starting point and the end point to the local terminal. decoder.

本实施例中，通过在首次待识别语音信号之前进行回溯，可以保证首次待识别语音信号的完整性，提高首次识别成功率。In this embodiment, by backtracking before the first voice signal to be recognized, the integrity of the first voice signal to be recognized can be ensured and the first recognition success rate can be improved.

图3是根据本申请第二实施例的示意图。本实施例提供一种离线语音交互方法，结合图2所示的系统，该方法包括：Figure 3 is a schematic diagram according to a second embodiment of the present application. This embodiment provides an offline voice interaction method. Combined with the system shown in Figure 2, the method includes:

301-302、数据采集模块采集到唤醒语音信号后，将唤醒语音信号发送给唤醒模块和识别处理模块。301-302. After the data acquisition module collects the wake-up voice signal, it sends the wake-up voice signal to the wake-up module and the recognition processing module.

唤醒语音信号比如是用户发出的包含唤醒词“小度小度”的语音信号。The wake-up voice signal is, for example, a voice signal sent by the user containing the wake-up word "Xiaodu Xiaodu".

可以理解的是，数据采集模块向唤醒模块和识别处理模块发送唤醒语音信号的时序关系不限定，比如，可以是同时向唤醒模块和识别处理模块发送，或者，也可以是先向唤醒模块发送，再向识别处理模块发送，或者，也可以先向识别处理模块发送再向唤醒模块发送。It can be understood that the timing relationship of the wake-up voice signal sent by the data acquisition module to the wake-up module and the recognition processing module is not limited. For example, it can be sent to the wake-up module and the recognition processing module at the same time, or it can be sent to the wake-up module first, Then send it to the recognition processing module, or you can also send it to the recognition processing module first and then send it to the wake-up module.

303、唤醒模块接收到唤醒语音信号后，识别其中的唤醒词，在识别出唤醒词后，确定本地终端唤醒，并向数据采集模块发送唤醒标识。303. After receiving the wake-up voice signal, the wake-up module identifies the wake-up word in it. After identifying the wake-up word, it determines that the local terminal is awake, and sends the wake-up identification to the data collection module.

唤醒标识比如是语音水印值。The wake-up identifier is, for example, a voice watermark value.

数据采集模块可以在唤醒语音信号上添加语音水印，并将添加了语音水印的唤醒语音信号发送给唤醒模块和识别处理模块。数据采集模块在添加语音水印时，还可以为每个语音水印分配语音水印值，语音水印值比如从0开始依次计数，即语音水印值可以分别为0、1、2...等。数据采集模块可以采用各种相关技术在语音信号上添加语音水印，本实施例对添加语音水印的方式不作限定。The data collection module can add a voice watermark to the wake-up voice signal, and send the wake-up voice signal with the added voice watermark to the wake-up module and the recognition processing module. When adding a voice watermark, the data collection module can also assign a voice watermark value to each voice watermark. For example, the voice watermark values are counted sequentially from 0, that is, the voice watermark values can be 0, 1, 2, etc. respectively. The data collection module can use various related technologies to add voice watermarks to the voice signals. This embodiment does not limit the method of adding voice watermarks.

唤醒模块在检测唤醒词时，可以基于语音帧进行处理。即，将语音信号划分为各个语音帧，比如，每隔32ms划分为一个语音帧，在每个语音帧中检测是否包含唤醒词。当检测到唤醒词后，可以基于预先配置的协议解析包含唤醒词的语音帧上的语音水印，得到对应的语音水印值，之后将该语音水印值作为唤醒标识发送给数据采集模块。When detecting wake-up words, the wake-up module can process based on speech frames. That is, the speech signal is divided into speech frames, for example, every 32 ms is divided into a speech frame, and whether the wake-up word is contained in each speech frame is detected. When the wake-up word is detected, the voice watermark on the voice frame containing the wake-up word can be parsed based on the pre-configured protocol to obtain the corresponding voice watermark value, and then the voice watermark value is sent to the data collection module as a wake-up identification.

304-306、数据采集模块接收到唤醒标识后，确定本地终端唤醒。之后，可以将接收的唤醒标识发送给识别处理模块，以及，数据采集模块将本地终端唤醒之后采集到的语音信号作为待识别语音信号，将待识别语音信号发送给识别处理模块和语音端点检测模块。304-306. After receiving the wake-up identifier, the data acquisition module determines that the local terminal wakes up. After that, the received wake-up identification can be sent to the recognition processing module, and the data collection module uses the voice signal collected after the local terminal wakes up as the voice signal to be recognized, and sends the voice signal to be recognized to the recognition processing module and the voice endpoint detection module. .

可以理解的是，304-306的时序关系也不限定。It is understandable that the timing relationship between 304 and 306 is not limited.

307、语音端点检测模块接收到待识别语音信号后，检测出待识别语音信号的起点和尾点，并将起点和尾点发送给识别处理模块。307. After receiving the voice signal to be recognized, the voice endpoint detection module detects the starting point and the end point of the voice signal to be recognized, and sends the starting point and the end point to the recognition processing module.

308、识别处理模块接收到唤醒标识(即语音水印值)后，将该语音水印值对应的语音水印所在的语音帧的尾点确定为唤醒时间点，以唤醒时间点为基准，向前回溯预设时长确定为回溯起点，将所述回溯起点和首次待识别语音信号的尾点之间的语音信号确定为回溯语音信号。以及，将回溯语音信号发送给解码器。308. After receiving the wake-up identifier (i.e., the voice watermark value), the recognition processing module determines the end point of the voice frame where the voice watermark corresponding to the voice watermark value is located as the wake-up time point, and uses the wake-up time point as the basis to trace forward the preset time. Assume that the duration is determined as the starting point of backtracking, and the voice signal between the starting point of backtracking and the end point of the first voice signal to be recognized is determined as the backtracking voice signal. and, sending the traceback speech signal to the decoder.

数据采集模块采集到唤醒语音信号后，不仅向唤醒模块发送唤醒语音信号，还向识别处理模块发送唤醒语音信号，识别处理模块接收到唤醒语音信号后可以对其进行缓存。以及，如上所述，数据采集模块可以在发送的唤醒语音信号中添加语音水印。识别处理模块接收到作为唤醒标识的语音水印值后，可以根据预先配置的协议解析唤醒语音信号上的语音水印，找到接收的语音水印值对应的语音水印，以及，确定该语音水印所在的语音帧，之后将该语音帧的尾点确定为唤醒时间点。After the data acquisition module collects the wake-up voice signal, it not only sends the wake-up voice signal to the wake-up module, but also sends the wake-up voice signal to the recognition processing module. The recognition processing module can cache the wake-up voice signal after receiving it. And, as mentioned above, the data collection module can add a voice watermark to the sent wake-up voice signal. After the recognition processing module receives the voice watermark value as the wake-up identifier, it can parse the voice watermark on the wake-up voice signal according to the preconfigured protocol, find the voice watermark corresponding to the received voice watermark value, and determine the voice frame in which the voice watermark is located. , and then determine the end point of the speech frame as the wake-up time point.

本实施例中，通过以唤醒时间点为基准向前回溯，可以提高回溯起点的准确性，进而保证首次待识别语音信号的完整性。In this embodiment, by backtracking forward based on the wake-up time point, the accuracy of the backtracking starting point can be improved, thereby ensuring the integrity of the first voice signal to be recognized.

本实施例中，通过基于语音水印值确定唤醒时间点，可以简便准确地确定出唤醒时间点。In this embodiment, by determining the wake-up time point based on the voice watermark value, the wake-up time point can be determined simply and accurately.

比如，参见图4，依据上述流程可以在唤醒语音信号中确定出唤醒时间点。For example, referring to Figure 4, the wake-up time point can be determined in the wake-up voice signal according to the above process.

预设时长一般是大于唤醒词所占的时长，比如，预设时长为2080ms。参见图4，以唤醒时间点为基准，向前回溯2080ms得到回溯起点。The default duration is generally greater than the duration occupied by the wake word. For example, the default duration is 2080ms. Referring to Figure 4, based on the wake-up time point, go back 2080ms to get the starting point of backtracking.

待识别语音信号可以分为首次待识别语音信号和非首次待识别语音信号，经过语音端点检测模块的处理，可以检测出首次待识别语音信号的起点和尾点，以及非首次待识别语音信号的起点和尾点，之后，语音端点检测模块可以将检测得到的语音信号(包括首次待识别语音信号和非首次待识别语音信号)的起点和尾点发送给识别处理模块。The speech signal to be recognized can be divided into the first speech signal to be recognized and the non-first speech signal to be recognized. After processing by the speech endpoint detection module, the starting point and end point of the first speech signal to be recognized can be detected, as well as the starting point and end point of the non-first speech signal to be recognized. After that, the voice endpoint detection module can send the starting point and end point of the detected voice signal (including the first voice signal to be recognized and the non-first voice signal to be recognized) to the recognition processing module.

对应首次待识别语音信号，如图4所示，将回溯起点与首次待识别语音信号的尾点之前的语音信号确定为回溯语音信号。比如，首次待识别语音信号为“播放音乐”，则将回溯起点与“播放音乐”尾点之间的语音信号作为回溯语音信号。Corresponding to the first speech signal to be recognized, as shown in Figure 4, the speech signal before the starting point of the traceback and the end point of the first speech signal to be recognized is determined as the traceback speech signal. For example, if the first voice signal to be recognized is "play music", the voice signal between the starting point of the traceback and the end point of "play the music" will be used as the traceback voice signal.

比如，首次待识别语音信号为“播放音乐”，则将回溯起点与“播放音乐”尾点之间的语音信号发送给解码器。For example, if the first voice signal to be recognized is "play music", then the voice signal between the starting point of the traceback and the end point of "play music" is sent to the decoder.

本实施例中，通过将回溯起点和首次待识别语音信号的尾点之间的语音信号作为回溯语音信号发送给解码器，可以在解码器处保证首次待识别语音信号的完整性，提高首次识别成功率。In this embodiment, by sending the speech signal between the starting point of the traceback and the end point of the first speech signal to be recognized as the traceback speech signal to the decoder, the integrity of the speech signal to be recognized for the first time can be ensured at the decoder, and the first recognition can be improved. Success rate.

309、识别处理模块持续获取非首次待识别语音信号的起点和尾点，并将所述起点和尾点之间的非首次待识别语音信号持续传输至解码器。309. The recognition processing module continues to obtain the starting point and the end point of the non-first time to be recognized speech signal, and continuously transmits the non-first time to be recognized speech signal between the starting point and the end point to the decoder.

对于非首次待识别语音信号，比如，非首次待识别语音信号包括“换一首”、“调大音量”等，则语音端点检测模块分别对每个非首次待识别语音信号进行端点检测并将检测得到的端点信息(起点和尾点)发送给识别处理模块，识别处理模块根据端点信息将起点和尾点之间的非首次待识别语音信号发送给解码器。For non-first-time speech signals to be recognized, for example, non-first-time speech signals to be recognized include "change a song", "turn up the volume", etc., the speech endpoint detection module performs endpoint detection on each non-first-time speech signal to be recognized and The detected endpoint information (starting point and ending point) is sent to the recognition processing module, and the recognition processing module sends the non-first-time to-be-recognized speech signal between the starting point and the ending point to the decoder based on the endpoint information.

310、解码器持续解码待识别语音信号得到语音识别结果，并持续传输语音识别结果至识别处理模块。310. The decoder continues to decode the speech signal to be recognized to obtain the speech recognition result, and continuously transmits the speech recognition result to the recognition processing module.

其中，由于解码器首次接收的语音信号，即回溯语音信号，存在一定冗余，所以，解码器在首次接收的语音信号中需要去掉一部分，即从头去掉预设时长(如2080ms)，对去掉之后的语音信号再进行解码处理。Among them, since the voice signal received by the decoder for the first time, that is, the traceback voice signal, has certain redundancy, the decoder needs to remove part of the voice signal received for the first time, that is, remove the preset duration (such as 2080ms) from the beginning. The speech signal is then decoded.

解码器在解码得到语音识别结果后，可以在语音识别结果中按序添加顺序标识，以便识别处理模块按序响应语音识别结果。顺序标识可以具有相同的标识前缀，比如，顺序标识分别为sn_1、sn_2、sn_3...等。After the decoder decodes and obtains the speech recognition result, it can add sequence identifiers in sequence to the speech recognition result so that the recognition processing module responds to the speech recognition result in sequence. Sequence identifiers can have the same identifier prefix, for example, the sequence identifiers are sn_1, sn_2, sn_3...etc.

本实施例中，通过按序响应语音识别结果，可以保证响应的准确性，提升用户体验。In this embodiment, by responding to the speech recognition results in sequence, the accuracy of the response can be ensured and the user experience can be improved.

本实施例中，通过将顺序标识具有相同的标识前缀，可以便于统一识别。In this embodiment, unified identification can be facilitated by using the same identification prefix for the sequence identifiers.

比如，语音识别结果为“播放音乐”的顺序标识是sn_1，语音识别结果为“换一首”的顺序标识是sn_2，语音识别结果为“调大音量”的顺序标识是sn_3，语音识别结果为“停止播放”的顺序标识是sn_4等。For example, if the speech recognition result is "play music", the sequence identifier is sn_1, the speech recognition result is "change a song", the sequence identifier is sn_2, the speech recognition result is "turn up the volume", the sequence identifier is sn_3, the speech recognition result is The sequence identifier of "stop playing" is sn_4, etc.

311、识别处理模块持续响应语音识别结果。311. The recognition processing module continues to respond to the speech recognition results.

识别处理模块可以根据所述顺序标识按序响应所述语音识别结果。比如，先响应顺序标识为sn_1的语音识别结果，再响应顺序标识为sn_2的语音识别结果等。The recognition processing module may respond to the speech recognition results in sequence according to the sequence identification. For example, first respond to the speech recognition result with the sequence identifier sn_1, then respond to the speech recognition result with the sequence identifier sn_2, etc.

312、识别处理模块接收到用户发出的结束指令后，结束本次持续交互。312. After receiving the end instruction from the user, the recognition processing module ends this continuous interaction.

比如，识别处理模块接收到的语音识别结果为“停止播放”，则接收到该语音识别结果后，结束本地持续交互。For example, if the speech recognition result received by the recognition processing module is "stop playing", then after receiving the speech recognition result, the local continuous interaction will end.

下面以一个具体示例说明用户与本地终端的交互过程。本地终端以车载终端为例，在车内空间中，经常存在无网络或者网络不佳的情况，为提升用户体验，避免资源浪费和提高语音交互效率，本申请实施例可以支持持续的离线语音交互。The following uses a specific example to illustrate the interaction process between the user and the local terminal. The local terminal takes the vehicle-mounted terminal as an example. In the vehicle space, there are often situations where there is no network or poor network. In order to improve the user experience, avoid resource waste and improve the efficiency of voice interaction, embodiments of this application can support continuous offline voice interaction. .

用户向本地终端发出的语音指令分别是：小度小度。播放音乐。声音大一点。停止播放。The voice commands issued by the user to the local terminal are: Xiaodu Xiaodu. play music. Be louder. Stop play.

1)用户对着车载终端说唤醒词，比如“小度小度”；车载终端基于该唤醒词唤醒车载终端；1) The user speaks a wake-up word to the vehicle-mounted terminal, such as "小DU小DU"; the vehicle-mounted terminal wakes up the vehicle-mounted terminal based on the wake-up word;

2)车载终端播放应答音“在呢”；随后开启本次的持续交互过程，识别处理模块将带回溯的语音信号上传给解码器；2) The vehicle-mounted terminal plays the response tone "Here you are"; then the continuous interaction process is started, and the recognition processing module uploads the traced voice signal to the decoder;

3)用户继续说“播放音乐”；解码器返回识别结果，识别处理模块调用车载终端的音乐资源进行播放；识别处理模块持续传输语音信号到解码器；3) The user continues to say "play music"; the decoder returns the recognition result, and the recognition processing module calls the music resources of the vehicle terminal for playback; the recognition processing module continues to transmit the voice signal to the decoder;

4)用户继续说“声音大一点”；解码器返回识别结果，识别处理模块调用车载终端的音量资源调大音量；识别处理模块持续传输数据到解码器；4) The user continues to say "sound louder"; the decoder returns the recognition result, and the recognition processing module calls the volume resource of the vehicle terminal to increase the volume; the recognition processing module continues to transmit data to the decoder;

5)用户继续说“停止播放”；解码器返回识别结果，识别处理模块停止播放音乐。结束本次持续交互。5) The user continues to say "stop playing"; the decoder returns the recognition result, and the recognition processing module stops playing music. End this ongoing interaction.

本实施例中，通过在本地终端唤醒后，持续传输和处理语音信号，在用户主动发起结束时才结束本次语音交互，可以支持离线语音交互场景下的一次唤醒后的持续识别，以提升用户体验、避免资源浪费和提高语音交互效率。通过在首次待识别语音信号之前进行回溯，可以保证首次待识别语音信号的完整性，提高首次识别成功率。通过以唤醒时间点为基准向前回溯，可以提高回溯起点的准确性，进而保证首次待识别语音信号的完整性。通过基于语音水印值确定唤醒时间点，可以简便准确地确定出唤醒时间点。通过按序响应语音识别结果，可以保证响应的准确性，提升用户体验。通过将顺序标识具有相同的标识前缀，可以便于统一识别。In this embodiment, by continuously transmitting and processing voice signals after the local terminal wakes up, and not ending the voice interaction until the end of the user's initiative, it can support continuous recognition after a wake-up in an offline voice interaction scenario to improve the user experience. experience, avoid resource waste and improve voice interaction efficiency. By backtracking before the first voice signal to be recognized, the integrity of the first voice signal to be recognized can be ensured and the first recognition success rate can be improved. By backtracking based on the wake-up time point, the accuracy of the backtracking starting point can be improved, thereby ensuring the integrity of the first voice signal to be recognized. By determining the wake-up time point based on the voice watermark value, the wake-up time point can be determined simply and accurately. By responding to speech recognition results in sequence, the accuracy of the response can be ensured and the user experience can be improved. Uniform identification can be facilitated by assigning sequential identifiers the same identifier prefix.

图5是根据本申请第三实施例的示意图。如图5所示，该实施例提供一种离线语音交互装置，离线语音交互装置500可以包括传输单元501、响应单元502和结束单元503。其中，传输单元501用于在本地终端唤醒后，持续传输用户发出的待识别语音信号至所述本地终端内的解码器，以使所述解码器持续解码所述待识别语音信号得到语音识别结果；响应模块502用于持续接收所述解码器发送的所述语音识别结果，并持续响应所述语音识别结果，直至接收到所述用户发出的结束指令；结束模块503用于接收到所述结束指令后，结束本次持续交互。Figure 5 is a schematic diagram according to a third embodiment of the present application. As shown in FIG. 5 , this embodiment provides an offline voice interaction device. The offline voice interaction device 500 may include a transmission unit 501 , a response unit 502 and an end unit 503 . Among them, the transmission unit 501 is used to continuously transmit the voice signal to be recognized sent by the user to the decoder in the local terminal after the local terminal wakes up, so that the decoder continues to decode the voice signal to be recognized to obtain the voice recognition result. ; The response module 502 is used to continuously receive the speech recognition result sent by the decoder, and continue to respond to the speech recognition result until the end instruction issued by the user is received; the end module 503 is used to receive the end instruction After the instruction, end this continuous interaction.

一些实施例中，所述唤醒是根据所述用户发出的唤醒语音信号确定，所述待识别语音信号包括首次待识别语音信号和非首次待识别语音信号，参见图6，该装置600包括传输单元601、响应单元602和结束单元603，传输单元601可以包括第一传输模块6011和第二传输模块6012。In some embodiments, the wake-up is determined based on the wake-up voice signal sent by the user, and the voice signal to be recognized includes the first voice signal to be recognized and the non-first voice signal to be recognized. Referring to Figure 6, the device 600 includes a transmission unit 601, response unit 602 and end unit 603. The transmission unit 601 may include a first transmission module 6011 and a second transmission module 6012.

第一传输模块6011用于在所述唤醒语音信号中确定回溯起点，根据所述回溯起点和所述首次待识别语音信号确定回溯语音信号，将所述回溯语音信号传输至所述本地终端内的解码器；以及，第二传输模块6012用于持续获取所述非首次待识别语音信号的起点和尾点，并将所述起点和尾点之间的非首次待识别语音信号持续传输至所述本地终端内的解码器。The first transmission module 6011 is used to determine a traceback starting point in the wake-up voice signal, determine a traceback voice signal based on the traceback starting point and the first voice signal to be recognized, and transmit the traceback voice signal to the local terminal. Decoder; and, the second transmission module 6012 is used to continuously obtain the starting point and the end point of the non-first time to be recognized voice signal, and continuously transmit the non-first time to be recognized voice signal between the starting point and the end point to the Decoder within the local terminal.

一些实施例中，所述第一传输模块6011具体用于：在所述唤醒语音信号中，确定所述唤醒对应的唤醒时间点；以所述唤醒时间点为基准，向前回溯预设时长确定为回溯起点。In some embodiments, the first transmission module 6011 is specifically configured to: determine the wake-up time point corresponding to the wake-up in the wake-up voice signal; use the wake-up time point as a reference to look back and determine the preset duration. The starting point for backtracking.

一些实施例中，所述第一传输模块6011进一步具体用于：接收唤醒标识，所述唤醒标识包括：语音水印值；将所述语音水印值对应的语音水印所在的语音帧的尾点确定为唤醒时间点。In some embodiments, the first transmission module 6011 is further specifically configured to: receive a wake-up identifier, where the wake-up identifier includes: a voice watermark value; and determine the end point of the voice frame where the voice watermark corresponding to the voice watermark value is located as Wake-up time.

一些实施例中，所述第一传输模块6011具体用于：获取所述首次待识别语音信号的尾点；将所述回溯起点与所述首次待识别语音信号的尾点之间的语音信号确定为回溯语音信号。In some embodiments, the first transmission module 6011 is specifically configured to: obtain the end point of the first voice signal to be recognized; determine the voice signal between the starting point of the traceback and the end point of the first voice signal to be recognized. To trace back the speech signal.

一些实施例中，所述语音识别结果包括顺序标识，所述响应模块603具体用于：根据所述顺序标识，按序响应所述语音识别结果。In some embodiments, the speech recognition results include a sequence identifier, and the response module 603 is specifically configured to: respond to the speech recognition results in sequence according to the sequence identifier.

一些实施例中，所述顺序标识具有相同的标识前缀。In some embodiments, the sequence identifiers have the same identifier prefix.

图7是根据本申请第五实施例的示意图。本实施例提供一种离线语音交互系统，该系统700包括：离线语音交互装置701，该装置可以如图5或图6所示，在此不再详述。该系统700还可以包括：解码器702，解码器702用于在首次接收的语音信号中从头开始去掉预设时长的语音信号，对去掉预设时长后的语音信号进行解码处理。Figure 7 is a schematic diagram according to the fifth embodiment of the present application. This embodiment provides an offline voice interaction system. The system 700 includes: an offline voice interaction device 701. The device can be shown in Figure 5 or Figure 6 and will not be described in detail here. The system 700 may also include: a decoder 702. The decoder 702 is configured to remove the voice signal of a preset duration from the first received voice signal from the beginning, and decode the voice signal after the preset duration is removed.

一些实施例中，解码器702还用于：在语音识别结果中，按序添加顺序标识。In some embodiments, the decoder 702 is also configured to: add sequence identifiers in sequence to the speech recognition results.

根据本申请的实施例，本申请还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to embodiments of the present application, the present application also provides an electronic device, a readable storage medium and a computer program product.

如图8所示，是根据本申请实施例实现的离线语音交互方法的电子设备的框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本申请的实现。As shown in Figure 8, it is a block diagram of an electronic device implementing an offline voice interaction method according to an embodiment of the present application. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit the implementation of the present application as described and/or claimed herein.

如图8所示，该电子设备包括：一个或多个处理器801、存储器802，以及用于连接各部件的接口，包括高速接口和低速接口。各个部件利用不同的总线互相连接，并且可以被安装在公共主板上或者根据需要以其它方式安装。处理器可以对在电子设备内执行的指令进行处理，包括存储在存储器中或者存储器上以在外部输入/输出装置(诸如，耦合至接口的显示设备)上显示GUI的图形信息的指令。在其它实施方式中，若需要，可以将多个处理器和/或多条总线与多个存储器和多个存储器一起使用。同样，可以连接多个电子设备，各个设备提供部分必要的操作(例如，作为服务器阵列、一组刀片式服务器、或者多处理器系统)。图8中以一个处理器801为例。As shown in Figure 8, the electronic device includes: one or more processors 801, memory 802, and interfaces for connecting various components, including high-speed interfaces and low-speed interfaces. The various components are connected to each other using different buses and can be mounted on a common motherboard or otherwise mounted as desired. The processor may process instructions executed within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used with multiple memories and multiple memories, if desired. Likewise, multiple electronic devices can be connected, each device providing part of the necessary operation (eg, as a server array, a set of blade servers, or a multi-processor system). Figure 8 takes a processor 801 as an example.

存储器802即为本申请所提供的非瞬时计算机可读存储介质。其中，所述存储器存储有可由至少一个处理器执行的指令，以使所述至少一个处理器执行本申请所提供的离线语音交互方法。The memory 802 is the non-transitory computer-readable storage medium provided by this application. The memory stores instructions executable by at least one processor, so that the at least one processor executes the offline voice interaction method provided by this application.

存储器802作为一种非瞬时计算机可读存储介质，可用于存储非瞬时软件程序、非瞬时计算机可执行程序以及模块，如本申请实施例中的离线语音交互方法对应的程序指令/模块。处理器801通过运行存储在存储器802中的非瞬时软件程序、指令以及模块，从而执行服务器的各种功能应用以及数据处理，即实现上述方法实施例中的离线语音交互方法。As a non-transient computer-readable storage medium, the memory 802 can be used to store non-transient software programs, non-transient computer executable programs and modules, such as program instructions/modules corresponding to the offline voice interaction method in the embodiment of the present application. The processor 801 executes various functional applications and data processing of the server by running non-transient software programs, instructions and modules stored in the memory 802, that is, implementing the offline voice interaction method in the above method embodiment.

存储器802可以包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需要的应用程序；存储数据区可存储根据离线语音交互方法的电子设备的使用所创建的数据等。此外，存储器802可以包括高速随机存取存储器，还可以包括非瞬时存储器，例如至少一个磁盘存储器件、闪存器件、或其他非瞬时固态存储器件。在一些实施例中，存储器802可选包括相对于处理器801远程设置的存储器，这些远程存储器可以通过网络连接至执行离线语音交互方法的电子设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system and an application program required for at least one function; the storage data area may store data created according to the use of the electronic device according to the offline voice interaction method. wait. In addition, memory 802 may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory 802 optionally includes memories remotely located relative to the processor 801, and these remote memories can be connected to electronic devices that perform offline voice interaction methods through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.

执行离线语音交互方法的电子设备还可以包括：输入装置803和输出装置804。处理器801、存储器802、输入装置803和输出装置804可以通过总线或者其他方式连接，图8中以通过总线连接为例。The electronic device that performs the offline voice interaction method may also include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803 and the output device 804 can be connected through a bus or other means. In Figure 8, connection through a bus is taken as an example.

输入装置803可接收输入的数字或字符信息，以及产生与执行离线语音交互方法的电子设备的用户设置以及功能控制有关的键信号输入，例如触摸屏、小键盘、鼠标、轨迹板、触摸板、指示杆、一个或者多个鼠标按钮、轨迹球、操纵杆等输入装置。输出装置804可以包括显示设备、辅助照明装置(例如，LED)和触觉反馈装置(例如，振动电机)等。该显示设备可以包括但不限于，液晶显示器(LCD)、发光二极管(LED)显示器和等离子体显示器。在一些实施方式中，显示设备可以是触摸屏。The input device 803 can receive input numeric or character information, and generate key signal input related to user settings and function control of electronic devices that perform offline voice interaction methods, such as touch screens, keypads, mice, trackpads, touchpads, instructions. An input device such as a stick, one or more mouse buttons, a trackball, or a joystick. Output devices 804 may include display devices, auxiliary lighting devices (eg, LEDs), tactile feedback devices (eg, vibration motors), and the like. The display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

此处描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、专用ASIC(专用集成电路)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein may be implemented in digital electronic circuitry, integrated circuit systems, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.

这些计算程序(也称作程序、软件、软件应用、或者代码)包括可编程处理器的机器指令，并且可以利用高级过程和/或面向对象的编程语言、和/或汇编/机器语言来实施这些计算程序。如本文使用的，术语“机器可读介质”和“计算机可读介质”指的是用于将机器指令和/或数据提供给可编程处理器的任何计算机程序产品、设备、和/或装置(例如，磁盘、光盘、存储器、可编程逻辑装置(PLD))，包括，接收作为机器可读信号的机器指令的机器可读介质。术语“机器可读信号”指的是用于将机器指令和/或数据提供给可编程处理器的任何信号。These computing programs (also referred to as programs, software, software applications, or code) include machine instructions for programmable processors, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine language Calculation program. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or means for providing machine instructions and/or data to a programmable processor ( For example, magnetic disks, optical disks, memories, programmable logic devices (PLD)), including machine-readable media that receive machine instructions as machine-readable signals. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)、互联网和区块链网络。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), the Internet, and blockchain networks.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器，又称为云计算服务器或云主机，是云计算服务体系中的一项主机产品，以解决了传统物理主机与VPS服务("Virtual Private Server"，或简称"VPS")中，存在的管理难度大，业务扩展性弱的缺陷。服务器也可以为分布式系统的服务器，或者是结合了区块链的服务器。Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short) Among them, there are defects such as difficult management and weak business scalability. The server can also be a distributed system server or a server combined with a blockchain.

可以理解的是，虽然本申请针对的是离线语音交互系统，但是并不排除部署该离线语音交互系统的终端具有联网能力，比如，该离线语音交互系统部署在手机上，在一定条件下，比如车内时，由于车内空间的网络信号不佳，可以在车内空间时处于离线状态，而不限定为该终端时刻都处于离线状态，比如，在网络信号良好时，该终端可以具有联网能力。而本申请所针对的是终端(比如手机)在离线状态(比如位于车内无网络信号时)下的离线语音交互方案。It can be understood that although this application is targeted at an offline voice interaction system, it does not rule out that the terminal deploying the offline voice interaction system has networking capabilities. For example, the offline voice interaction system is deployed on a mobile phone. Under certain conditions, such as When in the car, due to the poor network signal in the car space, the terminal can be offline in the car space, but it is not limited to the terminal being offline at all times. For example, when the network signal is good, the terminal can have networking capabilities. . This application is targeted at an offline voice interaction solution for a terminal (such as a mobile phone) in an offline state (such as when it is in a car and there is no network signal).

根据本申请实施例的技术方案，通过在本地终端唤醒后，持续传输和处理语音信号，在用户主动发起结束时才结束本次语音交互，可以支持离线语音交互场景下的一次唤醒后的持续识别，以提升用户体验、避免资源浪费和提高语音交互效率。通过在首次待识别语音信号之前进行回溯，可以保证首次待识别语音信号的完整性，提高首次识别成功率。通过以唤醒时间点为基准向前回溯，可以提高回溯起点的准确性，进而保证首次待识别语音信号的完整性。通过基于语音水印值确定唤醒时间点，可以简便准确地确定出唤醒时间点。通过按序响应语音识别结果，可以保证响应的准确性，提升用户体验。通过将顺序标识具有相同的标识前缀，可以便于统一识别。According to the technical solutions of the embodiments of this application, by continuously transmitting and processing voice signals after the local terminal wakes up, and not ending the voice interaction until the end of the user's initiative, continuous recognition after a wake-up in an offline voice interaction scenario can be supported. , to improve user experience, avoid resource waste and improve voice interaction efficiency. By backtracking before the first voice signal to be recognized, the integrity of the first voice signal to be recognized can be ensured and the first recognition success rate can be improved. By backtracking based on the wake-up time point, the accuracy of the backtracking starting point can be improved, thereby ensuring the integrity of the first voice signal to be recognized. By determining the wake-up time point based on the voice watermark value, the wake-up time point can be determined simply and accurately. By responding to speech recognition results in sequence, the accuracy of the response can be ensured and the user experience can be improved. Uniform identification can be facilitated by assigning sequential identifiers the same identifier prefix.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发申请中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本申请公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in the present application can be executed in parallel, sequentially, or in a different order. As long as the desired results of the technical solution disclosed in the present application can be achieved, there is no limitation here.

上述具体实施方式，并不构成对本申请保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本申请的精神和原则之内所作的修改、等同替换和改进等，均应包含在本申请保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the scope of protection of the present application. It will be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions are possible depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of this application shall be included in the protection scope of this application.