CN110837758A

Movatterモバイル変換

Info

Publication number: CN110837758A
Application number: CN201810939640.7A
Authority: CN
Inventors: 董勤波
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Guangzhou Gaohang Technology Transfer Co ltd
Priority date: 2018-08-17
Filing date: 2018-08-17
Publication date: 2020-02-25
Anticipated expiration: 2038-08-17
Also published as: CN110837758B

Abstract

Translated fromChinese

本发明实施例提供了一种关键词输入方法、装置及电子设备。方法包括：获取用户输入的音频信号，以及用户输入音频信号期间采集的视频信号，视频信号包括唇部视频图像；对音频信号进行关键词识别，得到第一关键词以及第一关键词的置信度；对唇部视频图像进行唇语识别，得到第二关键词以及第二关键词的置信度；根据相对质量与第一关键词的置信度，确定第一关键词的加权置信度，以及根据相对质量与第二关键词的置信度，确定第二关键词的加权置信度，相对质量用于表示音频信号的信号质量相对于视频信号的信号质量的优劣程度；将第一关键词和第二关键词中加权置信度较大的关键词，作为输入的关键词。可以有效提高关键词输入的准确度。

Embodiments of the present invention provide a keyword input method, apparatus, and electronic device. The method includes: acquiring an audio signal input by a user and a video signal collected during the user inputting the audio signal, where the video signal includes a lip video image; performing keyword recognition on the audio signal to obtain a first keyword and a confidence level of the first keyword ; Carry out lip language recognition to the lip video image to obtain the confidence of the second keyword and the second keyword; According to the relative quality and the confidence of the first keyword, determine the weighted confidence of the first keyword, and according to the relative The confidence of the quality and the second keyword is used to determine the weighted confidence of the second keyword, and the relative quality is used to indicate the quality of the audio signal relative to the signal quality of the video signal. The keywords with larger weighted confidence among the keywords are used as the input keywords. It can effectively improve the accuracy of keyword input.

Description

Translated fromChinese

一种关键词输入方法、装置及电子设备A keyword input method, device and electronic device

技术领域technical field

本发明涉及音频技术领域，特别是涉及一种关键词输入方法、装置及电子设备。The present invention relates to the field of audio technology, and in particular, to a keyword input method, device and electronic device.

背景技术Background technique

关键词识别技术可以将用户语音中的词汇转换为计算机可读的输入，可以用于实现智能化的人机交互。其中，关键词识别可以在用户的连续语音中识别和确定少量的特定词，作为该连续语音的关键词。但是受限制于关键词识别技术的准确度，识别到的关键词准确率可能不高。Keyword recognition technology can convert words in user speech into computer-readable input, which can be used to realize intelligent human-computer interaction. Among them, the keyword recognition can identify and determine a small number of specific words in the continuous speech of the user as the keywords of the continuous speech. However, limited by the accuracy of the keyword identification technology, the accuracy of the identified keywords may not be high.

现有技术中，为了提高关键词识别结果的准确性，可以在检测到用户进行语音输入时，利用摄像机拍摄用户的脸部图像得到面部视频，对用户输入的语音进行关键词识别，得到关键词识别结果，从面部视频中定位用户的唇部，以进行唇语识别，得到唇语识别结果，将关键词识别结果和唇语识别结果中置信度较高的一个作为识别结果，其中，关键词识别结果的置信度取决于关键词识别过程中语音特征与声学模型的相似程度，唇语识别结果的置信度取决于唇语识别过程中唇部特征与预设模板的相似程度。In the prior art, in order to improve the accuracy of the keyword recognition result, when it is detected that the user performs voice input, a camera can be used to shoot the user's face image to obtain a facial video, and the voice input by the user can be recognized by keyword to obtain the keyword. The recognition result is to locate the user's lips from the facial video to perform lip language recognition, obtain the lip language recognition result, and use the keyword recognition result and the lip language recognition result with a higher confidence as the recognition result, wherein the keyword The confidence of the recognition result depends on the similarity between the speech features and the acoustic model in the keyword recognition process, and the confidence of the lip recognition results depends on the similarity between the lip features and the preset template in the lip recognition process.

但是，实际应用场景中可能存在干扰语音输入和/或干扰摄像机拍摄的因素，例如背景噪音、光线不充足，这些因素可能导致关键词识别结果和/或唇语识别结果的置信度与实际的可能性存在较大偏差，得到的识别结果可能不够准确。However, in practical application scenarios, there may be factors that interfere with speech input and/or interfere with camera shooting, such as background noise and insufficient light. These factors may cause the confidence level of the keyword recognition result and/or lip language recognition result to be different from the actual possibility. There is a large deviation in sex, and the obtained identification results may not be accurate enough.

发明内容SUMMARY OF THE INVENTION

本发明实施例的目的在于提供一种关键词输入方法、装置及电子设备，以实现根据音频信号和视频信号的信号质量，自适应地调整权重，以提高关键词输入的准确度。具体技术方案如下：The purpose of the embodiments of the present invention is to provide a keyword input method, device and electronic device, so as to adjust the weight adaptively according to the signal quality of the audio signal and the video signal, so as to improve the accuracy of the keyword input. The specific technical solutions are as follows:

在本发明实施例的第一方面，提供了一种关键词输入方法，所述方法包括：In a first aspect of the embodiments of the present invention, a keyword input method is provided, the method comprising:

获取用户输入的音频信号，以及所述用户输入所述音频信号期间采集的视频信号，所述视频信号包括所述用户的唇部视频图像；acquiring an audio signal input by a user, and a video signal collected during the user inputting the audio signal, the video signal including a video image of the user's lips;

对所述音频信号进行关键词识别，得到第一关键词以及所述第一关键词的置信度，所述第一关键词的置信度用于表示所述用户输入的关键词为第一关键词的可信程度；Perform keyword identification on the audio signal to obtain a first keyword and a confidence level of the first keyword, where the confidence level of the first keyword is used to indicate that the keyword input by the user is the first keyword the degree of credibility;

对所述唇部视频图像进行唇语识别，得到第二关键词以及第二关键词的置信度，所述第二关键词的置信度用于表示所述用户输入的关键词为第二关键词的可信程度；Perform lip language recognition on the lip video image to obtain a second keyword and a confidence level of the second keyword, where the confidence level of the second keyword is used to indicate that the keyword input by the user is the second keyword the degree of credibility;

根据相对质量与所述第一关键词的置信度，确定所述第一关键词的加权置信度，以及根据所述相对质量与所述第二关键词的置信度，确定所述第二关键词的加权置信度，所述相对质量用于表示所述音频信号的信号质量相对于所述视频信号的信号质量的优劣程度，所述第一关键词的加权置信度与所述相对质量正相关，所述第二关键词的加权置信度与所述相对质量负相关；Determine the weighted confidence of the first keyword according to the relative quality and the confidence of the first keyword, and determine the second keyword according to the relative quality and the confidence of the second keyword The relative quality is used to represent the quality of the audio signal relative to the signal quality of the video signal, and the weighted confidence of the first keyword is positively correlated with the relative quality , the weighted confidence of the second keyword is negatively correlated with the relative quality;

将所述第一关键词和所述第二关键词中加权置信度较大的关键词，作为输入的关键词。A keyword with a larger weighted confidence among the first keyword and the second keyword is used as the input keyword.

结合第一方面，在第一种可能的实现方式中，在所述计算所述第一权重与所述第一关键词的置信度的乘积，作为所述第一关键词的加权置信度，计算所述第二权重与所述第二关键词的置信度的乘积，作为所述第二关键词的加权置信度之后，所述方法还包括：With reference to the first aspect, in a first possible implementation manner, in the calculation of the product of the first weight and the confidence of the first keyword, as the weighted confidence of the first keyword, calculate After the product of the second weight and the confidence of the second keyword is used as the weighted confidence of the second keyword, the method further includes:

确定所述第一关键词的加权置信度和所述第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值；Determine whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than a first preset confidence threshold;

如果所述较大值大于所述第一预设置信度阈值，执行所述将所述第一关键词和所述第二关键词中加权置信度较大的关键词，作为输入的关键词的步骤。If the larger value is greater than the first preset reliability threshold, perform the process of taking the keyword with the larger weighted confidence among the first keyword and the second keyword as the input keyword. step.

结合第一方面的第一种可能的实现方式，在第二种可能的实现方式中，在所述根据相对质量与所述第一关键词的置信度，确定所述第一关键词的加权置信度，以及根据所述相对质量与所述第二关键词的置信度，确定所述第二关键词的加权置信度之后，所述方法还包括：With reference to the first possible implementation manner of the first aspect, in the second possible implementation manner, in the said, according to the relative quality and the confidence of the first keyword, the weighted confidence of the first keyword is determined. and after determining the weighted confidence of the second keyword according to the relative quality and the confidence of the second keyword, the method further includes:

如果所述较大值不大于所述第一预设置信度阈值，确定未识别出关键词。If the larger value is not greater than the first preset reliability threshold, it is determined that the keyword is not identified.

结合第一方面的第第一种可能的实现方式，在第三种可能的实现方式中，在所述确定所述第一关键词的加权置信度和所述第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值之前，所述方法还包括：With reference to the first possible implementation manner of the first aspect, in a third possible implementation manner, in the determining of the weighted confidence level of the first keyword and the weighted confidence level of the second keyword Before the larger value of , whether it is greater than the first preset reliability threshold, the method further includes:

确定所述第一关键词和所述第二关键词是否一致；determining whether the first keyword and the second keyword are consistent;

如果所述第一关键词和所述第二关键词不一致，执行所述确定所述第一关键词的加权置信度和所述第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值的步骤。If the first keyword and the second keyword are inconsistent, execute the determining whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first keyword. A step of presetting reliability thresholds.

如果所述第一关键词和所述第二关键词一致，确定所述较大值是否大于第二预设置信度阈值，所述第二预设置信度阈值小于所述第一预设置信度阈值；If the first keyword and the second keyword are consistent, determine whether the larger value is greater than a second preset reliability threshold, and the second preset reliability threshold is smaller than the first preset reliability threshold;

如果所述较大值大于所述第二预设置信度阈值，将所述第一关键或者所述第二关键词作为输入的关键词。If the larger value is greater than the second preset reliability threshold, the first key or the second key is used as the input key.

结合第一方面，在第四种可能的实现方式中，所述对所述音频信号进行关键词识别，包括：With reference to the first aspect, in a fourth possible implementation manner, the performing keyword identification on the audio signal includes:

将所述音频信号输入至预设的检测神经网络，以去除所述音频信号中的噪音信号和静音信号，得到所述音频信号的语音信号，所述检测神经网络预先经过多个样本音频信号的训练，其中，针对每个样本音频信号，预先使用该样本音频信号的语音信号进行标注；The audio signal is input to a preset detection neural network to remove noise signals and mute signals in the audio signal to obtain a speech signal of the audio signal, and the detection neural network is pre-processed by a plurality of sample audio signals. training, wherein, for each sample audio signal, the voice signal of the sample audio signal is used for labeling in advance;

将所述音频信号输入至预设的识别神经网络，所述识别神经网络预先经过多个样本语音信号的训练，其中，针对每个样本语音信号，预先使用该样本语音信号所对应的语音内容进行标注；The audio signal is input to a preset recognition neural network, and the recognition neural network is pre-trained with a plurality of sample voice signals, wherein, for each sample voice signal, the voice content corresponding to the sample voice signal is used in advance to carry out label;

获取所述识别神经网络输出的第一关键词以及所述第一关键词的置信度。Obtain the first keyword output by the recognition neural network and the confidence level of the first keyword.

对所述语音内容进行关键词识别，得到第一关键词以及所述第一关键词的置信度。Perform keyword recognition on the voice content to obtain a first keyword and a confidence level of the first keyword.

结合第一方面，在第五种可能的实现方式中，所述对所述唇部视频图像进行唇语识别，包括：With reference to the first aspect, in a fifth possible implementation manner, the performing lip language recognition on the lip video image includes:

将所述唇部视频图像输入至预设的唇语神经网络，所述唇语神经网络预先经过多个样本唇部视频图像的训练，其中，针对每个样本唇部视频图像，预先使用该样本唇部图像所对应的关键词进行标注；Inputting the lip video image to a preset lip language neural network, the lip language neural network is pre-trained with a plurality of sample lip video images, wherein, for each sample lip video image, the sample is used in advance Label the keywords corresponding to the lip image;

获取所述唇语神经网络输出的关键词以及该关键词的置信度，作为第二关键词以及所述第二关键词的置信度。The keyword output by the lip language neural network and the confidence level of the keyword are obtained as the second keyword and the confidence level of the second keyword.

结合第一方面，在第六种可能的实现方式中，所述方法应用于无人机控制器，在所述将所述第一关键词和所述第二关键词中加权置信度较大的关键词，作为输入的关键词之后，所述方法还包括：With reference to the first aspect, in a sixth possible implementation manner, the method is applied to a UAV controller, and in the weighting of the first keyword and the second keyword with a larger confidence After the keyword is used as the input keyword, the method further includes:

获取与所述输入的关键词对应的控制指令；obtaining a control instruction corresponding to the input keyword;

控制所述无人机控制器所绑定的无人机执行所述控制指令。Control the drone bound to the drone controller to execute the control instruction.

结合第一方面，在第七种可能的实现方式中，所述根据相对质量与所述第一关键词的置信度，确定所述第一关键词的加权置信度，以及根据所述相对质量与所述第二关键词的置信度，确定所述第二关键词的加权置信度，包括：With reference to the first aspect, in a seventh possible implementation manner, the weighted confidence level of the first keyword is determined according to the relative quality and the confidence level of the first keyword, and the weighted confidence level of the first keyword is determined according to the relative quality and the confidence level of the first keyword. The confidence of the second keyword, determining the weighted confidence of the second keyword, including:

按照预设加权规则，确定第一权重以及第二权重，所述第一权重与相对质量正相关，所述第二权重与所述相对质量负相关；According to a preset weighting rule, a first weight and a second weight are determined, the first weight is positively correlated with the relative quality, and the second weight is negatively correlated with the relative quality;

计算所述第一权重与所述第一关键词的置信度的乘积，作为所述第一关键词的加权置信度，以及计算所述第二权重与所述第二关键词的置信度的乘积，作为所述第二关键词的加权置信度。Calculate the product of the first weight and the confidence of the first keyword as the weighted confidence of the first keyword, and calculate the product of the second weight and the confidence of the second keyword , as the weighted confidence of the second keyword.

结合第一方面的第七种可能的实现方式，在第八种可能的实现方式中，所述按照预设加权规则，确定第一权重以及第二权重，包括：With reference to the seventh possible implementation manner of the first aspect, in an eighth possible implementation manner, the determining of the first weight and the second weight according to a preset weighting rule includes:

按照预设预设加权规则，确定插值因子α，其中，α和相对质量正相关，α大于0且小于1，所述相对质量为所述音频信号的信号质量与所述视频信号的信号质量的比值；Determine the interpolation factor α according to the preset preset weighting rule, where α is positively correlated with the relative quality, α is greater than 0 and less than 1, and the relative quality is the signal quality of the audio signal and the signal quality of the video signal. ratio;

将α作为第一权重；Take α as the first weight;

将1-α作为第二权重。Take 1-α as the second weight.

在本发明实施例的第二方面，提供了一种关键词输入装置，，所述装置包括：In a second aspect of the embodiments of the present invention, a keyword input device is provided, wherein the device includes:

信号采集模块，用于获取用户输入的音频信号，以及所述用户输入所述音频信号期间采集的视频信号，所述视频信号包括所述用户的唇部视频图像；a signal acquisition module, configured to acquire an audio signal input by a user, and a video signal collected during the audio signal input by the user, where the video signal includes a video image of the user's lips;

关键词识别模块，用于对所述音频信号进行关键词识别，得到第一关键词以及所述第一关键词的置信度，所述第一关键词的置信度用于表示所述用户输入的关键词为第一关键词的可信程度；A keyword identification module is used for keyword identification on the audio signal to obtain a first keyword and a confidence level of the first keyword, where the confidence level of the first keyword is used to represent the user input The keyword is the credibility of the first keyword;

唇语识别模块，用于对所述唇部视频图像进行唇语识别，得到第二关键词以及第二关键词的置信度，所述第二关键词的置信度用于表示所述用户输入的关键词为第二关键词的可信程度；The lip language recognition module is used to perform lip language recognition on the lip video image to obtain the second keyword and the confidence level of the second keyword, and the confidence level of the second keyword is used to indicate the user input. The keyword is the credibility of the second keyword;

命令判决模块，用于根据相对质量与所述第一关键词的置信度，确定所述第一关键词的加权置信度，以及根据所述相对质量与所述第二关键词的置信度，确定所述第二关键词的加权置信度，所述相对质量用于表示所述音频信号的信号质量相对于所述视频信号的信号质量的优劣程度，所述第一关键词的加权置信度与所述相对质量正相关，所述第二关键词的加权置信度与所述相对质量负相关；并将所述第一关键词和所述第二关键词中加权置信度较大的关键词，作为输入的关键词。The command decision module is configured to determine the weighted confidence of the first keyword according to the relative quality and the confidence of the first keyword, and to determine the weighted confidence of the first keyword according to the relative quality and the confidence of the second keyword. The weighted confidence level of the second keyword, the relative quality is used to indicate the degree of quality of the signal quality of the audio signal relative to the signal quality of the video signal, and the weighted confidence level of the first keyword is the same as The relative quality is positively correlated, and the weighted confidence of the second keyword is negatively correlated with the relative quality; and the keyword with a larger weighted confidence among the first keyword and the second keyword, as input keywords.

结合第二方面，在第一种可能的实现方式中，所述命令判决模块还用于在所述根据相对质量与所述第一关键词的置信度，确定所述第一关键词的加权置信度，以及根据所述相对质量与所述第二关键词的置信度，确定所述第二关键词的加权置信度之后，确定所述第一关键词的加权置信度和所述第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值；With reference to the second aspect, in a first possible implementation manner, the command decision module is further configured to determine the weighted confidence of the first keyword according to the relative quality and the confidence of the first keyword and after determining the weighted confidence of the second keyword according to the relative quality and the confidence of the second keyword, determine the weighted confidence of the first keyword and the second keyword Whether the larger value in the weighted confidence of , is greater than the first preset confidence threshold;

结合第二方面的第一种可能的实现方式，在第二种可能的实现方式中，所述命令判决模块还用于在确定所述第一关键词的加权置信度和所述第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值之后，如果所述较大值不大于所述第一预设置信度阈值，确定未识别出关键词。With reference to the first possible implementation manner of the second aspect, in the second possible implementation manner, the command decision module is further configured to determine the weighted confidence of the first keyword and the second keyword. After whether the larger value of the weighted confidence is greater than the first preset credibility threshold, if the larger value is not greater than the first preset credibility threshold, it is determined that the keyword is not identified.

结合第二方面的第一种可能的实现方式，在第三种可能的实现方式中，所述命令判决模块还用于在所述确定所述第一关键词的加权置信度和所述第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值之前，确定所述第一关键词和所述第二关键词是否一致；With reference to the first possible implementation manner of the second aspect, in a third possible implementation manner, the command decision module is further configured to determine the weighted confidence of the first keyword and the second Whether the larger value of the weighted confidence of the keyword is greater than the first preset confidence threshold, determine whether the first keyword and the second keyword are consistent;

如果所述第一关键词和所述第二关键词不一致，执行所述确定所述第一关键词的加权置信度和所述第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值的步骤；If the first keyword and the second keyword are inconsistent, execute the determining whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first keyword. a step of presetting a reliability threshold;

如果所述较大值大于所述第二预设置信度阈值，将所述第一关键词或者所述第二关键词作为输入的关键词。If the larger value is greater than the second preset reliability threshold, the first keyword or the second keyword is used as the input keyword.

结合第二方面，在第四种可能的实现方式中，所述关键词识别模块具体用于将所述音频信号输入至预设的检测神经网络，以去除所述音频信号中的噪音信号和静音信号得到所述音频信号的语音信号，所述检测神经网络预先经过多个样本音频信号的训练，其中，针对每个样本音频信号，预先使用该样本音频信号的语音信号进行标注；With reference to the second aspect, in a fourth possible implementation manner, the keyword identification module is specifically configured to input the audio signal into a preset detection neural network to remove noise signals and silence in the audio signal The signal obtains the voice signal of the audio signal, and the detection neural network is pre-trained with a plurality of sample audio signals, wherein, for each sample audio signal, the voice signal of the sample audio signal is used for labeling in advance;

结合第二方面，在第五种可能的实现方式中，所述唇语识别模块具体用于将所述唇部视频图像输入至预设的唇语神经网络，所述唇语神经网络预先经过多个样本唇部视频图像的训练，其中，针对每个样本唇部视频图像，预先使用该样本唇部图像所对应的关键词进行标准；With reference to the second aspect, in a fifth possible implementation manner, the lip language recognition module is specifically configured to input the lip video image into a preset lip language neural network, and the lip language neural network has undergone a number of pre-processing steps. The training of sample lip video images, wherein, for each sample lip video image, the keyword corresponding to the sample lip image is used for standardization in advance;

结合第二方面，在第六种可能的实现方式中，所述装置应用于无人机控制器，所述命令判决模块在所述将所述第一关键词和所述第二关键词中加权置信度较大的关键词，作为输入的关键词之后，获取与所述输入的关键词对应的控制指令；控制所述无人机控制器所绑定的无人机执行所述控制指令。With reference to the second aspect, in a sixth possible implementation manner, the device is applied to an unmanned aerial vehicle controller, and the command judging module weights the first keyword and the second keyword in the After a keyword with a high degree of confidence is used as an input keyword, a control instruction corresponding to the input keyword is obtained; the UAV bound to the UAV controller is controlled to execute the control instruction.

结合第二方面，在第七种可能的实现方式中，所述命令判决模块具体用于按照预设加权规则，确定第一权重以及第二权重，所述第一权重与相对质量正相关，所述第二权重与所述相对质量负相关；In combination with the second aspect, in a seventh possible implementation manner, the command decision module is specifically configured to determine a first weight and a second weight according to a preset weighting rule, and the first weight is positively correlated with the relative quality, so the second weight is negatively correlated with the relative mass;

结合第二方面的第七种可能的实现方式，在第八种可能的实现方式中，所述命令判决模块具体用于按照预设加权规则，确定插值因子α，其中，α和相对质量正相关，α大于0且小于1，所述相对质量为所述音频信号的信号质量与所述视频信号的信号质量的比值；With reference to the seventh possible implementation manner of the second aspect, in the eighth possible implementation manner, the command decision module is specifically configured to determine an interpolation factor α according to a preset weighting rule, where α is positively correlated with relative quality , α is greater than 0 and less than 1, and the relative quality is the ratio of the signal quality of the audio signal to the signal quality of the video signal;

将α作为第一权重；Take α as the first weight;

将1-α作为第二权重。Take 1-α as the second weight.

在本发明实施例的第三方面，提供了一种电子设备，所述电子设备包括：In a third aspect of the embodiments of the present invention, an electronic device is provided, and the electronic device includes:

存储器，用于存放计算机程序；memory for storing computer programs;

处理器，用于执行存储器上所存放的程序时，实现上述任一所述的关键词输入方法。The processor is configured to implement any one of the above keyword input methods when executing the program stored in the memory.

在本发明实施例的第四方面，提供了一种计算机可读存储介质，所述计算机可读存储介质内存储有计算机程序，所述计算机程序被处理器执行时实现上述任一所述的关键词输入方法。In a fourth aspect of the embodiments of the present invention, a computer-readable storage medium is provided, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, any one of the above-mentioned keys is implemented word input method.

本发明实施例提供的一种关键词输入方法、装置及电子设备，可以根据音频信号和视频信号的信号质量，自适应地调整第一权重与第二权重，在音频信号的信号质量较好时，提高第一关键词的加权置信度，在视频信号的信号质量较好时，提高第二关键词的加权置信度，关键词的加权置信度能够更好的反应用户输入的音频信号包括该关键词的可能性，有效提高关键词输入的准确度。当然，实施本发明的任一产品或方法并不一定需要同时达到以上所述的所有优点。A keyword input method, device, and electronic device provided by the embodiments of the present invention can adaptively adjust the first weight and the second weight according to the signal quality of the audio signal and the video signal. When the signal quality of the audio signal is good , improve the weighted confidence of the first keyword, and when the signal quality of the video signal is good, increase the weighted confidence of the second keyword, the weighted confidence of the keyword can better reflect that the audio signal input by the user includes the key The possibility of words can effectively improve the accuracy of keyword input. Of course, it is not necessary for any product or method of the present invention to achieve all of the advantages described above at the same time.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1为本发明实施例提供的关键词输入方法的一种流程示意图；1 is a schematic flowchart of a keyword input method provided by an embodiment of the present invention;

图2为本发明实施例提供的关键词输入方法的另一种流程示意图；2 is another schematic flowchart of a keyword input method provided by an embodiment of the present invention;

图3为本发明实施例提供的关键词输入方法的另一种流程示意图；3 is another schematic flowchart of a keyword input method provided by an embodiment of the present invention;

图4为本发明实施例提供的无人机控制方法的一种流程示意图；4 is a schematic flowchart of a method for controlling an unmanned aerial vehicle provided by an embodiment of the present invention;

图5为本发明实施例提供的神经网络的一种结构示意图；5 is a schematic structural diagram of a neural network provided by an embodiment of the present invention;

图6为本发明实施例提供的关键词输入装置的一种结构示意图；6 is a schematic structural diagram of a keyword input device provided by an embodiment of the present invention;

图7为本发明实施例提供的电子设备的一种结构示意图。FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

参见图1，图1所示为本发明实施例提供的关键词输入方法的一种流程示意图，可以包括：Referring to FIG. 1, FIG. 1 shows a schematic flowchart of a keyword input method provided by an embodiment of the present invention, which may include:

S101，获取用户输入的音频信号，以及用户输入该音频信号期间采集的视频信号，该视频信号包括用户的唇部视频图像。S101. Acquire an audio signal input by a user and a video signal collected during the user inputting the audio signal, where the video signal includes a video image of the user's lips.

该方法可以是应用于一个具备视频拍摄功能以及语音录入功能的电子设备上，如带有麦克风和摄像机的移动终端，该移动终端可以是在检测到用户正在输入音频信号后，开启摄像机以采集视频信号，采集到的视频信号中可以只包括唇部视频图像，进一步的，视频信号也可以是摄像机基于人脸识别技术，采集用户的人脸图像得到的，在该情况下，视频信号中除唇部视频图像以外，还可以包括有用户面部其他区域的视频图像。The method can be applied to an electronic device with a video shooting function and a voice recording function, such as a mobile terminal with a microphone and a camera, and the mobile terminal can turn on the camera to capture video after detecting that the user is inputting an audio signal signal, the collected video signal may only include the lip video image. Further, the video signal may also be obtained by the camera based on the face recognition technology, and the user's face image is collected. In this case, the lip is removed from the video signal. In addition to the video image, the video image of other areas of the user's face may also be included.

S102，对音频信号进行关键词识别，得到第一关键词以及第一关键词的置信度。S102: Perform keyword recognition on the audio signal to obtain a first keyword and a confidence level of the first keyword.

其中，第一关键词的置信度用于表示用户输入的关键词为第一关键词的可信程度。关键词识别过程可以是去除音频信号中的非语音信号，得到该音频信号的语音信号，从语音信号中提取声学特征，并将提取到的声学特征与预设各个关键词的声学模型进行匹配，以确定提取到的声学特征所对应的关键词，作为第一关键词，并根据声学特征与该关键词的声学模型的匹配程度，计算得到第一关键词的置信度。The confidence level of the first keyword is used to indicate the confidence level of the keyword input by the user as the first keyword. The keyword recognition process may be to remove the non-speech signal in the audio signal, obtain the speech signal of the audio signal, extract the acoustic feature from the speech signal, and match the extracted acoustic feature with the acoustic model of each preset keyword, The keyword corresponding to the extracted acoustic feature is determined as the first keyword, and the confidence level of the first keyword is calculated according to the matching degree between the acoustic feature and the acoustic model of the keyword.

进一步的，在一种可选的实施例中，去除音频信号中的非语音信号可以是通过预设的检测神经网络完成的，该检测神经网络预先经过多个样本音频信号的训练，其中，针对每个样本音频信号，预先使用该样本音频信号的语音信号进行标注，例如预先对样本音频信号中的每个数据帧进行标注，以表示该视频帧是否是否属于语音信号，进一步的，该标注可以是通过预设的识别神经网络实现的。Further, in an optional embodiment, the removal of non-speech signals in the audio signal may be accomplished by a preset detection neural network, which is pre-trained with a plurality of sample audio signals, wherein Each sample audio signal is pre-marked with the voice signal of the sample audio signal. For example, each data frame in the sample audio signal is pre-marked to indicate whether the video frame belongs to a voice signal. Further, the mark can be It is achieved by a preset recognition neural network.

该检测神经网络在经过训练后，可以去除输入的音频信号中的非语音信号，以得到对应的语音信号。可以理解的是，用户在输入音频信号期间，由于周边环境较为嘈杂或者录音设备内部电流不稳定等特殊原因，可能导致音频信号中包括噪音信号，并且用户输入音频信号期间可能并非一直在发声，音频信号中存在部分没有人声的静音信号，这些噪音信号和静音信号并非用户想要输入的音频信号，因此将这些噪音信号和静音信号从音频信号中去除，可以得到更为纯正的语音信号，以方便后续分析处理。而利用检测神经网络，可以基于深度学习更准确地从音频信号中去除噪音信号和静音信号。After the detection neural network is trained, the non-speech signal in the input audio signal can be removed to obtain the corresponding speech signal. It is understandable that when the user is inputting the audio signal, due to special reasons such as the noisy surrounding environment or the unstable internal current of the recording device, the audio signal may include a noise signal, and the user may not be uttering sound all the time while the user is inputting the audio signal. There are some mute signals without human voice in the signal. These noise signals and mute signals are not the audio signals that the user wants to input. Therefore, if these noise signals and mute signals are removed from the audio signal, a purer voice signal can be obtained. It is convenient for subsequent analysis and processing. With the detection neural network, noise and silence can be more accurately removed from audio signals based on deep learning.

并且，在得到语音信号后，可以利用预设的识别神经网络对语音信号进行关键词识别。可以是将语音信号输入至识别神经网络，该识别神经网络预先经过多个样本语音信号的训练，其中，针对每个样本语音信号，预先使用该样本语音信号所对应的语音内容进行标注。这样可以实现语音信号到关键词的端到端映射，不需要识别出语音信号中的所有文字，再从识别出来的文字中确定关键词。Moreover, after the speech signal is obtained, a preset recognition neural network can be used to perform keyword recognition on the speech signal. The speech signal may be input to a recognition neural network, and the recognition neural network is pre-trained with a plurality of sample speech signals, wherein, for each sample speech signal, the speech content corresponding to the sample speech signal is used for labeling in advance. In this way, the end-to-end mapping of the speech signal to the keywords can be realized, and it is not necessary to recognize all the characters in the speech signal, and then determine the keywords from the recognized characters.

S103，对唇部视频图像进行唇语识别，得到第二关键词以及第二关键词的置信度。S103: Perform lip language recognition on the lip video image to obtain the second keyword and the confidence level of the second keyword.

其中，第二关键词的置信度用于表示用户输入的关键词为第二关键词的可信程度。可以理解的是，如果视频信号是采集用户的人脸图像得到的，可以在采集到的人脸视频图像中，基于唇部识别技术定位唇部区域，并将唇部区域的视频图像作为唇部视频图像。The confidence level of the second keyword is used to indicate the confidence level of the keyword input by the user as the second keyword. It can be understood that if the video signal is obtained by collecting the face image of the user, the lip region can be located in the collected face video image based on the lip recognition technology, and the video image of the lip region can be used as the lip. video image.

在一种可选的实施例中，可以是分析唇部视频图像中唇部的运动方式，根据唇部的运动方式提取出唇部特征，将提取到的唇部特征与预设的唇语识别模型进行匹配，以确定提取到的唇部特征所对应的关键词，作为第二关键词，并根据唇部特征与该关键词的唇语识别模型的匹配程度，计算得到第二关键词的置信度。但是由于预设的唇语识别模型数量往往有限，因此该唇语识别方法不够灵活，可能存在较大的误差。In an optional embodiment, the movement mode of the lips in the lip video image may be analyzed, the lip features are extracted according to the movement modes of the lips, and the extracted lip features are compared with the preset lip language recognition. The model is matched to determine the keyword corresponding to the extracted lip feature as the second keyword, and according to the degree of matching between the lip feature and the lip language recognition model of the keyword, the confidence of the second keyword is calculated. Spend. However, since the number of preset lip language recognition models is often limited, the lip language recognition method is not flexible enough, and there may be large errors.

在另一种可选的实施例中，可以将唇部视频图像输入至预设的唇语识别神经网络。该唇语识别神经网络预先经过多个样本唇部视频图像的训练，每个样本唇部视频图像，预先使用该样本唇部视频图像所对应的关键词进行标注，示例性的，可以是在相关人员念出某一关键词的期间，采集该相关人员的唇部视频图像，作为一个样本唇部视频图像，并使用该相关人员所念的关键词标注该样本唇部视频图像。In another optional embodiment, the lip video image may be input into a preset lip language recognition neural network. The lip language recognition neural network is pre-trained with multiple sample lip video images, and each sample lip video image is pre-marked with the keyword corresponding to the sample lip video image. During the period when a person recites a certain keyword, a lip video image of the relevant person is collected as a sample lip video image, and the sample lip video image is marked with the keyword recited by the relevant person.

该唇语识别神经网络经过预先训练后，可以实现唇语视频图像到关键词的端对端映射，可以根据输入的唇语视频图像，输出对应的关键词以及该关键词的置信度。将该关键词作为第二关键词，该关键词的置信度即为第二关键词的置信度。相比于上述使用唇语识别模型的方法，唇语识别网络可以基于大量的样本唇语视频图像进行深度学习，进而通过调整网络参数更好的逼近真实的唇语视频图像到关键词之间的映射，提高唇语识别的准确率。After pre-training, the lip language recognition neural network can realize the end-to-end mapping of lip language video images to keywords, and can output the corresponding keywords and the confidence of the keywords according to the input lip language video images. The keyword is used as the second keyword, and the confidence level of the keyword is the confidence level of the second keyword. Compared with the above method using the lip language recognition model, the lip language recognition network can perform deep learning based on a large number of sample lip language video images, and then adjust the network parameters to better approximate the relationship between the real lip language video image and the keyword. Mapping to improve the accuracy of lip recognition.

可以理解的是，图1仅是本发明实施例提供的关键词输入方法的一种流程示意图，在其他实施例中，S103也可以在S102之前执行，还可以是与S102同步或者交替执行。It can be understood that FIG. 1 is only a schematic flowchart of a keyword input method provided by an embodiment of the present invention. In other embodiments, S103 may also be performed before S102, or may be performed synchronously or alternately with S102.

S104，根据相对质量与第一关键词的置信度，确定第一关键词的加权执行度，以及根据相对质量与第二关键词的置信度，确定第二关键词的加权置信度。S104: Determine the weighted execution degree of the first keyword according to the relative quality and the confidence of the first keyword, and determine the weighted confidence of the second keyword according to the relative quality and the confidence of the second keyword.

其中，相对质量用于表示音频信号的信号质量相对于视频信号的信号质量的优劣程度，第一关键词的加权置信度与相对质量正相关，第二关键词的加权置信度与相对质量负相关。Among them, the relative quality is used to indicate the quality of the audio signal relative to the signal quality of the video signal. The weighted confidence of the first keyword is positively related to the relative quality, and the weighted confidence of the second keyword is negatively related to the relative quality. related.

进一步的，在一种可选的实施例中，相对质量可以为音频信号的信号质量与视频信号的信号质量的比值，相对质量越高，则说明音频信号的信号质量相对于视频信号的信号质量越好，相对质量越低，则说明音频信号的信号质量相对于视频信号的信号质量越差。第一关键词的加权置信度与相对质量正相关，是指在决定第一关键词的加权置信度的取值的所有参数中，除相对质量以外的其他参数不变的情况下，相对质量越大，则第一关键词的加权置信度越大。同理，第二关键词的加权置信度与相对质量负相关，是指在决定第二关键词的加权置信度的取值的所有参数中，除相对质量以外的其他参数不变的情况下，相对质量越大，则第一关键词的加权置信度越小。Further, in an optional embodiment, the relative quality may be the ratio of the signal quality of the audio signal to the signal quality of the video signal. The higher the relative quality, the better the signal quality of the audio signal relative to the signal quality of the video signal. The better the relative quality, the worse the signal quality of the audio signal relative to the signal quality of the video signal. The weighted confidence of the first keyword is positively correlated with the relative quality, which means that among all the parameters that determine the value of the weighted confidence of the first keyword, when other parameters except the relative quality remain unchanged, the higher the relative quality is. larger, the greater the weighted confidence of the first keyword. Similarly, the weighted confidence of the second keyword is negatively correlated with the relative quality, which means that among all the parameters that determine the value of the weighted confidence of the second keyword, when other parameters except the relative quality remain unchanged, The higher the relative quality, the lower the weighted confidence of the first keyword.

在一种可选的实施例中，可以是按照预设加权规则，确定第一权重以及第二权重，其中，第一权重与相对质量正相关，第二权重与相对质量负相关，再计算第一权重与第一关键词的置信度的乘积，作为第一关键词的加权置信度，以及计算第二权重与第二关键词的置信度的乘积，作为第二关键词的加权置信度。由于第一权重与音频信号的相对质量正相关，因此第一关键词的加权置信度除了能够反映出音频信号的声学特征与声学模型的匹配程度，还能反映出音频信号的信号质量相对于视频信号的信号质量的好坏。同理，第二关键词的加权置信度也能够反映出视频信号的信号质量相对于音频信号的信号质量的好坏。In an optional embodiment, the first weight and the second weight may be determined according to a preset weighting rule, wherein the first weight is positively correlated with the relative quality, the second weight is negatively correlated with the relative quality, and then the first weight is calculated. The product of a weight and the confidence of the first keyword is used as the weighted confidence of the first keyword, and the product of the second weight and the confidence of the second keyword is calculated as the weighted confidence of the second keyword. Since the first weight is positively related to the relative quality of the audio signal, the weighted confidence of the first keyword can not only reflect the matching degree between the acoustic features of the audio signal and the acoustic model, but also reflect the signal quality of the audio signal relative to the video signal. The signal quality of the signal is good or bad. Similarly, the weighted confidence of the second keyword can also reflect the signal quality of the video signal relative to the signal quality of the audio signal.

在满足第一权重与相对质量正相关，并且第二权重与相对质量负相关的前提下，加权规则可以根据实际需求进行设置。示例性的，可以是按照预设的质量评分规则，确定音频信号的质量评分Score_voice以及视频信号的质量评分Score_video,将Score_voice/Score_video作为第一权重，1-Score_voice/Score_video作为第二权重，其中，信号质量越高则信号质量的评分越高，信号质量的高低可以是通过信号的码率确定的。On the premise that the first weight is positively correlated with the relative quality, and the second weight is negatively correlated with the relative quality, the weighting rule can be set according to actual requirements. Exemplarily, according to a preset quality scoring rule, determine the quality score Score_voice of the audio signal and the quality score Score_video of the video signal, take Score_voice/Score_video as the first weight, 1-Score_voice/Score_video as the second weight, wherein, The higher the signal quality is, the higher the score of the signal quality is, and the level of the signal quality may be determined by the bit rate of the signal.

进一步的，在一种可选的实施例中，可以是按照预设加权规则，确定插值因子α，α为一个和相对质量正相关的因子，α的取值区间为(0,1)，并将α作为第一权重，(1-α)作为第二权重。可以理解的是，由于α与相对质量正相关，因此(1-α)与相对质量负相关。示例性的，α可以由下式计算得到的：Further, in an optional embodiment, the interpolation factor α may be determined according to a preset weighting rule, where α is a factor positively correlated with the relative quality, and the value interval of α is (0, 1), and Take α as the first weight and (1-α) as the second weight. Understandably, since α is positively correlated with relative mass, (1-α) is negatively correlated with relative mass. Exemplarily, α can be calculated by the following formula:

S105，将第一关键词和第二关键词中加权置信度较大的关键词，作为输入的关键词。S105, the keyword with a larger weighted confidence among the first keyword and the second keyword is used as the input keyword.

可以理解的是，如果音频信号的信号质量较差，说明该音频信号所包含的语音内容相对用户实际所说的语音内容完整度可能较低，即用户输入的音频信号中有效信息可能较少，即使声学特征与声学模型匹配度较高，由于缺乏足够多的有效信息，用户输入的关键词为第一关键词的可信程度仍然较低，在这种情况下，第一关键词的置信度与实际的可信程度可能存在较大差距。同理，当视频信号的信号质量较差时，第二关键词的置信度与实际的可信程度同样可能存在较大差距。此时，仅根据两个关键词的置信度，难以确定两个关键词的实际可信程度的相对大小，例如可能第一关键词的实际可信程度大于第二关键词的实际可信程度，但是第一关键词的置信度低于第二关键词的置信度，此时如果仅根据置信度判断，可能将实际可信程度更低的第二关键词作为输入的关键词，导致输入的关键词准确度较低。It can be understood that if the signal quality of the audio signal is poor, it means that the audio content contained in the audio signal may be less complete than the audio content actually spoken by the user, that is, the audio signal input by the user may contain less valid information. Even if the matching degree between the acoustic feature and the acoustic model is high, due to the lack of sufficient effective information, the confidence level of the keyword input by the user as the first keyword is still low. In this case, the confidence level of the first keyword is still low. There may be a big gap with the actual reliability. Similarly, when the signal quality of the video signal is poor, there may also be a large gap between the confidence level of the second keyword and the actual confidence level. At this time, it is difficult to determine the relative size of the actual credibility of the two keywords only based on the confidence of the two keywords. For example, the actual credibility of the first keyword may be greater than the actual credibility of the second keyword. However, the confidence level of the first keyword is lower than the confidence level of the second keyword. At this time, if only the confidence level is used to judge, the second keyword with a lower actual level of confidence may be used as the input keyword, resulting in the input key Word accuracy is low.

而选用本实施例，由于加权置信度可以有效反映出音频信号的信号质量相对于视频信号的信号质量的好坏，在音频信号的信号质量相对更好的时候，认为关键词识别得到的结果更为可信，在视频信号的信号质量相对更好的时候，认为唇语识别得到的结果更为可信，因此加权置信度更加接近实际的可信程度，选用加权置信度更大的关键词作为输入的关键词，更加准确。However, in this embodiment, since the weighted confidence can effectively reflect the quality of the audio signal relative to the video signal, when the audio signal is relatively better in quality, it is considered that the result obtained by keyword recognition is better. In order to be credible, when the signal quality of the video signal is relatively better, the results obtained by lip language recognition are considered to be more credible, so the weighted confidence is closer to the actual degree of confidence, and keywords with higher weighted confidence are selected as The keywords entered are more accurate.

参见图2，图2所示为本发明实施例提供的关键词输入方法的另一种流程示意图，可以包括：Referring to FIG. 2, FIG. 2 shows another schematic flowchart of a keyword input method provided by an embodiment of the present invention, which may include:

S201，获取用户输入的音频信号，以及用户输入该音频信号期间采集的视频信号，该视频信号包括用户的唇部视频图像。S201. Acquire an audio signal input by a user and a video signal collected during the user input of the audio signal, where the video signal includes a video image of the user's lips.

该步骤与S101相同，可以参见前述关于S101的描述，在此不再赘述。This step is the same as S101, and reference may be made to the foregoing description of S101, and details are not repeated here.

S202，对音频信号进行关键词识别，得到第一关键词以及第一关键词的置信度。S202: Perform keyword identification on the audio signal to obtain a first keyword and a confidence level of the first keyword.

该步骤与S102相同，可以参见前述关于S102的描述，在此不再赘述。This step is the same as S102, and reference may be made to the foregoing description of S102, and details are not repeated here.

S203，对唇部视频图像进行唇语识别，得到第二关键词以及第二关键词的置信度。S203: Perform lip language recognition on the lip video image to obtain the second keyword and the confidence level of the second keyword.

该步骤与S103相同，可以参见前述关于S103的描述，在此不再赘述。This step is the same as S103, and reference may be made to the foregoing description of S103, and details are not repeated here.

S204，根据相对质量与第一关键词的置信度，确定第一关键词的加权执行度，以及根据相对质量与第二关键词的置信度，确定第二关键词的加权置信度。。S204: Determine the weighted execution degree of the first keyword according to the relative quality and the confidence of the first keyword, and determine the weighted confidence of the second keyword according to the relative quality and the confidence of the second keyword. .

该步骤与S104相同，可以参见前述关于S104的描述，在此不再赘述。This step is the same as S104, and reference may be made to the foregoing description of S104, and details are not repeated here.

S205，确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值，如果较大值大于第一预设置信度阈值则执行S206，如果较大值不大于第一预设置信度阈值则执行S207。S205, determine whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset credibility threshold, and if the larger value is greater than the first preset credibility threshold, execute S206, if the larger value is not greater than the first preset reliability threshold, execute S207.

其中，第一预设置信度阈值可以根据实际需求进行设置，如果第一预设置信度阈值设置的越高，则关键词输入的准确度越高，如果第一关键词预设置信度阈值设置的越低，则关键词输入的识别率越高，识别率是指成功地从音频信号或者视频信号中识别到关键词的概率。The first preset reliability threshold can be set according to actual needs. If the first preset reliability threshold is set higher, the accuracy of keyword input is higher. The lower the , the higher the recognition rate of the keyword input, the recognition rate refers to the probability of successfully recognizing the keyword from the audio signal or video signal.

S206，将第一关键词和第二关键词中加权置信度较大的关键词，作为输入的关键词。S206, the keyword with a larger weighted confidence among the first keyword and the second keyword is used as the input keyword.

该步骤与S105相同，可以参见前述关于S105的描述，在此不再赘述。This step is the same as S105, and reference may be made to the foregoing description of S105, and details are not repeated here.

S207，确定未识别出关键词。S207, it is determined that the keyword is not identified.

如果第一关键词和第二关键词的加权置信度均不大大于第一预设置信度阈值，可以认为第一关键词和第二关键词是用户输入的关键词的可信程度均不高，为了避免输入错误的关键词，可以确定本次未识别出关键词。在其他实施例中，如果第一关键词和第二关键词的加权置信度均不大于第一预设置信度阈值，也可以进一步提示用户本次关键词输入失败或者无效。If the weighted confidence of the first keyword and the second keyword is not greater than the first preset reliability threshold, it can be considered that the credibility of the first keyword and the second keyword is not high. , in order to avoid entering wrong keywords, it can be determined that no keywords are identified this time. In other embodiments, if the weighted confidence levels of the first keyword and the second keyword are not greater than the first preset confidence threshold, the user may be further prompted that the keyword input this time fails or is invalid.

参见图3，图3所示为本发明实施例提供的关键词输入方法的另一种流程示意图，可以包括：Referring to FIG. 3, FIG. 3 shows another schematic flowchart of a keyword input method provided by an embodiment of the present invention, which may include:

S301，获取用户输入的音频信号，以及用户输入该音频信号期间采集的视频信号，该视频信号包括用户的唇部视频图像。S301. Acquire an audio signal input by a user and a video signal collected during the user inputting the audio signal, where the video signal includes a video image of the user's lips.

S302，对音频信号进行关键词识别，得到第一关键词以及第一关键词的置信度。S302: Perform keyword identification on the audio signal to obtain a first keyword and a confidence level of the first keyword.

S303，对唇部视频图像进行唇语识别，得到第二关键词以及第二关键词的置信度。S303: Perform lip language recognition on the lip video image to obtain the second keyword and the confidence level of the second keyword.

S304，根据相对质量与第一关键词的置信度，确定第一关键词的加权执行度，以及根据相对质量与第二关键词的置信度，确定第二关键词的加权置信度。。S304: Determine the weighted execution degree of the first keyword according to the relative quality and the confidence of the first keyword, and determine the weighted confidence of the second keyword according to the relative quality and the confidence of the second keyword. .

S305，确定第一关键词和第二关键词是否一致，如果第一关键词和第二关键词不一致，则执行S306，如果第一关键词和第二关键词一致，则执行S307。S305, determine whether the first keyword and the second keyword are consistent, if the first keyword and the second keyword are inconsistent, execute S306, and if the first keyword and the second keyword are consistent, execute S307.

S306，确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值，如果较大值大于第一预设置信度阈值则执行S308，如果较大值不大于第一预设置信度阈值则执行S309。S306, determine whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset credibility threshold, and if the larger value is greater than the first preset credibility threshold, execute S308, if the larger value is not greater than the first preset reliability threshold, execute S309.

该步骤与S205相同，可以参见前述关于S205的描述，在此不再赘述。This step is the same as S205, and reference may be made to the foregoing description of S205, and details are not repeated here.

S307，确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第二预设置信度阈值，如果较大值大于第二预设置信度阈值则执行S311，如果较大值不大于第一预设置信度阈值则执行S310。S307, determine whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the second preset credibility threshold, and if the larger value is greater than the second preset credibility threshold, execute S311, if the larger value is not greater than the first preset reliability threshold, execute S310.

其中，第二预设置信度阈值小于第一预设置信度阈值。假设第一关键词的加权置信度为0.8，第二关键词的加权置信度为0.7，如果第一关键词和第二关键词一致，则第一关键词和第二关键词均不是用户输入的关键词理论上的概率为0.06，而第一关键词和第二关键词为用户输入的关键词理论上的概率为0.94，该概率大于第一关键词或者第二关键词的加权置信度。可见，如果第一关键词和第二关键词一致，则第一关键词和第二关键词的实际可信程度要会高于第一加权置信度以及第二加权置信度，此时可以认为关键词输入的准确度已经得到保证，可以适当降低置信度阈值标准，以提高关键词的识别率。Wherein, the second preset reliability threshold is smaller than the first preset reliability threshold. Assuming that the weighted confidence of the first keyword is 0.8 and the weighted confidence of the second keyword is 0.7, if the first keyword and the second keyword are consistent, neither the first keyword nor the second keyword is input by the user The theoretical probability of the keyword is 0.06, and the theoretical probability of the first keyword and the second keyword being keywords input by the user is 0.94, which is greater than the weighted confidence of the first keyword or the second keyword. It can be seen that if the first keyword and the second keyword are consistent, the actual credibility of the first keyword and the second keyword will be higher than the first weighted confidence and the second weighted confidence. The accuracy of word input has been guaranteed, and the confidence threshold standard can be appropriately lowered to improve the recognition rate of keywords.

S308，将第一关键词和第二关键词中加权置信度较大的关键词，作为输入的关键词。S308, the keyword with a larger weighted confidence among the first keyword and the second keyword is used as the input keyword.

S309，确定未识别出关键词。S309, it is determined that the keyword is not recognized.

该步骤与S207相同，可以参见前述关于S207的描述，在此不再赘述。This step is the same as S207, and reference may be made to the foregoing description of S207, and details are not repeated here.

S310，将第一关键词或者第二关键词作为输入的关键词。S310. Use the first keyword or the second keyword as the input keyword.

由于第一关键词和第二关键词一致，因此无论是将第一关键词还是第二关键词作为输入的关键词，没有实质性的区别。Since the first keyword and the second keyword are the same, there is no substantial difference whether the first keyword or the second keyword is used as the input keyword.

参见图4，图4所示为本发明实施例提供的无人机控制方法的一种流程示意图，该方法应用于无人机控制器，可以包括：Referring to FIG. 4, FIG. 4 shows a schematic flowchart of a drone control method provided by an embodiment of the present invention. The method is applied to a drone controller and may include:

S401，获取用户输入的音频信号，以及用户输入该音频信号期间采集的视频信号，该视频信号包括用户的唇部视频图像。S401. Acquire an audio signal input by a user and a video signal collected during the user inputting the audio signal, where the video signal includes a video image of the user's lips.

S402，对音频信号进行关键词识别，得到第一关键词以及第一关键词的置信度。S402: Perform keyword identification on the audio signal to obtain a first keyword and a confidence level of the first keyword.

S403，对唇部视频图像进行唇语识别，得到第二关键词以及第二关键词的置信度。S403: Perform lip language recognition on the lip video image to obtain the second keyword and the confidence level of the second keyword.

S404，根据相对质量与第一关键词的置信度，确定第一关键词的加权执行度，以及根据相对质量与第二关键词的置信度，确定第二关键词的加权置信度。。S404. Determine the weighted execution degree of the first keyword according to the relative quality and the confidence of the first keyword, and determine the weighted confidence of the second keyword according to the relative quality and the confidence of the second keyword. .

S405，确定第一关键词和第二关键词是否一致，如果第一关键词和第二关键词不一致，则执行S406，如果第一关键词和第二关键词一致，则执行S407。S405, determine whether the first keyword and the second keyword are consistent, if the first keyword and the second keyword are inconsistent, execute S406, and if the first keyword and the second keyword are consistent, execute S407.

该步骤与S305相同，可以参见前述关于S305的描述，在此不再赘述。This step is the same as S305, and reference may be made to the foregoing description of S305, and details are not repeated here.

S406，确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值，如果较大值大于第一预设置信度阈值则执行S408，如果较大值不大于第一预设置信度阈值则执行S409。S406, determine whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset credibility threshold, and if the larger value is greater than the first preset credibility threshold, execute S408, if the larger value is not greater than the first preset reliability threshold, execute S409.

S407，确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第二预设置信度阈值，如果较大值大于第二预设置信度阈值则执行S410，如果较大值不大于第一预设置信度阈值则执行S409。S407, determine whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the second preset credibility threshold, and if the larger value is greater than the second preset credibility threshold, execute S410, if the larger value is not greater than the first preset reliability threshold, execute S409.

该步骤与S307相同，可以参见前述关于S307的描述，在此不再赘述。This step is the same as S307, and reference may be made to the foregoing description of S307, and details are not repeated here.

S408，将第一关键词和第二关键词中加权置信度较大的关键词，作为输入的关键词。S408, the keyword with a larger weighted confidence among the first keyword and the second keyword is used as the input keyword.

S409，确定未识别出关键词。S409, it is determined that the keyword is not recognized.

S410，将第一关键词或者第二关键词作为输入关键词。S410. Use the first keyword or the second keyword as the input keyword.

该步骤与S310相同，可以参见前述关于S310的描述，在此不再赘述。This step is the same as S310, and reference may be made to the foregoing description of S310, and details are not repeated here.

S411，获取与输入的关键词对应的控制指令。S411: Acquire a control instruction corresponding to the input keyword.

其中，控制指令和关键词之间的对应关系是预先设置的，可以是以映射表的形式保存在无人机控制器的内存中，进一步的，该对应关系可以根据用户实际需求进行更改。示例性的，假设预先将关键词“上升”与提高无人机飞行高度的控制指令建立对应关系，则当输入的关键词为上升时，获取到提高无人机飞行高度的控制指令。Among them, the corresponding relationship between the control instruction and the keyword is preset, and can be stored in the memory of the UAV controller in the form of a mapping table. Further, the corresponding relationship can be changed according to the actual needs of the user. Exemplarily, assuming that a corresponding relationship is established between the keyword "rise" and the control instruction for increasing the flying height of the drone, when the input keyword is rising, the control instruction for increasing the flying height of the drone is obtained.

S412，控制无人机控制器所绑定的无人机执行该控制指令。S412, control the UAV bound to the UAV controller to execute the control instruction.

其中，无人机控制器可以是一个移动终端也可以是远程控制服务器，无人机控制器自身带有录音功能以及拍摄功能，或者外接有具有这两个功能的外接设备，示例性的无人机控制器可以是一个安装有无人机控制程序的智能手机。现有技术中，用户可以通过语音对无人机进行智能控制，但是无人机往往工作于室外环境，而室外环境中可能存在较大的背景噪音，使得关键词识别结果不够准确，导致用户无法准确通过语音控制无人机。而选用该实施例，通过结合关键词识别以及唇语识别，并针对信号质量对置信度进行加权修正，使得用户可以更准确地输入关键词，进而通过关键词控制无人机，有效提高了在嘈杂场景中语音控制无人机的准确度。The drone controller may be a mobile terminal or a remote control server. The drone controller itself has a recording function and a shooting function, or an external device with these two functions is externally connected. The drone controller can be a smartphone with a drone control program installed. In the prior art, the user can intelligently control the drone through voice, but the drone often works in an outdoor environment, and there may be a large background noise in the outdoor environment, which makes the keyword recognition result inaccurate, and the user cannot. Control your drone accurately by voice. In this embodiment, by combining keyword recognition and lip language recognition, and weighted correction of the confidence based on the signal quality, the user can input keywords more accurately, and then control the drone through the keywords, which effectively improves the reliability of the drone. Accuracy of voice-controlled drones in noisy scenes.

本发明实施例所提供的检测神经网络、识别神经网络以及唇语神经网络的网络结构可以根据实际需求进行设置，如循环神经网络、卷积神经网络、深度神经网络等。检测神经网络、识别神经网络以及唇语神经网络的网络结构可以是相同，也可以是不相同的。进一步的，可以参见图5，图5所示为本发明实施例所提供的神经网络的一种结构示意图，检测神经网络、识别神经网络以及唇语识别网络均可以是该结构的神经网络，该神经网络包括输入层(Input Layer)510、隐含层(Hidden Layer)520以及输出层(Output Layer)530。其中输入层510可以包括多个输入层神经元511，每个神经元用于输入一个特征，示例性的，在唇语神经网络中，输入层的每一个神经元可以用于输入唇部视频图像的一个特征。隐含层520用于对通过输入层510输入的特征进行非线性映射，隐含层520可以是由一层隐含层神经元521构成的，在其他实施例中，也可以是由多层隐含神经元构成的。输出层可以是由一个输出节点531构成的，在其他实施例中也可以是包括多个输出节点。The network structures of the detection neural network, the recognition neural network, and the lip language neural network provided by the embodiments of the present invention can be set according to actual requirements, such as a recurrent neural network, a convolutional neural network, a deep neural network, and the like. The network structures of the detection neural network, the recognition neural network and the lip language neural network can be the same or different. Further, referring to FIG. 5, FIG. 5 shows a schematic structural diagram of a neural network provided by an embodiment of the present invention. The detection neural network, the recognition neural network, and the lip language recognition network can all be neural networks of this structure. The neural network includes an input layer (Input Layer) 510 , a hidden layer (Hidden Layer) 520 and an output layer (Output Layer) 530 . Theinput layer 510 may include a plurality ofinput layer neurons 511, each neuron is used to input a feature, for example, in a lip language neural network, each neuron of the input layer may be used to input a lip video image a feature of . Thehidden layer 520 is used to perform nonlinear mapping on the features input through theinput layer 510. Thehidden layer 520 may be composed of a layer of hiddenlayer neurons 521. In other embodiments, it may also be composed of multiple layers of hidden layers. composed of neurons. The output layer may be composed of oneoutput node 531, or may include multiple output nodes in other embodiments.

参见图6，图6所示为本发明实施例提供的关键词输入装置的一种结构示意图，可以包括：Referring to FIG. 6, FIG. 6 shows a schematic structural diagram of a keyword input device provided by an embodiment of the present invention, which may include:

信号采集模块601，用于获取用户输入的音频信号，以及用户输入音频信号期间采集的视频信号，视频信号包括用户的唇部视频图像；Thesignal acquisition module 601 is used to acquire the audio signal input by the user and the video signal collected during the audio signal input by the user, and the video signal includes the video image of the user's lips;

关键词识别模块602，用于对音频信号进行关键词识别，得到第一关键词以及第一关键词的置信度，第一关键词的置信度用于表示用户输入的关键词为第一关键词的可信程度；Thekeyword identification module 602 is configured to perform keyword identification on the audio signal to obtain the first keyword and the confidence level of the first keyword, where the confidence level of the first keyword is used to indicate that the keyword input by the user is the first keyword the degree of credibility;

唇语识别模块603，用于对唇部视频图像进行唇语识别，得到第二关键词以及第二关键词的置信度，第二关键词的置信度用于表示用户输入的关键词为第二关键词的可信程度；The liplanguage recognition module 603 is used to perform lip language recognition on the lip video image to obtain the second keyword and the confidence level of the second keyword, and the confidence level of the second keyword is used to indicate that the keyword input by the user is the second keyword the credibility of the keyword;

命令判决模块604，用于根据相对质量与第一关键词的置信度，确定第一关键词的加权置信度，以及根据相对质量与第二关键词的置信度，确定第二关键词的加权置信度，相对质量用于表示音频信号的信号质量相对于视频信号的信号质量的优劣程度，第一关键词的加权置信度与相对质量正相关，第二关键词的加权置信度与相对质量负相关；并将第一关键词和第二关键词中加权置信度较大的关键词，作为输入的关键词。Command decision module 604, configured to determine the weighted confidence of the first keyword according to the relative quality and the confidence of the first keyword, and determine the weighted confidence of the second keyword according to the relative quality and the confidence of the second keyword The relative quality is used to indicate the quality of the audio signal relative to the signal quality of the video signal. The weighted confidence of the first keyword is positively related to the relative quality, and the weighted confidence of the second keyword is negatively related to the relative quality. related; and the keyword with a larger weighted confidence among the first keyword and the second keyword is used as the input keyword.

进一步的，命令判决模块604还用于在根据相对质量与第一关键词的置信度，确定第一关键词的加权置信度，以及根据相对质量与第二关键词的置信度，确定第二关键词的加权置信度之后，确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值；Further, thecommand decision module 604 is further configured to determine the weighted confidence of the first keyword according to the relative quality and the confidence of the first keyword, and to determine the second key according to the relative quality and the confidence of the second keyword. After the weighted confidence of the word, determine whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset confidence threshold;

如果较大值大于第一预设置信度阈值，执行将第一关键词和第二关键词中加权置信度较大的关键词，作为输入的关键词的步骤。If the larger value is greater than the first preset confidence threshold, the step of using the keyword with the larger weighted confidence among the first keyword and the second keyword as the input keyword is performed.

进一步的，命令判决模块604还用于在确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值之后，如果较大值不大于第一预设置信度阈值，确定未识别出关键词。Further, thecommand decision module 604 is also used to determine whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset confidence threshold, if the larger value is If it is not greater than the first preset reliability threshold, it is determined that the keyword is not identified.

进一步的，命令判决模块604还用于在确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值之前，确定第一关键词和第二关键词是否一致；Further, thecommand judging module 604 is further configured to determine whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset confidence threshold, to determine the first key. Whether the word and the second keyword are consistent;

如果第一关键词和第二关键词不一致，执行确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值的步骤；If the first keyword and the second keyword are inconsistent, perform the step of determining whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset confidence threshold;

如果第一关键词和第二关键词一致，确定较大值是否大于第二预设置信度阈值，第二预设置信度阈值小于第一预设置信度阈值；If the first keyword and the second keyword are consistent, determine whether the larger value is greater than a second preset reliability threshold, and the second preset reliability threshold is smaller than the first preset reliability threshold;

如果较大值大于第二预设置信度阈值，将第一关键词或者第二关键词作为输入的关键词。If the larger value is greater than the second preset reliability threshold, the first keyword or the second keyword is used as the input keyword.

进一步的，关键词识别模块602具体用于将音频信号输入至预设的检测神经网络，以去除音频信号中的噪音信号和静音信号得到音频信号的语音信号，检测神经网络预先经过多个样本音频信号的训练，其中，针对每个样本音频信号，预先使用该样本音频信号的语音内容进行标注；Further, thekeyword identification module 602 is specifically used to input the audio signal into the preset detection neural network, to remove the noise signal and the mute signal in the audio signal to obtain the voice signal of the audio signal, and the detection neural network passes through a plurality of sample audio signals in advance. Signal training, wherein, for each sample audio signal, the voice content of the sample audio signal is used for labeling in advance;

将音频信号输入至预设的识别神经网络，识别神经网络预先经过多个样本语音信号的训练，其中，针对每个样本语音信号，预先使用该样本语音信号所对应的语音内容进行标注；Input the audio signal into a preset recognition neural network, and the recognition neural network is pre-trained with a plurality of sample voice signals, wherein, for each sample voice signal, the voice content corresponding to the sample voice signal is used to mark in advance;

获取识别神经网络输出的第一关键词以及第一关键词的置信度。Obtain the first keyword output by the recognition neural network and the confidence level of the first keyword.

对语音内容进行关键词识别，得到第一关键词以及第一关键词的置信度。Perform keyword recognition on the speech content to obtain the first keyword and the confidence level of the first keyword.

进一步的，唇语识别模块603具体用于将唇部视频图像输入至预设的唇语神经网络，唇语神经网络预先经过多个样本唇部视频图像的训练，其中，针对每个样本唇部视频图像，预先使用该样本唇部图像所对应的关键词进行标准；Further, the liplanguage recognition module 603 is specifically used to input the lip language video image into the preset lip language neural network, and the lip language neural network is pre-trained on a plurality of sample lip video images, wherein, for each sample lip Video images, pre-standardized using the keywords corresponding to the sample lip images;

获取唇语神经网络输出的关键词以及该关键词的置信度，作为第二关键词以及第二关键词的置信度。The keyword output by the lip language neural network and the confidence level of the keyword are obtained as the second keyword and the confidence level of the second keyword.

进一步的，该装置应用于无人机控制器，命令判决模块604在将第一关键词和第二关键词中加权置信度较大的关键词，作为输入的关键词之后，获取与输入的关键词对应的控制指令；控制无人机控制器所绑定的无人机执行控制指令。Further, the device is applied to the UAV controller, and thedecision module 604 is instructed to obtain the key to the input after taking the keyword with a larger weighted confidence among the first keyword and the second keyword as the input keyword. Control command corresponding to the word; control the drone to which the drone controller is bound to execute the control command.

进一步的，命令判决模块604具体用于按照预设加权规则，确定第一权重以及第二权重，第一权重与相对质量正相关，第二权重与相对质量负相关；Further, thecommand decision module 604 is specifically configured to determine the first weight and the second weight according to the preset weighting rule, the first weight is positively correlated with the relative quality, and the second weight is negatively correlated with the relative quality;

计算第一权重与第一关键词的置信度的乘积，作为第一关键词的加权置信度，以及计算第二权重与第二关键词的置信度的乘积，作为第二关键词的加权置信度。Calculate the product of the first weight and the confidence of the first keyword as the weighted confidence of the first keyword, and calculate the product of the second weight and the confidence of the second keyword as the weighted confidence of the second keyword .

进一步的，命令判决模块604具体用于按照预设加权规则，确定插值因子α，其中，α和相对质量正相关，α大于0且小于1，相对质量为音频信号的信号质量与视频信号的信号质量的比值；Further, thecommand decision module 604 is specifically configured to determine the interpolation factor α according to the preset weighting rule, wherein α is positively correlated with the relative quality, α is greater than 0 and less than 1, and the relative quality is the signal quality of the audio signal and the signal of the video signal. mass ratio;

将α作为第一权重；Take α as the first weight;

将1-α作为第二权重。Take 1-α as the second weight.

本发明实施例还提供了一种电子设备，如图7所示，可以包括：An embodiment of the present invention also provides an electronic device, as shown in FIG. 7 , which may include:

存储器701，用于存放计算机程序；amemory 701 for storing computer programs;

处理器702，用于执行存储器701上所存放的程序时，实现如下步骤：When theprocessor 702 is used to execute the program stored in thememory 701, the following steps are implemented:

获取用户输入的音频信号，以及用户输入音频信号期间采集的视频信号，视频信号包括用户的唇部视频图像；Acquiring the audio signal input by the user, and the video signal collected during the audio signal input by the user, the video signal includes the video image of the user's lips;

对音频信号进行关键词识别，得到第一关键词以及第一关键词的置信度，第一关键词的置信度用于表示用户输入的关键词为第一关键词的可信程度；Perform keyword identification on the audio signal to obtain the first keyword and the confidence level of the first keyword, where the confidence level of the first keyword is used to indicate that the keyword input by the user is the credibility level of the first keyword;

对唇部视频图像进行唇语识别，得到第二关键词以及第二关键词的置信度，第二关键词的置信度用于表示用户输入的关键词为第二关键词的可信程度；The lip language recognition is performed on the lip video image, and the confidence level of the second keyword and the second keyword is obtained, and the confidence level of the second keyword is used to indicate that the keyword input by the user is the confidence level of the second keyword;

根据相对质量与第一关键词的置信度，确定第一关键词的加权置信度，以及根据相对质量与第二关键词的置信度，确定第二关键词的加权置信度，相对质量用于表示音频信号的信号质量相对于视频信号的信号质量的优劣程度，第一关键词的加权置信度与相对质量正相关，第二关键词的加权置信度与相对质量负相关；Determine the weighted confidence of the first keyword according to the relative quality and the confidence of the first keyword, and determine the weighted confidence of the second keyword according to the relative quality and the confidence of the second keyword, where the relative quality is used to represent The quality of the signal quality of the audio signal relative to the signal quality of the video signal, the weighted confidence level of the first keyword is positively correlated with the relative quality, and the weighted confidence level of the second keyword is negatively correlated with the relative quality;

将第一关键词和第二关键词中加权置信度较大的关键词，作为输入的关键词。A keyword with a larger weighted confidence among the first keyword and the second keyword is used as the input keyword.

进一步的，在根据相对质量与第一关键词的置信度，确定第一关键词的加权置信度，以及根据相对质量与第二关键词的置信度，确定第二关键词的加权置信度之后，方法还包括：Further, after determining the weighted confidence of the first keyword according to the relative quality and the confidence of the first keyword, and determining the weighted confidence of the second keyword according to the relative quality and the confidence of the second keyword, Methods also include:

确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值；Determine whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset confidence threshold;

进一步的，在确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值之后，方法还包括：Further, after determining whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset confidence threshold, the method further includes:

如果较大值不大于第一预设置信度阈值，确定未识别出关键词。If the larger value is not greater than the first preset reliability threshold, it is determined that the keyword is not identified.

进一步的，在确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值之前，方法还包括：Further, before determining whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset confidence threshold, the method further includes:

确定第一关键词和第二关键词是否一致；Determine whether the first keyword and the second keyword are consistent;

如果第一关键词和第二关键词不一致，执行确定第一关键词的加权置信度和第二关键词的加权置信度中的较大值，是否大于第一预设置信度阈值的步骤。If the first keyword and the second keyword are inconsistent, the step of determining whether the larger value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than the first preset confidence threshold is performed.

如果较大值大于第二预设置信度阈值，将第一关键或者第二关键词作为输入的关键词。If the larger value is greater than the second preset reliability threshold, the first keyword or the second keyword is used as the input keyword.

进一步的，对音频信号进行关键词识别，包括：Further, keyword recognition is performed on the audio signal, including:

将音频信号输入至预设的检测神经网络，以去除音频信号中的噪音信号和静音信号，得到音频信号的语音信号，检测神经网络预先经过多个样本音频信号的训练，其中，针对每个样本音频信号，预先使用该样本音频信号的语音信号进行标注；Input the audio signal into the preset detection neural network to remove the noise signal and the mute signal in the audio signal, and obtain the speech signal of the audio signal. Audio signal, pre-marked with the voice signal of the sample audio signal;

获取识别神经网络输出的第一关键词以及第一关键词的置信度进一步的，对唇部视频图像进行唇语识别，包括：Obtain the first keyword output by the recognition neural network and the confidence level of the first keyword. Further, perform lip language recognition on the lip video image, including:

将唇部视频图像输入至预设的唇语神经网络，唇语神经网络预先经过多个样本唇部视频图像的训练，其中，针对每个样本唇部视频图像，预先使用该样本唇部图像所对应的关键词进行标注；The lip video image is input to the preset lip language neural network, and the lip language neural network is pre-trained with multiple sample lip video images, wherein, for each sample lip video image, the sample lip image is used in advance. The corresponding keywords are marked;

进一步的，应用于无人机控制器，在将第一关键词和第二关键词中加权置信度较大的关键词，作为输入的关键词之后，方法还包括：Further, applied to the UAV controller, after taking the keyword with a larger weighted confidence among the first keyword and the second keyword as the input keyword, the method further includes:

获取与输入的关键词对应的控制指令；Obtain the control instruction corresponding to the input keyword;

控制无人机控制器所绑定的无人机执行控制指令。Control the UAV bound to the UAV controller to execute the control command.

进一步的，根据相对质量与第一关键词的置信度，确定第一关键词的加权置信度，以及根据相对质量与第二关键词的置信度，确定第二关键词的加权置信度，包括：Further, determining the weighted confidence of the first keyword according to the relative quality and the confidence of the first keyword, and determining the weighted confidence of the second keyword according to the relative quality and the confidence of the second keyword, including:

按照预设加权规则，确定第一权重以及第二权重，第一权重与相对质量正相关，第二权重与相对质量负相关；According to the preset weighting rule, determine the first weight and the second weight, the first weight is positively correlated with the relative quality, and the second weight is negatively correlated with the relative quality;

进一步的，，按照预设加权规则，确定第一权重以及第二权重，包括：Further, determining the first weight and the second weight according to a preset weighting rule, including:

按照预设预设加权规则，确定插值因子α，其中，α和相对质量正相关，α大于0且小于1，相对质量为音频信号的信号质量与视频信号的信号质量的比值；Determine the interpolation factor α according to the preset preset weighting rule, wherein α is positively correlated with the relative quality, α is greater than 0 and less than 1, and the relative quality is the ratio of the signal quality of the audio signal to the signal quality of the video signal;

将α作为第一权重；将1-α作为第二权重。Take α as the first weight; take 1-α as the second weight.

上述电子设备提到的存储器可以包括随机存取存储器(Random Access Memory，RAM)，也可以包括非易失性存储器(Non-Volatile Memory，NVM)，例如至少一个磁盘存储器。可选的，存储器还可以是至少一个位于远离前述处理器的存储装置。The memory mentioned in the above electronic device may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk memory. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.

上述的处理器可以是通用处理器，包括中央处理器(Central Processing Unit，CPU)、网络处理器(Network Processor，NP)等；还可以是数字信号处理器(Digital SignalProcessing，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; may also be a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

在本发明提供的又一实施例中，还提供了一种计算机可读存储介质，该计算机可读存储介质中存储有指令，当其在计算机上运行时，使得计算机执行上述实施例中任一的关键词输入方法。In yet another embodiment provided by the present invention, a computer-readable storage medium is also provided, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium is run on a computer, the computer is made to execute any one of the above-mentioned embodiments. keyword input method.

在本发明提供的又一实施例中，还提供了一种包含指令的计算机程序产品，当其在计算机上运行时，使得计算机执行上述实施例中任一的关键词输入方法。In yet another embodiment provided by the present invention, there is also provided a computer program product including instructions, which, when executed on a computer, causes the computer to execute any of the keyword input methods in the foregoing embodiments.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), among others.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any relationship between these entities or operations. any such actual relationship or sequence exists. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

本说明书中的各个实施例均采用相关的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于装置、电子设备、计算机可读存储介质、计算机程序产品的实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the embodiments of the apparatus, electronic equipment, computer-readable storage medium, and computer program product, since they are basically similar to the method embodiments, the descriptions are relatively simple, and for relevant details, refer to the partial descriptions of the method embodiments. .

以上所述仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等，均包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method for inputting a keyword, the method comprising:

acquiring an audio signal input by a user and a video signal acquired during the audio signal input by the user, wherein the video signal comprises a lip video image of the user;

performing keyword recognition on the audio signal to obtain a first keyword and a confidence coefficient of the first keyword, wherein the confidence coefficient of the first keyword is used for representing the credibility of the keyword input by the user as the first keyword;

performing lip language recognition on the lip video image to obtain a second keyword and a confidence coefficient of the second keyword, wherein the confidence coefficient of the second keyword is used for representing the credibility of the keyword input by the user as the second keyword;

determining a weighted confidence coefficient of the first keyword according to a relative quality and a confidence coefficient of the first keyword, and determining a weighted confidence coefficient of the second keyword according to the relative quality and a confidence coefficient of the second keyword, wherein the relative quality is used for representing the quality of the audio signal relative to the quality of the video signal, the weighted confidence coefficient of the first keyword is positively correlated with the relative quality, and the weighted confidence coefficient of the second keyword is negatively correlated with the relative quality;

and taking the keyword with higher weighted confidence coefficient in the first keyword and the second keyword as an input keyword.

2. The method of claim 1, wherein after determining the weighted confidence for the first keyword based on the confidence for the first keyword in terms of relative quality and determining the weighted confidence for the second keyword based on the confidence for the second keyword in terms of relative quality, the method further comprises:

determining whether the greater value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is greater than a first preset confidence coefficient threshold value;

and if the larger value is larger than the first preset confidence coefficient threshold value, executing the step of taking the keyword with larger weighted confidence coefficient in the first keyword and the second keyword as the input keyword.

3. The method of claim 2, wherein after determining whether a greater of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than a first preset confidence threshold, the method further comprises:

and if the larger value is not larger than the first preset confidence coefficient threshold value, determining that no keyword is identified.

4. The method of claim 2, wherein prior to said determining whether a greater of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than a first preset confidence threshold, the method further comprises:

determining whether the first keyword and the second keyword are consistent;

if the first keyword and the second keyword are inconsistent, executing the step of determining whether the greater value of the weighted confidence coefficient of the first keyword and the weighted confidence coefficient of the second keyword is greater than a first preset confidence coefficient threshold value;

if the first keyword and the second keyword are consistent, determining whether the larger value is larger than a second preset confidence threshold value, wherein the second preset confidence threshold value is smaller than the first preset confidence threshold value;

and if the larger value is larger than the second preset confidence coefficient threshold value, taking the first key or the second key as an input key.

5. The method of claim 1, wherein the performing keyword recognition on the audio signal comprises:

inputting the audio signal to a preset detection neural network to remove a noise signal and a mute signal in the audio signal to obtain a voice signal of the audio signal, wherein the detection neural network is trained by a plurality of sample audio signals in advance, and for each sample audio signal, the voice signal of the sample audio signal is used for labeling in advance;

inputting the audio signal into a preset recognition neural network, wherein the recognition neural network is trained by a plurality of sample voice signals in advance, and for each sample voice signal, voice content corresponding to the sample voice signal is labeled in advance;

and acquiring a first keyword output by the recognition neural network and the confidence coefficient of the first keyword.

6. The method according to claim 1, wherein the lip recognition of the lip video image comprises:

inputting the lip video images into a preset lip neural network, wherein the lip neural network is trained by a plurality of sample lip video images in advance, and for each sample lip video image, a keyword corresponding to the sample lip image is used for labeling in advance;

and acquiring the keywords output by the lip language neural network and the confidence coefficient of the keywords as second keywords and the confidence coefficient of the second keywords.

7. The method of claim 1, applied to a drone controller, wherein after the keyword with a higher weighted confidence of the first keyword and the second keyword is taken as an input keyword, the method further comprises:

acquiring a control instruction corresponding to the input keyword;

and controlling the unmanned aerial vehicle bound by the unmanned aerial vehicle controller to execute the control instruction.

8. The method of claim 1, wherein determining a weighted confidence for the first keyword based on the confidence of the first keyword in the relative quality and determining a weighted confidence for the second keyword based on the confidence of the second keyword in the relative quality comprises:

determining a first weight and a second weight according to a preset weighting rule, wherein the first weight is positively correlated with the relative quality, and the second weight is negatively correlated with the relative quality;

and calculating the product of the first weight and the confidence coefficient of the first keyword as the weighted confidence coefficient of the first keyword, and calculating the product of the second weight and the confidence coefficient of the second keyword as the weighted confidence coefficient of the second keyword.

9. The method of claim 8, wherein determining the first weight and the second weight according to a predetermined weighting rule comprises:

determining an interpolation factor α according to a preset weighting rule, wherein α is positively correlated with a relative quality, α is greater than 0 and less than 1, and the relative quality is a ratio of the signal quality of the audio signal to the signal quality of the video signal;

α as a first weight;

with 1- α as the second weight.

10. A keyword input apparatus, characterized in that the apparatus comprises:

the signal acquisition module is used for acquiring an audio signal input by a user and a video signal acquired during the period that the user inputs the audio signal, wherein the video signal comprises a lip video image of the user;

the keyword identification module is used for carrying out keyword identification on the audio signal to obtain a first keyword and the confidence coefficient of the first keyword, wherein the confidence coefficient of the first keyword is used for representing the credibility of the keyword input by the user as the first keyword;

the lip language identification module is used for carrying out lip language identification on the lip video image to obtain a second keyword and the confidence coefficient of the second keyword, wherein the confidence coefficient of the second keyword is used for representing the credibility degree that the keyword input by the user is the second keyword;

a command decision module, configured to determine a weighted confidence of the first keyword according to a relative quality and a confidence of the first keyword, and determine a weighted confidence of the second keyword according to the relative quality and a confidence of the second keyword, where the relative quality is used to indicate a quality of the audio signal relative to a quality of the video signal, the weighted confidence of the first keyword is positively correlated with the relative quality, and the weighted confidence of the second keyword is negatively correlated with the relative quality; and using the keyword with higher weighted confidence coefficient in the first keyword and the second keyword as an input keyword.

11. The apparatus of claim 10, wherein the command decision module is further configured to determine whether a greater value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than a first preset confidence threshold after determining the weighted confidence of the first keyword according to the confidence of the relative quality and the first keyword and determining the weighted confidence of the second keyword according to the confidence of the relative quality and the second keyword;

12. The apparatus of claim 11, wherein the command decision module is further configured to determine that no keyword is recognized after determining whether a greater value of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than a first preset confidence threshold, if the greater value is not greater than the first preset confidence threshold.

13. The apparatus of claim 11, wherein the command decision module is further configured to determine whether the first keyword and the second keyword are consistent before the determining whether the greater of the weighted confidence of the first keyword and the weighted confidence of the second keyword is greater than a first preset confidence threshold;

and if the larger value is larger than the second preset confidence coefficient threshold value, taking the first keyword or the second keyword as an input keyword.

14. The apparatus according to claim 10, wherein the keyword recognition module is specifically configured to input the audio signal into a preset detection neural network, so as to remove a noise signal and a silence signal in the audio signal to obtain a speech content of the audio signal, and the detection neural network is trained in advance through a plurality of sample audio signals, wherein for each sample audio signal, the speech content of the sample audio signal is labeled in advance;

15. The apparatus according to claim 10, wherein the lip recognition module is specifically configured to input the lip video images into a preset lip neural network, and the lip neural network is trained in advance through a plurality of sample lip video images, wherein for each sample lip video image, a criterion is performed in advance using a keyword corresponding to the sample lip image;

16. The device of claim 10, applied to an unmanned aerial vehicle controller, wherein the command decision module obtains a control command corresponding to an input keyword after the keyword with a higher weighted confidence coefficient is taken as the input keyword; and controlling the unmanned aerial vehicle bound by the unmanned aerial vehicle controller to execute the control instruction.

17. The apparatus according to claim 10, wherein the command decision module is specifically configured to determine a first weight and a second weight according to a preset weighting rule, wherein the first weight is positively correlated with the relative quality, and the second weight is negatively correlated with the relative quality;

18. The apparatus of claim 17, wherein the command decision module is specifically configured to determine an interpolation factor α according to a preset weighting rule, wherein α is positively correlated with the relative quality, α is greater than 0 and less than 1, and the relative quality is a ratio of the signal quality of the audio signal to the signal quality of the video signal;

α as a first weight;

with 1- α as the second weight.

19. An electronic device, characterized in that the electronic device comprises:

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-9 when executing a program stored in the memory.

20. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-9.