Disclosure of Invention
The invention mainly aims to provide a new voice command recognition method, a new voice command recognition device, a new voice command recognition system and a new storage medium, which are used for solving the problem that in the prior art, the household appliance mistakenly recognizes an exchange statement as a control command and executes the command because the exchange statement of a user and other people is similar to an appliance control statement, so that the household appliance is helped to accurately recognize whether the exchange statement is a control command, and the accuracy and the rapidity of correct response are improved.
The invention provides a voice command recognition method in a first aspect, which comprises the following steps: acquiring voice information and behavior image data of a voice user who sends the voice information when sending the voice information; performing behavior feature recognition on the voice user in the behavior image data to determine whether the voice user sends the voice information as a control instruction; and under the condition that the voice information is determined to be sent out as a control instruction, generating a corresponding control instruction according to the voice information, wherein the control instruction is used for controlling the target equipment to execute corresponding actions according to the voice information.
Optionally, performing behavior feature recognition on the voice user in the behavior image data to determine whether the voice user sends the voice information as a control instruction, including: dividing the behavior image data to obtain a video frame sequence; performing video analysis on the video frame sequence, and determining face orientation information when a voice user sends voice information and behavior types when the voice user sends the voice information; and judging whether the voice user sends the voice information as a control instruction or not according to the face orientation information and the behavior category.
Optionally, performing video analysis on the video frame sequence to determine face orientation information when a voice user sends out voice information, including: analyzing the video frame sequence to obtain head coordinate information of the voice user on each video frame; based on the head coordinate information of the voice user, positioning the head area of the voice user on each video frame, and acquiring head posture information and face feature information in the head area corresponding to each video frame; and determining face orientation information of the voice user when the voice information is sent by the voice user based on the head posture information and the face characteristic information on each video frame by using the trained face orientation recognition model.
Optionally, performing video analysis on the video frame sequence to determine a behavior category when a voice user sends out voice information, including: screening out video frames containing voice users from the video frame sequence; and determining the behavior type of the voice user when the voice user sends out voice information based on the video frame containing the voice user by using the trained preset behavior recognition model.
Optionally, the determining, based on the trained preset behavior recognition model and based on the video frame containing the voice user, the behavior category when the voice user sends out the voice information includes: positioning and analyzing the voice user in the video frame containing the voice user to obtain the position information of the voice user in the video frame containing the voice user; extracting human behavior feature information of the voice user in a video frame sequence containing the voice user according to the position information; and determining the behavior category of the voice user based on the human behavior feature information of the voice user in the video frame containing the voice user.
Optionally, the preset behavior recognition model is trained through the following steps: performing individual positioning analysis on the persons in the field according to the video frames to obtain position information containing the persons in the field on the corresponding video frames; acquiring human behavior characteristic information of the present person in the corresponding video frame according to the position information corresponding to the present person; and training a preset behavior recognition model based on the human behavior feature information of the on-site person in the corresponding video frame.
Optionally, in a case that it is determined that the voice information is sent as a control instruction, generating a corresponding control instruction according to the voice information, where the control instruction is used to control the target device to execute a corresponding action according to the voice information, where the method includes: preprocessing the voice information and extracting voice keywords from a processing result; recognizing control information from the voice keywords by using a preset voice recognition model; and under the condition that the voice information is determined to be sent out as a control instruction, generating a corresponding control instruction according to the control information, wherein the corresponding control instruction is used for controlling the target equipment to execute a corresponding action.
Optionally, the pre-processing comprises one or more of: denoising, pre-emphasis, framing, windowing, and endpoint detection.
Optionally, after determining that the voice information is sent as a control instruction, and before generating a corresponding control instruction according to the control information, for controlling the target device to execute a corresponding action, the method further includes: judging whether the control information is included in a preset instruction list or not based on the preset instruction list, wherein the preset instruction list comprises a plurality of executable control information; and under the condition that the preset instruction list comprises the control information, determining to generate a corresponding control instruction according to the control information so as to control the target equipment to execute a corresponding action.
A second aspect of the present invention provides a storage medium storing one or more programs executable by one or more processors to implement the voice command recognition method described above.
A third aspect of the present invention provides a voice command recognition apparatus, comprising: comprises a processor and a memory; the memory is used for storing computer instructions, and the processor is used for operating the computer instructions stored by the memory so as to realize the voice command recognition method.
A fourth aspect of the present invention provides a voice command recognition system, the system comprising: the terminal is used for acquiring voice information and behavior image data of a voice user sending the voice information when sending the voice information; the server is in communication connection with the terminal and is used for receiving the voice information sent by the terminal and behavior image data of a voice user sending the voice information when sending the voice information; wherein the server further comprises a processor and a memory; the memory is used for storing computer instructions, and the processor is used for operating the computer instructions stored by the memory so as to realize the voice command recognition method.
The invention provides an intelligent household appliance system, which comprises the voice command recognition device or system and a household appliance connected with the device or system in a communication way, wherein the household appliance receives a control instruction sent by the device or system and executes corresponding action according to the control instruction.
Compared with the prior art, the invention has the following beneficial effects: by performing behavior feature recognition on the behavior image data of the voice user when the voice user sends the voice information, whether the voice user sends the voice information as a control instruction can be determined. Therefore, when the voice information sent by the voice user is determined not to be used as the control instruction through the behavior image data, the target device does not respond to the voice information no matter whether the voice information is similar to or the same as the communication language used for man-machine interaction or even if the voice information is used for the communication between the voice user and a person; when the voice information sent by the voice user is determined to be used as the control instruction through the behavior image data, the fact that the voice user is in man-machine interaction with household appliances such as an intelligent air conditioner can be determined, the target device can respond to the voice information, therefore, the target device is prevented from responding mistakenly, and the accuracy of the target device response is improved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The voice command recognition method provided by the invention can be applied to the application environment of the voice command recognition system shown in FIG. 1. Theterminal 102 and theserver 104 are in communication connection through a wireless network. Theterminal 102 is used for collecting voice information and behavior image data of a voice user who sends the voice information when sending the voice information; the collected voice information and behavior image data are transmitted to theserver 104, and theserver 104 performs behavior feature recognition on the voice user in the behavior image data to determine whether the voice user sends the voice information as a control instruction; and under the condition that the voice information is determined to be sent out as a control instruction, generating a corresponding control instruction according to the voice information, wherein the control instruction is used for controlling the target equipment to execute corresponding actions according to the voice information. Theterminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and theserver 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
Wherein, the target device includes but is not limited to: intelligent household electrical appliance. For convenience of description, the intelligent household appliance is hereinafter referred to as a household appliance.
In this embodiment, theterminal 102 includes but is not limited to: theserver 104 includes, but is not limited to, a signal processing device applied to the home appliance and configured to process a data signal uploaded by the collecting device.
In this embodiment, theserver 104 and the home device form a data connection, but in another embodiment, theterminal 102 and the home device may form a data connection.
In this embodiment, the home device includes but is not limited to: for example: intelligence refrigerator and intelligent air conditioner.
In another embodiment, as shown in fig. 2, a voice command recognition method is provided, which is exemplified by the application of the method to theserver 104 in fig. 1, and includes the following steps:
step S201: acquiring voice information and behavior image data of a voice user who sends the voice information when sending the voice information;
step S202: performing behavior feature recognition on the voice user in the behavior image data to determine whether the voice user sends the voice information as a control instruction; if the voice information is determined to be sent out as a control instruction, the following step S203 is executed, otherwise, the following step S204 is executed;
step S203: and generating a corresponding control instruction according to the voice information so as to control the target equipment to execute a corresponding action according to the voice information.
Step S204: no treatment is performed.
In this embodiment, theterminal 102 includes a voice collecting device and a video collecting device, and when the voice collecting device collects voice information, the video collecting device collects video data in the shooting area at this time, that is: the video acquisition equipment acquires behavior image data of a voice user when the voice user sends voice information. Of course, in this embodiment, the voice capture device and the video capture device continuously collect corresponding data information.
After behavior image data are obtained, performing behavior feature recognition on the voice user in the behavior image data to determine whether the voice user sends the voice information as a control instruction. Namely: in this embodiment, by performing behavior feature recognition on the voice user in the behavior image data, it is determined whether the voice user performs a behavior action of issuing the voice information as the control instruction when the voice information is acquired, and in the case of yes determination, it may be determined that the voice user issues the voice information as the control instruction. In this case, a corresponding control instruction may be generated according to the voice information to control the target device to perform a corresponding action according to the voice information.
Therefore, in the present embodiment, by performing behavior feature recognition on the behavior image data of the voice user when uttering the voice information, it can be determined whether the voice user utters the voice information as the control instruction. Therefore, when the voice information sent by the voice user is determined not to be used as the control instruction through the behavior image data, the target device does not respond to the voice information no matter whether the voice information is similar to or the same as the communication language used for man-machine interaction or even if the voice information is used for the communication between the voice user and a person; when the voice information sent by the voice user is determined to be used as the control instruction through the behavior image data, the fact that the voice user is in man-machine interaction with household appliances such as an intelligent air conditioner and the like can be determined, the target device can respond to the voice information, and therefore the target device is prevented from responding mistakenly.
In another embodiment, an implementation manner of the step S202 is as follows:
step S221: dividing the behavior image data to obtain a video frame sequence;
step S222: performing video analysis on the video frame sequence, and determining face orientation information when a voice user sends voice information and behavior types when the voice user sends the voice information;
step S223: and judging whether the voice user sends the voice information as a control instruction or not according to the face orientation information and the behavior characteristics.
In the present embodiment, the behavior image data is divided uniformly to obtain, for example, continuous video frames. Then, carrying out video analysis on the continuous video frames so as to determine face orientation information when a voice user sends voice information and behavior types when the voice user sends the voice information; therefore, the face orientation of the voice user when the voice user sends the voice information and whether the voice user talks with a person or carries out man-machine communication can be determined, and then whether the voice user sends the voice information as a control instruction or not is judged according to the face orientation information and the behavior characteristics, so that whether the voice user carries out man-machine interaction with household appliances such as an intelligent air conditioner or not can be judged by comprehensively considering the face orientation and the behavior types, and the judgment accuracy rate of determining whether the voice information sent by the voice user is sent as the control instruction or not is improved.
Therefore, in another embodiment, one implementation manner of the step S222 includes:
step S2221: analyzing the continuous video frame sequence to obtain the head coordinate information of the voice user on each video frame;
step S2222: based on the head coordinate information of the voice user, positioning the head area of the voice user on each video frame, and acquiring head posture information and face feature information in the head area corresponding to each video frame;
step S2223: and determining face orientation information of the voice user when the voice information is sent by the voice user based on the head posture information and the face characteristic information on each video frame by using the trained face orientation recognition model.
In this embodiment, video analysis is performed on consecutive video frames one by one, so as to obtain head coordinate information of the voice user on each video frame, so as to obtain a head coordinate information set, and the head area of the voice user on each video frame is located through the head coordinate information. In this case, the head pose information and the face feature information in the head region corresponding to each video frame are obtained by clipping the head region in the corresponding video frame, the head pose information and the face feature information are transmitted to a trained face orientation recognition model for recognition, and the face orientation information of the voice user when the voice information is sent out is determined.
Specifically, the YOLOv3 algorithm may be used to detect a voice user on a sequence of consecutive video frames and detect head coordinate information of a person, so as to determine the face orientation, if two persons are talking, the face orientations of the two persons are generally opposite, and if a person wants to control an air conditioner through voice, the person generally faces the position of an intelligent appliance such as an intelligent air conditioner.
Of course, in this embodiment, a large number of labeled face orientation samples may be collected to train the face orientation recognition model, so that the trained model can be used for face orientation detection.
Specifically, in another embodiment, the implementation manner of step S222 further includes:
step S2224: screening out video frames containing voice users from the video frame sequence;
step S2225: and determining the behavior type of the voice user when the voice user sends out voice information based on the video frame containing the voice user by using the trained preset behavior recognition model.
Of course, in this embodiment, the execution sequence between step S2221 to step S2223 and step S2224 to step S2225 is not limited, and step S2221 to step S2223 may be executed first, and step S2224 to step S2225 may also be executed first.
In this embodiment, human body detection is performed on all voice users in the video frame sequence, such as: detecting the voice users in the video frame sequence by taking human as a target, so as to screen out the video frames containing the voice users; and determining the behavior type of the voice user when the voice user sends out voice information through the video frame containing the voice user by using the trained preset behavior recognition model.
Specifically, in another embodiment, one implementation manner of the step S2225 includes:
step S22251: positioning and analyzing the voice user in the video frame containing the voice user to obtain the position information of the voice user in the corresponding video frame containing the voice user;
step S22252: extracting human behavior feature information of the voice user in a video frame sequence containing the voice user according to the position information;
step S22253: and determining the behavior category of the voice user based on the human behavior characteristic information of the voice user in the video frame sequence containing the voice user.
In this embodiment, by locating the voice user on each video frame, the human behavior feature information of the voice user on the corresponding video frame can be accurately extracted, and the behavior category of the voice user can be determined by combining the extracted human behavior feature information.
Of course, in another embodiment, the training method of the preset behavior recognition model includes:
performing individual positioning analysis on the persons in the field according to the continuous video frames to obtain the position information of the persons in the field on the corresponding video frames; acquiring human behavior characteristic information of the present person in the corresponding video frame according to the position information corresponding to the present person; and training a preset behavior recognition model based on the human behavior feature information of the on-site person in the corresponding video frame.
Therefore, in this embodiment, each person in the video frame can perform individual positioning analysis, so as to determine whether each person in the video frame is communicating with another person, man-machine communicating with another person, or other behaviors.
Of course, in this embodiment, the predetermined behavior recognition model may adopt a CNN-BiLSTM model. Specifically, through human target detection, the detected video frames containing the human are input into a preset behavior recognition model for classification and recognition. The preset behavior recognition model adopts a CNN-BilSTM model, and the classification category number of the CNN-BilSTM model is 3: the communication between people is carried out, and the people control air conditioning and other behaviors through voice. In addition, in this embodiment, the human behavior feature information of the present person may be collected to train the model, and of course, the human behavior feature information includes the inter-human communication in the actual scene, and the human controls the air conditioner and others through voice, and then inputs the human behavior feature information into the model to train.
Furthermore, in another embodiment, one implementation manner of the step S203 includes:
step S301: preprocessing the voice information and extracting voice keywords from a processing result;
of course, in this embodiment, the pre-processing includes, but is not limited to, one or more of the following: denoising, pre-emphasis, framing, windowing, and endpoint detection.
Step S302: identifying control information from the voice keywords through a preset voice identification model; (ii) a
Step S303: and under the condition that the voice information is determined to be sent out as a control instruction, generating a corresponding control instruction according to the control information so as to control the target equipment to execute a corresponding action.
In the present embodiment, in steps S301 to S303, the language information is preprocessed to extract the speech feature parameters from the processed speech information; then, the control information is recognized by presetting the speech recognition model for the extracted speech feature parameter input values, and in this embodiment, when it is determined that the speech information is sent as a control instruction, a corresponding control instruction is generated according to the control information to control the target device to execute a corresponding action. In the embodiment, the voice information is converted into the text information according to the text information, and the voice keywords in the text information are extracted; and then, inquiring the semantics and the attributes of the extracted keywords according to a preset dictionary database. These keywords include, but are not limited to: assist words, exclamation words and verbs. And then performing combined analysis by taking the preset voice recognition model as the semantics and attributes of the keywords so as to recognize the control information.
Moreover, in this embodiment, after determining that the voice information is issued as the control instruction and before generating a corresponding control instruction according to the control information to control the target device to perform a corresponding action, the voice command recognition method further includes the following implementation steps: judging whether the control information is included in a preset instruction list or not based on the preset instruction list, wherein the preset instruction list comprises a plurality of executable control information; and if so, determining to generate a corresponding control instruction according to the control information so as to control the target equipment to execute a corresponding action. Otherwise, no processing is performed.
It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In another embodiment of the present invention, a storage medium is provided that stores one or more programs executable by one or more processors to implement the voice command recognition method described above.
The nouns and the implementation principle related to a storage medium in this embodiment may specifically refer to a voice command recognition method in the foregoing embodiment, and are not described herein again.
In another embodiment of the present invention, a voice command recognition apparatus is provided that includes a processor and a memory; the memory is used for storing computer instructions, and the processor is used for operating the computer instructions stored by the memory to realize the voice command recognition method.
The nouns and the implementation principle related to the voice command recognition apparatus in this embodiment may specifically refer to a voice command recognition method in the foregoing embodiment, and are not described herein again.
In another embodiment of the present invention, there is provided a voice command recognition system, as shown in fig. 1, including:
the terminal is used for acquiring voice information and behavior image data of a voice user sending the voice information when sending the voice information;
the server is in communication connection with the terminal and is used for receiving the voice information sent by the terminal and behavior image data of a voice user sending the voice information when sending the voice information;
wherein the server further comprises a processor and a memory; the memory is used for storing computer instructions, and the processor is used for operating the computer instructions stored by the memory to realize the voice command recognition method.
The nouns and the implementation principle related to the voice command recognition system in this embodiment may specifically refer to a voice command recognition method in the foregoing embodiment, and are not described herein again.
In another embodiment of the present invention, an intelligent home appliance system is provided, which includes the voice command recognition apparatus or system as described above, and a home appliance in communication connection with the apparatus or system, wherein the home appliance receives a control instruction sent by the terminal device, and executes a corresponding action according to the control instruction.
The term and the implementation principle related to an intelligent household appliance system in this embodiment may specifically refer to a voice command recognition device or a voice command recognition system in the embodiment of the present invention, and are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.