Disclosure of Invention
In order to overcome the defects of the existing intelligent household equipment control, the invention provides an intelligent household interaction system based on audio and video. Compared with the existing household equipment control and interaction system, the invention adopts a means of combining voice and images to achieve more natural and robust human-computer interaction experience; the unified information analysis and fusion platform is provided, products of other intelligent home manufacturers can be well expanded and compatible, and the user operation is more natural and convenient.
The specific technical scheme adopted by the invention to solve the problems is as follows:
an intelligent home interaction control system based on audio and video mainly comprises a front end, a central processing unit, a rear end and a cloud end. The front end comprises an audio and video information collection module, such as a microphone system, a camera system, a third-party sensor interface and a feedback display module. The central processing unit comprises an audio signal processing and information extracting module, a video signal processing and information extracting module, a third-party signal processing and information extracting interface module and an information fusion module. The rear end comprises a control signal transmitting module and a cloud server communication module. The cloud is a cloud server.
The microphone system is a microphone array. The system collects sound information under the home environment in real time through a specific sampling frequency and a coding mode, and transmits an original audio signal to an audio signal analysis and information extraction module.
The audio signal analysis and information extraction module is used for carrying out preprocessing such as noise reduction, echo reduction and sound source separation on the collected sound signals, and carrying out processing such as sound source positioning, speaker recognition, voice awakening, voice recognition and instruction detection.
Firstly, a Kalman filter carries out preliminary denoising on a signal of each sound channel, and carries out endpoint detection and signal cutting; the condition that multiple sound sources are mixed possibly exists in the divided signals, different sound sources are separated through a nonnegative matrix algorithm by the module, and a target sound source is extracted; then, the signal is subjected to multi-channel noise reduction and echo reduction technology through a GCCdelay-and-sum-covering algorithm to suppress noise and echo.
The sound source localization system utilizes time difference of received signals (TDOA) of different channels to determine the location of a sound source while applying multi-channel noise and echo suppression techniques. When the sound source is determined, the system can automatically adjust the direction according to the position of the speaker, so that the system and the user are in a relatively proper angle.
Then, the signal processed by noise reduction and echo reduction is input into the speaker confirmation module. The module is used for judging whether the user has the use right of the system. The module uses an i-vector algorithm to identify the speaker. An unauthorized user will not have control rights to the system.
If the user has the use authority, the voice awakening module judges whether the detected sound contains the awakening keyword. If so, the system of the invention enters the active interactive mode from the sleep mode. The subsequently detected sound signal is directly sent to the speech recognition and natural semantic understanding module.
The voice recognition module converts the voice signal into character information, and analyzes and detects a control or interactive instruction through a natural language understanding technology.
The camera system comprises a common camera and a depth camera. It is responsible for collecting the user's action and activity information. In particular, it is used to detect face, gesture, and motion information of a user.
Firstly, face detection is carried out on RGB images obtained by a common camera. Once the human face is detected, the related image is subjected to face recognition and identity verification. Here, the system compares the detected face with a pre-stored face of an authorized user (based on facial features and machine learning), and if the verification is successful, the motion recognition module is activated. The input of the module is a depth image acquired by a depth camera, and the image is firstly used for real-time skeleton tracking and acquiring information such as human joint positions. The information of the skeleton tracking can also be used for positioning the user, and the system can automatically adjust the direction according to the position of the user, so that the system and the user are in a relatively proper angle.
The body joint information is then compared to the actions in the action library in the system. Once a corresponding matching action is found, instructional information associated with the action is generated.
The third-party sensor interface and the third-party signal processing and information extraction interface module are used for function expansion and providing corresponding interfaces for other developers in the future so as to realize the customized function.
And the feedback display module is used for communication and interaction between the system and the user. When the command identifies a ambiguity or error, the user can confirm or correct it through the feedback display module.
The information fusion module is used for fusing the detected voice instruction, gesture instruction and other instruction information and judging the instruction of the user by using probability, and the mathematical description is as follows:wherein. Wherein,as instructionsA predicted probability value of (a);、andrespectively voice, video and other sensor pair instructionsA predicted probability of (d);、andspeech, video and other sensor signal weights, respectively.
The control signal transmitting module is used for converting a control command into a signal which can actually control the household appliance, and achieves the purpose of controlling the household appliance by utilizing wireless communication modes such as infrared, RF (radio frequency), Bluetooth, wifi (wireless fidelity), Zigbee, Z-Wave and the like.
And the communication module with the cloud server is used for communication between the information fusion module and the cloud server. The local end can send a resource acquisition instruction to the cloud end, and corresponding resources are returned to the local end through the module. The cloud end can also send instructions to the local end through the module so as to realize remote control of household appliances or transmit home information to the cloud end.
The cloud server is used for a) providing additional computing resources for a local end; b) providing additional storage space or data backup locally; c) providing an information exchange platform for a user terminal such as a mobile phone; d) other information is provided to the user, such as query searches or music.
The invention has the beneficial effects that: 1) the front end adopts a voice and gesture recognition interaction mode, so that the interaction naturalness is improved; 2) the voice interaction mode and the visual interaction mode are independent and complementary, and can work independently or cooperatively, so that the application limitation of a single interaction mode in a household is broken through, and the robustness of man-machine interaction is improved; 3) an interface of a third party is provided, and a third party developer can add signal processing and information extraction functions of other sensors as required, so that the system is well expanded; 4) the back end provides a plurality of wireless communication modes, and good compatibility is provided; 5) both local and remote modes of operation are provided. The local mode physically ensures the security and privacy of the user's system, while the remote mode may provide the user with additional information and more advanced services.
Detailed Description
Aiming at the problems in the prior art, the invention provides an intelligent home interaction system which is based on an intelligent audio and video analysis processing technology, can improve the convenience, comfort level and control accuracy of human-computer interaction, and has high compatibility and expandability.
In order to make the technical solution of the present invention clearer, the following detailed description of the present invention is made with reference to the accompanying drawings and examples, and the description is to be considered as exemplary.
As shown in fig. 1, the system includes: the system comprises a front end, a central processing unit, a rear end and a cloud end. The front end is mainly responsible for collecting sound and image signals and other information and displaying feedback of the system; the central processing unit is mainly responsible for processing the collected sound and visual signals and acquiring useful instruction information by using a machine learning and pattern recognition method; the back end is mainly responsible for converting the acquired instruction into a signal capable of being transmitted so as to control home appliances and the like; meanwhile, information can be obtained and exchanged from a cloud server at the cloud end.
The invention can detect the sound signal and the image signal in the home in real time when the device is in the open state.
A detailed flow chart of the audio signal processing and information extraction of the present invention is shown in fig. 2. When a user speaks, e.g., "lights," at home. The sound is detected by the microphone system (step 202), and after a preliminary de-noising process of the multi-channel audio signal (step 202), an endpoint detection and segmentation are performed (step 203), and an audio signal containing "on lights" is extracted. When multiple sound sources are speaking simultaneously (e.g., multiple users are speaking simultaneously or when users are speaking, music is played simultaneously), the system separates the sound sources (step 204) and strips off the background sound. Meanwhile, the present invention analyzes the source of the sound (step 205) to adjust the direction of the system in time (step 206). For example, when the user is on the back of the system, the system may be rotated 180 degrees to face the user on the front. After further noise reduction and echo reduction (step 207), the system will confirm the user and ignore if not a member with authority; if so, the user's input sound is further processed (step 208) and a system wake-up detection is performed (step 209). If the user's voice can match a wake-up keyword such as "light on," the system will switch from the sleep state to the wake-up state; otherwise, the wake-up instruction is continuously detected. After the system wakes up, speech recognition is performed on the voice of the subsequent user (step 210). For example, when the recognition result is "please turn on the lamp", "turn up the air conditioning temperature", "play the blue and white porcelain of zhou jilun", "view my unread mail", etc., the system extracts the keywords therein through natural semantic understanding (step 211), such as "turn on", "the lamp", "turn up", "air conditioning", "temperature", "play", "zhou jilun", "blue and white porcelain", "view", "my", "unread mail", etc. These keywords are sent to the information fusion module (block 15) for further processing.
The invention detects the video signal in real time while detecting the audio signal. The detailed flow of video signal processing and information extraction is shown in fig. 3. The input of the module is a video signal, which comprises two types: a normal RGB image signal (301) and a depth image signal (302). Firstly, the module carries out face detection (303) in an RGB image in real time, and carries out face recognition and identity confirmation (304) on the image when a face is detected. Once the identity is confirmed and the identity has the corresponding usage rights, further operations are permitted, otherwise the face detection step is returned to. At the same time, the module will also perform real-time skeleton tracking (305) using the depth image, the tracking information can be used to locate the user (306), and adjust the orientation of the system in real-time to achieve the best detection effect (307). Once the user's identity is confirmed, the user's skeletal information is used for action recognition (309), and the recognizable actions are stored in an action library (308). Finally, the recognized action is translated into an instruction (311) in an instruction library (310). The instruction is sent to the information fusion module for further processing.
When the system detects a voice or gesture command signal, the information fusion module of the present invention (as shown in fig. 4) will decide the final command by the maximum probability. Some typical application scenarios are exemplified below.
1) Only the audio system is active. For example, when a user is cooking, both hands are busy. At this time, if the user wants to listen to the song, the system can be awakened through voice, and the song that the user wants to play is selected.
2) Only the video system is active. For example, when a family party is in a high noisy environment, the owner can realize the control of the family device through gesture instructions.
3) The audio and video are simultaneously activated. At the moment, the audio and video information are mutually supplemented, and the identification accuracy of the instruction is improved. For example, when a user speaks "turn off the light" while pointing at a particular light with a hand, the present invention combines voice and gesture commands to turn off the particular light.
As mentioned above, the audio system and the video system of the present invention can work independently or jointly. The high integration of human-computer interaction is achieved, and meanwhile, the robustness of instruction identification is improved. If the maximum probability of the command obtained by the information fusion module is lower than a specified threshold value or the audio-video command is in conflict, that is, the command identification is uncertain, the system can obtain the confirmation of the user through the feedback display module (module 14). The feedback of the invention has three modes: speech, images and text. The text feedback can be directly displayed on the feedback display module, and the voice needs to be played through the user feedback module after being synthesized by the voice. For example, the present invention can feed back "do you determine to turn off the lamp? Similarly, the image can be output in the feedback display module, so as to improve the interactivity of the system. The user can confirm the system by voice or gestures to avoid misoperation.
Then, the information fusion module will deliver the command to the control signal transmitting module (module 16) or to the cloud server communication module (module 17) for processing according to the command type.
The command related to the household appliance, such as turning on the lamp, is sent to the control signal transmitting module. The module converts "turn on lamp" into a specific signal that the lamp controller can receive and transmit. The signal may be infrared, RF, Bluetooth, Wifi, Zigbee, Z-Wave, etc. Similarly, the user may also use motion commands, such as hand gestures to move left and right to switch music being played, and up and down to adjust volume.
The instructions related to the internet, such as query information and the like, are sent to the cloud end through the communication module of the cloud end server. For example, "view my unread mail", the instruction is sent to the cloud server to obtain the unread mail and return to the local end; for another example, "download Zhou Jilun blue and white porcelain", the module also downloads songs via a music library of a network connection server.
The cloud server mentioned above is connected to the local end. Its function is, but not limited to, the following example.
1) Providing additional computing resources for the local end. The voice recognition, the face recognition and the like related by the invention can save local computing resources and improve the recognition accuracy by transferring part or all of the computing requirements to the cloud server.
2) And providing a space for information backup and storage for the local end. The user can save data such as documents, pictures, videos and the like to the cloud according to the needs of the user. This example has the advantage of enabling the user to obtain the data anywhere and at any time via the internet.
3) Providing a resource portal for a third party. For example, songs can be played by connecting a cloud server of the system with a third-party music library to obtain and return the songs, so that the entertainment requirements of users are met. For another example, through the cloud server, the user can query the online goods to provide an entrance for electronic commerce.
4) And an entrance for information exchange is provided for a mobile terminal (such as a mobile phone, a tablet and the like). The user can be connected with the cloud server through the mobile phone APP, and the cloud server is utilized to forward the control signal to the local end, so that the purpose of controlling the household appliances is achieved. The embodiment can meet the requirement of a user for remotely controlling the household appliance. For another example, the mobile terminal may query the situation of the home through the cloud server, and the present invention may send a request for obtaining an image or a video to the local terminal through the cloud server.
The cloud server is in two-way communication with the local end, so that an internet entrance is provided for a user at home, and external information is acquired; and can provide local end entrance for outside users to know and monitor the conditions in the house.
In addition, the cloud server is a user selectable module. Namely, under the condition of closing the cloud server module, the invention is in a local working mode, and the communication channel with the external information is cut off. By doing so, the information security of the user can be ensured, but the function provided by the cloud server can be lost.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.