CN105045122A

Movatterモバイル変換

Info

Publication number: CN105045122A
Application number: CN201510355845.7A
Authority: CN
Inventors: 张子兴; 陈宇翔; 黄力; 林子楠
Original assignee: Individual
Current assignee: Individual
Priority date: 2015-06-24
Filing date: 2015-06-24
Publication date: 2015-11-11

Abstract

The invention discloses an intelligent household natural interaction system based on audios and videos. The system mainly comprises four parts, i.e.,a front end, a central processing unit, a back end and a cloud end. The front end comprises a microphone system, a camera system, a third party sensor interface and a feedback module. The front end is used to collect sound and picture associated information and display system feedback. The central processing unit comprises an audio signal processing and information retrieving module, a video signal processing and information retrieving module, a third party signal processing and information retrieving module and an information integrating module. The central processing unit processes the acquired sounds and visual signals and utilizes a machine learning method to get useful commanding order. The back end comprises an indoor signal controlling and emitting module and a cloud server communication module. The back end is used to convert obtained commanding order into emitting signals. At the same time, the back end provides a communication channel for the system. The cloud end comprises the cloud server which provides computing resources, storing resources and communicating resources. The system is highly human-machine interactive; and the intelligent household natural interaction system greatly improves convenience for controlling household electric appliances and acquiring information.

Description

Intelligent home natural interaction system based on audio and video

Technical Field

The invention relates to the technical field of information, in particular to an intelligent home natural interaction system based on audio and video technologies.

Background

Under the technical wave of the internet of things and artificial intelligence, the intelligent home technology is developed rapidly, and many hardware products related to intelligent families appear, such as an intelligent thermostat and a smoke alarm of Nest, a Hue intelligent bulb of philips, an intelligent refrigerator of hail, an intelligent lock of August and the like. The intelligent devices greatly meet the control requirements of people on the household devices. However, these devices lack a uniform control standard and interface. Generally, each of them has a separate system and a matching control method, such as a mobile phone App. This incompatibility can lead to control complexity for the user, such as multiple iterations. In view of this, Apple publishes its own control platform Homekit, samsung developed SmartHome platform, Quicky has Wink and Relay platforms, etc., and these platforms or devices improve the convenience of operating intelligent devices to a certain extent. However, these existing platforms or devices all use a relatively single voice control, or a smartphone control, etc. In many cases, none of these single interaction means enables natural interaction with home devices.

It is queried that the system and control method of patent publication No. CN102298443 adopts a method of reading lip language to assist a voice recognition system in a home environment. However, lip language recognition is greatly limited by the angle, position, illumination and the like of a user, and it is difficult to achieve a high recognition rate in practical application, thereby affecting user experience. Meanwhile, the system has no interface and cloud service platform which are open to the outside, so that the expansibility and the application range of the system are greatly limited.

Disclosure of Invention

In order to overcome the defects of the existing intelligent household equipment control, the invention provides an intelligent household interaction system based on audio and video. Compared with the existing household equipment control and interaction system, the invention adopts a means of combining voice and images to achieve more natural and robust human-computer interaction experience; the unified information analysis and fusion platform is provided, products of other intelligent home manufacturers can be well expanded and compatible, and the user operation is more natural and convenient.

The specific technical scheme adopted by the invention to solve the problems is as follows:

an intelligent home interaction control system based on audio and video mainly comprises a front end, a central processing unit, a rear end and a cloud end. The front end comprises an audio and video information collection module, such as a microphone system, a camera system, a third-party sensor interface and a feedback display module. The central processing unit comprises an audio signal processing and information extracting module, a video signal processing and information extracting module, a third-party signal processing and information extracting interface module and an information fusion module. The rear end comprises a control signal transmitting module and a cloud server communication module. The cloud is a cloud server.

The microphone system is a microphone array. The system collects sound information under the home environment in real time through a specific sampling frequency and a coding mode, and transmits an original audio signal to an audio signal analysis and information extraction module.

The audio signal analysis and information extraction module is used for carrying out preprocessing such as noise reduction, echo reduction and sound source separation on the collected sound signals, and carrying out processing such as sound source positioning, speaker recognition, voice awakening, voice recognition and instruction detection.

Firstly, a Kalman filter carries out preliminary denoising on a signal of each sound channel, and carries out endpoint detection and signal cutting; the condition that multiple sound sources are mixed possibly exists in the divided signals, different sound sources are separated through a nonnegative matrix algorithm by the module, and a target sound source is extracted; then, the signal is subjected to multi-channel noise reduction and echo reduction technology through a GCCdelay-and-sum-covering algorithm to suppress noise and echo.

The sound source localization system utilizes time difference of received signals (TDOA) of different channels to determine the location of a sound source while applying multi-channel noise and echo suppression techniques. When the sound source is determined, the system can automatically adjust the direction according to the position of the speaker, so that the system and the user are in a relatively proper angle.

Then, the signal processed by noise reduction and echo reduction is input into the speaker confirmation module. The module is used for judging whether the user has the use right of the system. The module uses an i-vector algorithm to identify the speaker. An unauthorized user will not have control rights to the system.

If the user has the use authority, the voice awakening module judges whether the detected sound contains the awakening keyword. If so, the system of the invention enters the active interactive mode from the sleep mode. The subsequently detected sound signal is directly sent to the speech recognition and natural semantic understanding module.

The voice recognition module converts the voice signal into character information, and analyzes and detects a control or interactive instruction through a natural language understanding technology.

The camera system comprises a common camera and a depth camera. It is responsible for collecting the user's action and activity information. In particular, it is used to detect face, gesture, and motion information of a user.

Firstly, face detection is carried out on RGB images obtained by a common camera. Once the human face is detected, the related image is subjected to face recognition and identity verification. Here, the system compares the detected face with a pre-stored face of an authorized user (based on facial features and machine learning), and if the verification is successful, the motion recognition module is activated. The input of the module is a depth image acquired by a depth camera, and the image is firstly used for real-time skeleton tracking and acquiring information such as human joint positions. The information of the skeleton tracking can also be used for positioning the user, and the system can automatically adjust the direction according to the position of the user, so that the system and the user are in a relatively proper angle.

The body joint information is then compared to the actions in the action library in the system. Once a corresponding matching action is found, instructional information associated with the action is generated.

The third-party sensor interface and the third-party signal processing and information extraction interface module are used for function expansion and providing corresponding interfaces for other developers in the future so as to realize the customized function.

And the feedback display module is used for communication and interaction between the system and the user. When the command identifies a ambiguity or error, the user can confirm or correct it through the feedback display module.

The information fusion module is used for fusing the detected voice instruction, gesture instruction and other instruction information and judging the instruction of the user by using probability, and the mathematical description is as follows:wherein. Wherein,as instructionsA predicted probability value of (a);、andrespectively voice, video and other sensor pair instructionsA predicted probability of (d);、andspeech, video and other sensor signal weights, respectively.

The control signal transmitting module is used for converting a control command into a signal which can actually control the household appliance, and achieves the purpose of controlling the household appliance by utilizing wireless communication modes such as infrared, RF (radio frequency), Bluetooth, wifi (wireless fidelity), Zigbee, Z-Wave and the like.

And the communication module with the cloud server is used for communication between the information fusion module and the cloud server. The local end can send a resource acquisition instruction to the cloud end, and corresponding resources are returned to the local end through the module. The cloud end can also send instructions to the local end through the module so as to realize remote control of household appliances or transmit home information to the cloud end.

The cloud server is used for a) providing additional computing resources for a local end; b) providing additional storage space or data backup locally; c) providing an information exchange platform for a user terminal such as a mobile phone; d) other information is provided to the user, such as query searches or music.

The invention has the beneficial effects that: 1) the front end adopts a voice and gesture recognition interaction mode, so that the interaction naturalness is improved; 2) the voice interaction mode and the visual interaction mode are independent and complementary, and can work independently or cooperatively, so that the application limitation of a single interaction mode in a household is broken through, and the robustness of man-machine interaction is improved; 3) an interface of a third party is provided, and a third party developer can add signal processing and information extraction functions of other sensors as required, so that the system is well expanded; 4) the back end provides a plurality of wireless communication modes, and good compatibility is provided; 5) both local and remote modes of operation are provided. The local mode physically ensures the security and privacy of the user's system, while the remote mode may provide the user with additional information and more advanced services.

Drawings

Fig. 1 is a frame diagram of an audio and video based smart home natural interaction control system according to the present invention.

FIG. 2 is a flow chart of audio signal processing and information extraction according to the present invention.

FIG. 3 is a flow chart of video signal processing and information extraction according to the present invention.

FIG. 4 is a flow chart of an information fusion module of the present invention.

Detailed Description

Aiming at the problems in the prior art, the invention provides an intelligent home interaction system which is based on an intelligent audio and video analysis processing technology, can improve the convenience, comfort level and control accuracy of human-computer interaction, and has high compatibility and expandability.

In order to make the technical solution of the present invention clearer, the following detailed description of the present invention is made with reference to the accompanying drawings and examples, and the description is to be considered as exemplary.

As shown in fig. 1, the system includes: the system comprises a front end, a central processing unit, a rear end and a cloud end. The front end is mainly responsible for collecting sound and image signals and other information and displaying feedback of the system; the central processing unit is mainly responsible for processing the collected sound and visual signals and acquiring useful instruction information by using a machine learning and pattern recognition method; the back end is mainly responsible for converting the acquired instruction into a signal capable of being transmitted so as to control home appliances and the like; meanwhile, information can be obtained and exchanged from a cloud server at the cloud end.

The invention can detect the sound signal and the image signal in the home in real time when the device is in the open state.

A detailed flow chart of the audio signal processing and information extraction of the present invention is shown in fig. 2. When a user speaks, e.g., "lights," at home. The sound is detected by the microphone system (step 202), and after a preliminary de-noising process of the multi-channel audio signal (step 202), an endpoint detection and segmentation are performed (step 203), and an audio signal containing "on lights" is extracted. When multiple sound sources are speaking simultaneously (e.g., multiple users are speaking simultaneously or when users are speaking, music is played simultaneously), the system separates the sound sources (step 204) and strips off the background sound. Meanwhile, the present invention analyzes the source of the sound (step 205) to adjust the direction of the system in time (step 206). For example, when the user is on the back of the system, the system may be rotated 180 degrees to face the user on the front. After further noise reduction and echo reduction (step 207), the system will confirm the user and ignore if not a member with authority; if so, the user's input sound is further processed (step 208) and a system wake-up detection is performed (step 209). If the user's voice can match a wake-up keyword such as "light on," the system will switch from the sleep state to the wake-up state; otherwise, the wake-up instruction is continuously detected. After the system wakes up, speech recognition is performed on the voice of the subsequent user (step 210). For example, when the recognition result is "please turn on the lamp", "turn up the air conditioning temperature", "play the blue and white porcelain of zhou jilun", "view my unread mail", etc., the system extracts the keywords therein through natural semantic understanding (step 211), such as "turn on", "the lamp", "turn up", "air conditioning", "temperature", "play", "zhou jilun", "blue and white porcelain", "view", "my", "unread mail", etc. These keywords are sent to the information fusion module (block 15) for further processing.

The invention detects the video signal in real time while detecting the audio signal. The detailed flow of video signal processing and information extraction is shown in fig. 3. The input of the module is a video signal, which comprises two types: a normal RGB image signal (301) and a depth image signal (302). Firstly, the module carries out face detection (303) in an RGB image in real time, and carries out face recognition and identity confirmation (304) on the image when a face is detected. Once the identity is confirmed and the identity has the corresponding usage rights, further operations are permitted, otherwise the face detection step is returned to. At the same time, the module will also perform real-time skeleton tracking (305) using the depth image, the tracking information can be used to locate the user (306), and adjust the orientation of the system in real-time to achieve the best detection effect (307). Once the user's identity is confirmed, the user's skeletal information is used for action recognition (309), and the recognizable actions are stored in an action library (308). Finally, the recognized action is translated into an instruction (311) in an instruction library (310). The instruction is sent to the information fusion module for further processing.

When the system detects a voice or gesture command signal, the information fusion module of the present invention (as shown in fig. 4) will decide the final command by the maximum probability. Some typical application scenarios are exemplified below.

1) Only the audio system is active. For example, when a user is cooking, both hands are busy. At this time, if the user wants to listen to the song, the system can be awakened through voice, and the song that the user wants to play is selected.

2) Only the video system is active. For example, when a family party is in a high noisy environment, the owner can realize the control of the family device through gesture instructions.

3) The audio and video are simultaneously activated. At the moment, the audio and video information are mutually supplemented, and the identification accuracy of the instruction is improved. For example, when a user speaks "turn off the light" while pointing at a particular light with a hand, the present invention combines voice and gesture commands to turn off the particular light.

As mentioned above, the audio system and the video system of the present invention can work independently or jointly. The high integration of human-computer interaction is achieved, and meanwhile, the robustness of instruction identification is improved. If the maximum probability of the command obtained by the information fusion module is lower than a specified threshold value or the audio-video command is in conflict, that is, the command identification is uncertain, the system can obtain the confirmation of the user through the feedback display module (module 14). The feedback of the invention has three modes: speech, images and text. The text feedback can be directly displayed on the feedback display module, and the voice needs to be played through the user feedback module after being synthesized by the voice. For example, the present invention can feed back "do you determine to turn off the lamp? Similarly, the image can be output in the feedback display module, so as to improve the interactivity of the system. The user can confirm the system by voice or gestures to avoid misoperation.

Then, the information fusion module will deliver the command to the control signal transmitting module (module 16) or to the cloud server communication module (module 17) for processing according to the command type.

The command related to the household appliance, such as turning on the lamp, is sent to the control signal transmitting module. The module converts "turn on lamp" into a specific signal that the lamp controller can receive and transmit. The signal may be infrared, RF, Bluetooth, Wifi, Zigbee, Z-Wave, etc. Similarly, the user may also use motion commands, such as hand gestures to move left and right to switch music being played, and up and down to adjust volume.

The instructions related to the internet, such as query information and the like, are sent to the cloud end through the communication module of the cloud end server. For example, "view my unread mail", the instruction is sent to the cloud server to obtain the unread mail and return to the local end; for another example, "download Zhou Jilun blue and white porcelain", the module also downloads songs via a music library of a network connection server.

The cloud server mentioned above is connected to the local end. Its function is, but not limited to, the following example.

1) Providing additional computing resources for the local end. The voice recognition, the face recognition and the like related by the invention can save local computing resources and improve the recognition accuracy by transferring part or all of the computing requirements to the cloud server.

2) And providing a space for information backup and storage for the local end. The user can save data such as documents, pictures, videos and the like to the cloud according to the needs of the user. This example has the advantage of enabling the user to obtain the data anywhere and at any time via the internet.

3) Providing a resource portal for a third party. For example, songs can be played by connecting a cloud server of the system with a third-party music library to obtain and return the songs, so that the entertainment requirements of users are met. For another example, through the cloud server, the user can query the online goods to provide an entrance for electronic commerce.

4) And an entrance for information exchange is provided for a mobile terminal (such as a mobile phone, a tablet and the like). The user can be connected with the cloud server through the mobile phone APP, and the cloud server is utilized to forward the control signal to the local end, so that the purpose of controlling the household appliances is achieved. The embodiment can meet the requirement of a user for remotely controlling the household appliance. For another example, the mobile terminal may query the situation of the home through the cloud server, and the present invention may send a request for obtaining an image or a video to the local terminal through the cloud server.

The cloud server is in two-way communication with the local end, so that an internet entrance is provided for a user at home, and external information is acquired; and can provide local end entrance for outside users to know and monitor the conditions in the house.

In addition, the cloud server is a user selectable module. Namely, under the condition of closing the cloud server module, the invention is in a local working mode, and the communication channel with the external information is cut off. By doing so, the information security of the user can be ensured, but the function provided by the cloud server can be lost.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. An intelligent home natural interaction system based on audio and video is characterized by comprising a front end, a central processing unit, a rear end and a cloud end;

wherein the front end comprises:

the microphone system (111) is used for arraying the microphones and is used for acquiring and sending sound information in real time;

the camera system (121) is an infrared depth camera and a common camera and is used for acquiring and sending image information in real time;

a third party sensor interface (131) for implementing the collection of other possible information and sending the information;

a feedback display module (14) for responding to and displaying a reaction to the instruction information;

the central processing unit comprises:

the audio signal processing and information extraction module (112) is used for processing the voice signal and extracting information such as speaker, semanteme and the like in the voice signal;

the video signal processing and information extraction module (122) is used for processing the image signal and extracting gesture, human face and motion information in the image signal;

a third party signal processing and information extraction interface module (132) for processing signals collected by the third party sensors and extracting relevant information;

an information fusion module (15) for fusing the information collected by the modules (112, 122 and 132) to generate a final instruction;

the rear end thereof includes:

the indoor control signal transmitting module (16) is used for converting a specific instruction into a specific transmittable wireless signal to control the household appliance;

the communication module (17) of the cloud server is used for converting specific instructions into specific network operations to acquire and exchange information on the Internet;

its high in the clouds includes:

and the cloud server (18) is used for providing necessary computing resources, storage resources, network resources and communication pipelines for the user.

2. The smart home natural interaction system of claim 1, wherein the audio signal processing and information extraction module (112) further comprises a signal preprocessing module for denoising, echo cancellation, sound source separation, etc., and a speaker recognition module, a voice wake-up module, a deep learning based voice recognition module, and a natural language understanding module.

3. The smart home natural interaction system of claim 1, wherein the video signal processing and information extraction module (122) further comprises modules for gesture recognition, face detection and recognition, and motion detection.

4. The smart home natural interaction system of claim 1 and the audio-video signal processing and information extraction modules of claims 3 and 4,

if the indoor environment is not suitable for the audio system, the video system can independently complete the instruction identification and execution work;

if the indoor environment is not suitable for the video system, the audio system can independently complete the instruction identification and execution work;

if the indoor environment is normal, the audio and video system can supplement information with each other and work cooperatively;

when the command prediction probability is lower than a preset value or the prediction conflicts, the feedback information is responded and displayed to the feedback display module.

5. The smart home natural interaction system of claim 1 and the audio-video signal processing and information extraction modules of claims 3 and 4, wherein the system is capable of real-time positioning a user and real-time adjusting the orientation of the system according to a sound source positioning module, human body detection and face detection.

6. The smart home natural interaction system of claim 1 and the audio-video signal processing and information extraction modules of claims 3 and 4, wherein the access right of a user to the system can be determined according to the speaker recognition module and the face recognition module.

7. The smart home natural interaction system of claim 1, wherein the third party sensor interface (131) and the third party signal processing and information extraction module (132) are configured to extend possible future functionality of the system.

8. The smart home natural interaction system of claim 1, wherein the indoor control signal emitting module set (16) is in a wireless communication mode such as infrared, RF, Bluetooth, wifi, Zigbee, Z-Wave, etc.;

the indoor control signal transmitting module (16) selects a specific wireless communication and coding mode according to different household appliances; meanwhile, aiming at the uncommon household appliance brand, the module can learn the wireless communication code.

9. The smart home natural interaction system of claim 1, wherein the cloud server (18) is accessible according to a user selection;

if the cloud server is not accessed, the signal processing and information extraction modules (112, 122 and 132) can process locally; if the cloud server is accessed, all or part of the computing resources of the signal processing and information extraction modules (112, 122, and 132) may be processed by the cloud server.

10. The smart home interaction system of claim 1 and the cloud server of claim 9,

if the central processing unit is connected to a cloud server, the user instruction can acquire corresponding storage resources and information resources through the cloud server; the user can also control and monitor the indoor condition by connecting terminals such as a mobile phone and the like with the cloud server.