Detailed Description
For the purposes of making the objects and embodiments of the present invention more apparent, an exemplary embodiment of the present invention will be described in detail below with reference to the accompanying drawings in which exemplary embodiments of the present invention are illustrated, it being apparent that the exemplary embodiments described are only some, but not all, of the embodiments of the present invention.
It should be noted that the brief description of the terminology in the present invention is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present invention. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.
The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises," "comprising," and any variations thereof herein are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
In the related art, a user may trigger a voice interaction function of an electronic device through a preset wake-up word corresponding to the electronic device. For example, after the user speaks a preset wake-up word such as "small-focus" or "hello small-focus", the electronic device responds to the wake-up audio corresponding to the preset wake-up word, and triggers the voice interaction function.
After triggering the voice interaction function, the electronic equipment can output relevant prompt voice to inform the user that the voice interaction function is triggered, and wait for the user to further input a voice instruction; after receiving the voice command input by the user, responding to the voice command and executing corresponding operation. For example, after triggering the voice interaction function, the electronic device may output a prompt audio such as "your voice, i am listening" or "i am listening", and after receiving the prompt audio, the user may further input a voice command such as "please play video" or "please play song", so that after receiving the voice command, the electronic device performs operations such as playing video or playing song.
It can be understood that the wake-up module in the existing voice interaction function is limited by the computing resources of the electronic device, and generally cannot achieve a better wake-up effect, that is, a phenomenon of false wake-up occurs, for example, if a preset wake-up word of the electronic device is "small-focus", and a user chat beside the electronic device, and the chat content has words of "small-focus" or similar words to the preset wake-up word, the electronic device may identify the content as wake-up audio, or the electronic device may also identify environmental noise as wake-up audio, thereby triggering the voice interaction function by mistake, resulting in poor user experience.
Therefore, the related information of the false wake-up audio is statistically reported in the related technology, so that the reason for the false wake-up is analyzed and determined based on the information, and the voice recognition function of the electronic equipment is improved and perfected. However, statistics of relevant information of the false wake-up audio usually depends on approaches such as active reporting, customer service inquiry, laboratory test and the like of a user, so that the data volume of the obtained relevant information of the false wake-up audio is usually small, and the information is usually lagged, so that the statistics efficiency of the false wake-up audio is low, and further, the improvement and perfection of a voice recognition function are adversely affected. Therefore, there is a need for further improvements in the related art for determining and statistically reporting false wake-up audio.
The method for determining the false wake-up audio provided by the invention is described in detail below with reference to the accompanying drawings.
The electronic device provided in the embodiment of the invention may have various implementation forms, for example, may be a mobile terminal, a tablet computer, a notebook computer, a television, an electronic whiteboard (electronic bulletin board), an electronic desktop (electronic table), and the like, and the specific form of the electronic device is not limited in the embodiment of the invention.
Fig. 1 shows an interaction schematic diagram of an electronic device and a control device according to an embodiment of the present invention. As shown in fig. 1, a user may operate the electronic apparatus 200 through the mobile terminal 300 or the control device 100. The control device 100 may be a remote controller, and the remote controller and the electronic apparatus 200 may communicate through an infrared protocol, a bluetooth protocol, or the remote controller may control the electronic apparatus 200 in a wireless or other wired manner.
The user may control the electronic device 200 by inputting user instructions through keys on a remote control, voice input, a control panel, etc. For example, the user may control the electronic device 200 to switch the displayed page through up-down keys on the remote controller, control the video played by the electronic device 200 to play or pause through play pause keys, and input a voice command through a voice input key to control the electronic device 200 to perform a corresponding operation.
In some embodiments, the user may also control the electronic device 200 using a mobile terminal, tablet, computer, notebook, and other smart device. For example, a user may control the electronic device 200 through an application installed on the smart device that, by configuration, may provide the user with various controls in an intuitive user interface on a screen associated with the smart device.
In some embodiments, the mobile terminal 300 may implement connection communication with a software application installed on the electronic device 200 through a network communication protocol for the purpose of one-to-one control operation and data communication. For example, a control command protocol may be established between the mobile terminal 300 and the electronic device 200, a remote control keyboard may be synchronized with the mobile terminal 300, a function of controlling the electronic device 200 may be implemented by controlling a user interface on the mobile terminal 300, or a function of transmitting content displayed on the mobile terminal 300 to the electronic device 200, and a synchronous display may be implemented.
As shown in fig. 1, the electronic device 200 and the server 400 may communicate data via a variety of communication means, which may allow the electronic device 200 to be communicatively coupled via a local area network (Local Area Network, LAN), a wireless local area network (Wireless Local Area Network, WLAN), and other networks. The server 400 may provide various content and interactions to the electronic device 200. For example, electronic device 200 receives software program updates, or accesses a remotely stored digital media library by sending and receiving messages, and electronic program guide (ELECTRICAL PROGRAM GUIDE, EPG) interactions. The server 400 may be one cluster or multiple clusters, and may include one or more types of servers.
The electronic device 200 may be a liquid crystal display, an Organic Light-Emitting Diode (OLED) display, a projection electronic device, a smart terminal, such as a mobile phone, a tablet computer, a smart television, a laser projection device, an electronic desktop (electronic table), etc. The specific electronic device type, size, resolution, etc. are not limited.
Fig. 2 shows a block diagram of a configuration of the control device 100 in an exemplary embodiment of the present invention, and as shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an operation instruction input by a user, and convert the operation instruction into an instruction recognizable and responsive to the electronic device 200, and may perform an interaction between the user and the electronic device 200.
Taking an electronic device as an example, fig. 3 shows a hardware configuration block diagram of an electronic device 200 according to an embodiment of the present invention. As shown in fig. 3, the electronic device 200 includes: a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, and at least one of a memory, a power supply, and a user interface.
The modem 210 may receive broadcast television signals through a wired or wireless reception manner and demodulate an audio/video signal, such as an EPG data signal, from a plurality of wireless or wired broadcast television signals. The detector 230 may be used to collect signals of the external environment or interaction with the outside.
In some embodiments, the frequency point demodulated by the modem 210 is controlled by the controller 250, and the controller 250 may issue a control signal according to the user selection, so that the modem responds to the television signal frequency selected by the user and modulates and demodulates the television signal carried by the frequency.
The broadcast television signal may be classified into a terrestrial broadcast signal, a cable broadcast signal, a satellite broadcast signal, an internet broadcast signal, or the like according to different broadcasting systems of the television signal. Or may be differentiated into digital modulation signals, analog modulation signals, etc., depending on the type of modulation. And further, the signals are classified into digital signals, analog signals and the like according to different signal types.
In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.
In some embodiments, communicator 220 may be a component for communicating with external devices or external servers according to various communication protocol types. For example: the communicator 220 may include at least one of a Wifi chip, a bluetooth communication protocol chip, a wired ethernet communication protocol chip, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver.
In some embodiments, the detector 230 may be used to collect signals of or interact with the external environment, may include an optical receiver and a temperature sensor, etc.
The light receiver can be used for acquiring a sensor of the intensity of the ambient light, and adaptively adjusting display parameters and the like according to the intensity of the ambient light; the temperature sensor may be used to sense an ambient temperature, so that the electronic device 200 may adaptively adjust a display color temperature of the image, such as may adjust a color temperature colder hue of the image displayed by the electronic device 200 when the ambient temperature is higher, or may adjust a color temperature warmer hue of the image displayed by the electronic device 200 when the ambient temperature is lower.
In some embodiments, the detector 230 may further include an image collector, such as a camera, a video camera, etc., which may be used to collect external environmental scenes, collect attributes of a user or interact with a user, adaptively change display parameters, and recognize a user gesture to realize an interaction function with the user.
In some embodiments, the detector 230 may also include a sound collector or the like, such as a microphone, that may be used to receive the user's sound. For example, a voice signal including control instructions for a user to control the electronic device 200, or collecting ambient sound for identifying an ambient scene type, so that the electronic device 200 can adapt to ambient noise.
In some embodiments, external device interface 240 may include, but is not limited to, the following: any one or more interfaces such as a high-definition multimedia interface (High Definition Multimedia Interface, HDMI), an analog or data high-definition component input interface, a composite video input interface, a universal serial bus (Universal Serial Bus, USB) input interface, an RGB port, or the like, or an input/output interface in which the plurality of interfaces form a composite can be used.
As shown in fig. 3, the controller 250 may include at least one of a central processor, a video processor, an audio processor, a graphic processor, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), and a first interface to an nth interface for input/output. Wherein the communication bus connects the various components.
In some embodiments, the controller 250 may control the operation of the electronic device and respond to user operations through various software control programs stored on an external memory. For example, a user may input a user command through a graphical user interface (Graphic User Interface, GUI) displayed on the display 260, the user input interface receives the user input command through the graphical user interface, or the user may input the user command by inputting a specific sound or gesture, the user input interface recognizes the sound or gesture through the sensor, and receives the user input command.
A "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user that enables conversion between an internal form of information and a user-acceptable form. A commonly used presentation form of a user interface is a graphical user interface, which refers to a user interface related to computer operations that is displayed in a graphical manner. The control can comprise at least one of an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget (short for Widget) and other visual interface elements.
In some embodiments, RAM may be used to store temporary data for the operating system or other on-the-fly programs; ROM may be used to store instructions for various system starts, for example, may be used to store instructions for a basic input output system, referred to as a basic input output system (Basic Input Output System, BIOS) start. ROM can be used to complete the power-on self-test of the system, the initialization of each functional module in the system, the driving program of the basic input/output of the system and the booting of the operating system.
In some embodiments, upon receipt of the power-on signal, the electronic device 200 power begins to boot, and the central processor runs system boot instructions in ROM, copying temporary data of the operating system stored in memory into RAM for booting or running the operating system. When the starting of the operating system is completed, the CPU copies the temporary data of various application programs in the memory into the RAM, and then the temporary data are convenient for starting or running the various application programs.
In some embodiments, the central processor may be configured to execute operating system and application instructions stored in memory, and to execute various applications, data, and content in accordance with various interactive instructions received from external inputs, to ultimately display and play various audio-visual content.
In some example embodiments, the central processor may include a plurality of processors. The plurality of processors may include one main processor and one or more sub-processors. A main processor for performing some operations of the electronic device 200 in a pre-power-up mode and/or displaying a screen in a normal mode. One or more sub-processors for one operation in a standby mode or the like.
In some embodiments, the video processor may be configured to receive an external video signal, perform video processing in accordance with standard codec protocols for input signals, decompression, decoding, scaling, noise reduction, frame rate conversion, resolution conversion, transparency settings, image composition, etc., and may result in a signal that is directly displayable or playable on the electronic device 200.
In some embodiments, the video processor may include a demultiplexing module, a video decoding module, an image compositing module, a frame rate conversion module, a display formatting module, and the like.
The demultiplexing module is used for demultiplexing the input audio/video data stream, such as input moving picture expert group standard 2 (Moving Picture Experts Group-2, MPEG-2), and demultiplexes the input audio/video data stream into video signals, audio signals and the like; the video decoding module is used for processing the demultiplexed video signal, including decoding and scaling, transparency setting, etc.
And an image synthesis module, such as an image synthesizer, for performing superposition mixing processing on the graphic generator and the video image after the scaling processing according to the GUI signal input by the user or generated by the graphic generator, so as to generate an image signal for display. The frame rate conversion module is configured to convert the input video frame rate, for example, converting the 60Hz frame rate into the 120Hz frame rate or the 240Hz frame rate, and the common format is implemented in an inserting frame manner. The display format module is used for converting the received frame rate into a video output signal, and changing the video output signal to a signal conforming to the display format, such as outputting an RGB data signal.
In some embodiments, the audio processor may be configured to receive an external audio signal, decompress and decode the audio signal according to a standard codec protocol of the input signal, and perform noise reduction, digital-to-analog conversion, and amplification processes to obtain a sound signal that may be played in a speaker.
In some embodiments, the video processor may comprise one or more chips. The audio processor may also comprise one or more chips. Meanwhile, the video processor and the audio processor may be a single chip, or may be integrated with the controller in one or more chips.
In some embodiments, the interface for input/output may be used for audio output, that is, receiving the sound signal output by the audio processor under the control of the controller 250 and outputting the sound signal to an external device such as a speaker, and may output the sound signal to an external sound output terminal of the generating device of the external device, except for the speaker carried by the electronic device 200 itself, for example: external sound interface or earphone interface, etc. The audio output may also include a near field communication module in the communication interface, such as: and the Bluetooth module is used for outputting sound of a loudspeaker connected with the Bluetooth module.
In some embodiments, the graphics processor may be used to generate various graphical objects, such as: icons, operation menus, user input instruction display graphics, and the like. The graphic processor may include an operator to display various objects according to display attributes by receiving user input of various interactive instructions to perform operations. And a renderer for rendering the various objects obtained by the arithmetic unit, wherein the rendered objects are used for being displayed on a display.
In some embodiments, the graphics processor and the video processor may be integrated or may be separately configured, where the integrated configuration may perform processing of graphics signals output to the display, and the separate configuration may perform different functions, such as a graphics processor (Graphics Processing Unit, GPU) +frame frequency Conversion (FRAME RATE Conversion, FRC) architecture, respectively.
The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen.
In some embodiments, the display 260 may be used to display a user interface, such as may be used to display an interface corresponding to an electronic device, for example, the display interface may be a channel search interface in an electronic device, or may also be a display interface of some application program, etc.
In some embodiments, the display 260 may be used to receive audio and video signals output by the audio processor and video processor, display video content and images, play audio of the video content, and display components of a menu manipulation interface.
In some embodiments, the display 260 may be used to present a user-operated UI interface generated in the electronic device 200 and used to control the electronic device 200.
In some embodiments, the electronic device 200 may establish control signal and data signal transmission and reception between the communicator 220 and the control apparatus 100 or the content providing device.
In some embodiments, the memory may include storage of various software modules for driving the electronic device 200. Such as: various software modules stored in the first memory, including: at least one of a basic module, a detection module, a communication module, a display control module, a browser module, various service modules and the like.
The base module is a bottom software module for signal communication between the hardware of the electronic device 200 and sending processing and control signals to the upper module. The detection module is used for collecting various information from various sensors or user input interfaces and carrying out digital-to-analog conversion and analysis management.
The display control module can be used for controlling the display to display the image content and can be used for playing the multimedia image content, the UI interface and other information. And the communication module can be used for carrying out control and data communication with external equipment. And the browser module can be used for executing data communication between the browsing servers. And the service module is used for providing various services and various application programs. Meanwhile, the memory may also store images of various items in various user interfaces, visual effect patterns of the focus object, and the like, which receive external data and user data.
In some embodiments, the user interface may be used to receive control device 100, such as: an infrared control signal transmitted by an infrared remote controller, etc.
The power supply may supply power to the electronic device 200 through power input from an external power source under the control of the controller 250.
In some embodiments, the electronic device 200 may receive a query instruction entered by a user through the communicator 220. For example, when communicator 220 is a touch component, the touch component may together with display 260 form a touch screen. On the touch screen, a user can input different control instructions through touch operation, for example, the user can input touch instructions such as clicking, sliding, long pressing, double clicking and the like, and different touch instructions can represent different control functions.
To implement the different touch actions, the touch assembly may generate different electrical signals when the user inputs the different touch actions, and transmit the generated electrical signals to the controller 250. The controller 250 may perform feature extraction on the received electrical signal to determine a control function to be performed by the user based on the extracted features.
For example, when a user inputs a click touch action at a search location in the display interface, the touch component will sense the touch action to generate an electrical signal. After receiving the electrical signal, the controller 250 may determine the duration of the level corresponding to the touch action in the electrical signal, and recognize that the user inputs the click command when the duration is less than the preset time threshold. The controller 250 then extracts the location features generated by the electrical signals to determine the touch location. When the touch position is within the search position range, it is determined that the user has input a click touch instruction at the search position. Then, the controller 250 may start a media search function and receive a search instruction input by the user, such as a search keyword, a voice search instruction, etc.
In some embodiments, the user may trigger the query operation through a specific gesture operation on the touch screen, for example, when the user performs two continuous double-click operations on the display interface, the controller 250 may determine an interval time between two continuous double-clicks, and when the interval time is less than a preset time threshold, recognize that the user inputs the continuous double-click operation, and determine that the user triggers the media resource search operation.
In some embodiments, a user may enter voice instructions on a touch screen via a touch operation, such as a user may trigger a voice query operation on display 260 via a voice-triggered gesture.
In some embodiments, the communicator 220 may also be an external control component, such as a mouse, remote control, or the like, which may establish a communication connection with an electronic device. When the user performs different control operations on the external control component, the external control component may generate different control signals in response to the control operations of the user and transmit the generated control signals to the controller 250. The controller 250 may perform feature extraction on the received control signal to determine a control function to be performed by the user according to the extracted features.
For example, when a user clicks a left mouse button at any position in the channel display interface through the external control component, the external control component can sense a control action to generate a control signal. After receiving the control signal, the controller 250 may control the stay time of the action at the position according to the control signal, and identify that the click command is input by the user through the external control component when the stay time is less than the preset time threshold. The clicking instruction is used for triggering an input function instruction of the query instruction or switching the media resource page under the current scene.
For another example, when the user presses a voice key on the remote control, the remote control may initiate a voice entry function, and during the process of the user entering a voice command, the remote control may synchronize the voice command to the display 260, at which time the display 260 may display a voice entry identifier to indicate that the user is entering a voice command.
In some embodiments, the communicator 220 may also be a control component, such as a desktop computer, coupled to the display 260, which may be a keyboard coupled to the display. The user can input different control instructions, such as media information switching instructions, inquiry instructions and the like through the keyboard.
Illustratively, the user may input a click command, a voice command, etc. through the corresponding shortcut key. For example, the user may trigger the sliding operation by selecting the "Tab" key and the direction key, that is, when the user selects the "Tab" key and the direction key on the keyboard at the same time, the controller 250 may receive the key signal, determine that the user triggers the operation of performing the switching operation in the direction corresponding to the direction key, and then, the controller 250 may control to turn or scroll the display interface in the media presentation page to display the corresponding media options.
Correspondingly, the user can also input voice instructions through corresponding shortcut keys. For example, when the user selects the "Ctrl" key and the "V" key, the controller 250 may receive a key signal to determine that the user triggers a voice search operation, and then the controller 250 may receive a voice command input by the user and control the display 260 to perform a corresponding operation, such as displaying a query result page corresponding to the voice command, according to the voice command.
In order to facilitate the detailed description of the method for determining the false wake-up audio provided by the embodiment of the present invention, fig. 4 shows a flowchart of a method for determining the false wake-up audio provided by the embodiment of the present invention, and the method may be applied to the electronic device 200 shown in fig. 1.
Among other things, the electronic device 200 may include a communicator 220 and a controller 250 coupled with the communicator 220.
In some embodiments, the communicator 220 may be configured to receive wake-up audio and audio to be detected, and the controller 250 may perform a wake-up operation in response to the wake-up audio and control the communicator 220 to receive the audio to be detected.
According to the method for determining the false wake-up audio, provided by the embodiment of the invention, whether the target historical detection text matched with the currently acquired text to be detected exists can be determined based on the historical detection text database; if the target history detection text exists and the target access times of the target history detection text are greater than the target preset access threshold, further detecting context information of the text to be detected; if the context audio corresponding to the audio to be detected includes a preset instruction audio, or after the audio to be detected is acquired, the user triggers a preset operation on the electronic device, the wake-up audio can be determined as a false wake-up audio.
By applying the technical scheme of the invention, after the electronic equipment acquires the awakening audio and the audio to be detected at the current moment, whether the awakening audio is the false awakening audio or not can be determined based on the historical statistical data (namely the historical detection text database) and the audio to be detected, so that the resource consumption of the electronic equipment when the false awakening audio is identified can be reduced, sufficient false awakening data can be acquired, and the instantaneity and the effectiveness of the false awakening data are ensured.
As shown in fig. 4, the controller 250 is configured to perform the following steps S410 to S440:
S410: in response to the wake-up audio, a wake-up operation is performed and audio to be detected is received.
In some embodiments, the electronic device may store information related to when the user uses the voice interaction function of the electronic device and send the information to a server that establishes a communication connection with the electronic device. The related information when the user uses the voice interaction function may include wake-up audio information, detection audio information, link log information, electronic device information, user portrait information, and the like.
The electronic device may record a voice interaction between the user and the electronic device as a session, and determine the identification information (or referred to as the identification number, or sessionid) of the session, where wake-up audio information, detection audio information, and link log information corresponding to each session also have corresponding same identification information.
The contents included in the wake-up audio information, the detection audio information, the link log information, the electronic device information, and the user portrait information may be shown in table 1:
TABLE 1
Illustratively, as shown in Table 1, the wake-up audio information may include identification information sessionid corresponding to the session, a number of the wake-up audio, a time stamp (e.g., a time at which the wake-up audio was received by the electronic device), a storage location of the wake-up audio in the electronic device, and so forth. Based on the identification information sessionid in the wake-up audio information and the storage location of the wake-up audio in the electronic device, the corresponding wake-up audio may be obtained.
The detected audio information may include identification information sessionid corresponding to the session, a detected audio number, a time stamp (e.g., the time the detected audio was received by the electronic device), a storage location of the detected audio in the electronic device, and so forth. Based on the identification information sessionid in the detected audio information and the storage location of the detected audio in the electronic device, corresponding detected audio may be obtained. The detected audio includes audio to be detected and historical detected audio.
Wherein, the detection of the audio refers to the audio acquired after the electronic device receives and responds to the wake-up audio. It will be appreciated that since the wake-up audio may be the correct wake-up audio or the false wake-up audio, the detected audio may or may not be the instruction audio (e.g., an instruction such as a video play, a message query, etc., which is input by the user to the electronic device), the instruction audio (e.g., an audio that may be played by the electronic device or another device, or what is spoken by the user that is unrelated to the voice interaction, or ambient noise, etc.). The audio to be detected is detected audio acquired at the current time point, and the historical detected audio is detected audio acquired before the current time point.
The link log information may include identification information sessionid corresponding to the session, an encoding of the electronic device, a time stamp (e.g., time the electronic device received wake-up audio or detected audio), detected text, semantic recognition results for the detected text, event encoding, event description, and foreground application information, etc.
The detection text is a text recognition result obtained after the voice recognition processing is carried out on the detection audio.
The events include instructions received by the electronic device during a period of power-on operation or operations triggered by a user, such as television programs and open application programs watched by the user during each period, time periods during which the user uses voice interaction functions, related operations performed by the user during voice interaction with the electronic device, and the like. User profile information regarding voice interaction functions may be generated based on the event description, such as: information such as a high-frequency scene in which the user uses the voice interaction function, a high-frequency time period in which the user uses the voice interaction function, and a service area in which the user uses the voice interaction function.
Wherein the foreground application information includes information about applications opened by the user or television programs watched during each period, for example, the user watched morning news at 8 am; or in the process of voice interaction, a certain media resource playing application is opened in the electronic equipment, and the media resource played by the application is a certain television play, and the like.
The device basic information may include a uuid corresponding to the electronic device, a model number of the electronic device, a chip type of the electronic device, version information of the electronic device, and the like.
The user information may include uuid corresponding to the electronic device, user portrait information (or user behavior preferences), and so forth. Wherein the user profile information may be generated based on the link log information.
In some embodiments, as shown in fig. 5, the controller 250 is further configured to perform the following steps S510-S520:
S510: and acquiring a plurality of history detection texts corresponding to the time periods of the plurality of users in at least one period.
In some embodiments, the electronic device may build a history detection text database based on history detection text (e.g., history detection text in the link log information described above). The history detection texts are obtained by performing voice recognition processing on the history detection audios.
In some examples, the link log information of the other electronic devices corresponding to N days (i.e., one or more periods) may be obtained, and a preferred history detection text with access times greater than a preset threshold may be selected from the history detection texts in the link log information of each day and stored in a data table (e.g., a preferred result list). Wherein N is a positive integer greater than or equal to 1, and specific numerical values of N are not limited in the embodiment of the present invention, and may be, for example, 1 day, 4 days, 7 days, etc.; the link log information corresponding to other electronic devices may be obtained by the electronic device from a server, which is not limited in the embodiment of the present invention.
Illustratively, the history detection text in tables 2 and 3 is taken as an example. Table 2 is a history detection text corresponding to the history detection audio received by the plurality of electronic devices of 2024, 2 nd and 3 rd, table 3 is a history detection text corresponding to the history detection audio received by the plurality of electronic devices of 2024, 2 nd and 4 th, and table 4 is a preferred result list.
TABLE 2
| Numbering device | History detection text | Number of accesses | Date of day |
| 1 | Cartoon piece | 23030 | 2024-02-03 |
| 2 | Film making apparatus | 21566 | 2024-02-03 |
| 3 | Playing back | 19500 | 2024-02-03 |
| 4 | Weather forecast display | 18500 | 2024-02-03 |
| 5 | The spring festival brings our reporter to the mind | 15700 | 2024-02-03 |
| 6 | Shutdown | 14005 | 2024-02-03 |
TABLE 3 Table 3
| Numbering device | History detection text | Number of accesses | Date of day |
| 1 | Playing back | 24130 | 2024-02-04 |
| 2 | Cartoon piece | 22208 | 2024-02-04 |
| 3 | Shutdown | 18705 | 2024-02-04 |
| 4 | Joint-cheering evening party | 15983 | 2024-02-04 |
| 5 | Film making apparatus | 12050 | 2024-02-04 |
| 6 | Music | 10028 | 2024-02-04 |
TABLE 4 Table 4
| Numbering device | History detection text | Number of accesses | Date of day |
| 1 | Cartoon piece | 23030 | 2024-02-03 |
| 2 | Film making apparatus | 21566 | 2024-02-03 |
| 3 | Playing back | 19500 | 2024-02-03 |
| 4 | Weather forecast display | 18500 | 2024-02-03 |
| 5 | The spring festival brings our reporter to the mind | 15700 | 2024-02-03 |
| 6 | Shutdown | 14005 | 2024-02-03 |
| 7 | Playing back | 24130 | 2024-02-04 |
| 8 | Cartoon piece | 22208 | 2024-02-04 |
| 9 | Shutdown | 18705 | 2024-02-04 |
| 10 | Joint-cheering evening party | 15983 | 2024-02-04 |
Taking the preset threshold value as 14000 as an example, it can be seen that the access times of the history detection texts corresponding to the numbers 1 to 6 in the table 2 are all greater than the preset threshold value 14000, so that the history detection texts corresponding to the numbers 1 to 6 can be used as the preferred history detection texts of the day, and the preferred history detection texts are stored in the preferred result list shown in the table 4; the number of accesses to the history detection texts corresponding to numbers 1 to 4 in table 3 is greater than the preset threshold 14000, and therefore, the history detection texts corresponding to numbers 1 to 4 may be selected as the preference history detection text for the day, and the preference history detection text may be stored in the preference result list shown in table 4.
S520: a history detection text database is determined based on a plurality of history detection texts corresponding to each time period.
In some embodiments, after obtaining the preferred result list, the electronic device may determine the number of accesses of the plurality of history detection texts corresponding to each day in the preferred result list in each period, so as to determine the history detection text database. The length of each period may be 1 minute, 3 minutes, 5 minutes, etc., and the embodiment of the present invention does not limit the specific length of each period.
In some examples, a hot interval in which the user uses the electronic device with a high frequency, for example, 11 to 13 points, 17 to 23 points, and so on, may be selected, and the number of accesses of the plurality of history detection texts corresponding to each day in each period may be obtained in the hot interval.
Illustratively, table 5 is one possible history detection text database, taking the number of accesses of a plurality of history detection texts as an example for each period within the abovementioned hot interval at 2 months 3 of 2024. Wherein, whether the high frequency represents the text of whether the history detection text is high-frequency occurrence in a certain period of time relative to the total access times of the day (i.e. whether the text is greater than or equal to a preset high-frequency access threshold).
For example, if the preset high frequency access threshold is 0.4% of the total number of accesses of the day, for the period of 11:50-11:51, the number of accesses of the history detection text "play" is 603 times, and compared to the number of accesses of all the history detection texts of the day (for example: 23030+21566+19500+18500+15700+14005= 112301), the number of accesses of the history detection text in the period is greater than the preset high frequency access threshold (112301 x 0.5% =561), and therefore, the history detection text "play" is a text that appears at high frequency in the period of 11:50-11:51.
TABLE 5
In some embodiments, after determining the history detection text database, the preset access threshold corresponding to each time period may be determined based on the number of accesses of the plurality of history detection texts corresponding to each time period in at least one period (i.e., in N days). For example, the mean and variance of the number of text accesses corresponding to each time period in each day may be determined first; and then determining a preset access threshold corresponding to each time period based on the mean value and the variance corresponding to each time period in N days.
For example, for the time period 17:31-17:32, the number of text accesses corresponding to the time period of 2 nd month 3 days 2024 is 1188, including two history detection texts, "play" and "cartoon", the mean and variance of the number of text accesses corresponding to the time period of 2 nd month 3 days 2024 can be determined based on the number of text accesses and the number of history detection texts, and the mean and variance of the number of text accesses corresponding to the time period of other days can be obtained similarly.
Based on the mean value and the variance of the text access times corresponding to the time period in N days (for example: 7 days), a preset access threshold corresponding to the time period can be obtained, for example, the mean value of the mean value and the variance of the text access times corresponding to the time period in 7 days can be taken, and the preset access threshold can be determined according to the mean value of the mean value and the variance. The preset access threshold corresponding to other time periods can be obtained in the same way, and will not be described in detail here.
In some embodiments, after determining the history detection text database and the preset access threshold corresponding to each time period, when the electronic device receives the wake-up audio at the current time, the electronic device may respond to the wake-up audio and perform a wake-up operation. For example, the electronic device may output relevant alert audio to the user after receiving the wake-up audio to alert the user that the voice interactive function has been turned on and wait to receive voice instructions from the user.
It will be appreciated that the wake-up audio may be the correct wake-up audio, or may be audio played by an electronic device or other device, content spoken by the user that is not related to voice interactions, or false wake-up audio such as ambient noise.
Referring to a scene graph of voice interactions shown in fig. 6A-6D. Taking the preset wake-up word of the electronic device 200 as "small poly" as an example. As shown in fig. 6A, if the wake-up audio 601 input by the user is "small", the electronic device 200 may respond to the wake-up audio and output related prompt information, for example, a prompt control of "you good, i am listening" may be displayed in the user interface 602 of the electronic device 200, or a voice message of "you good, i am listening" may be output. As shown in fig. 6B, if the preset wake-up word "small group" is included in the chat content of the user or the audio being played by the electronic device/other device (e.g., the wake-up audio 604 by mistake), the electronic device 200 may still respond to the wake-up audio.
S420: and performing voice recognition processing on the audio to be detected to obtain a text to be detected corresponding to the audio to be detected.
In some embodiments, if the electronic device receives a piece of audio (i.e., audio to be detected) after performing the wake-up operation, the audio to be detected may be subjected to a voice recognition process, and a text to be detected corresponding to the audio to be detected is obtained.
It will be appreciated that if the electronic device receives correct wake-up audio, the audio to be detected is more likely to be a voice command input by the user, such as playing a certain video, playing a certain television program, turning off, suspending, etc. If the electronic device receives the wake-up error audio, the audio to be detected is more likely to be chat audio of the user, audio being played by the electronic device/other devices, and the like, namely, audio irrelevant to the voice interaction process.
In some embodiments, the electronic device may search the history detection text database for whether there is a target history detection text that matches the text to be detected. If there is a target history detection text matching the above-mentioned text to be detected, S430 is performed. The history detection text database comprises a plurality of history detection texts, and the plurality of history detection texts comprise target history detection texts.
The condition that the text to be detected is matched with the target history detection text may include: the text to be detected is completely matched with the target history detection text, or the matching degree of the text to be detected and the target history detection text is larger than a certain matching degree threshold value.
The method for determining whether the text to be detected is matched with the history detection text in the history detection text database can adopt a similarity algorithm, a similarity matching model and the like. For example, the similarity matching model may include a Cross-encoder (Cross-encoders) model, a Bi-encoders model, a Late Interaction model, an Attention-based aggregation (Attention-based Aggregator) model, and the like; the similarity algorithm may include a Jacquard similarity coefficient (Jaccard Similarity Coefficient), a Euclidean distance (Euclidean Distance), a Manhattan distance (MANHATTAN DISTANCE), a Cosine of the included angle (Cosine), a Tanimoto coefficient (Tanimoto Coefficient), and the like.
S430: if the history detection text database contains a target history detection text matched with the text to be detected, and the target access times of the target history detection text are larger than a target preset access threshold, detecting context information of the audio to be detected.
In some embodiments, if there is a target history detection text in the history detection text database that matches the text to be detected, then it may be continuously determined whether the number of target accesses of the target history detection text is greater than a target preset access threshold; if the target access times of the target history detection text are greater than the target preset access threshold, the wake-up audio corresponding to the audio to be detected can be initially determined to be false wake-up audio, and then context information detection is continuously carried out on the audio to be detected, so that whether the wake-up audio is false wake-up audio or not is further determined.
In some embodiments, the target time period and the target access number corresponding to the target history detection text may be determined based on the history detection text database, and whether the target access number is greater than the target preset access threshold may be determined according to the target preset access threshold corresponding to the target time period.
Taking the text to be detected as "spring festival bringing our reporter" as an example, it can be determined that the target history detection text matched with the text to be detected is "spring festival bringing our reporter" in the history detection text database shown in the above table 5; the target time period corresponding to the target history detection text is 21:23-21:24, and the target access times are 300. If the target preset access threshold corresponding to the target time period is 21:23-21:24 and 205 times, the target access times are larger than the target preset access threshold.
In some embodiments, in the case where there is a target history detection text that matches the text to be detected, it may be determined whether the target access number of the target history detection text is less than a preset high-frequency threshold in addition to determining whether the target access number of the target history detection text is greater than a target preset access threshold.
If the access times of the detected text are larger than a preset high-frequency threshold value, the text corresponding to the instruction audio, which indicates that the high probability of the detected text is correct, is indicated; if the access times of the detected text are smaller than the preset high-frequency threshold value, the detected text is possibly the text acquired by waking up the audio by mistake.
And primarily determining wake-up audio corresponding to the text to be detected (audio to be detected) of which the target access times are larger than a target preset access threshold and smaller than a preset high-frequency threshold as false wake-up audio, namely primarily determining wake-up audio which is easier to cause false wake-up and wake-up audio corresponding to non-instruction audio as false wake-up audio.
It will be appreciated that, as shown in fig. 6C, if the user inputs correct wake-up audio, the electronic device will receive a large probability of further instruction audio 603 (for example, "play XXX") input by the user, i.e., the audio to be detected is instruction audio; and outputs a relevant prompt (e.g., "good, you play right away") in the user interface 602 to inform the user that the instruction audio has been successfully responded to at present, and waits for the user to input the instruction audio to perform the relevant function indicated by the instruction audio.
As shown in fig. 6D, if the wake-up audio is a false wake-up audio (e.g., the false wake-up audio 604 in fig. 6B), and the electronic device 200 responds to the audio to be detected, the electronic device 200 may output a related prompt message to inform the user that the currently acquired audio to be detected does not include a valid instruction; after learning the prompt information output by the electronic device, the user typically exits the voice interaction function through a voice command 605 (e.g., exit the voice interaction function) or the control device 100.
That is, in the case where the audio to be detected is a non-instruction audio, the context information corresponding to the audio to be detected may include that the user exits the voice interaction function through a voice instruction (i.e., a preset instruction audio), or triggers an operation of exiting the voice interaction function of the electronic device through a control device or the like (i.e., a preset operation), or the like.
S440: if the context audio corresponding to the audio to be detected comprises preset instruction audio or a user triggers preset operation on the electronic equipment after the audio to be detected is acquired, the wake-up audio is determined to be false wake-up audio.
In some embodiments, if the context audio corresponding to the audio to be detected includes the preset instruction audio, or the user triggers the preset operation on the electronic device after the audio to be detected is obtained, it indicates that the user does not need to use the voice interaction function currently, so it may be inferred that the voice interaction function is likely to be awakened by mistake, so that the previously received awakening audio is the awakening audio by mistake, that is, the awakening audio is determined to be the awakening audio by mistake.
In some embodiments, after determining the wake-up audio as the false wake-up audio, the electronic device may store a plurality of identification information sessionid corresponding to the multiple false wake-up audio, and obtain, in the subsequent process of reporting the false wake-up data, the multiple false wake-up audio and a plurality of detection texts corresponding to the multiple false wake-up audio according to the stored plurality of identification information in the link log information. And, at least one dimension information corresponding to each false wake-up audio may be obtained from the device base information and the user information, where the dimension information is used to determine a cause of generating the false wake-up audio, and may include a region, a language, an electronic device model, a chip type, version information, and the like where the false wake-up phenomenon occurs.
Then, the electronic device may send the above-mentioned false wake-up data (including multiple false wake-up audio, multiple detection texts corresponding to the multiple false wake-up audio, and at least one dimension information corresponding to each false wake-up audio) to the terminal device, so that the terminal device or related personnel determines a cause of generating the false wake-up audio based on the multiple false wake-up audio, multiple texts to be detected corresponding to the multiple false wake-up audio, and the at least one dimension information. Moreover, the false wake-up data can be used as a negative sample of the voice interaction model of the electronic equipment to train the voice interaction model, so that the probability of false wake-up of the electronic equipment can be effectively reduced, and the user experience is improved.
By applying the technical scheme of the invention, after the electronic equipment acquires the awakening audio and the audio to be detected at the current moment, whether the awakening audio is the false awakening audio or not can be determined in real time based on the historical statistical data (namely the historical detection text database) and the audio to be detected, so that the resource consumption of the electronic equipment when the false awakening audio is identified can be reduced, and the acquired false awakening data can be more; after the false wake-up audio is determined, false wake-up data (for example, a plurality of false wake-up audio, a plurality of detection texts corresponding to the false wake-up audio and at least one dimension information corresponding to each false wake-up audio) can be reported to the terminal equipment or the server in batches, so that the instantaneity and the effectiveness of the false wake-up data can be ensured, a large number of reporting processes of the false wake-up problem can be saved, and a developer can timely find and process the false wake-up problem of the electronic equipment.
The embodiment of the invention also provides a device for determining the false wake-up audio, referring to fig. 7, the device 700 for determining the false wake-up audio may be applied to an electronic device, and the device 700 for determining the false wake-up audio may include: a receiving module 710, a processing module 720, a detecting module 730, and a determining module 740.
The receiving module 710 is configured to receive wake-up audio and audio to be detected.
The processing module 720 is configured to perform voice recognition processing on the audio to be detected, so as to obtain a text to be detected corresponding to the audio to be detected.
The detecting module 730 is configured to detect the context information of the audio to be detected if the history detection text database contains the target history detection text matched with the text to be detected and the target access frequency of the target history detection text is greater than the target preset access threshold.
The history detection text database comprises a plurality of history detection texts, and the plurality of history detection texts comprise target history detection texts.
The determining module 740 is configured to determine the wake-up audio as a false wake-up audio if the context audio corresponding to the audio to be detected includes a preset instruction audio, or if the user triggers a preset operation on the electronic device after the audio to be detected is acquired.
As shown in fig. 7, the determination apparatus 700 of false wake-up audio may further include: and an acquisition module 750.
In some embodiments, the acquisition module 750 is configured to: acquiring a plurality of historical detection audios corresponding to time periods of a plurality of users in at least one period; the processing module 720 is further configured to: performing voice recognition processing on the plurality of history detection audios to obtain a plurality of history detection texts corresponding to the plurality of history detection audios; the determination module 740 is further configured to: a history detection text database is determined based on a plurality of history detection texts corresponding to each time period.
In some embodiments, the determination module 740 is further configured to: determining a preset access threshold corresponding to each time period based on the access times of a plurality of history detection texts corresponding to each time period in at least one period in a history detection text database; the target preset access threshold is a preset access threshold corresponding to a target time period in each time period.
In some embodiments, the determination module 740 is specifically configured to: determining access times mean and access times variance corresponding to each time period based on the access times of the plurality of history detection texts corresponding to each time period; and determining a preset access threshold corresponding to each time period based on the access frequency mean value and the access frequency variance corresponding to each time period.
In some embodiments, the determination module 740 is further configured to: determining whether the target access times of the target history detection text are larger than a target preset access threshold value or not; the determining module 740 is specifically configured to: determining a target time period and target access times corresponding to the target history detection text based on the history detection text database; and determining whether the target access times are greater than a target preset access threshold according to the target preset access threshold corresponding to the target time period.
In some embodiments, the detection module 730 is specifically configured to perform context information detection on the audio to be detected if the target access number is greater than the target preset access threshold and the target access number is less than the preset high frequency access threshold.
As shown in fig. 7, the determination apparatus 700 of false wake-up audio may further include: a storage module 760 and a transmission module 770.
In some embodiments, the storage module 760 is configured to store a plurality of identification information corresponding to a plurality of false wake-up audios; the acquisition module 750 is further configured to: acquiring a plurality of false wake-up audios and a plurality of texts to be detected corresponding to the false wake-up audios according to the plurality of identification information; acquiring at least one dimension information corresponding to each false wake-up audio; the dimension information is used for determining the reason for generating the false wake-up audio; the sending module 770 is configured to send the multiple false wake-up audios, multiple to-be-detected texts corresponding to the multiple false wake-up audios, and at least one dimension information corresponding to each false wake-up audio to the terminal device, so that the terminal device determines a cause of generating the false wake-up audio based on the multiple false wake-up audios, the multiple to-be-detected texts corresponding to the multiple false wake-up audios, and the at least one dimension information.
Correspondingly, the specific details of each part in the device for determining the false wake-up audio are already described in detail in the embodiment of the electronic equipment part, and the details not disclosed can be referred to the embodiment of the electronic equipment part, so that the details are not repeated.
The embodiment of the invention provides a computer readable storage medium, which stores at least one executable instruction, and when the executable instruction runs on an electronic device/device for determining false wake-up audio, the method for determining false wake-up audio in any method embodiment is executed by the device for determining false wake-up audio.
The executable instructions may be specifically configured to cause the determining means of the electronic device/false wake-up audio to perform the above-described method for determining false wake-up audio.
In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. In addition, embodiments of the present invention are not directed to any particular programming language.
In the description provided herein, numerous specific details are set forth. It will be appreciated, however, that embodiments of the invention may be practiced without such specific details. Similarly, in the above description of exemplary embodiments of the invention, various features of embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. Wherein the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Except that at least some of such features and/or processes or elements are mutually exclusive.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.