TECHNICAL FIELDThis disclosure relates to adapting hotword recognition based on personalized negatives.
BACKGROUNDA speech-enabled environment (e.g., home, workplace, school, automobile, etc.) allows a user to speak a query or a command out loud to a computer-based system that fields and answers the query and/or performs a function based on the command. The speech-enabled environment can be implemented using a network of connected microphone devices distributed through various rooms or areas of the environment. These devices may use hotwords to help discern when a given utterance is directed at the system, as opposed to an utterance that is directed to another individual present in the environment. Accordingly, the devices may operate in a sleep state or a hibernation state and wake-up only when a detected utterance includes a hotword. Typically, systems used to detect hotwords in streaming audio generate a probability score indicative of a probability that a hotword is present in the streaming audio. When the probability score satisfies a predetermined threshold, the device initiates the wake-up process.
SUMMARYOne aspect of the disclosure provides a method for adapting hotword recognition based on personalized negatives. The method includes receiving, at data processing hardware, audio data characterizing a hotword event detected by a first stage hotword detector in streaming audio captured by a user device. The method also includes processing, by the data processing hardware, using a second stage hotword detector, the audio data to determine whether a hotword is detected by the second stage hotword detector in a first segment of the audio data. When the hotword is not detected by the second stage hotword detector in the first segment of the audio data, the method includes classifying, by the data processing hardware, the first segment of the audio data as containing a negative hotword that caused a false detection of the hotword event in the streaming audio by the first stage hotword detector. Based on the first segment of the audio data classified as containing the negative hotword, the method includes updating, by the data processing hardware, the first stage hotword detector to prevent triggering the hotword event in subsequent audio data that contains the negative hotword.
Implementations of the disclosure may include one or more of the following optional features. In some implementations, the method further includes, when the hotword is not detected by the second stage hotword detector in the first segment of the audio data suppressing, by the data processing hardware, a wake-up process on the user device for processing the hotword and one or more other terms following the hotword in the streaming audio, and determining, by the data processing hardware, whether an immediate follow-up query was provided by a user of the user device after suppressing the wake-up process on the user device. In these implementations, classifying the first segment of the audio data as containing the negative hotword is further based on determining that, no follow-up query was provided by the user of the user device after suppressing the wake-up process.
In some examples, when the hotword is detected by the second stage hotword detector in the first segment of the audio data, the method further includes processing, by the data processing hardware, a second segment of the audio data that follows the first segment of the audio data to determine whether the second segment of the audio data is indicative of a spoken query-type utterance. In these examples, when the second audio segment of the audio data is not indicative of the spoken query-type utterance, the method also includes: classifying, by the data processing hardware, the first segment of the audio data as containing the negative word, and based on the first segment of the audio data classified as containing the negative hotword, updating, by the data processing hardware, the first stage hotword detector to prevent triggering the hotword event in subsequent audio data that includes the negative hotword. In some implementations, the method further includes, when the second audio segment of the audio data is not indicative of the spoken query-type utterance, determining, by the data processing hardware, whether an immediate follow-up query was provided by a user of the user device. Here, classifying the first segment of the audio data as containing the negative hotword is further based on determining that no follow-up query was provided by the user of the user device. When the second audio segment of the audio data is indicative of the spoken query-type utterance, the method may also include: receiving, at the data processing hardware, a negative interaction result indicating that a user of the user device negatively interacted with results for the spoken query type utterance provided to the user device; classifying, by the data processing hardware, based on the received negative interaction result, the first segment of the audio data as containing the negative hotword; and based on the first segment of the audio data classified as containing the negative hotword, updating, by the data processing hardware, the first stage hotword detector to prevent detecting the hotword event in subsequent audio data that contains the negative hotword.
In some examples, after receiving the audio data characterizing the hotword event detected by the first stage hotword detector, the method further includes receiving, at the data processing hardware, a negative user interaction indicating user suppression of a wake-up process on the user device. Here, classifying the first segment of the audio data as containing the negative hotword is further based on the negative user interaction indicating user suppression of the wake-up process.
Optionally, updating the first stage hotword detector to prevent triggering the hotword event in subsequent audio data may include providing the first segment of the audio data classified as containing the negative hotword to the user device. The user device is configured to retrain the first stage hotword detector using the first segment of audio data classified as containing the negative hotword. In some implementations, the user device is configured to retrain the first stage hotword detector by storing, in memory hardware of the user device, each instance of the first segment of the audio data classified as containing the negative hotword in memory hardware of the user device and retraining the first stage hotword detector based on an aggregation of the number of instances of the first segment of the audio data classified as containing the negative hotword stored in the memory hardware. In these implementations, the user device is further configured to, prior to retraining the first stage hotword detector, determine that a corresponding confidence score associated each instance of the first segment of the audio data classified as containing the negative hotword fails to satisfy a negative hotword threshold score and determine that the number of instances exceeds a threshold number of instances.
Updating the first stage hotword detector to prevent triggering the hotword event in subsequent audio data may include providing the first segment of the audio data classified as containing the negative hotword to the user device. The user device configured to obtain an embedding representation of the first segment of the audio data and store, in memory hardware of the user device, the embedding representation of the first segment of the audio data. Additionally, the user dev ice is configured to determine when subsequent audio data characterizing the hotword event detected by the first stage hotword detector includes the negative hotword by computing an evaluation embedding representation for the audio data, determine a similarity score between the embedding representation of the first segment of the audio data classified as the negative hotword and the evaluation embedding representation for the subsequent audio data; and when the similarity score satisfies a similarity score threshold, determine that the subsequent audio data includes the negative hotword.
In some implementations, the data processing hardware resides on a server in communication with the data processing hardware and the first stage hotword detector executes on a processor of the user device. Processing the audio data to determine whether the hotword is detected by the second stage hotword detector in the first segment of the audio data may include performing automated speech recognition to determine whether the hotword is recognized in the first segment of the audio data.
In some examples, the data processing resides on the user device. In these examples, the first stage hotword detector may execute on a digital signal processor (DSP) of the data processing hardware and the second stage hotword detector executes on an application processor of the data processing hardware. The first stage hotword detector may be configured to generate a probability score that indicates a presence of the hotword in audio features of the streaming audio captured by the user device and detect the hotword event in the streaming audio when the probability score satisfies a hotword detection threshold of the first stage hotword detector.
Another aspect of the disclosure provides a system for adapting hotword recognition based on personalized negatives. The system includes data processing hardware and memory hardware in communication with the data processing hardware. The memory hardware stores instructions that when executed on the data processing hardware cause the data processing hardware to perform operations. The operations include receiving audio data characterizing a hotword event detected by a first stage hotword detector in streaming audio captured by a user device. The operations also include processing, using a second stage hotword detector, the audio data to determine whether a hotword is detected by the second stage hotword detector in a first segment of the audio data. When the hotword is not detected by the second stage hotword detector in the first segment of the audio data, the operations include classifying the first segment of the audio data as containing a negative hotword that caused a false detection of the hotword event in the streaming audio by the first stage hotword detector. Based on the first segment of the audio data classified as containing the negative hotword, the operations include updating the first stage hotword detector to prevent triggering the hotword event in subsequent audio data that contains the negative hotword.
Implementations of the disclosure may include one or more of the following optional features. In some implementations, the operations further include, when the hotword is not detected by the second stage hotword detector in the first segment of the audio data suppressing a wake-up process on the user device for processing the hotword and one or more other terms following the hotword in the streaming audio; and determining whether an immediate follow-up query was provided by a user of the user device after suppressing the wake-up process on the user device. In these implementations, classifying the first segment of the audio data as containing the negative hotword is further based on determining that no follow-up query was provided by the user of the user device after suppressing the wake-up process.
In some examples, when the hotword is detected by the second stage hotword detector in the first segment of the audio data, the operations further include processing a second segment of the audio data that follows the first segment of the audio data to determine whether the second segment of the audio data is indicative of a spoken query-type utterance. In these examples, when the second audio segment of the audio data is not indicative of the spoken query-type utterance, the operations also include classifying the first segment of the audio data as containing the negative word; and based on the First segment of the audio data classified as containing the negative hotword, updating the first stage hotword detector to prevent triggering the hotword event in subsequent audio data that includes the negative hotword. In some implementations, the operations further include, when the second audio segment of the audio data is not indicative of the spoken query-type utterance, determining whether an immediate follow-up query was provided by a user of the user device. Here, classifying the first segment of the audio data as containing the negative hotword is further based on determining that no follow-up query was provided by the user of the user device. When the second audio segment of the audio data is indicative of the spoken query-type utterance, the operations may also include: receiving a negative interaction result indicating that a user of the user device negatively interacted with results for the spoken query type utterance provided to the user device; classifying, based on the received negative interaction result, the first segment of the audio data as containing the negative hotword; and based on the first segment of the audio data classified as containing the negative hotword, updating the first stage hotword detector to prevent detecting the hotword event in subsequent audio data that contains the negative hotword.
In some examples, after receiving the audio data characterizing the hotword event detected by the first stage hotword detector, the operations further include receiving a negative user interaction indicating user suppression of a wake-up process on the user device. Here, classifying the first segment of the audio data as containing the negative hotword is further based on the negative user interaction indicating user suppression of the wake-up process.
Optionally, updating the first stage hotword detector to prevent triggering the hotword event in subsequent audio data may include providing the first segment of the audio data classified as containing the negative hotword to the user device. The user device is configured to retrain the first stage hotword detector using the first segment of audio data classified as containing the negative hotword. In some implementations, the user device is configured to retrain the first stage hotword detector by storing, in memory hardware of the user device, each instance of the first segment of the audio data classified as containing the negative hotword in memory hardware of the user device and retraining the first stage hotword detector based on an aggregation of the number of instances of the first segment of the audio data classified as containing the negative hotword stored in the memory hardware. In these implementations, the user device is further configured to, prior to retraining the first stage hotword detector, determine that a corresponding confidence score associated each instance of the first segment of the audio data classified as containing the negative hotword fails to satisfy a negative hotword threshold score and determine that the number of instances exceeds a threshold number of instances.
Updating the first stage hotword detector to prevent triggering the hotword event in subsequent audio data may include providing the first segment of the audio data classified as containing the negative hotword to the user device. The user device configured to obtain an embedding representation of the first segment of the audio data and store, in memory hardware of the user device, the embedding representation of the first segment of the audio data. Additionally, the user device is configured to determine when subsequent audio data characterizing the hotword event detected by the first stage hotword detector includes the negative hotword by: computing an evaluation embedding representation for the audio data: determine a similarity score between the embedding representation of the first segment of the audio data classified as the negative hotword and the evaluation embedding representation for the subsequent audio data; and when the similarity score satisfies a similarity score threshold, determine that the subsequent audio data includes the negative hotword.
In some implementations, the data processing hardware resides on a server in communication with the data processing hardware and the first stage hotword detector executes on a processor of the user device. Processing the audio data to determine whether the hotword is detected by the second stage hotword detector in the first segment of the audio data may include performing automated speech recognition to determine whether the hotword is recognized in the first segment of the audio data.
In some examples, the data processing resides on the user device. In these examples, the first stage hotword detector may execute on a digital signal processor (DSP) of the data processing hardware and the second stage hotword detector executes on an application processor of the data processing hardware. The first stage hotword detector may be configured to generate a probability score indicating a presence of the hotword in audio features of the streaming audio captured by the user device and detect the hotword event in the streaming audio when the probability score satisfies a hotword detection threshold of the first stage hotword detector.
The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.
DESCRIPTION OF DRAWINGSFIG.1 is a schematic view of an example system for classifying negative hotwords and updating a first stage hotword detector to prevent detecting hotword events in audio containing negative hotwords.
FIG.2 is a schematic view of a hotword detection architecture.
FIG.3 is a schematic view of an example negative hotword classifier of the system ofFIG.1.
FIG.4 is a schematic view of a user device storing classification results for audio data classified as containing a negative hotword.
FIG.5 is a schematic view of an example user device identifying the presence of a personal negative hotword in captured audio data to prevent triggering a hotword event in the audio data.
FIG.6 is a flowchart of an example arrangement of operations for classifying personalized negative hotwords and updating a first stage hotword detector to prevent triggering a hotword event in audio data containing a personalized negative hotword.
FIG.7 is a schematic view of an example computing device that may be used to implement the systems and methods described herein.
Like reference symbols in the various drawings indicate like elements.
DETAILED DESCRIPTIONSpeech-based interfaces such as digital assistants are becoming increasingly prevalent across a variety of devices including without limitation, mobile phones and smart speakers/displays that include microphones for capturing speech. The general way of initiating voice interaction with a speech-enabled device is to speak a fixed phrase, e.g., a hotword, that when detected by the speech-enabled device in streaming audio, triggers the speech-enabled device to initiate a wake-up process to begin recording and processing subsequent speech to ascertain a query spoken by the user. Thus, as a primary entry point for a speech-based interface, it is critical that hotword detection/recognition works reliably both in terms of recall, and also precision, so that a number of false wake-up events is minimized.
A false negative (also referred to as ‘false rejection’) refers to not detecting a hotword spoken in streaming audio that was spoken by the user when intending to interact with the speech-based interface (e.g., digital assistant). Here, the speech-enabled device fails to react to the user and requires the user to attempt to invoke the interface again by speaking the hotword again, often louder and with different enunciation, to ensure that the hotword is detected. On the other hand, a false positive (also referred to as ‘false acceptance’) refers to detecting a hotword in streaming audio when the streaming audio did not actually contain the hotword, generally due to the streaming audio containing a word/phrase that, when spoken, sounds phonetically similar to the hotword. The false positive causes the speech-enabled device to initiate the wake-up process even though the user did not intend to invoke the system, thereby surprising and/or confusing the user by reacting when the speech-enabled device should have remained in a sleep state.
A cascade hotword detection architecture incorporates a first stage hotword detector that runs on device to detect the presence of a hotword in streaming, and a second stage hotword detector that confirms the presence of the hotword detected by the first stage hotword detector. The second stage hotword detector is associated with higher accuracy in detecting the presence of hotwords in streaming audio, and thus, includes higher power requirements than the first stage hotword detector. Often, the second stage hotword detector is implemented on a server in communication with the first stage hotword detector implemented on the speech-enabled device. Even during a partial false positive, where the first stage hotword detector detects the presence of a hotword locally but the second stage hotword detector implemented at the server rejects the presence of the hotword, has a negative effect on user experience even though the server ultimately suppresses the wake-up process. That is, the detection of the hotword by the first stage hotword detector still causes the device to wake-up and connect to the server which is noticeable to the user (e.g., visible notification or flashing light) and is further undesirable from the privacy and power-preserving perspectives. Accordingly, eliminating the occurrence of partial false positive instances is desirable to improve user experience.
Conventionally, speech-enabled devices all use a same, fixed hotword model for all users of a given language (or locale) that is updated periodically with new versions pushed from server to device. That is, the same hotword model is used to detect hotwords in streaming audio for all users despite the existence of huge variations across user speech, accents, vocabulary, and/or acoustic environments in which the speech-enabled devices are operating. As a result, it is nearly impossible to implement stringent precision/recall requirements for detecting hotwords when a single hotword model is shared across all users of a given language/locale.
For a given user and/or environment, hotword false positive instances are very likely to cluster together. In a non-limiting example, a particular user speaking the term “poodle” may cause a hotword detection model on a speech-enabled device to incorrectly detect the presence of the designated hotword “Hey Google”, whereas a different user may cause the same hotword detection model implemented on another speech-enabled device to detect the designated hotword when that user speaks “doodle”. The variation in these false positive instances across different users can be attributed to pronunciation differences of the users and/or frequency of those terms in the respective vocabularies of those users. Since the same false positive instances are likely to reoccur based on similar acoustic patterns for a same user in a same environment, a hotword detector should ideally learn to adapt to avoid repeating the same false positives over and over when the user speaks a given with a pronunciation similar to the designated hotword.
Implementations herein are directed toward personalizing a hotword detector on a speech-enabled device of a user based on specific terms classified as negative hotwords which caused previous instances of hotword detection false positives. A specific term classified as a negative hotword may be user-specific such that any hotword detection is suppressed when an audio segment derived from a particular user speaking the term is detected by the hotword detector. Additionally or alternatively, a specific term classified as a negative hotword may be device-specific such that hotword detection is suppressed when the audio segment derived from users speaking the term is detected by the hotword detector implemented on a particular device but not on other speech-enabled devices associated with the same user(s). This follows the notion that devices located in some environments may be more prone to hotword detection false positives than devices located in other environments due to acoustic variation, variation in vocabulary, and variation across users who typically speak in the environments.
Referring toFIG.1, in some implementations, anexample system100 includes auser device102 associated with one ormore users10 and in communication with aremote system110 via anetwork104. Theuser device102 may correspond to a computing device, such as a mobile phone, computer (laptop or desktop), tablet, smart speaker/display, smart appliance, smart headphones, wearable, vehicle infotainment system, etc., and is equipped withdata processing hardware103 andmemory hardware105. Theuser device102 includes or is in communication with one ormore microphones106 for capturing utterances from therespective user10. Theremote system110 may be a single computer, multiple computers, or a distributed system (e.g., a cloud environment) having scalable/elastic computing resources112 (e.g., data processing hardware) and/or storage resources114 (e.g., memory hardware).
Theuser device102 includes a first stage hotword detector210 (also referred to as a hotword detection model) configured to detect the presence of a hotword in streamingaudio118 without performing semantic analysis or speech recognition processing on thestreaming audio118. In some implementations,user device102 also includes an initial coarse hotword detector205 (FIG.2) that is trained to initially detect the presence of the hotword before invoking the first stage (fine)hotword detector210 to receive thestreaming audio118 and confirm whether or not the hotword is detected in thestreaming audio118. In some implementations, the firststage hotword detector210 includes a trained neural network (e.g., a memorialized neural network) received from theremote system110 via thenetwork104. Theremote system110 may push updates or new version of the firststage hotword detector210 to a population ofuser devices102 of a given local and/or associated with users of a specific language. That is, alluser devices102 associated with United States-English speaking users receive the same firststage hotword detector210, which may include a neural network trained to detect the presence of the hotword in streamingaudio118 when spoken by English speaking users in the United States.
In some examples, the firststage hotword detector210 executing on theuser device102 is configured to detect a presence of the hotword “Hey Google” in thestreaming audio118 to initiate a wake-up process on theuser device102 for processing the hotword and/or one or more other terms (e.g., query or command) following the hotword in thestreaming audio118. The firststage hotword detector210 may be configured to generate a probability score that indicates a presence of the hotword in audio features of thestreaming audio118 captured by theuser device102, and detect the hotword in thestreaming audio118 when the probability score satisfies a hotword detection threshold of the firststage hotword detector210. Accordingly, the firststage hotword detector210 may detect a hotword event in streamingaudio118 captured by theuser device102 when a probability score generated for audio features of thestreaming audio118 satisfy the hotword detection threshold.
In the example shown, theuser10 speaks anutterance119 that includes a term/phrase (e.g., “Hey Poodle”) captured as streamingaudio118 by theuser device102 that has a similar pronunciation as a fixed term/phrase (e.g., “Hey Google”) designated as the hotword the firststage hotword detector210 is trained to detect. Notably, theuser10 may pronounce the term “Hey Poodle” such that it is more difficult for the first stage hotword detector to distinguish from the designated hotword “Hey Google” than it would be when spoken by other users that use a slightly different pronunciation. As a result, the firststage hotword detector210 executing on theuser device102 falsely detects the presence of the hotword by generating a probability score for audio features associated with “Hey Poodle” that satisfies the hotword detection threshold, thereby triggering initiation of a wake-up process that was not intended by theuser10.
When the probability score for the audio features associated with “Hey Poodle” satisfy the hotword detection threshold, the firststage hotword detector210 further transmitsaudio data120 characterizing the hotword event to a secondstage hotword detector220 executing on theremote system110. In some examples, theaudio data120 is a direct representation of thestreaming audio118, while in other examples, the audio data136 represents thestreaming audio118 after processing by the first stage hotword detector210 (e.g., to identify and/or isolate specific audio characteristics of thestreaming audio118 or to convert thestreaming audio118 to a format suitable for transmission and/or processing by the second stage hotword detector220). For instance, theaudio data120 includes afirst segment121 chomped from the streamingaudio120 that includes the relevant audio features associated with the presence of the hotword detected by the firststage hotword detector210. Theaudio data120 also includes asecond segment122 that, includes audio features captured by theuser device102 in thestreaming audio118 that follow thefirst segment121. Typically, thefirst segment121 is generally of a fixed duration sufficient for containing audio features associated with the designated hotword. Thesecond segment122, however, may have a variable duration containing audio captured by theuser device102 while themicrophone106 is open. Thesecond segment122 may capture a query-type utterance that requires further processing (e.g., automated speech recognition and/or semantic interpretation) on one or more terms to identify a query or command in the audio data. In the example scenario, since thefirst segment121 includes audio features captured by the user device that are not associated with the user speaking the hotword, but are rather associated with another term/phrase that theuser10 pronounces similar to the hotword, theuser10 did not intend to invoke thedevice102 through speech, and therefore, thesecond segment122 is likely not to include a query-type utterance, but instead include background noise captured in thestreaming audio118 from the environment of theuser device102.
The firststage hotword detector210 and the secondstage hotword detector220 cooperate to form a cascadehotword detection architecture200 whereby the secondstage hotword detector220 is configured to confirm whether or not a hotword detected by the firststage hotword detector210 is present theaudio data120. Specifically, the secondstage hotword detector220 executing on theremote system110 processes theaudio data120 to determine whether a hotword is detected by the second stage hotword detector in thefirst segment121 of theaudio data120. In some examples, the secondstage hotword detector210 is implemented as an automated speech recognition (ASR) engine that performs speech recognition on thefirst segment121 of theaudio data120 to determine whether the hotword is present. The secondstage hotword detector210 may detect the presence of the hotword in thefirst segment121 when a probability of recognizing the hotword satisfies a hotword detection threshold.
In other examples, the secondstage hotword detector220 is similar to the firststage hotword detector210 in that the secondstage hotword detector220 is a model implemented as a trained neural network configured to detect the presence of the hotword in thefirst audio segment121 without performing semantic analysis or speech recognition processing. In these examples, the secondstage hotword detector220 may be associated with a larger version of the hotword detection model used by the firststage hotword detector210 and include a different neural network that is potentially more computationally-intensive than the neural network of the firststage hotword detector210, thereby offering an increased hotword detection accuracy over the firststage hotword detector210, which is limited by resources of theuser device102. The secondstage hotword detector220 may generate a probability score indicating the presence of the hotword in the first segment ofaudio data120 and detect the presence of the hotword when the probability score satisfies a hotword detection threshold of the secondstage hotword detector220. Here, a value of the hotword detection threshold at the secondstage hotword detector220 may be the same or different than a value of the hotword detection threshold at the firststage hotword detector210. In some examples, the value of the hotword detection threshold at the secondstage hotword detector220 is set higher to require the secondstage hotword detector220 to be more confident when determining whether or not a hotword is present in thefirst audio segment121.
In some implementations, the secondstage hotword detector220 executes on the user device102 (e.g., the data processing hardware103) to implement the entire cascadehotword detection architecture200 on-device without the use of theremote system110. The secondstage hotword detector220 when executing on theuser device102 can be implemented as an on-device ASR engine to detect the presence of the hotword by performing speech recognition on thefirst audio segment121 or as a larger version of the hotword detection model implemented by the firststage hotword detector210 to detect the presence of the hotword in thefirst audio segment121 without performing speech recognition.
FIG.2 provides an example of the cascadehotword detection architecture200 ofFIG.1 including the firststage hotword detector210, the secondstage hotword detector220, and optionally the initialcoarse hotword detector205. In some examples, when theuser device102 is a battery-powered device, thedata processing hardware103 of theuser device102 collectively includes a first processor60 (e.g., digital signal processor (DSP)) and a second processor70 (e.g., application processor (AP)). Thefirst processor60 consumes less power while operating than the second processor70 consumes while operating. As used herein, thefirst processor60 may be interchangeably referred to as a DSP and the second processor70 may be interchangeably referred to as an “AP” or a “device SoC”. The initialcoarse hotword detector205 may run on thefirst processor60 and the firststage hotword detector210 may run on the second processor70. The secondstage hotword detector220 may execute on a server (e.g., remote system110) in communication with theuser device102 to provide server-side hotword confirmation that leverages increased processing capabilities at the server. Alternatively, the secondstage hotword detector220 may run on the second processor70 of theuser device102 to implement the entire cascade hotword detection architecture on-device.
Generally, thecoarse hotword detector205 resides on aspecialized DSP60, includes a smaller model size than a model associated with the firststage hotword detector210, and is computationally efficient for coarsely screeninginput streaming audio118 for hotword detection. Accordingly, thespecialized DSP60 e.g., first processor) may be “always-on” so that thecoarse hotword detector205 is always running to coarsely screen for hotword candidates inmulti-channel audio118, while all other components of theuser device102, including the main AP70 (e.g., second processor), are in a sleep state/mode to conserve battery life. On the other hand, the firststage hotword detector210 resides on the main AP70, includes a larger model size and provides more computational output than the coarsestage hotword detector205 for providing a more accurate detection of the hotword that was initially detected by thecoarse hotword detector205. Thus, the firststage hotword detector210 may be more stringent in determining whether or not the hotword is present in the audio118. While theDSP60 is “always-on”, the more power consuming main AP70 operates in a sleep mode to reserve battery-life until thecoarse hotword detector205 at theDSP60 detects the candidate hotword in thestreaming audio118. Thus, only once the candidate hotword is detected, does theDSP60 trigger the main AP70 to transition from the sleep mode and into a hotword detection mode for running the firststage hotword detector210.
Upon receiving thestreaming audio118, the always-onDSP60 executes/runs thecoarse hotword detector205 for determining whether a hotword is detected in the respective audio features of thestreaming audio118. Notably, the AP70 may operate in the sleep mode when the multi-channel audio is received at theDSP60 and while theDSP60 processes the respective audio features of thestreaming audio118.
When thecoarse hotword detector205 detects the hotword in thestreaming audio118, theDSP60 provides chompedaudio data120 to the AP70. In some examples, theDSP60 providing the chompedaudio data120 to the AP70 triggers/invokes the AP70 to transition from the sleep mode to the hotword detection mode. In some implementations, theaudio data120 chomped from the streamingaudio118 includes afirst segment121 characterizing the hotword detected by thecoarse hotword detector205 in thestreaming audio118. That is, the first includes a duration sufficient to safely contain the detected hotword. Additionally, audio data212 includes asecond segment122 following thefirst segment121 that may include a duration of audio containing a spoken query. Thecoarse hotword detector205 is optional and the firststage hotword detector210 may initially detect the hotword event in thestreaming audio118, chomp theaudio data120 including the first andsecond segments121,122, and provide the chompedaudio data120 to the secondstage hotword detector220.
When the hotword is detected by the firststage hotword detector210 in thefirst segment121 of theaudio data120, the AP70 initiates a wake-up process on theuser device102 and provides theaudio data120 to the secondstage hotword detector220 for processing to determine/confirm whether the hotword is detected by the secondstage hotword detector220 in thefirst segment121 of theaudio data120. In examples where the cascadehotword detection architecture200 is implemented entirely on-device, the AP70 simply passes theaudio data120 to the secondstage hotword detector220 also running on the AP70. In other examples, where the secondstage hotword detector220 is implemented at aserver110, the AP70 instructs theuser device102 to transmit theaudio data120 via anetwork104 to the secondstage hotword detector220.
Referring back toFIG.1, in some examples, when the hotword is not detected by the secondstage hotword detector220 in thefirst segment121 of the audio data120 (i.e., the probability score satisfies the hotword detection threshold), anegative hotword classifier300 executing on theremote system110 classifies thefirst segment121 of theaudio data120 as containing a negative hotword (e.g., Hey Poodle) that caused a false detection of the hotword event in thestreaming audio118 by the firststage hotword detector210. In configurations when the secondstage hotword detector220 runs on theuser device102, thenegative hotword classifier300 may also execute on theuser device102. Thenegative hotword classifier300 may provide aclassification result170 to anegative hotword updater400 that indicates the classification of thefirst segment121 of theaudio data120 as containing the negative hotword. Theclassification result170 may provide theprobability score171 generated by the secondstage hotword detector220 for thefirst audio segment121 and/or otherpertinent information172 associated with thecorresponding classification result170 that may be useful for thenegative hotword updater400 for updating the firststage hotword detector210. The otherpertinent information172 may include, without limitation, a transcript of theutterance119, a speaker identification score (e.g., a speaker embedding) identifying the speaker characteristics of theuser10 that spoke theutterance119, a time stamp of the hotword event indicating the time of day and/or a day of the week, and a negative hot word confidence score304 (FIG.3) indicating a confidence for classifying thefirst segment121 as the negative hotword. In the example shown, thenegative hotword updater400 executes on theuser device102 for personalizing hotword detection on theuser device102 by updating the firststage hotword detector210 to prevent triggering the hotword event in subsequent audio data that contains the negative hotword.
Sending theclassification result170 to thenegative hotword updater400 may cause thenegative hotword updater400 to update the firststage hotword detector210 to prevent triggering the hotword event in subsequent audio data that contains the negative hotword (e.g., Hey Poodle). In some implementations, when the hotword is not detected by the secondstage hotword detector220 in thefirst segment121 of theaudio data120, the remote system110 (or the user device102) suppresses a wake-up process on theuser device102 for processing the hotword and/or one or more other terms following the hotword in thestreaming audio118. In some implementations, theremote system110 suppresses the wake-up process by sendingsuppression instructions160 to theuser device102 that causes theuser device102 to suppress the wake-up process. In other implementations, the providing of theclassification result170 indicating that thefirst segment121 of theaudio data120 contains the negative hotword causes theuser device102 to suppress the wake-up process. In yet other implementations, theremote system110 suppresses the wake-up process by not responding to the user device102 (e.g., by closing the network connection) after receiving theaudio data120. A lack of response from theremote system110 may cause theuser device102 to suppress the wake-up process. That is, theuser device102, in some examples, only initiates the wake-up process upon receiving confirmation from the secondstage hotword detector220 that the hotword was present in thestreaming audio118. Theuser device102 may independently suppress the wake-up process. For example, when the query or command following the hotword is empty, theuser device102 may automatically suppress the wake-up process (i.e., the streamingaudio118 following the hotword fails to include a command or query directed at the user device102).
In some scenarios, after suppressing the wake-up process due to the secondstage hotword detector220 not detecting the presence of the hotword in thefirst segment121 of theaudio data120, thenegative hotword classifier300 determines whether an immediate follow-up query was provided by theuser10 of theuser device102. This determination may be made when no subsequent hotword event detected by the firststage hotword detector210 is received by the secondstage hotword detector220. Here, a determination that theuser10 did not provide an immediate follow-up query serves as additional confirmation that theuser10 did not previously intend to speak the hotword in theutterance119, but rather spoke a term (“Hey Poodle”) with a similar pronunciation to the particular term/phrase (“Hey Google”) designated as the hotword. Accordingly, classifying thefirst segment121 of theaudio data120 as containing the negative hotword may be further based on a determination that no follow-up query was received from theuser device102 after suppressing the wake-up process.
In additional examples, when the secondstage hotword detector220 detects the presence of the hotword (“Hey Google”) infirst segment121 of the audio data120 (i.e., the probability score satisfies the hotword detection threshold) despite theuser10 really speaking another similarly sounding phrase (“Hey Poodle”), the second segment122 (and optionally the first segment121) of theaudio data120 are provided to thequery processor180. Here, thequery processor180 processes thesecond segment122 of theaudio data120 to determine whether thesecond segment122 of theaudio data120 is indicative of a spoken query-type utterance. In examples when the secondstage hotword detector220 is implemented as an ASR engine, thequery processor180 process the resulting speech recognition result by performing semantic analysis to determine if thesecond segment122 is indicative of the query-type utterance. In other examples, when the secondstage hotword detector220 is implemented as a hotword detection model, thequery processor180 is implemented as the ASR engine that processes thesecond segment122 of theaudio data120 by performing speech recognition and then performing semantic analysis on the speech recognition result. As used herein, a query-type utterance corresponds to an utterance that was directed to theuser device102, e.g., an utterance directed to a digital assistant interface for querying the digital assistant to perform an operation or action. Thus, when thesecond segment122 of theaudio data120 is indicative of the query-type utterance, there is a strong likelihood that the secondstage hotword detector220 was accurate in detecting the presence of the hot word in thefirst segment121 of theaudio data120. Otherwise, when thequery processor180 determines that thesecond segment122 is not indicative of the query-type utterance, there exists a strong likelihood that the secondstage hotword detector220 was incorrect in detecting the presence of the hotword in thefirst segment121.
Thequery processor180 may provide ascore182 indicating whether or not thesecond segment122 is indicative of a query-type utterance. In some examples, thescore182 is binary where thescore182 of zero or one (1) indicates the query-type utterance and the score of the other one of zero or one (1) does not indicate the query-type utterance. In other examples, thescore182 provides a likelihood (e.g., probability) that thesecond segment122 is indicative of a query-type utterance. Here, when thescore182 fails to satisfy a query-type utterance threshold thesecond segment122 may not be indicative of the query-type utterance. In the example shown, thenegative hotword classifier300 may receive thescore182 as an input in addition to the determination made by the secondstage hotword detector220 for determining whether or not thefirst segment121 of theaudio data120 should be classified as containing the negative hotword.
Thus, when thenegative hotword classifier300 receives an indication from thequery processor180 that thesecond audio segment122 of the audio data is not indicative of the spoken query-type utterance, thenegative hotword classifier300 may classify thefirst segment121 of theaudio data120 as containing the negative hotword to indicate that the secondstage hotword detection220 provided a false acceptance. Thenegative hotword classifier300 may additionally receive theprobability score171 generated by the secondstage hotword detector220 for thefirst segment121, whereby a probability score only satisfying the hotword detection threshold by a narrow margin may further bias thenegative hotword classifier300 to classify thefirst segment121 as containing the negative hot word. Moreover, after thequery processor180 determines that thesecond segment122 is not indicative of the query-type utterance, thenegative hotword classifier300 may also determine whether an immediate follow-up query was provided by theuser10 of the user device. As discussed above, a determination that theuser10 did not provide an immediate follow-up query serves as additional confirmation that theuser10 did not previously intend to speak the hotword in theutterance119, but rather spoke a term (“Hey Poodle”) with a similar pronunciation to the particular term/phrase (“Hey Google”) designated as the hotword. Accordingly, classifying thefirst segment121 of theaudio data120 as containing the negative hotword may be further based on the determination that no follow-up query was received from theuser device102.
In some examples, after receiving theaudio data120 characterizing the hotword event detected by the firststage hotword detector210, theremote system110 receives anegative user interaction162 indicating user suppression of a wake-up process on theuser device102. That is, the false acceptance instance by the firststage hotword detector210 in detecting the hotword event when the user spoke the negative hotword “Hey Poodle” may trigger theuser device102 to initially wake-up while waiting for the secondstage hotword detector220 to confirm or reject the presence of the hotword. Here, theuser device102 may provide an audible and/or visual notification to inform the user that theuser device102 is awake and theuser10 may provide anegative user interaction162 to revert thedevice102 back to the sleep state since theuser10 did not intend to trigger the wake-up process. For instance, theuser10 may press a physical button on the user device, provide a gesture, or, when theuser device102 includes a display, select a graphic rendered in a graphical user interface displayed on the display that causes theuser device102 to revert back to the sleep state. In some implementations, thenegative hotword classifier300 uses thenegative user interaction162 indicating user suppression of the wake-up process on the user device as an input for classifying thefirst segment121 of theaudio data120 as containing the negative hotword.
In some additional examples, when thequery processor180 determines that thesecond segment122 of theaudio data120 is indicative of the spoken query-type utterance, thequery processor180 provides aquery185 to a search engine190 (or other downstream application) that contains a transcription of thesecond segment122 of theaudio data120. Here, thesearch engine190 providesresults192 responsive to thequery185 back to theuser device102. Here, thequery processor180 may have identified thesecond segment122 as being indicative of a query-type utterance even though thesecond segment122 of theaudio data120 corresponded to background speech or other background audio that was captured by theuser device102 in streamingaudio118 after the firststage hotword detector210 detected the false positive hot word event when the user spoke “Hey Poodle”. This background audio may be captured in thestreaming audio118 and thequery processor180 may identify a query-type utterance and provide acorresponding query185 to thesearch engine190 to obtain aresult192. Theresult192 may be audibly and/or visually output by theuser device102 to theuser10 even though theuser10 never intended to invoke theuser device102. As a result, theuser10 may provide thenegative user interaction162 indicating that theuser10 negatively interacted with theresults192. For instance, theuser10 may provide a spoken input indicating theuser10 is confused with the results or a statement that theuser10 did not provide a query. Additionally or alternatively, theuser10 may provide an input indication indicating an instruction/command to dismiss theresults192.
In other scenarios, theresult192 is a prompt from the digital assistant stating for audible output from thedevice102 that theuser10 needs to provide confirmation for performing an action, e.g., “You asked for the current weather, is that correct?”, whereby the negative user interaction can be theuser10 speaking “No, I did not ask about the weather”. Similarly, theresult192 can be a prompt requesting the user to repeat a query because thequery processor180 was not confident in the query, e.g., “I did not understand your question, please repeat?”, whereby the negative user interaction can be theuser10 expressing confusion by uttering “Huh”, theuser10 affirmatively dismissing the prompt by speaking “I did not ask anything”, or simply the user not responding within a predetermined period of time. Thus, thenegative user interaction162 may be provided to thenegative hotword classifier300 in addition to one or more of the other inputs discussed above such as an indication that the hotword was not detected by the secondstage hotword detector220 in the first segment of theaudio data120, an indication that thesecond segment122 of theaudio data120 is not associated with a query-type utterance, or an indication that theuser10 did provide an immediate follow-up query after the wake-up process was suppressed.
FIG.3 shows an example of thenegative hotword classifier300 ofFIG.1 receiving one or more input features302 for making the determination of whether or not afirst segment121 ofaudio data120 should be classified as a negative hotword. When thenegative hotword classifier300 determines that thefirst segment121 of the audio should be classified as the negative hotword based on the one or more input features302, thenegative hotword classifier300 will generate the classification results170 indicating classification of thefirst segment121 ofaudio data120 as the negative hotword as discussed above inFIG.1. The one or more input features302 received by thenegative hotword classifier300 may include, without limitation, whether or not thesecond hotword detector220 detected the presence of the hotword in thefirst segment121 ofaudio data120 and/or theprobability score171, whether an immediate follow-up query was received from theuser device102, whether or not thesecond segment122 of theaudio data120 is provided an indication ascore182 indicating whether or not thesecond segment122 includes a query-type utterance (e.g., by providing thescore182 indicating whether or not thesecond segment122 is indicative of a query-type utterance), a transcription of thefirst segment121 and/or thesecond segment122 ofaudio data120, and whether or not anegative user interaction162 is received indicating user suppression of a wake-up process on theuser device102 and/or that theuser10 negatively interacted withresults192 responsive to processing the second segment122 (and/or optionally the first segment121) of theaudio data120 as a query-type utterance (e.g., providing thequery185 to thesearch engine190 or other downstream application).
Some input features302 may be weighted more heavily when determining whether thefirst segment121 of theaudio data120 should be classified as the negative hotword. For instance, the secondstage hotword detector220 failing to detect the presence of the hotword in thefirst segment121 is a strong indication that thefirst segment121 includes a negative hotword that caused the false acceptance instance at the firststage hotword detector210. The magnitude of theprobability score171 may bias theclassification result170. For instance, theprobability score171 fading to satisfy the hotword detection threshold at the secondstage hotword detector220 by a wide margin provides a greater likelihood of a negative hotword than if theprobability score171 missed satisfying the hotword detection threshold by small margin.
In some configurations, thenegative hotword classifier300 includes a trained classifier (which may include a neural network model trained via machine learning) configured to generate a negativehotword confidence score304 indicating a likelihood that thefirst segment121 of theaudio data120 includes a negative hotword. Theclassifier300 may classify thefirst segment121 as containing the negative hotword when the negativehotword confidence score304 satisfies a confidence threshold. The negativehotword confidence score304 may be included in the classification results170 received by thenegative hotword updater400 ofFIG.1 for use in updating the firststage hotword detector210 to not detect the hotword event in subsequent audio containing the negative hotword. In some examples, the negativehotword confidence score304 is a binary score indicating that thefirst segment121 ofaudio data120 includes the negative hotword, and thus should be classified as the negative hotword, or that thefirst segment121 does not include the negative hotword.
Referring toFIGS.1 and4, in some examples, thenegative hotword updater400 updates the firststage hotword detector210 to prevent triggering the hotword event in subsequent audio data that includes a negative hotword by providing theclassification result170 that includes thefirst segment121 of theaudio data120 to theuser device102. Here, theuser device102 may be configured to retain the firststage hotword detector210 using thefirst segment121 of theaudio data120 classified as containing the negative hotword. For example, thefirst segment121 ofaudio data120 may be labeled as a negative hotword and provided as a training input to the firststage hotword detector210 so the firststage hotword detector210 learns to not detect the presence of the hotword (“Hey Google”) in subsequent audio data containing the negative hotword (“Hey Poodle”). As used herein, retraining the firststage hotword detector210 may include retraining an existinghotword detector210 running on theuser device102 that initially detected the hotword event incorrectly or may include a new firststage hotword detector210 pushed to theuser device102 at a later time.
Moreover, updating the firststage hotword detector210 may also include updating the optional initialcoarse hotword detector205 running on a DSP60 (FIG.2) when theuser device102 includes a battery-powered device. As with the firststage hotword detector210, updating the initialcoarse hotword detector205 may include retraining the initialcoarse hotword detector205 to prevent triggering hotword events for audio data including a negative hotword. In some examples, only the initialcoarse hotword detector205 is updated to prevent triggering the hotword event in subsequent audio data that includes a negative hotword.
As shown inFIG.4, thenegative hotword updater400 executes on theuser device102 store each instance of afirst segment121 ofaudio data120 classified by thenegative hotword classifier300 as containing a corresponding negative hotword in thememory hardware105. In the example shown, theuser device102 stores each instance where afirst segment121 ofaudio data120 was classified as a negative hotword by storing thecorresponding classification result170 for each instance. Here, theclassification result170 includes thefirst segment121 classified as containing the negative hotword, theprobability score171 indicating the likelihood that thefirst segment121 included the actual hotword, and the otherpertinent information172, such as the transcript of theutterance119, a speaker identification score (e.g., a speaker embedding) identifying the speaker characteristics of theuser10 that spoke theutterance119, a time stamp of the hotword event indicating the time of day and/or a day of the week, and a negative hotword confidence score304 (FIG.3) indicating a confidence for classifying thefirst segment121 as the negative hotword. In the example shown, theuser device102 stores one ormore classification results170 each associated with a different corresponding negative hotword. For instance, one more classification results170Aa-n,170Ba-n,170Ca-n may be stored for each of the negative hotwords “Poodle”, “Doodle”, and “Noodle” that when spoken by theuser10, are pronounced similar to the designated hotword “Hey Google”.
In some implementations, the user device102 (via the negative hotword updater400) is configured to retrain the firststage hotword detector210 based on an aggregation of the number of instances (e.g., number of classification results170) of thefirst segment121 of theaudio data120 classified as containing the negative hotword stored in thememory hardware105. Here, the number of instances of audio data classified as containing the same hotword satisfying a threshold number of instances may establish a pattern that theuser10 is regularly speaking the negative hotword falsely detected as the designated hotword by the firststage hotword detector210. In some examples, theuser device102 requires a specified number of false acceptance instances resulting from theuser10 speaking the same term when the negative hotword confidence scores304 associated with scores is relatively low, e.g., the negative hotword confidence scores304 only satisfied the threshold by a narrow margin.
With continued reference toFIG.4, in some examples, thenegative hotword updater400 may append an embedding representation12 to eachclassification result170 stored in thememory hardware105. Here, the firststage hotword detector210 may compute an embedding representation12 for any audio data characterizing a hotword event detected by the firststage hotword detector210, and when theaudio data120 is subsequently classified by thenegative hotword classifier300 as containing a negative hotword, thenegative hotword updater400 may append the embedding representation12 to the corresponding instance of theclassification result170. In some implementations, thenegative hotword updater400 aggregates/averages the embedding representations12 stored in thememory hardware105 for each corresponding negative hotword to generate a reference embedding15 for each corresponding negative hotword. For example, a corresponding reference embedding15 may be generated for each of the negative hotword “Poodle”, “Noodle”, and “Doodle”.
FIG.5 shows aschematic view500 depicting an example whereuser device102 capturessubsequent audio data120 corresponding to another utterance519 spoken by theuser10 that includes the term “My Poodle” and that causes the firststage hotword detector210 to falsely detect another hotword event. The firststage hotword detector210 executing on theuser device102 computes anevaluation embedding representation18 for the subsequent audio data120 (e.g., the portion of thesubsequent audio data120 characterizing the hotword event). Theuser device102 may simultaneously access the classification result(s)170 stored in thememory hardware105 that may each include an embedding representation12 of the correspondingfirst segment121 ofaudio data120 that was classified as one of the negative hotwords (e.g., “Poodle”, “Noodle”, and “Doodle”). Additionally or alternatively, theuser device102 may access a corresponding reference embedding15 generated for each of the negative hotwords as described above with reference toFIG.4.
In some implementations, ascorer510 compares theevaluation embedding representation18 computed for thesubsequent audio data120 with all of thereference embeddings15 generated and stored for each of the negative hotwords. In these implementations, thereference embedding representations12,15 associated with each negative hotword “Poodle”, “Doodle”, and “Noodle” will all be clustered together in an embedding representation space distinct from the clusters of the reference embeddings associated with the other negative hotwords. Accordingly, thescorer510 may determine a similarity score515 between the reference embedding representation12 computed for eachfirst segment121 ofaudio data120 that was classified as any of the negative hotword and theevaluation embedding representation18 for thesubsequent audio data120. Additionally or alternatively, thescorer510 may determine a similarity score515 between each correspondingreference embedding representation15 that represents an aggregate/average embedding representation fora corresponding one of the negative hotwords (e.g., “Poodle”, “Doodle”, and “Noodle”). In some examples, each similarity score515 is associated with a distance (e.g., a cosine distance) between theevaluation embedding representation18 and thereference embedding representation12,15 in the embedding representation space.
After thescorer510 determines/generates the similarity score(s)515, aclassifier520 may compare each similarity score515 to a similarity score threshold and determine/classify thesubsequent audio data120 as including the negative hotword when the similarity score515 satisfies the similarity score threshold. In some examples, the similarity score threshold represents a maximum allowable cosine distance between embedding representations associated with a same negative hotword. In some scenarios, when a similarity score515 is computed between the evaluation embedding18 for the subsequent audio and a corresponding reference embedding representation12 computed for each instance of afirst segment121 ofaudio data120 being classified as a negative hotword, multiple similarity scores515 may satisfy the similarity score threshold. To illustrate, in the example shown, the similarity scores515 between the evaluation embedding18 and the corresponding reference embedding representations12 classified as the negative hotword “Poodle” will satisfy the similarity score threshold to indicate that the evaluation embedding18 falls into a cluster of the embedding representations12 classified as the negative hotword Poodle and outside the clusters of embedding representations12 classified as the oilier negative hotwords “Doodle” and “Noodle”. Thus, when theclassifier520 determines that the similarity score515 satisfies the similarity threshold, theclassifier520 determines that thesubsequent audio data120 includes the negative hotword. As a result, theclassifier520 may instruct the firststage hotword detector210 to suppress detecting the hotword event in the subsequent audio data or instructs theuser device102 to revert back to a sleep state if the firststage hotword detector210 affirmatively falsely detected the hotword event and triggered initiation of a wake-up process on the user device.
FIG.6 provides a flowchart of example operations for amethod600 of personalizing a hotword detector on a user device based on classifying specific terms as negative hotwords which caused previous instances of hotword detection false positive by the hotword detector. Atoperation602, themethod600 includes receiving, atdata processing hardware103,112,audio data120 characterizing a hotword event detected by a firststage hotword detector210 in streamingaudio118 captured by auser device102. The firststage hotword detector210 may execute on a digital signal processor (DSP) of thedata processing hardware103 of theuser device102 or execute on an application processor of thedata processing hardware103 of theuser device102.
Atoperation604, themethod600 includes processing, by thedata processing hardware103,112, using a secondstage hotword detector220, theaudio data120 to determine whether a hotword is detected by the secondstage hotword detector220 in afirst segment121 of theaudio data120. The secondstage hotword detector220 may be implemented as an ASR engine that performs automate speech recognition to determine whether the hotword is recognized in thefirst segment121. The secondstage hotword detector220 may be implemented as a hotword detection model in other configurations, whereby the hotword detection model determines whether or not the hotword is detected in thefirst segment121 without performing speech recognition.
Atoperation606, when the hotword is not detected by the secondstage hotword detector220 in thefirst segment121 of theaudio data120, themethod600 includes classifying, by thedata processing hardware103,112, thefirst segment121 of theaudio data120 as containing a negative hotword that caused a false detection of the hotword event in thestreaming audio118 by the firststage hotword detector210. Atoperation608, themethod600 includes updating, by thedata processing hardware103,112, the firststage hotword detector210 to prevent triggering the hotword event insubsequent audio data120 that contains the negative hotword.
A software application (i.e., a software resource) may refer to computer software that causes a computing device to perform a task. In some examples, a software application may be referred to as an “application,” an “app,” or a “program.” Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications.
FIG.7 is schematic view of anexample computing device700 that may be used to implement the systems and methods described in this document. Thecomputing device700 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.
Thecomputing device700 includes aprocessor710,memory720, astorage device730, a high-speed interface/controller740 connecting to thememory720 and high-speed expansion ports750, and a low speed interface/controller760 connecting to a low speed bus770 and astorage device730. Each of thecomponents710,720,730,740,750, and760, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. Theprocessor710 can process instructions for execution within thecomputing device700, including instructions stored in thememory720 or on thestorage device730 to display graphical information for a graphical user interface (GUI) on an external input/output device, such asdisplay780 coupled tohigh speed interface740. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also,multiple computing devices700 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
Thememory720 stores information non-transitorily within thecomputing device700. Thememory720 may be a computer-readable medium, a volatile memory unit(s), or non-volatile memory unit(s). Thenon-transitory memory720 may be physical devices used to store programs (e.g., sequences of instructions) or data (e.g., program state information) on a temporary or permanent basis for use by thecomputing device700. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electronically erasable programmable read-only memory (EEPROM) (e.g., typically used for firmware, such as boot programs). Examples of volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM) as well as disks or tapes.
Thestorage device730 is capable of providing mass storage for thecomputing device700. In some implementations, thestorage device730 is a computer-readable medium. In various different implementations, thestorage device730 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. In additional implementations, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as thememory720, thestorage device730, or memory onprocessor710.
Thehigh speed controller740 manages bandwidth-intensive operations for thecomputing device700, while thelow speed controller760 manages lower bandwidth-intensive operations. Such allocation of duties is exemplary only. In some implementations, the high-speed controller740 is coupled to thememory720, the display780 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports750, which may accept various expansion cards (not shown). In some implementations, the low-speed controller760 is coupled to thestorage device730 and a low-speed expansion port790. The low-speed expansion port790, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet), may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
Thecomputing device700 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as astandard server700aor multiple times in a group ofsuch servers700a, as alaptop computer700b, or as part of arack server system700c.
Various implementations of the systems and techniques described herein can be realized in digital electronic and/or optical circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” and “computer-readable medium” refer to any computer program product non-transitory computer readable medium, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.
The processes and logic flows described in this specification can be performed by one or more programmable processors, also referred to as data processing hardware, executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, one or more aspects of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touch screen for displaying information to the user and optionally a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.
A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. Accordingly, other implementations are within the scope of the following claims.