Movatterモバイル変換


[0]ホーム

URL:


US12094489B2 - Voice onset detection - Google Patents

Voice onset detection
Download PDF

Info

Publication number
US12094489B2
US12094489B2US18/459,342US202318459342AUS12094489B2US 12094489 B2US12094489 B2US 12094489B2US 202318459342 AUS202318459342 AUS 202318459342AUS 12094489 B2US12094489 B2US 12094489B2
Authority
US
United States
Prior art keywords
voice
audio signal
probability
microphone
voice activity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US18/459,342
Other versions
US20230410835A1 (en
Inventor
Jung-Suk Lee
Jean-Marc Jot
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Magic Leap Inc
Original Assignee
Magic Leap Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Magic Leap IncfiledCriticalMagic Leap Inc
Priority to US18/459,342priorityCriticalpatent/US12094489B2/en
Publication of US20230410835A1publicationCriticalpatent/US20230410835A1/en
Assigned to MAGIC LEAP, INC.reassignmentMAGIC LEAP, INC.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: LEE, JUNG-SUK, JOT, JEAN-MARC
Priority to US18/764,006prioritypatent/US20250006219A1/en
Application grantedgrantedCritical
Publication of US12094489B2publicationCriticalpatent/US12094489B2/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Definitions

Landscapes

Abstract

In some embodiments, a first audio signal is received via a first microphone, and a first probability of voice activity is determined based on the first audio signal. A second audio signal is received via a second microphone, and a second probability of voice activity is determined based on the first and second audio signals. Whether a first threshold of voice activity is met is determined based on the first and second probabilities of voice activity. In accordance with a determination that a first threshold of voice activity is met, it is determined that a voice onset has occurred, and an alert is transmitted to a processor based on the determination that the voice onset has occurred. In accordance with a determination that a first threshold of voice activity is not met, it is not determined that a voice onset has occurred.

Description

RELATED APPLICATIONS
This patent application is a Continuation of U.S. patent application Ser. No. 17/714,708 filed Apr. 6, 2022, which claims priority to Continuation of U.S. patent application Ser. No. 16/987,267 filed Aug. 6, 2020, now U.S. Pat. No. 11,328,740, which claims priority to U.S. Provisional Patent Application No. 62/884,143 filed on Aug. 7, 2019 and U.S. Provisional Patent Application No. 63/001,118 filed Mar. 27, 2020, all of which are hereby incorporated reference in their entirety.
FIELD
This disclosure relates in general to systems and methods for processing speech signals, and in particular to systems and methods for processing a speech signal to determine the onset of voice activity.
BACKGROUND
Systems for speech recognition are tasked with receiving audio input representing human speech, typically via one or more microphones, and processing the audio input to determine words, logical structures, or other outputs corresponding to that audio input. For example, automatic speech recognition (ASR) systems may generate a text output based on the human speech corresponding to an audio input signal; and natural language processing (NLP) tools may generate logical structures, or computer data, corresponding to the meaning of that human speech. While such systems may contain any number of components, at the heart of such systems is a speech processing engine, which is a component that accepts an audio signal as input, performs some recognition logic on the input, and outputs some text corresponding to that input.
Historically, audio input was provided to speech processing engines in a structured, predictable manner. For example, a user might speak directly into a microphone of a desktop computer in response to a first prompt (e.g., “Begin Speaking Now”); immediately after pressing a first button input (e.g., a “start” or “record” button, or a microphone icon in a software interface); or after a significant period of silence. Similarly, a user might stop providing microphone input in response to a second prompt (e.g., “Stop Speaking”); immediately before pressing a second button input (e.g., a “stop” or “pause” button); or by remaining silent for a period of time. Such structured input sequences left little doubt as to when the user was providing input to a speech processing engine (e.g., between a first prompt and a second prompt, or between pressing a start button and pressing a stop button). Moreover, because such systems typically required deliberate action on the part of the user, it could generally be assumed that a user's speech input was directed to the speech processing engine, and not to some other listener (e.g., a person in an adjacent room). Accordingly, many speech processing engines of the time may not have had any particular need to identify, from microphone input, which portions of the input were directed to the speech processing engine and were intended to provide speech recognition input, and conversely, which portions were not.
The ways in which users provide speech recognition input has changed as speech processing engines have become more pervasive and more fully integrated into users' everyday lives. For example, some automated voice assistants are now housed in or otherwise integrated with household appliances, automotive dashboards, smart phones, wearable devices, “living room” devices (e.g., devices with integrated “smart” voice assistants), and other environments far removed from the conventional desktop computer. In many cases, speech processing engines are made more broadly usable by this level of integration into everyday life. However, these systems would be made cumbersome by system prompts, button inputs, and other conventional mechanisms for demarcating microphone input to the speech processing engine. Instead, some such systems place one or more microphones in an “always on” state, in which the microphones listen for a “wake-up word” (e.g., the “name” of the device or any other predetermined word or phrase) that denotes the beginning of a speech recognition input sequence. Upon detecting the wake-up word, the speech processing engine can process the following sequence of microphone input as input to the speech processing engine.
While the wake-up word system replaces the need for discrete prompts or button inputs for speech processing engines, it can be desirable to minimize the amount of time the wake-up word system is required to be active. For example, mobile devices operating on battery power benefit from both power efficiency and the ability to invoke a speech processing engine (e.g., invoking a smart voice assistant via a wake-up word). For mobile devices, constantly running the wake-up word system to detect the wake-up word may undesirably reduce the device's power efficiency. Ambient noises or speech other than the wake-up word may be continually processed and transcribed, thereby continually consuming power. However, processing and transcribing ambient noises or speech other than the wake-up word may not justify the required power consumption. It therefore can be desirable to minimize the amount of time the wake-up word system is required to be active without compromising the device's ability to invoke a speech processing engine.
In addition to reducing power consumption, it is also desirable to improve the accuracy of speech recognition systems. For example, a user who wishes to invoke a smart voice assistant may become frustrated if the smart voice assistant does not accurately respond to the wake-up word. The smart voice assistant may respond to an acoustic event that is not the wake-up word (i.e., false positives), the assistant may fail to respond to the wake-up word (i.e., false negatives), or the assistant may respond too slowly to the wake-up word (i.e., lag). Inaccurate responses to the wake-up word like the above examples may frustrate the user, leading to a degraded user experience. The user may further lose trust in the reliability of the product's speech processing engine interface. It therefore can be desirable to develop a speech recognition system that accurately responds to user input.
BRIEF SUMMARY
Examples of the disclosure describe systems and methods for determining a voice onset. According to an example method, a first audio signal is received via a first microphone, and a first probability of voice activity is determined based on the first audio signal. A second audio signal is received via a second microphone, and a second probability of voice activity is determined based on the first and second audio signals. Whether a first threshold of voice activity is met is determined based on the first and second probabilities of voice activity. In accordance with a determination that a first threshold of voice activity is met, it is determined that a voice onset has occurred and an alert is transmitted to a processor based on the determination that the voice onset has occurred. In accordance with a determination that a first threshold of voice activity is not met, it is not determined that a voice onset has occurred.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG.1 illustrates an example wearable system according to some embodiments of the disclosure.
FIG.2 illustrates an example handheld controller according to some embodiments of the disclosure.
FIG.3 illustrates an example auxiliary unit according to some embodiments of the disclosure.
FIG.4 illustrates an example functional block diagram for an example wearable system according to some embodiments of the disclosure.
FIG.5 illustrates an example flow chart for determining an onset of voice activity according to some embodiments of the disclosure.
FIGS.6A-6C illustrate examples of processing input audio signals according to some embodiments of the disclosure.
FIGS.7A-7E illustrate examples of processing input audio signals according to some embodiments of the disclosure.
FIG.8 illustrates an example of determining an onset of voice activity according to some embodiments of the disclosure.
FIG.9 illustrates an example of determining an onset of voice activity according to some embodiments of the disclosure.
FIG.10 illustrates an example mixed reality (MR) system, according to some embodiments of the disclosure.
FIGS.11A-11C illustrate example signal processing steps, according to some embodiments of the disclosure.
DETAILED DESCRIPTION
In the following description of examples, reference is made to the accompanying drawings which form a part hereof, and in which it is shown by way of illustration specific examples that can be practiced. It is to be understood that other examples can be used and structural changes can be made without departing from the scope of the disclosed examples.
Example Wearable System
FIG.1 illustrates an examplewearable head device100 configured to be worn on the head of a user.Wearable head device100 may be part of a broader wearable system that comprises one or more components, such as a head device (e.g., wearable head device100), a handheld controller (e.g.,handheld controller200 described below), and/or an auxiliary unit (e.g.,auxiliary unit300 described below). In some examples,wearable head device100 can be used for virtual reality, augmented reality, or mixed reality systems or applications.Wearable head device100 can comprise one or more displays, such asdisplays110A and110B (which may comprise left and right transmissive displays, and associated components for coupling light from the displays to the user's eyes, such as orthogonal pupil expansion (OPE)grating sets112A/112B and exit pupil expansion (EPE)grating sets114A/114B); left and right acoustic structures, such asspeakers120A and120B (which may be mounted ontemple arms122A and122B, and positioned adjacent to the user's left and right ears, respectively); one or more sensors such as infrared sensors, accelerometers, GPS units, inertial measurement units (IMUs, e.g. IMU126), acoustic sensors (e.g., microphones150); orthogonal coil electromagnetic receivers (e.g.,receiver127 shown mounted to theleft temple arm122A); left and right cameras (e.g., depth (time-of-flight)cameras130A and130B) oriented away from the user; and left and right eye cameras oriented toward the user (e.g., for detecting the user's eye movements)(e.g.,eye cameras128A and128B). However,wearable head device100 can incorporate any suitable display technology, and any suitable number, type, or combination of sensors or other components without departing from the scope of the invention. In some examples,wearable head device100 may incorporate one ormore microphones150 configured to detect audio signals generated by the user's voice; such microphones may be positioned adjacent to the user's mouth. In some examples,wearable head device100 may incorporate networking features (e.g., Wi-Fi capability) to communicate with other devices and systems, including other wearable systems.Wearable head device100 may further include components such as a battery, a processor, a memory, a storage unit, or various input devices (e.g., buttons, touchpads); or may be coupled to a handheld controller (e.g., handheld controller200) or an auxiliary unit (e.g., auxiliary unit300) that comprises one or more such components. In some examples, sensors may be configured to output a set of coordinates of the head-mounted unit relative to the user's environment, and may provide input to a processor performing a Simultaneous Localization and Mapping (SLAM) procedure and/or a visual odometry algorithm. In some examples,wearable head device100 may be coupled to ahandheld controller200, and/or anauxiliary unit300, as described further below.
FIG.2 illustrates an example mobilehandheld controller component200 of an example wearable system. In some examples,handheld controller200 may be in wired or wireless communication withwearable head device100 and/orauxiliary unit300 described below. In some examples,handheld controller200 includes ahandle portion220 to be held by a user, and one ormore buttons240 disposed along atop surface210. In some examples,handheld controller200 may be configured for use as an optical tracking target; for example, a sensor (e.g., a camera or other optical sensor) ofwearable head device100 can be configured to detect a position and/or orientation ofhandheld controller200—which may, by extension, indicate a position and/or orientation of the hand of a user holdinghandheld controller200. In some examples,handheld controller200 may include a processor, a memory, a storage unit, a display, or one or more input devices, such as described above. In some examples,handheld controller200 includes one or more sensors (e.g., any of the sensors or tracking components described above with respect to wearable head device100). In some examples, sensors can detect a position or orientation ofhandheld controller200 relative towearable head device100 or to another component of a wearable system. In some examples, sensors may be positioned inhandle portion220 ofhandheld controller200, and/or may be mechanically coupled to the handheld controller.Handheld controller200 can be configured to provide one or more output signals, corresponding, for example, to a pressed state of thebuttons240; or a position, orientation, and/or motion of the handheld controller200 (e.g., via an IMU). Such output signals may be used as input to a processor ofwearable head device100, toauxiliary unit300, or to another component of a wearable system. In some examples,handheld controller200 can include one or more microphones to detect sounds (e.g., a user's speech, environmental sounds), and in some cases provide a signal corresponding to the detected sound to a processor (e.g., a processor of wearable head device100).
FIG.3 illustrates an exampleauxiliary unit300 of an example wearable system. In some examples,auxiliary unit300 may be in wired or wireless communication withwearable head device100 and/orhandheld controller200. Theauxiliary unit300 can include a battery to provide energy to operate one or more components of a wearable system, such aswearable head device100 and/or handheld controller200 (including displays, sensors, acoustic structures, processors, microphones, and/or other components ofwearable head device100 or handheld controller200). In some examples,auxiliary unit300 may include a processor, a memory, a storage unit, a display, one or more input devices, and/or one or more sensors, such as described above. In some examples,auxiliary unit300 includes aclip310 for attaching the auxiliary unit to a user (e.g., a belt worn by the user). An advantage of usingauxiliary unit300 to house one or more components of a wearable system is that doing so may allow large or heavy components to be carried on a user's waist, chest, or back—which are relatively well suited to support large and heavy objects—rather than mounted to the user's head (e.g., if housed in wearable head device100) or carried by the user's hand (e.g., if housed in handheld controller200). This may be particularly advantageous for relatively heavy or bulky components, such as batteries.
FIG.4 shows an example functional block diagram that may correspond to an examplewearable system400, such as may include examplewearable head device100,handheld controller200, andauxiliary unit300 described above. In some examples, thewearable system400 could be used for virtual reality, augmented reality, or mixed reality applications. As shown inFIG.4,wearable system400 can include example handheld controller400B, referred to here as a “totem” (and which may correspond tohandheld controller200 described above); the handheld controller400B can include a totem-to-headgear six degree of freedom (6DOF)totem subsystem404A.Wearable system400 can also includeexample headgear device400A (which may correspond towearable head device100 described above); theheadgear device400A includes a totem-to-headgear 6DOF headgear subsystem404B. In the example, the 6DOF totem subsystem404A and the 6DOF headgear subsystem404B cooperate to determine six coordinates (e.g., offsets in three translation directions and rotation along three axes) of the handheld controller400B relative to theheadgear device400A. The six degrees of freedom may be expressed relative to a coordinate system of theheadgear device400A. The three translation offsets may be expressed as X, Y, and Z offsets in such a coordinate system, as a translation matrix, or as some other representation. The rotation degrees of freedom may be expressed as sequence of yaw, pitch and roll rotations; as vectors; as a rotation matrix; as a quaternion; or as some other representation. In some examples, one or more depth cameras444 (and/or one or more non-depth cameras) included in theheadgear device400A; and/or one or more optical targets (e.g.,buttons240 ofhandheld controller200 as described above, or dedicated optical targets included in the handheld controller) can be used for 6DOF tracking. In some examples, the handheld controller400B can include a camera, as described above; and theheadgear device400A can include an optical target for optical tracking in conjunction with the camera. In some examples, theheadgear device400A and the handheld controller400B each include a set of three orthogonally oriented solenoids which are used to wirelessly send and receive three distinguishable signals. By measuring the relative magnitude of the three distinguishable signals received in each of the coils used for receiving, the 6DOF of the handheld controller400B relative to theheadgear device400A may be determined. In some examples,6DOF totem subsystem404A can include an Inertial Measurement Unit (IMU) that is useful to provide improved accuracy and/or more timely information on rapid movements of the handheld controller400B.
In some examples involving augmented reality or mixed reality applications, it may be desirable to transform coordinates from a local coordinate space (e.g., a coordinate space fixed relative toheadgear device400A) to an inertial coordinate space, or to an environmental coordinate space. For instance, such transformations may be necessary for a display ofheadgear device400A to present a virtual object at an expected position and orientation relative to the real environment (e.g., a virtual person sitting in a real chair, facing forward, regardless of the position and orientation ofheadgear device400A), rather than at a fixed position and orientation on the display (e.g., at the same position in the display ofheadgear device400A). This can maintain an illusion that the virtual object exists in the real environment (and does not, for example, appear positioned unnaturally in the real environment as theheadgear device400A shifts and rotates). In some examples, a compensatory transformation between coordinate spaces can be determined by processing imagery from the depth cameras444 (e.g., using a Simultaneous Localization and Mapping (SLAM) and/or visual odometry procedure) in order to determine the transformation of theheadgear device400A relative to an inertial or environmental coordinate system. In the example shown inFIG.4, thedepth cameras444 can be coupled to a SLAM/visual odometry block406 and can provide imagery to block406. The SLAM/visual odometry block406 implementation can include a processor configured to process this imagery and determine a position and orientation of the user's head, which can then be used to identify a transformation between a head coordinate space and a real coordinate space. Similarly, in some examples, an additional source of information on the user's head pose and location is obtained from anIMU409 ofheadgear device400A. Information from theIMU409 can be integrated with information from the SLAM/visual odometry block406 to provide improved accuracy and/or more timely information on rapid adjustments of the user's head pose and position.
In some examples, thedepth cameras444 can supply 3D imagery to ahand gesture tracker411, which may be implemented in a processor ofheadgear device400A. Thehand gesture tracker411 can identify a user's hand gestures, for example by matching 3D imagery received from thedepth cameras444 to stored patterns representing hand gestures. Other suitable techniques of identifying a user's hand gestures will be apparent.
In some examples, one ormore processors416 may be configured to receive data from headgear subsystem404B, theIMU409, the SLAM/visual odometry block406,depth cameras444, microphones450; and/or thehand gesture tracker411. Theprocessor416 can also send and receive control signals from the6DOF totem system404A. Theprocessor416 may be coupled to the6DOF totem system404A wirelessly, such as in examples where the handheld controller400B is untethered.Processor416 may further communicate with additional components, such as an audio-visual content memory418, a Graphical Processing Unit (GPU)420, and/or a Digital Signal Processor (DSP)audio spatializer422. TheDSP audio spatializer422 may be coupled to a Head Related Transfer Function (HRTF)memory425. TheGPU420 can include a left channel output coupled to the left source of imagewise modulated light424 and a right channel output coupled to the right source of imagewise modulatedlight426.GPU420 can output stereoscopic image data to the sources of imagewise modulated light424,426. TheDSP audio spatializer422 can output audio to aleft speaker412 and/or aright speaker414. TheDSP audio spatializer422 can receive input from processor419 indicating a direction vector from a user to a virtual sound source (which may be moved by the user, e.g., via the handheld controller400B). Based on the direction vector, theDSP audio spatializer422 can determine a corresponding HRTF (e.g., by accessing a HRTF, or by interpolating multiple HRTFs). TheDSP audio spatializer422 can then apply the determined HRTF to an audio signal, such as an audio signal corresponding to a virtual sound generated by a virtual object. This can enhance the believability and realism of the virtual sound, by incorporating the relative position and orientation of the user relative to the virtual sound in the mixed reality environment—that is, by presenting a virtual sound that matches a user's expectations of what that virtual sound would sound like if it were a real sound in a real environment.
In some examples, such as shown inFIG.4, one or more ofprocessor416,GPU420,DSP audio spatializer422,HRTF memory425, and audio/visual content memory418 may be included in an auxiliary unit400C (which may correspond toauxiliary unit300 described above). The auxiliary unit400C may include abattery427 to power its components and/or to supply power toheadgear device400A and/or handheld controller400B. Including such components in an auxiliary unit, which can be mounted to a user's waist, can limit the size and weight ofheadgear device400A, which can in turn reduce fatigue of a user's head and neck.
WhileFIG.4 presents elements corresponding to various components of an examplewearable system400, various other suitable arrangements of these components will become apparent to those skilled in the art. For example, elements presented inFIG.4 as being associated with auxiliary unit400C could instead be associated withheadgear device400A or handheld controller400B. Furthermore, some wearable systems may forgo entirely a handheld controller400B or auxiliary unit400C. Such changes and modifications are to be understood as being included within the scope of the disclosed examples.
Speech Processing Engines
Speech recognition systems in general include a speech processing engine that can accept an input audio signal corresponding to human speech (a source signal); process and analyze the input audio signal; and produce, as a result of the analysis, an output corresponding to the human speech. In the case of automatic speech recognition (ASR) systems, for example, the output of a speech processing engine may be a text transcription of the human speech. In the case of natural language processing systems, the output may be one or more commands or instructions indicated by the human speech; or some representation (e.g., a logical expression or a data structure) of the semantic meaning of the human speech. While reference is made herein to speech processing engines, other forms of speech processing besides speech recognition should also be considered within the scope of the disclosure. Other types of speech processing systems (e.g., automatic translation systems), including those that do not necessarily “recognize” speech, are contemplated and are within the scope of the disclosure.
Speech recognition systems are found in a diverse array of products and applications: conventional telephone systems; automated voice messaging systems; voice assistants (including standalone and smartphone-based voice assistants); vehicles and aircraft; desktop and document processing software; data entry; home appliances; medical devices; language translation software; closed captioning systems; and others. An advantage of speech recognition systems is that they may allow users to provide input to a computer system using natural spoken language, such as presented to one or more microphones, instead of conventional computer input devices such as keyboards or touch panels; accordingly, speech recognition systems may be particularly useful in environments where conventional input devices (e.g., keyboards) may be unavailable or impractical. Further, by permitting users to provide intuitive voice-based input, speech processing engines can heighten feelings of immersion. As such, speech recognition can be a natural fit for wearable systems, and in particular, for virtual reality, augmented reality, and/or mixed reality applications of wearable systems, in which user immersion is a primary goal; and in which it may be desirable to limit the use of conventional computer input devices, whose presence may detract from feelings of immersion.
Although speech processing engines allow users to naturally interface with a computer system through spoken language, constantly running the speech processing engine can pose problems. For example, one problem is that the user experience may be degraded if the speech processing engine responds to noise, or other sounds, that are not intended to be speech input. Background speech can be particularly problematic, as it could cause the computer system to execute unintended commands if the speech processing engine hears and interprets the speech. Because it can be difficult, if not impossible, to eliminate the presence of background speech in a user's environment (particularly for mobile devices), speech processing engines can benefit from a system that can ensure that the speech processing engine only responds to audio signals intended to be speech input for the computer system.
Such a system can also alleviate a second problem of continually running the speech processing engine: power efficiency. A continually running speech processing engine requires power to process a continuous stream of audio signals. Because automatic speech recognition and natural language processing can be computationally expensive tasks, the speech processing engine can be power hungry. Power constraints can be particularly acute for battery powered mobile devices, as continually running the speech processing engine can undesirably reduce the operating time of the mobile device. One way a system can alleviate this problem is by activating the speech processing engine only when the system has determined there is a high likelihood that the audio signal is intended as input for the speech processing engine and the computer system. By initially screening the incoming audio signal to determine if it is likely to be intended speech input, the system can ensure that the speech recognition system accurately responds to speech input while disregarding non-speech input. The system may also increase the power efficiency of the speech recognition system by reducing the amount of time the speech processing engine is required to be active.
One part of such a system can be a wake-up word system. A wake-up word system can rely upon a specific word or phrase to be at the beginning of any intended speech input. The wake-up word system can therefore require that the user first say the specific wake-up word or phrase and then follow the wake-up word or phrase with the intended speech input. Once the wake-up word system detects that the wake-up word has been spoken, the associated audio signal (that may or may not include the wake-up word) can be processed by the speech processing engine or passed to the computer system. Wake-up word systems with a well-selected wake-up word or phrase can reduce or eliminate unintended commands to the computer system from audio signals that are not intended as speech input. If the wake-up word or phrase is not typically uttered during normal conversation, the wake-up word or phrase may serve as a reliable marker that indicates the beginning of intended speech input. However, a wake-up word system still requires a speech processing engine to actively process audio signals to determine if any given audio signal includes the wake-up word.
It therefore can be desirable to create an efficient system that first determines if an audio signal is likely to be a wake-up word. In some embodiments, the system can first determine that an audio signal is likely to include a wake-up word. The system can then wake the speech processing engine and pass the audio signal to the speech processing engine. In some embodiments, the system comprises a voice activity detection system and further comprises a voice onset detection system.
The present disclosure is directed to systems and methods for improving the accuracy and power efficiency of a speech recognition system by filtering out audio signals that are not likely to be intended speech input. As described herein, such audio signals can first be identified (e.g., classified) by a voice activity detection system (e.g., as voice activity or non-voice activity). A voice onset detection system can then determine that an onset has occurred (e.g., of a voice activity event). The determination of an onset can then trigger subsequent events (e.g., activating a speech processing engine to determine if a wake-up word was spoken). “Gatekeeping” audio signals that the speech processing engine is required to process allows the speech processing engine to remain inactive when non-input audio signals are received. In some embodiments, the voice activity detection system and the voice onset detection system are configured to run on a low power, always-on processor.
Such capabilities may be especially important in mobile applications of speech processing, even more particularly for wearable applications, such as virtual reality or augmented reality applications. In such wearable applications, the user may often speak without directing input speech to the wearable system. The user may also be in locations where significant amounts of background speech exists. Further, the wearable system may be battery-operated and have a limited operation time. Sensors of wearable systems (such as those described in this disclosure) are well suited to solving this problem, as described herein. However, it is also contemplated that systems and methods described herein can also be applied in non-mobile applications (e.g., a voice assistant running on a device connected to a power outlet or a voice assistant in a vehicle).
Voice Activity Detection
FIG.5 illustrates anexample system500, according to some embodiments, in which one or more audio signals are classified as either voice activity or non-voice activity. In the depicted embodiment, voice activity detection is determined from both a singlechannel signal step502 and from abeamforming signal step503. Atstep504, the single channel signal and the beamforming signal are combined to determine voice activity from the one or more audio signals. However, it is also contemplated that, in some cases, one of a single-channel audio signal or a beamforming audio signal can be used to determine voice activity detection. Atstep506, the amount of voice activity is measured over a length of time (e.g., a predetermined period of time or a dynamic period of time). Atstep508, if the amount of measured voice activity is less than a threshold (e.g., a predetermined period of time, a dynamic period of time), the system returns to step506 and continues to measure the amount of voice activity over a length of time. If the amount of measured voice activity is greater than a threshold, the system will determine an onset atstep510.
FIG.6A illustrates in more detail an example of how determining a probability of voice activity from a singlechannel signal step502 can be performed. In the depicted embodiment, input audio signals from microphones601 and602 (e.g., microphones150) can be processed at steps603 and604. Microphones601 and602 can be placed in a broadside configuration with respect to a user's mouth. Microphones601 and602 can also be placed equidistant from a user's mouth for ease of subsequent signal processing. However, it is also contemplated that other microphone arrangements (e.g., number of microphones, placement of microphones, asymmetric microphone arrangements) can be used. However, in other embodiments, input audio signals can be provided from other suitable sources as well (e.g., a data file or a network audio stream). In some embodiments, signal processing steps603 and604 include applying a window function to the audio signal. A window function can be zero-valued outside of a chosen interval, and it can be useful to use window functions when the information carrying portion of a signal is in a known range (e.g., human speech). A window function can also serve to deconstruct a signal into frames that approximate a signal from a stationary process. It can be beneficial to approximate a signal from a stationary process to apply signal processing techniques suited for stationary processes (e.g., a Fourier transform). In some embodiments, a Hann window can be used, with a frame overlapping percentage of 50%. However, it is contemplated that other window functions with different frame overlapping percentages and/or hop sizes can be used. In some embodiments, the audio signal can also be filtered at steps603 and604. Filtering can occur inside or outside the frequency domain to suppress parts of a signal that do not carry information (e.g., parts of a signal that do not carry human speech). In some embodiments, a bandpass filter can be applied to reduce noise in non-voice frequency bands. In some embodiments, a finite impulse response filter (FIR) can be used to preserve the phase of the signal. In some embodiments, both a window function and a filter are applied to an audio signal at steps603 and604. In some embodiments, a window function can be applied before a filter function is applied to an audio signal. However, it is also contemplated that one of a window function or a filter are applied to the audio signal at one of step603 or step604, or other signal processing steps are applied at one of step603 or step604.
In some embodiments, input audio signals can be summed together at step605. For microphone configurations that are symmetric relative to a signal source (e.g., a user's mouth), a summed input signal can serve to reinforce an information signal (e.g., a speech signal) because the information signal can be present in both individual input signals, and each microphone can receive the information signal at the same time. In some embodiments, the noise signal in the individual input audio signals can generally not be reinforced because of the random nature of the noise signal. For microphone configurations that are not symmetric relative to a signal source, a summed signal can serve to increase a signal-to-noise ratio (e.g., by reinforcing a speech signal without reinforcing a noise signal). In some embodiments, a filter or delay process can be used for asymmetric microphone configurations. A filter or delay process can align input audio signals to simulate a symmetric microphone configuration by compensating for a longer or shorter path from a signal source to a microphone. Although the depicted embodiment illustrates two input audio signals summed together, it is also contemplated that a single input audio signal can be used, or more than two input audio signals can be summed together as well. It is also contemplated that signal processing steps603 and/or604 can occur after a summation step605 on a summed input signal.
In some embodiments, input power can be estimated atstep606. In some embodiments, input power can be determined on a per-frame basis based on a windowing function applied at steps603 and604. At step608, the audio signal can optionally be smoothed to produce a smoothed input power. In some embodiments, the smoothing process occurs over the frames provided by the windowing function. Although the depicted embodiment shows signal processing and smoothing steps603,604, and608, it is also contemplated that the input audio signal can be processed at step610.
At step610, a ratio of the smoothed input power to the noise power estimate is calculated. In some embodiments, the noise power estimate is used to determine voice activity, however, the noise power estimate may also (in some embodiments) rely on information as to when speech is present or absent. Because of the interdependence between inputs and outputs, methods like minima controlled recursive averaging (MCRA) can be used to determine the noise power estimate (although other methods may be used).
FIG.6B illustrates an example processing of an input signal to calculate the ratio of smoothed input power to the noise power estimate. Ingraph613, asignal614 is displayed that includes both an information signal (e.g., human speech) and noise (e.g., any signal that is not human speech). Ingraph616, the noise power is estimated and displayed assignal620 and the smoothed input power—which can be a result ofprocessing signal614—can be displayed as signal618. Ingraph622, the ratio of the smoothed input power signal618 and the noisepower estimate signal620 is displayed assignal624.
Referring back toFIG.6A, the probability of voice activity in the single channel signal can be determined atstep612 from the ratio of the smoothed input power to the noise power estimate. In some embodiments, the presence of voice activity is determined by mapping the ratio of smoothed input power to the noise power estimate into probability space, as shown inFIG.6C. In some embodiments, the ratio is mapped into probability space by using a function to assign a probability that a signal at any given time is voice activity based on the calculated ratio. For example,FIG.6C depicts alogistic mapping function626 where the x-axis is the calculated ratio at a given time of the noise power estimate to the smoothed input power, and the y-axis is the probability that the ratio at that time represents voice activity. In some embodiments, the type of function and function parameters can be tuned to minimize false positives and false negatives. The tuning process can use any suitable methods (e.g., manual, semi-automatic, and/or automatic tuning, for example, machine learning based tuning). Although a logistic function is depicted for mapping input ratios to probability of voice activity, it is contemplated that other methods for mapping input ratios into probability space can be used as well.
Referring now toFIGS.5 and7A-7E,FIG.7A depicts in more detail an example of how determining a probability of voice activity from abeamforming signal step503 can be performed. In some embodiments, the beamforming signal can be determined from two ormore microphones702 and703 (e.g., microphones150). In some embodiments, the two or more microphones are spaced in a known manner relative to the user's mouth. The two or more microphones can optionally be spaced in a symmetric manner relative to a speech source (e.g., a user's mouth). The depicted embodiment shows thatmicrophones702 and703 provide input audio signals. However, in other embodiments, audio signals can be provided from other suitable sources as well (e.g., one or more data files or one or more network audio streams).
InFIG.7B,microphones702 and703 are shown on an example device701 (e.g., wearable head device) designed to be worn on a user's head. In the depicted embodiment,microphones702 and703 are placed in a broadside configuration with respect to the user's mouth.Microphones702 and703 can also be placed equidistant from the user's mouth for ease of subsequent signal processing. However, it is also contemplated that other microphone arrangements (e.g., number of microphones, placement of microphones, asymmetric microphone arrangements) can be used. For example, in some embodiments, one or more microphones can be placed outside of a device designed to be worn on a user's head. In some embodiments, one or more microphones may be placed in a room at known locations. One or more microphones placed in a room at known locations may be communicatively coupled to a wearable head device or a processor communicatively coupled to a wearable head device. In some embodiments, a position of a user in the room can be used in conjunction with known locations of one or more microphones in the room for subsequent processing or calibration.
In another example, one or more microphones can be placed in a location that is generally but not completely fixed with respect to a user. In some embodiments, one or more microphones may be placed in a car (e.g., two microphones equally spaced relative to a driver's seat). In some embodiments, one or more microphones may be communicatively coupled to a processor. In some embodiments, a generally expected location of a user may be used in conjunction with a known location of one or more microphones for subsequent processing or calibration.
Referring back to the example shown inFIG.7A, the input audio signals can be processed inblocks704 and705. In some embodiments, signal processing steps704 and705 include applying a window function to the audio signals. A window function can be zero-valued outside of a chosen interval, and it can be useful to use window functions when the information carrying portion of a signal is in a known range (e.g., human speech). A window function can also serve to deconstruct a signal into frames that approximate a signal from a stationary process. It can be beneficial to approximate a signal from a stationary process to apply signal processing techniques suited for stationary processes (e.g., a Fourier transform). In some embodiments, a Hann window can be used, with a frame overlapping percentage of 50%. However, it is contemplated that other window functions with different frame overlapping percentages and/or hop sizes can be used as well. In some embodiments, the audio signals can also be filtered atsteps704 and705. Filtering can occur inside or outside the frequency domain to suppress parts of a signal that do not carry information (e.g., parts of a signal that do not carry human speech). In some embodiments, a bandpass filter can be applied to reduce noise in non-voice frequency bands. In some embodiments, a FIR filter can be used to preserve the phase of the signals. In some embodiments, both a window function and a filter are applied to the audio signals atsteps704 and705. In some embodiments, a window function can be applied before a filter function is applied to an audio signal. However, it is also contemplated that one of a window function or a filter are applied to the audio signals atsteps704 and/or705, or other signal processing steps are applied atsteps704 and/or705.
Atstep706, the two or more audio signals are summed to produce a summation signal, as shown in more detail inFIG.7C.FIG.7C depicts an embodiment wheremicrophone702 records inputaudio signal716 andmicrophone703 records inputaudio signal718. Input audio signals716 and718 are then summed to produce summation signal720. For microphone configurations that are symmetric relative to a signal source (e.g., a user's mouth), a summed input signal can serve to reinforce an information signal (e.g., a speech signal) because the information signal can be present in both input audio signals716 and718. In some embodiments, the noise signal in input audio signals716 and718 can generally not be reinforced because of the random nature of the noise signal. For microphone configurations that are not symmetric relative to a signal source, a summed signal can still serve to increase a signal-to-noise ratio (e.g., by reinforcing a speech signal without reinforcing a noise signal). In some embodiments, a filter or delay process can be used for asymmetric microphone configurations. A filter or delay process can align input audio signals to simulate a symmetric microphone configuration by compensating for a longer or shorter path from a signal source to a microphone.
Referring back to the example shown inFIG.7A, two or more audio signals are subtracted to produce a difference signal at step707, as shown in more detail inFIG.7D.FIG.7D depicts an embodiment where the same input audio signals716 and718 recorded bymicrophones702 and703 are also subtracted to form adifference signal722. For microphone configurations that are symmetric relative to a signal source, thedifference signal722 can contain a noise signal. In some embodiments,microphones702 and703 are located equidistant from a user's mouth in a wearable head device. Accordingly, input audio signals716 and718 recorded bymicrophones702 and703 receive the same speech signal at the same amplitude and at the same time when the user speaks. Adifference signal722 calculated by subtracting one of input audio signals716 and718 from the other would therefore generally remove the speech signal from the input audio signal, leaving the noise signal. For microphone configurations that are not symmetric relative to a signal source, a difference signal can contain a noise signal. In some embodiments, a filter or delay process can be used to simulate a symmetric microphone configuration by compensating for a longer or shorter path from a signal source to a microphone.
Referring back to the example shown inFIG.7A, the difference signal can be normalized atstep709 in anticipation of calculating a ratio atstep714. In some embodiments, calculating the ratio of the summation signal and the normalized difference signal involves dividing the normalized difference signal by the summation signal (although the reverse order may also be used). A ratio of a summation signal and a normalized difference signal can have an improved signal-to-noise ratio as compared to either component signal. In some embodiments, calculating the ratio is simplified if the baseline noise signal is the same in both the summation signal and the difference signal. The ratio calculation can be simplified because the ratio will approximate “1” where noise occurs if the baseline noise signal is the same. However, in some embodiments, the summation signal and the difference signal will have different baseline noise signals. The summation signal, which can be calculated by summing the input audio signals received from two or more microphones, can have a baseline that is approximately double the baseline of each individual input audio signal. The difference signal, which can be calculated by subtracting the input audio signals from two or more microphones, can have a baseline that is approximately zero. Accordingly, in some embodiments, the baseline of the difference signal can be normalized to approximate the baseline of the summation signal (although the reverse is also contemplated).
In some embodiments, a baseline for a difference signal can be normalized to a baseline for a summation signal by using an equalization filter, which can be a FIR filter. A ratio of a power spectral density of a noise signal in a difference signal and a noise signal in a summation signal can be given as equation (1), where ΓN12(ω) represents the coherence of a signal N1(which can correspond to a noise signal from a first microphone) and a signal N2(which can correspond to a noise signal from a second microphone), and where Re(*) can represent the real portion of a complex number:
Φdiff,noise(ω)Φsum,noise(ω)=1-Re(ΓN12(ω))1+Re(ΓN12(ω))(1)
Accordingly, a desired frequency response of an equalization filter can be represented as equation (2):
Heq(ω)=1+Re(ΓN12(ω))1-Re(ΓN12(ω))(2)
Determining ΓN12(ω) can be difficult because it can require knowledge about which segments of a signal comprise voice activity. This can present a circular issue where voice activity information is required in part to determine voice activity information. One solution can be to model a noise signal as a diffuse field sound as equation (3), where d can represent a spacing between microphones, where c can represent the speed of sound, and ω can represent a normalized frequency:
Γdiffuse(ω)=sin(ωdc)ωdc(3)
Accordingly, a magnitude response using a diffuse field model for noise can be as equation (4):
Heq(ω)=ωdc+sin(ωdc)ωdc-sin(ωdc)(4)
In some embodiments, ΓN12(ω) can then be estimated using a FIR filter to approximate a magnitude response using a diffuse field model.
In some embodiments, input power can be estimated atsteps710 and711. In some embodiments, input power can be determined on a per-frame basis based on a windowing function applied atsteps704 and705. Atsteps712 and713, the summation signal and the normalized difference signal can optionally be smoothed. In some embodiments, the smoothing process occurs over the frames provided by the windowing function.
In the depicted embodiment, the probability of voice activity in the beamforming signal is determined atstep715 from the ratio of the normalized difference signal to the summation signal. In some embodiments, the presence of voice activity is determined by mapping the ratio of the normalized difference signal to the summation signal into probability space, as shown inFIG.7E. In some embodiments, the ratio is mapped into probability space by using a function to assign a probability that a signal at any given time is voice activity based on the calculated ratio. For example,FIG.7E depicts alogistic mapping function724 where the x-axis is the calculated ratio at a given time of the normalized difference signal to the summation signal, and the y-axis is the probability that the ratio at that time represents voice activity. In some embodiments, the type of function and function parameters can be tuned to minimize false positives and false negatives. The tuning process can use any suitable methods (e.g., manual, semi-automatic, and/or automatic tuning, for example, machine learning based tuning). Although a logistic function is depicted for mapping input ratios to probability of voice activity, it is contemplated that other methods for mapping input ratios into probability space can be used as well.
Referring back toFIG.5, the input signal can be classified into voice activity and non-voice activity atstep504. In some embodiments, a single combined probability of voice activity at a given time is determined from the probability of voice activity in the single channel signal and the probability of voice activity in the beamforming signal. In some embodiments, the probability from the single channel signal and from the beamforming signal can be separately weighted during the combination process. In some embodiments, the combined probability can be determined according to equation (5), where ψVAD(l) represents the combined probability, pBF(l) represents the probability of voice activity from the beamforming signal, pOD(l) represents the probability of voice activity from the single channel signal, and αBFand αODrepresent the weighting exponents for pBF(l) and pOD(l), respectively:
ψVAD(l)=pBF(l)αBF·pOD(lOD  (5)
Based on the combined probability for a given time, the input signal can then be classified in some embodiments as voice activity or non-voice activity as equation (6), where δVADrepresents a threshold:
VAD(l)={1:ψVAD(l)>δVAD(speech)0:otherwise(non-speech)(6)
In some embodiments, δVADis a tunable parameter that can be tuned by any suitable means (e.g., manually, semi-automatically, and/or automatically, for example, through machine learning). The binary classification of the input signal into voice activity or non-voice activity can be the voice activity detection (VAD) output.
Voice Onset Detection
Referring back toFIG.5, voice onset detection can be determined from the VAD output at steps506-510. In some embodiments, the VAD output is monitored to determine a threshold amount of voice activity within a given time atstep506. In some embodiments, if the VAD output does not contain a threshold amount of voice activity within a given time, the VAD output continues to be monitored. In some embodiments, if the VAD output does contain a threshold amount of voice activity within a given time, an onset can be determined atstep510. In some embodiments, an input audio signal can be classified as voice activity or non-voice activity on a per hop-size or on a per frame basis. For example, each hop or each frame of an input audio signal can be classified as voice activity or non-voice activity. In some embodiments, onset detection can be based on a threshold number of hops or frames classified as voice activity within a threshold amount of time. However, voice activity can be classified using other means as well (e.g., on a per sample basis).
FIG.8 depicts an embodiment of determining anonset marker802. In some embodiments,onset marker802 is determined from theVAD output signal804, which is determined from theinput audio signal806. In some embodiments, theonset marker802 is determined based on one or more tunable parameters. In some embodiments, the TLOOKBACKparameter can function as a buffer window through which theVAD output804 and/or theinput audio signal806 is evaluated. In some embodiments, the TLOOKBACKbuffer window can progress through theVAD output804 and/or theaudio signal806 in the time domain and evaluate the amount of voice activity detected within the TLOOKBACKbuffer window. In some embodiments, if the amount of voice activity detected within the TLOOKBACKbuffer window exceeds a threshold parameter TVA_ACCUM, the system can determine anonset marker802 at that time. In some embodiments, a larger TLOOKBACKvalue decreases the likelihood of detecting a false onset for short term noise at the expense of increasing latency. In some embodiments, TVA_ACCUMshould be less than or equal to the TLOOKBACKbuffer window size.
FIG.9 depicts the operation of a THOLDparameter. It can be desirable to determine no more than one onset for a single utterance. For example, if a wake-up word is determined to typically be no longer than 200 milliseconds, it can be desirable to not determine a second onset within 200 milliseconds of determining a first onset. Determining multiple onsets for a single utterance can result in unnecessary processing and power consumption. In some embodiments, THOLDis a duration of time during which a second onset marker should not be determined after a first onset marker has been determined. In the depicted embodiment,onset marker802 is determined based onVAD output804 and/or inputaudio signal806. In some embodiments, afteronset marker802 is determined, another onset marker may not be determined for the duration of THOLD. In some embodiments, after the duration of T HOLD has passed, another onset marker may be determined if the proper conditions are met (e.g., the amount of voice activity detected within the TLOOKBACKbuffer window exceeds a threshold parameter TVA_ACCUM). THOLDcan take any suitable form, like a static value or a dynamic value (e.g., THOLDcan be a function of the wake-up word length).
In some embodiments, an onset can be determined using parameters that can be tuned via any suitable means (e.g., manually, semi-automatically, and/or automatically, for example, through machine learning). For example, parameters can be tuned such that the voice onset detection system is sensitive to particular speech signals (e.g., a wake-up word). In some embodiments, a typical duration of a wake-up word is known (or can be determined for or by a user) and the voice onset detection parameters can be tuned accordingly (e.g., the THOLDparameter can be set to approximately the typical duration of the wake-up word) and, in some embodiments, may include padding. Although the embodiments discussed assume the unit of utterance to be detected by the voice onset detection system is a word (or one or more words), it is also contemplated that the target unit of utterance can be other suitable units, such as phonemes or phrases. In some embodiments, the TLOOKBACKbuffer window can be tuned to optimize for lag and accuracy. In some embodiments, the TLOOKBACKbuffer window can be tuned for or by a user. For example, a longer TLOOKBACKbuffer window can increase the system's sensitivity to onsets because the system can evaluate a larger window where the TVA_ACCUMthreshold can be met. However, in some embodiments, a longer TLOOKBACKwindow can increase lag because the system may have to wait longer to determine if an onset has occurred.
In some embodiments, the TLOOKBACKbuffer window size and the TVA_ACCUMthreshold can be tuned to yield the least amount of false negatives and/or false positives. For example, a longer buffer window size with the same threshold can make the system less likely to produce false negatives but more likely to produce false positives. In some embodiments, a larger threshold with the same buffer window size can make the system less likely to produce false positives but more likely to produce false negatives. In some embodiments, the onset marker can be determined at the moment the TVA_ACCUMthreshold is met. Accordingly, in some embodiments, the onset marker can be offset from the beginning of the detected voice activity by the duration TVA_ACCUM. In some embodiments, it is desirable to introduce an offset to remove undesired speech signals that can precede desired speech signals (e.g., “uh” or “um” preceding a command). In some embodiments, once the TVA_ACCUMthreshold is met, the onset marker can be “back-dated” using suitable methods to the beginning of the detected voice activity such that there may be no offset. For example, the onset marker can be back-dated to the most recent beginning of detected voice activity. In some embodiments, the onset marker can be back-dated using one or more of onset detection parameters (e.g., TLOOKBACKand TVA_ACCUM).
In some embodiments, onset detection parameters can be determined at least in part based on previous interactions. For example, the THOLDduration can be adjusted based on a determination of how long the user has previously taken to speak the wake-up word. In some embodiments, TLOOKBACKor TVA_ACCUMcan be adjusted based on a likelihood of false positives or false negatives from a user or a user's environment. In some embodiments, signal processing steps604 (inFIG.6A),704 (inFIG.7A), and705 (inFIG.7A) can be determined at least in part based on previous interactions. For example, parameters for a windowing function or a filtering function can be adjusted according to a user's typical voice frequencies. In some embodiments, a device can be pre-loaded with a set of default parameters which can adjust based on a specific user's interactions with a device.
In some embodiments, voice onset detection can be used to trigger subsequent events. For example, the voice onset detection system can run on an always-on, lower-power processor (e.g., a dedicated processor or a DSP), compared to a main processor. In some embodiments, the detection of an onset can wake a neighboring processor and prompt the neighboring processor to begin speech recognition. In some embodiments, the voice onset detection system can pass information to subsequent systems (e.g., the voice onset detection system can pass a timestamp of a detected onset to a speech processing engine running on a neighboring processor). In some embodiments, the voice onset detection system can use voice activity detection information to accurately determine the onset of speech without the aid of a speech processing engine. In some embodiments, the detection of an onset can serve as a trigger for a speech processing engine to activate; the speech processing engine therefore can remain inactive (reducing power consumption) until an onset has been detected. In some embodiments, a voice onset detector requires less processing (and therefore less power) than a speech processing engine because a voice onset detector analyzes input signal energy, instead of analyzing the content of the speech.
In some embodiments, sensors on a wearable head device can determine (at least in part) parameters for onset detection. For example, one or more sensors on a wearable head device may monitor a user's mouth movements in determining an onset event. In some embodiments, a user moving his or her mouth may indicate that an onset event is likely to occur. In some embodiments, one or more sensors on a wearable head device may monitor a user's eye movements in determining an onset event. For example, certain eye movements or patterns may be associated with preceding an onset event. In some embodiments, sensors on a wearable head device may monitor a user's vital signs to determine an onset event. For example, an elevated heartrate may be associated with preceding an onset event. It is also contemplated that sensors on a wearable head device may monitor a user's behavior in ways other than those described herein (e.g., head movement, hand movement).
In some embodiments, sensor data (e.g., mouth movement data, eye movement data, vital sign data) can be used as an additional parameter to determine an onset event (e.g., determination of whether a threshold of voice activity is met), or sensor data can be used exclusively to determine an onset event. In some embodiments, sensor data can be used to adjust other onset detection parameters. For example, mouth movement data can be used to determine how long a particular user takes to speak a wake-up word. In some embodiments, mouth movement data can be used to adjust a THOLDparameter accordingly. In some embodiments, a wearable head device with one or more sensors can be pre-loaded with instructions on how to utilize sensor data for determining an onset event. In some embodiments, a wearable head device with one or more sensors can also learn how to utilize sensor data for predetermining an onset event based on previous interactions. For example, it may be determined that, for a particular user, heartrate data is not meaningfully correlated with an onset event, but eye patterns are meaningfully correlated with an onset event. Heartrate data may therefore not be used to determine onset events, or a lower weight may be assigned to heartrate data. A higher weight may also be assigned to eye pattern data.
In some embodiments, the voice onset detection system functions as a wrapper around the voice activity detection system. In some embodiments, it is desirable to produce onset information because onset information may be more accurate than voice activity information. For example, onset information may be more robust against false positives than voice activity information (e.g., if a speaker briefly pauses during a single utterance, voice activity detection may show two instances of voice activity when one onset is desired). In some embodiments, it is desirable to produce onset information because it requires less processing in subsequent steps than voice activity information. For example, clusters of multiple detected voice activity may require further determination if the cluster should be treated as a single instance of voice activity or multiple.
Asymmetrical Microphone Placement
Symmetrical microphone configurations (such as the configuration shown inFIG.7B) can offer several advantages in detecting voice onset events. Because a symmetrical microphone configuration may place two or more microphones equidistant from a sound source (e.g., a user's mouth), audio signals received from each microphone may be easily added and/or subtracted from each other for signal processing. For example, because the audio signals corresponding to a user speaking may be received bymicrophones702 and703 at the same time, the audio signals (e.g., the audio signal atmicrophone702 and the audio signal at microphone703) may begin at the same time, and offsets of the audio signals may be unnecessary to combine or subtract the signals.
In some embodiments, asymmetrical microphone configurations may be used because an asymmetrical configuration may be better suited at distinguishing a user's voice from other audio signals. InFIG.10, a mixed reality (MR) system1000 (which may correspond towearable device100 or system400) can be configured to receive voice input from a user. In some embodiments, a first microphone may be placed atlocation1002, and a second microphone may be placed atlocation1004. In some embodiments,MR system1000 can include a wearable head device, and a user's mouth may be positioned atlocation1006. Sound originating from the user's mouth atlocation1006 may take longer to reachmicrophone location1002 thanmicrophone location1004 because of the larger travel distance betweenlocation1006 andlocation1002 than betweenlocation1006 andlocation1004.
In some embodiments, an asymmetrical microphone configuration (e.g., the microphone configuration shown inFIG.10) may allow a MR system to more accurately distinguish a user's voice from other audio signals. For example, a person standing directly in front of a user may not be distinguishable from the user with a symmetrical microphone configuration on a wearable head device. A symmetrical microphone configuration (e.g., the configuration shown inFIG.7B) may result in both microphones (e.g.,microphones702 and703) receiving speech signals at the same time, regardless of whether the user was speaking or if the person directly in front of the user is speaking. This may allow the person directly in front of the user to “hijack” a MR system by issuing voice commands that the MR system may not be able to determine as originating from someone other than the user. In some embodiments, an asymmetrical microphone configuration may more accurately distinguish a user's voice from other audio signals. For example, microphones placed atlocations1002 and1004 may receive audio signals from the user's mouth at different times, and the difference may be determined by the spacing betweenlocations1002/1004 andlocation1006. However, microphones atlocations1002 and1004 may receive audio signals from a person speaking directly in front of a user at the same time. The user's speech may therefore be distinguishable from other sound sources (e.g., another person) because the user's mouth may be at a lower height thanmicrophone locations1002 and1004, which can be determined from a sound delay atposition1002 as compared toposition1004.
Although asymmetrical microphone configurations may provide additional information about a sound source (e.g., an approximate height of the sound source), a sound delay may complicate subsequent calculations. In some embodiments, adding and/or subtracting audio signals that are offset (e.g., in time) from each other may decrease a signal-to-noise ratio (“SNR”), rather than increasing the SNR (which may happen when the audio signals are not offset from each other). It can therefore be desirable to process audio signals received from an asymmetrical microphone configuration such that a beamforming analysis (e.g., noise cancellation) may still be performed to determine voice activity. In some embodiments, a voice onset event can be determined based on a beamforming analysis and/or single channel analysis. A notification may be transmitted to a processor (e.g., a DSP or x86 processor) in response to determining that a voice onset event has occurred. The notification may include information such as a timestamp of the voice onset event and/or a request that the processor begin speech recognition.
FIGS.11A-11C illustrate examples of processing audio signals. In some embodiments,FIGS.11A-11C illustrate example embodiments of processing audio signals such that a beamforming voice activity detection analysis can be performed on audio signals that may be offset (e.g., in time) from each other (e.g., due to an asymmetric microphone configuration).
FIG.11A illustrates an example of implementing a time-offset in a bandpass filter. In some embodiments, an audio signal received atmicrophone1102 may be processed atsteps1104 and/or1106. In some embodiments,steps1104 and1106 together may correspond to processingstep704 and/or step603. For example,microphone1102 may be placed atposition1004. In some embodiments, a window function may be applied atstep1104 to a first audio signal (e.g., an audio signal corresponding to a user's voice) received bymicrophone1102. In some embodiments, a first filter (e.g., a bandpass filter) may be applied to the first audio signal atstep1106.
In some embodiments, an audio signal received atmicrophone1108 may be processed atsteps1110 and/or1112. In some embodiments,steps1110 and1112 together may correspond to processing step705 and/or step604. For example,microphone1108 may be placed atposition1002. In some embodiments, a window function may be applied atstep1110 to a second audio signal received bymicrophone1108. In some embodiments, the window function applied atstep1110 can be the same window function applied atstep1104. In some embodiments, a second filter (e.g., a bandpass filter) may be applied to the second audio signal atstep1112. In some embodiments, the second filter may be different from the first filter because the second filter may account for a time-delay between an audio signal received atmicrophone1108 and an audio signal received atmicrophone1102. For example, a user may speak while wearingMR system1000, and the user's voice may be picked up bymicrophone1108 at a later time than by microphone1102 (e.g., becausemicrophone1108 may be further away from a user's mouth than microphone1102). In some embodiments, a bandpass filter applied atstep1112 can be implemented in the time domain, and the bandpass filter can be shifted (as compared to a bandpass filter applied at step1106) by a delay time, which may include an additional time for sound to travel fromposition1006 to1002, as compared from1006 to1004. In some embodiments, a delay time may be approximately 3-4 samples at a 48 kHz sampling rate, although a delay time can vary depending on a particular microphone (and user) configuration. A delay time can be predetermined (e.g., using measuring equipment) and may be fixed across different MR systems (e.g., because the microphone configurations may not vary across different systems). In some embodiments, a delay time can be dynamically measured locally by individual MR systems. For example, a user may be prompted to generate an impulse (e.g., a sharp, short noise) with their mouth, and a delay time may be recorded as the impulse reaches asymmetrically positioned microphones. In some embodiments, a bandpass filter can be implemented in the frequency domain, and one or more delay times may be applied to different frequency domains (e.g., a frequency domain including human voices may be delayed by a first delay time, and all other frequency domains may be delayed by a second delay time).
FIG.11B illustrates an example of implementing two filters. In some embodiments, an audio signal received atmicrophone1114 may be processed atsteps1116 and/or1118. In some embodiments,steps1116 and1118 together may correspond to processingstep704 and/or step603. For example,microphone1114 may be placed atposition1004. In some embodiments, a window function may be applied atstep1116 to a first audio signal (e.g., an audio signal corresponding to a user's voice) received bymicrophone1114. In some embodiments, a first filter (e.g., a bandpass filter) may be applied to the first audio signal atstep1118. In some embodiments, a bandpass filter applied atstep1118 can have a lower tap than a bandpass filter applied at step1106 (e.g., the tap may be half of a tap used at step1106). A lower tap may result in a lower memory and/or computation cost, but may yield a lower fidelity audio signal.
In some embodiments, an audio signal received atmicrophone1120 may be processed atsteps1122 and/or1124. In some embodiments,steps1122 and1124 together may correspond to processing step705 and/or step604. For example,microphone1120 may be placed atposition1002. In some embodiments, a window function may be applied atstep1122 to a second audio signal received bymicrophone1120. In some embodiments, the window function applied atstep1122 can be the same window function applied atstep1116. In some embodiments, a second filter (e.g., a bandpass filter) may be applied to the second audio signal at step1124. In some embodiments, the second filter may be different from the first filter because the second filter may account for a time-delay between an audio signal received atmicrophone1120 and an audio signal received atmicrophone1114. In some embodiments, the second filter may have the same tap as the filter applied atstep1118. In some embodiments, the second filter may be configured to account for additional variations. For example, an audio signal originating from a user's mouth may be distorted as a result of, for example, additional travel time, reflections from additional material traversed (e.g., parts of MR system1000), reverberations from additional material traversed, and/or occlusion from parts ofMR system1000. In some embodiments, the second filter may be configured to remove and/or mitigate distortions that may result from an asymmetrical microphone configuration.
FIG.11C illustrates an example of implementing an additional FIR filter. In some embodiments, an audio signal received at microphone1126 may be processed atsteps1128 and/or1130. In some embodiments,steps1128 and1130 together may correspond to processingstep704 and/or step603. For example, microphone1126 may be placed atposition1004. In some embodiments, a window function may be applied atstep1128 to a first audio signal (e.g., an audio signal corresponding to a user's voice) received by microphone1126. In some embodiments, a first filter (e.g., a bandpass filter) may be applied to the first audio signal atstep1130.
In some embodiments, an audio signal received atmicrophone1132 may be processed atsteps1134,1136, and/or1138. In some embodiments,steps1134,1136, and1138 together may correspond to processing step705 and/or step604. For example,microphone1132 may be placed atposition1002. In some embodiments, a FIR filter can be applied to a second audio signal received bymicrophone1132. In some embodiments, a FIR filter can be configured to filter out non-impulse responses. An impulse response can be pre-determined (and may not vary across MR systems with the same microphone configurations), or an impulse response can be dynamically determined at individual MR systems (e.g., by having the user utter an impulse and recording the response). In some embodiments, a FIR filter can provide better control of designing a frequency-dependent delay than an impulse response filter. In some embodiments, a FIR filter can guarantee a stable output. In some embodiments, a FIR filter can be configured to compensate for a time delay. In some embodiments, a FIR filter can be configured to remove distortions that may result from a longer and/or different travel path for an audio signal. In some embodiments, a window function may be applied atstep1136 to a second audio signal received bymicrophone1132. In some embodiments, the window function applied atstep1136 can be the same window function applied atstep1128. In some embodiments, a second filter (e.g., a bandpass filter) may be applied to the second audio signal atstep1138. In some embodiments, the second filter may be the same as the filter applied atstep1130.
With respect to the systems and methods described above, elements of the systems and methods can be implemented by one or more computer processors (e.g., CPUs or DSPs) as appropriate. The disclosure is not limited to any particular configuration of computer hardware, including computer processors, used to implement these elements. In some cases, multiple computer systems can be employed to implement the systems and methods described above. For example, a first computer processor (e.g., a processor of a wearable device coupled to one or more microphones) can be utilized to receive input microphone signals, and perform initial processing of those signals (e.g., signal conditioning and/or segmentation, such as described above). A second (and perhaps more computationally powerful) processor can then be utilized to perform more computationally intensive processing, such as determining probability values associated with speech segments of those signals. Another computer device, such as a cloud server, can host a speech processing engine, to which input signals are ultimately provided. Other suitable configurations will be apparent and are within the scope of the disclosure.
Although the disclosed examples have been fully described with reference to the accompanying drawings, it is to be noted that various changes and modifications will become apparent to those skilled in the art. For example, elements of one or more implementations may be combined, deleted, modified, or supplemented to form further implementations. Such changes and modifications are to be understood as being included within the scope of the disclosed examples as defined by the appended claims.

Claims (20)

What is claimed is:
1. A system comprising:
a wearable head device comprising:
a frame;
a first microphone disposed on the frame, the first microphone configured to rest at a first distance from a user's mouth when the frame is worn by the user; and
a second microphone disposed on the frame, the second microphone configured to rest at a second distance from the user's mouth when the frame is worn by the user, the second distance unequal to the first distance; and
one or more processors configured to perform a method comprising:
receiving, via the first microphone, a first voice audio signal;
determining a first probability of voice activity based on the first voice audio signal;
receiving, via the second microphone, a second voice audio signal;
determining a second probability of voice activity based on the first voice audio signal and the second voice audio signal;
determining whether a first threshold of voice activity is met based on the first probability of voice activity and the second probability of voice activity;
in accordance with a determination that the first threshold of voice activity is met, determining that a voice onset has occurred; and
in accordance with a determination that the first threshold of voice activity is not met, forgoing determining that a voice onset has occurred.
2. The system ofclaim 1, wherein the method further comprises determining a time offset associated with a difference between the first distance and the second distance, wherein determining the second probability of voice activity based on the first voice audio signal and the second voice audio signal comprises compensating for the time offset.
3. The system ofclaim 2, the method further comprising:
applying a window function to the first voice audio signal;
applying a bandpass filter to the first voice audio signal;
applying a finite-impulse response (FIR) filter to the second voice audio signal, the FIR filter associated with the time offset compensation;
applying a window function to the second voice audio signal; and
applying a bandpass filter to the second voice audio signal.
4. The system ofclaim 1, wherein determining the second probability of voice activity based on the first voice audio signal and the second voice audio signal comprises:
combining the first voice audio signal and the second voice audio signal to produce a third voice audio signal; and
determining the second probability of voice activity based on the third voice audio signal.
5. The system ofclaim 4, wherein the third voice audio signal comprises a beamforming signal.
6. The system ofclaim 1, wherein determining whether the first threshold of voice activity is met comprises:
weighting the first probability of voice activity with a first weight; and
weighting the second probability of voice activity with a second weight.
7. The system ofclaim 1, wherein the first microphone is disposed on a left eye portion of the frame and the second microphone is disposed on a right eye portion of the frame.
8. A method comprising:
receiving, via a first microphone disposed on a frame of a wearable head device, a first voice audio signal;
determining a first probability of voice activity based on the first voice audio signal;
receiving, via a second microphone disposed on the frame of the wearable head device, a second voice audio signal;
determining a second probability of voice activity based on the first voice audio signal and the second voice audio signal;
determining whether a first threshold of voice activity is met based on the first probability of voice activity and the second probability of voice activity;
in accordance with a determination that the first threshold of voice activity is met, determining that a voice onset has occurred; and
in accordance with a determination that the first threshold of voice activity is not met, forgoing determining that a voice onset has occurred,
wherein:
the first microphone is configured to rest at a first distance from a user's mouth when the frame is worn by the user; and
the second microphone is configured to rest at a second distance from the user's mouth when the frame is worn by the user, the second distance unequal to the first distance.
9. The method ofclaim 8, further comprising determining a time offset associated with a difference between the first distance and the second distance, wherein determining the second probability of voice activity based on the first voice audio signal and the second voice audio signal comprises compensating for the time offset.
10. The method ofclaim 9, the method further comprising:
applying a window function to the first voice audio signal;
applying a bandpass filter to the first voice audio signal;
applying a finite-impulse response (FIR) filter to the second voice audio signal, the FIR filter associated with the time offset compensation;
applying a window function to the second voice audio signal; and
applying a bandpass filter to the second voice audio signal.
11. The method ofclaim 8, wherein determining the second probability of voice activity based on the first voice audio signal and the second voice audio signal comprises:
combining the first voice audio signal and the second voice audio signal to produce a third voice audio signal; and
determining the second probability of voice activity based on the third voice audio signal.
12. The method ofclaim 11, wherein the third voice audio signal comprises a beamforming signal.
13. The method ofclaim 8, wherein determining whether the first threshold of voice activity is met comprises:
weighting the first probability of voice activity with a first weight; and
weighting the second probability of voice activity with a second weight.
14. The method ofclaim 8, wherein the first microphone is disposed on a left eye portion of the frame and the second microphone is disposed on a right eye portion of the frame.
15. A non-transitory computer-readable medium storing one or more instructions, which, when executed by one or more processors, cause the one or more processors to perform a method comprising:
receiving, via a first microphone disposed on a frame of a wearable head device, a first voice audio signal;
determining a first probability of voice activity based on the first voice audio signal;
receiving, via a second microphone disposed on the frame of the wearable head device, a second voice audio signal;
determining a second probability of voice activity based on the first voice audio signal and the second voice audio signal;
determining whether a first threshold of voice activity is met based on the first probability of voice activity and the second probability of voice activity;
in accordance with a determination that the first threshold of voice activity is met, determining that a voice onset has occurred; and
in accordance with a determination that the first threshold of voice activity is not met, forgoing determining that a voice onset has occurred,
wherein:
the first microphone is configured to rest at a first distance from a user's mouth when the frame is worn by the user; and
the second microphone is configured to rest at a second distance from the user's mouth when the frame is worn by the user, the second distance unequal to the first distance.
16. The non-transitory computer-readable medium ofclaim 15, further comprising determining a time offset associated with a difference between the first distance and the second distance, wherein determining the second probability of voice activity based on the first voice audio signal and the second voice audio signal comprises compensating for the time offset.
17. The non-transitory computer-readable medium ofclaim 15, wherein determining the second probability of voice activity based on the first voice audio signal and the second voice audio signal comprises:
combining the first voice audio signal and the second voice audio signal to produce a third voice audio signal; and
determining the second probability of voice activity based on the third voice audio signal.
18. The non-transitory computer-readable medium ofclaim 17, wherein the third voice audio signal comprises a beamforming signal.
19. The non-transitory computer-readable medium ofclaim 15, wherein determining whether the first threshold of voice activity is met comprises:
weighting the first probability of voice activity with a first weight; and
weighting the second probability of voice activity with a second weight.
20. The non-transitory computer-readable medium ofclaim 15, wherein the first microphone is disposed on a left eye portion of the frame and the second microphone is disposed on a right eye portion of the frame.
US18/459,3422019-08-072023-08-31Voice onset detectionActiveUS12094489B2 (en)

Priority Applications (2)

Application NumberPriority DateFiling DateTitle
US18/459,342US12094489B2 (en)2019-08-072023-08-31Voice onset detection
US18/764,006US20250006219A1 (en)2019-08-072024-07-03Voice onset detection

Applications Claiming Priority (5)

Application NumberPriority DateFiling DateTitle
US201962884143P2019-08-072019-08-07
US202063001118P2020-03-272020-03-27
US16/987,267US11328740B2 (en)2019-08-072020-08-06Voice onset detection
US17/714,708US11790935B2 (en)2019-08-072022-04-06Voice onset detection
US18/459,342US12094489B2 (en)2019-08-072023-08-31Voice onset detection

Related Parent Applications (1)

Application NumberTitlePriority DateFiling Date
US17/714,708ContinuationUS11790935B2 (en)2019-08-072022-04-06Voice onset detection

Related Child Applications (1)

Application NumberTitlePriority DateFiling Date
US18/764,006ContinuationUS20250006219A1 (en)2019-08-072024-07-03Voice onset detection

Publications (2)

Publication NumberPublication Date
US20230410835A1 US20230410835A1 (en)2023-12-21
US12094489B2true US12094489B2 (en)2024-09-17

Family

ID=74498600

Family Applications (4)

Application NumberTitlePriority DateFiling Date
US16/987,267Active2040-08-13US11328740B2 (en)2019-08-072020-08-06Voice onset detection
US17/714,708ActiveUS11790935B2 (en)2019-08-072022-04-06Voice onset detection
US18/459,342ActiveUS12094489B2 (en)2019-08-072023-08-31Voice onset detection
US18/764,006PendingUS20250006219A1 (en)2019-08-072024-07-03Voice onset detection

Family Applications Before (2)

Application NumberTitlePriority DateFiling Date
US16/987,267Active2040-08-13US11328740B2 (en)2019-08-072020-08-06Voice onset detection
US17/714,708ActiveUS11790935B2 (en)2019-08-072022-04-06Voice onset detection

Family Applications After (1)

Application NumberTitlePriority DateFiling Date
US18/764,006PendingUS20250006219A1 (en)2019-08-072024-07-03Voice onset detection

Country Status (1)

CountryLink
US (4)US11328740B2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US12238496B2 (en)2020-03-272025-02-25Magic Leap, Inc.Method of waking a device using spoken voice commands
US12243531B2 (en)2019-03-012025-03-04Magic Leap, Inc.Determining input for speech processing engine
US12327573B2 (en)2019-04-192025-06-10Magic Leap, Inc.Identifying input for speech recognition engine
US12347448B2 (en)2018-06-212025-07-01Magic Leap, Inc.Wearable system speech processing
US12417766B2 (en)2020-09-302025-09-16Magic Leap, Inc.Voice user interface using non-linguistic input

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10957310B1 (en)2012-07-232021-03-23Soundhound, Inc.Integrated programming framework for speech and text understanding with meaning parsing
CN107180627B (en)*2017-06-222020-10-09潍坊歌尔微电子有限公司Method and device for removing noise
US10636421B2 (en)2017-12-272020-04-28Soundhound, Inc.Parse prefix-detection in a human-machine interface
US11328740B2 (en)2019-08-072022-05-10Magic Leap, Inc.Voice onset detection
US12080272B2 (en)*2019-12-102024-09-03Google LlcAttention-based clockwork hierarchical variational encoder
CN113132863B (en)*2020-01-162022-05-24华为技术有限公司 Stereo sound pickup method, device, terminal device and computer-readable storage medium
US11482236B2 (en)*2020-08-172022-10-25Bose CorporationAudio systems and methods for voice activity detection
US11783809B2 (en)*2020-10-082023-10-10Qualcomm IncorporatedUser voice activity detection using dynamic classifier
US12393398B2 (en)*2021-06-042025-08-19Samsung Electronics Co., Ltd.Apparatus and method for signal processing
US11988841B2 (en)2022-08-022024-05-21Snap Inc.Voice input for AR wearable devices
US20240233718A9 (en)*2022-10-192024-07-11Soundhound, Inc.Semantically conditioned voice activity detection
CN116246653A (en)*2022-12-202023-06-09小米汽车科技有限公司 Voice endpoint detection method, device, electronic device and storage medium

Citations (132)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
JPS52144205A (en)1976-05-271977-12-01Nec CorpVoice recognition unit
US4158750A (en)1976-05-271979-06-19Nippon Electric Co., Ltd.Speech recognition system with delayed output
US4852988A (en)1988-09-121989-08-01Applied Science LaboratoriesVisor and camera providing a parallax-free field-of-view image for a head-mounted eye movement measurement system
JP2000148184A (en)1998-11-062000-05-26Sanyo Electric Co LtdSpeech recognizing device
CA2316473A1 (en)1999-07-282001-01-28Steve MannCovert headworn information display or data display or viewfinder
US20010055985A1 (en)2000-06-212001-12-27AlcatelTelephoning and hands-free speech for cordless final apparatus with echo compensation
US6433760B1 (en)1999-01-142002-08-13University Of Central FloridaHead mounted display with eyetracking capability
US6491391B1 (en)1999-07-022002-12-10E-Vision LlcSystem, apparatus, and method for reducing birefringence
US6496799B1 (en)1999-12-222002-12-17International Business Machines CorporationEnd-of-utterance determination for voice processing
CA2362895A1 (en)2001-06-262002-12-26Steve MannSmart sunglasses or computer information display built into eyewear having ordinary appearance, possibly with sight license
US20030030597A1 (en)2001-08-132003-02-13Geist Richard EdwinVirtual display apparatus for mobile activities
CA2388766A1 (en)2002-06-172003-12-17Steve MannEyeglass frames based computer display or eyeglasses with operationally, actually, or computationally, transparent frames
US6847336B1 (en)1996-10-022005-01-25Jerome H. LemelsonSelectively controllable heads-up display system
US20050033571A1 (en)*2003-08-072005-02-10Microsoft CorporationHead mounted multi-sensory audio input system
US20050069852A1 (en)2003-09-252005-03-31International Business Machines CorporationTranslating emotion to braille, emoticons and other special symbols
JP2005196134A (en)2003-12-122005-07-21Toyota Central Res & Dev Lab Inc Spoken dialogue system and method, and spoken dialogue program
US6943754B2 (en)2002-09-272005-09-13The Boeing CompanyGaze tracking system, eye-tracking assembly and an associated method of calibration
US6977776B2 (en)2001-07-062005-12-20Carl Zeiss AgHead-mounted optical direct visualization system
US20060023158A1 (en)2003-10-092006-02-02Howell Thomas AEyeglasses with electrical components
US20060072767A1 (en)2004-09-172006-04-06Microsoft CorporationMethod and apparatus for multi-sensory speech enhancement
US20060098827A1 (en)2002-06-052006-05-11Thomas PaddockAcoustical virtual reality engine and advanced techniques for enhancing delivered sound
US20060178876A1 (en)2003-03-262006-08-10Kabushiki Kaisha KenwoodSpeech signal compression device speech signal compression method and program
US20070225982A1 (en)2006-03-222007-09-27Fujitsu LimitedSpeech recognition apparatus, speech recognition method, and recording medium recorded a computer program
US7346654B1 (en)1999-04-162008-03-18Mitel Networks CorporationVirtual meeting rooms with spatial audio
US7347551B2 (en)2003-02-132008-03-25Fergason Patent Properties, LlcOptical system for monitoring eye movement
US20080124690A1 (en)2006-11-282008-05-29Attune Interactive, Inc.Training system using an interactive prompt character
US20080201138A1 (en)2004-07-222008-08-21Softmax, Inc.Headset for Separation of Speech Signals in a Noisy Environment
US7488294B2 (en)2004-04-012009-02-10Torch William CBiosensors, communicators, and controllers monitoring eye movement and methods for using them
US20090180626A1 (en)2008-01-152009-07-16Sony CorporationSignal processing apparatus, signal processing method, and storage medium
US7587319B2 (en)2002-02-042009-09-08Zentian LimitedSpeech recognition circuit using parallel processors
US20100245585A1 (en)2009-02-272010-09-30Fisher Ronald EugeneHeadset-Based Telecommunications Platform
US20100323652A1 (en)2009-06-092010-12-23Qualcomm IncorporatedSystems, methods, apparatus, and computer-readable media for phase-based processing of multichannel signal
US7979277B2 (en)2004-09-142011-07-12Zentian LimitedSpeech recognition circuit and method
US20110213664A1 (en)2010-02-282011-09-01Osterhout Group, Inc.Local advertising content on an interactive head-mounted eyepiece
US20110211056A1 (en)2010-03-012011-09-01Eye-Com CorporationSystems and methods for spatially controlled scene illumination
US20110238407A1 (en)2009-08-312011-09-29O3 Technologies, LlcSystems and methods for speech-to-speech translation
US20110288860A1 (en)*2010-05-202011-11-24Qualcomm IncorporatedSystems, methods, apparatus, and computer-readable media for processing of speech signals using head-mounted microphone pair
US20120021806A1 (en)2010-07-232012-01-26Maltz Gregory AUnitized, Vision-Controlled, Wireless Eyeglass Transceiver
US8154588B2 (en)2009-01-142012-04-10Alan Alexander BurnsParticipant audio enhancement system
US20120130713A1 (en)2010-10-252012-05-24Qualcomm IncorporatedSystems, methods, and apparatus for voice activity detection
US8235529B1 (en)2011-11-302012-08-07Google Inc.Unlocking a screen using eye tracking information
US20120209601A1 (en)2011-01-102012-08-16AliphcomDynamic enhancement of audio (DAE) in headset systems
US20130077147A1 (en)2011-09-222013-03-28Los Alamos National Security, LlcMethod for producing a partially coherent beam with fast pattern update rates
US20130204607A1 (en)2011-12-082013-08-08Forrest S. Baker III TrustVoice Detection For Automated Communication System
US8611015B2 (en)2011-11-222013-12-17Google Inc.User interface
US20130339028A1 (en)2012-06-152013-12-19Spansion LlcPower-Efficient Voice Activation
US20140016793A1 (en)2006-12-142014-01-16William G. GardnerSpatial audio teleconferencing
US8638498B2 (en)2012-01-042014-01-28David D. BohnEyebox adjustment for interpupillary distance
US8696113B2 (en)2005-10-072014-04-15Percept Technologies Inc.Enhanced optical and perceptual digital eyewear
US20140194702A1 (en)2006-05-122014-07-10Bao TranHealth monitoring appliance
US20140195918A1 (en)2013-01-072014-07-10Steven FriedlanderEye tracking user interface
US20140200887A1 (en)2013-01-152014-07-17Honda Motor Co., Ltd.Sound processing device and sound processing method
WO2014113891A1 (en)2013-01-252014-07-31Hu HaiDevices and methods for the visualization and localization of sound
US20140222430A1 (en)2008-10-172014-08-07Ashwin P. RaoSystem and Method for Multimodal Utterance Detection
US20140270202A1 (en)2013-03-122014-09-18Motorola Mobility LlcApparatus with Adaptive Audio Adjustment Based on Surface Proximity, Surface Type and Motion
US20140270244A1 (en)*2013-03-132014-09-18Kopin CorporationEye Glasses With Microphone Array
JP2014178339A (en)2011-06-032014-09-25Nec CorpVoice processing system, utterer's voice acquisition method, voice processing device and method and program for controlling the same
WO2014159581A1 (en)2013-03-122014-10-02Nuance Communications, Inc.Methods and apparatus for detecting a voice command
US20140337023A1 (en)2013-05-102014-11-13Daniel McCullochSpeech to text conversion
US20140379336A1 (en)2013-06-202014-12-25Atul BhatnagarEar-based wearable networking device, system, and method
US20150006181A1 (en)2013-06-282015-01-01Kopin CorporationDigital Voice Processing Method and System for Headset Computer
US8929589B2 (en)2011-11-072015-01-06Eyefluence, Inc.Systems and methods for high-resolution gaze tracking
US9010929B2 (en)2005-10-072015-04-21Percept Technologies Inc.Digital eyewear
US20150168731A1 (en)2012-06-042015-06-18Microsoft Technology Licensing, LlcMultiple Waveguide Imaging Structure
US20150310857A1 (en)2012-09-032015-10-29Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.Apparatus and method for providing an informed multichannel speech presence probability estimation
WO2015169618A1 (en)2014-05-052015-11-12Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.System, apparatus and method for consistent acoustic scene reproduction based on informed spatial filtering
EP2950307A1 (en)2014-05-302015-12-02Apple Inc.Reducing the need for manual start/end-pointing and trigger phrases
US20150348572A1 (en)*2014-05-302015-12-03Apple Inc.Detecting a user's voice activity using dynamic probabilistic models of speech features
US20160019910A1 (en)2013-07-102016-01-21Nuance Communications,Inc.Methods and Apparatus for Dynamic Low Frequency Noise Suppression
US9274338B2 (en)2012-03-212016-03-01Microsoft Technology Licensing, LlcIncreasing field of view of reflective waveguide
US20160066113A1 (en)2014-08-282016-03-03Qualcomm IncorporatedSelective enabling of a component by a microphone circuit
US9292973B2 (en)2010-11-082016-03-22Microsoft Technology Licensing, LlcAutomatic variable virtual focus for augmented reality displays
US9294860B1 (en)2014-03-102016-03-22Amazon Technologies, Inc.Identifying directions of acoustically reflective surfaces
US20160112817A1 (en)*2013-03-132016-04-21Kopin CorporationHead wearable acoustic system with noise canceling microphone geometry apparatuses and methods
US9323325B2 (en)2011-08-302016-04-26Microsoft Technology Licensing, LlcEnhancing an object of interest in a see-through, mixed reality display device
CN105529033A (en)2014-10-202016-04-27索尼公司 sound processing system
US20160165340A1 (en)2014-12-052016-06-09Stages Pcs, LlcMulti-channel multi-domain source identification and tracking
US20160180837A1 (en)2014-12-172016-06-23Qualcomm IncorporatedSystem and method of speech recognition
US20160217781A1 (en)2013-10-232016-07-28Google Inc.Methods And Systems For Implementing Bone Conduction-Based Noise Cancellation For Air-Conducted Sound
US20160216130A1 (en)2012-06-212016-07-28Cellepathy Ltd.Enhanced navigation instruction
WO2016153712A1 (en)2015-03-262016-09-29Intel CorporationMethod and system of environment sensitive automatic speech recognition
US20160284350A1 (en)2015-03-272016-09-29Qualcomm IncorporatedControlling electronic device based on direction of speech
WO2016151956A1 (en)2015-03-232016-09-29ソニー株式会社Information processing system and information processing method
US20160358598A1 (en)2015-06-072016-12-08Apple Inc.Context-based endpoint detection
US20160379638A1 (en)2015-06-262016-12-29Amazon Technologies, Inc.Input speech quality matching
US20160379629A1 (en)2015-06-252016-12-29Intel CorporationMethod and system of automatic speech recognition with dynamic vocabularies
US20160379632A1 (en)2015-06-292016-12-29Amazon Technologies, Inc.Language model speech endpointing
WO2017017591A1 (en)2015-07-262017-02-02Vocalzoom Systems Ltd.Laser microphone utilizing mirrors having different properties
US20170091169A1 (en)2015-09-292017-03-30Apple Inc.Efficient word encoding for recurrent neural network language models
US20170092276A1 (en)2014-07-312017-03-30Tencent Technology (Shenzhen) Company LimitedVoiceprint Verification Method And Device
US20170110116A1 (en)2015-10-192017-04-20Google Inc.Speech endpointing
US20170148429A1 (en)2015-11-242017-05-25Fujitsu LimitedKeyword detector and keyword detection method
US9720505B2 (en)2013-01-032017-08-01Meta CompanyExtramissive spatial imaging digital eye glass apparatuses, methods and systems for virtual or augmediated vision, manipulation, creation, or interaction with objects, materials, or other entities
US20170270919A1 (en)2016-03-212017-09-21Amazon Technologies, Inc.Anchored speech detection and speech recognition
US20170316780A1 (en)2016-04-282017-11-02Andrew William LovittDynamic speech recognition data evaluation
WO2017191711A1 (en)2016-05-022017-11-09ソニー株式会社Control device, control method, and computer program
US20170332187A1 (en)2016-05-112017-11-16Htc CorporationWearable electronic device and virtual reality system
JP2017211596A (en)2016-05-272017-11-30トヨタ自動車株式会社Speech dialog system and utterance timing determination method
US20180011534A1 (en)2013-02-192018-01-11Microsoft Technology Licensing, LlcContext-aware augmented reality object commands
US20180053284A1 (en)2016-08-222018-02-22Magic Leap, Inc.Virtual, augmented, and mixed reality systems and methods
US20180077095A1 (en)2015-09-142018-03-15X Development LlcAugmentation of Communications with Emotional Data
US20180129469A1 (en)2013-08-232018-05-10Tobii AbSystems and methods for providing audio to a user based on gaze input
US10013053B2 (en)2012-01-042018-07-03Tobii AbSystem for gaze interaction
US10025379B2 (en)2012-12-062018-07-17Google LlcEye tracking wearable devices and methods for use
US20180227665A1 (en)2016-06-152018-08-09Mh Acoustics, LlcSpatial Encoding Directional Microphone Array
US20180316939A1 (en)2012-04-242018-11-01Skreens Entertainment Technologies, Inc.Systems and methods for video processing, combination and display of heterogeneous sources
JP2018179954A (en)2017-04-182018-11-15株式会社バンザイ Vehicle inspection aid using head mounted display
US10134425B1 (en)2015-06-292018-11-20Amazon Technologies, Inc.Direction-based speech endpointing
US20180336902A1 (en)2015-02-032018-11-22Dolby Laboratories Licensing CorporationConference segmentation based on conversational dynamics
US20180358021A1 (en)2015-12-232018-12-13Intel CorporationBiometric information for dialog system
US20180366114A1 (en)2017-06-162018-12-20Amazon Technologies, Inc.Exporting dialog-driven applications to digital communication platforms
US10289205B1 (en)2015-11-242019-05-14Google LlcBehind the ear gesture control for a head mountable device
WO2019224292A1 (en)2018-05-232019-11-28Koninklijke Kpn N.V.Adapting acoustic rendering to image-based object
US20190362741A1 (en)2018-05-242019-11-28Baidu Online Network Technology (Beijing) Co., LtdMethod, apparatus and device for recognizing voice endpoints
US20190373362A1 (en)2018-06-012019-12-05Shure Acquisition Holdings, Inc.Pattern-forming microphone array
US20190392641A1 (en)2018-06-262019-12-26Sony Interactive Entertainment Inc.Material base rendering
US20200027455A1 (en)2017-03-102020-01-23Nippon Telegraph And Telephone CorporationDialog system, dialog method, dialog apparatus and program
US20200064921A1 (en)2016-11-162020-02-27Samsung Electronics Co., Ltd.Electronic device and control method thereof
US20200194028A1 (en)2018-12-182020-06-18Colquitt Partners, Ltd.Glasses with closed captioning, voice recognition, volume of speech detection, and translation capabilities
US20200213729A1 (en)2018-12-202020-07-02Sonos, Inc.Optimization of network microphone devices using noise classification
US20200279552A1 (en)2015-03-302020-09-03Amazon Technologies, Inc.Pre-wakeword speech processing
WO2020180719A1 (en)2019-03-012020-09-10Magic Leap, Inc.Determining input for speech processing engine
US20200286465A1 (en)2018-01-312020-09-10Tencent Technology (Shenzhen) Company LimitedSpeech keyword recognition method and apparatus, computer-readable storage medium, and computer device
US20200296521A1 (en)2018-10-152020-09-17Orcam Technologies Ltd.Systems and methods for camera and microphone-based device
US20200335128A1 (en)2019-04-192020-10-22Magic Leap, Inc.Identifying input for speech recognition engine
US20210056966A1 (en)2017-11-162021-02-25Softbank Robotics EuropeSystem and method for dialog session management
US20210125609A1 (en)2019-10-282021-04-29Apple Inc.Automatic speech recognition imposter rejection on a headphone with an accelerometer
US20210264931A1 (en)2018-06-212021-08-26Magic Leap, Inc.Wearable system speech processing
US20210306751A1 (en)2020-03-272021-09-30Magic Leap, Inc.Method of waking a device using spoken voice commands
WO2022072752A1 (en)2020-09-302022-04-07Magic Leap, Inc.Voice user interface using non-linguistic input
US11328740B2 (en)2019-08-072022-05-10Magic Leap, Inc.Voice onset detection
WO2023064875A1 (en)2021-10-142023-04-20Magic Leap, Inc.Microphone array geometry

Patent Citations (160)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4158750A (en)1976-05-271979-06-19Nippon Electric Co., Ltd.Speech recognition system with delayed output
JPS52144205A (en)1976-05-271977-12-01Nec CorpVoice recognition unit
US4852988A (en)1988-09-121989-08-01Applied Science LaboratoriesVisor and camera providing a parallax-free field-of-view image for a head-mounted eye movement measurement system
US6847336B1 (en)1996-10-022005-01-25Jerome H. LemelsonSelectively controllable heads-up display system
JP2000148184A (en)1998-11-062000-05-26Sanyo Electric Co LtdSpeech recognizing device
US6433760B1 (en)1999-01-142002-08-13University Of Central FloridaHead mounted display with eyetracking capability
US7346654B1 (en)1999-04-162008-03-18Mitel Networks CorporationVirtual meeting rooms with spatial audio
US6491391B1 (en)1999-07-022002-12-10E-Vision LlcSystem, apparatus, and method for reducing birefringence
CA2316473A1 (en)1999-07-282001-01-28Steve MannCovert headworn information display or data display or viewfinder
US6496799B1 (en)1999-12-222002-12-17International Business Machines CorporationEnd-of-utterance determination for voice processing
JP2002135173A (en)2000-06-212002-05-10Alcatel Telephone and hands-free calls for cordless terminals with echo compensation
US20010055985A1 (en)2000-06-212001-12-27AlcatelTelephoning and hands-free speech for cordless final apparatus with echo compensation
CA2362895A1 (en)2001-06-262002-12-26Steve MannSmart sunglasses or computer information display built into eyewear having ordinary appearance, possibly with sight license
US6977776B2 (en)2001-07-062005-12-20Carl Zeiss AgHead-mounted optical direct visualization system
US20030030597A1 (en)2001-08-132003-02-13Geist Richard EdwinVirtual display apparatus for mobile activities
US10971140B2 (en)2002-02-042021-04-06Zentian LimitedSpeech recognition circuit using parallel processors
US7587319B2 (en)2002-02-042009-09-08Zentian LimitedSpeech recognition circuit using parallel processors
US20060098827A1 (en)2002-06-052006-05-11Thomas PaddockAcoustical virtual reality engine and advanced techniques for enhancing delivered sound
CA2388766A1 (en)2002-06-172003-12-17Steve MannEyeglass frames based computer display or eyeglasses with operationally, actually, or computationally, transparent frames
US6943754B2 (en)2002-09-272005-09-13The Boeing CompanyGaze tracking system, eye-tracking assembly and an associated method of calibration
US7347551B2 (en)2003-02-132008-03-25Fergason Patent Properties, LlcOptical system for monitoring eye movement
US20060178876A1 (en)2003-03-262006-08-10Kabushiki Kaisha KenwoodSpeech signal compression device speech signal compression method and program
US20050033571A1 (en)*2003-08-072005-02-10Microsoft CorporationHead mounted multi-sensory audio input system
US20050069852A1 (en)2003-09-252005-03-31International Business Machines CorporationTranslating emotion to braille, emoticons and other special symbols
US20060023158A1 (en)2003-10-092006-02-02Howell Thomas AEyeglasses with electrical components
JP2005196134A (en)2003-12-122005-07-21Toyota Central Res & Dev Lab Inc Spoken dialogue system and method, and spoken dialogue program
US7488294B2 (en)2004-04-012009-02-10Torch William CBiosensors, communicators, and controllers monitoring eye movement and methods for using them
US20080201138A1 (en)2004-07-222008-08-21Softmax, Inc.Headset for Separation of Speech Signals in a Noisy Environment
US10062377B2 (en)2004-09-142018-08-28Zentian LimitedDistributed pipelined parallel speech recognition system
US10839789B2 (en)2004-09-142020-11-17Zentian LimitedSpeech recognition circuit and method
US7979277B2 (en)2004-09-142011-07-12Zentian LimitedSpeech recognition circuit and method
US20060072767A1 (en)2004-09-172006-04-06Microsoft CorporationMethod and apparatus for multi-sensory speech enhancement
US9010929B2 (en)2005-10-072015-04-21Percept Technologies Inc.Digital eyewear
US8696113B2 (en)2005-10-072014-04-15Percept Technologies Inc.Enhanced optical and perceptual digital eyewear
US20070225982A1 (en)2006-03-222007-09-27Fujitsu LimitedSpeech recognition apparatus, speech recognition method, and recording medium recorded a computer program
US20140194702A1 (en)2006-05-122014-07-10Bao TranHealth monitoring appliance
US20080124690A1 (en)2006-11-282008-05-29Attune Interactive, Inc.Training system using an interactive prompt character
US20140016793A1 (en)2006-12-142014-01-16William G. GardnerSpatial audio teleconferencing
US20090180626A1 (en)2008-01-152009-07-16Sony CorporationSignal processing apparatus, signal processing method, and storage medium
US20140222430A1 (en)2008-10-172014-08-07Ashwin P. RaoSystem and Method for Multimodal Utterance Detection
US8154588B2 (en)2009-01-142012-04-10Alan Alexander BurnsParticipant audio enhancement system
US20100245585A1 (en)2009-02-272010-09-30Fisher Ronald EugeneHeadset-Based Telecommunications Platform
US20100323652A1 (en)2009-06-092010-12-23Qualcomm IncorporatedSystems, methods, apparatus, and computer-readable media for phase-based processing of multichannel signal
US20110238407A1 (en)2009-08-312011-09-29O3 Technologies, LlcSystems and methods for speech-to-speech translation
US20110213664A1 (en)2010-02-282011-09-01Osterhout Group, Inc.Local advertising content on an interactive head-mounted eyepiece
US20110211056A1 (en)2010-03-012011-09-01Eye-Com CorporationSystems and methods for spatially controlled scene illumination
US20110288860A1 (en)*2010-05-202011-11-24Qualcomm IncorporatedSystems, methods, apparatus, and computer-readable media for processing of speech signals using head-mounted microphone pair
US20120021806A1 (en)2010-07-232012-01-26Maltz Gregory AUnitized, Vision-Controlled, Wireless Eyeglass Transceiver
US20120130713A1 (en)2010-10-252012-05-24Qualcomm IncorporatedSystems, methods, and apparatus for voice activity detection
US9292973B2 (en)2010-11-082016-03-22Microsoft Technology Licensing, LlcAutomatic variable virtual focus for augmented reality displays
US20120209601A1 (en)2011-01-102012-08-16AliphcomDynamic enhancement of audio (DAE) in headset systems
JP2014178339A (en)2011-06-032014-09-25Nec CorpVoice processing system, utterer's voice acquisition method, voice processing device and method and program for controlling the same
US9323325B2 (en)2011-08-302016-04-26Microsoft Technology Licensing, LlcEnhancing an object of interest in a see-through, mixed reality display device
US20130077147A1 (en)2011-09-222013-03-28Los Alamos National Security, LlcMethod for producing a partially coherent beam with fast pattern update rates
US8929589B2 (en)2011-11-072015-01-06Eyefluence, Inc.Systems and methods for high-resolution gaze tracking
US8611015B2 (en)2011-11-222013-12-17Google Inc.User interface
US8235529B1 (en)2011-11-302012-08-07Google Inc.Unlocking a screen using eye tracking information
US20130204607A1 (en)2011-12-082013-08-08Forrest S. Baker III TrustVoice Detection For Automated Communication System
US10013053B2 (en)2012-01-042018-07-03Tobii AbSystem for gaze interaction
US8638498B2 (en)2012-01-042014-01-28David D. BohnEyebox adjustment for interpupillary distance
US9274338B2 (en)2012-03-212016-03-01Microsoft Technology Licensing, LlcIncreasing field of view of reflective waveguide
US20180316939A1 (en)2012-04-242018-11-01Skreens Entertainment Technologies, Inc.Systems and methods for video processing, combination and display of heterogeneous sources
US20150168731A1 (en)2012-06-042015-06-18Microsoft Technology Licensing, LlcMultiple Waveguide Imaging Structure
US20130339028A1 (en)2012-06-152013-12-19Spansion LlcPower-Efficient Voice Activation
US20160216130A1 (en)2012-06-212016-07-28Cellepathy Ltd.Enhanced navigation instruction
US20150310857A1 (en)2012-09-032015-10-29Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.Apparatus and method for providing an informed multichannel speech presence probability estimation
US10025379B2 (en)2012-12-062018-07-17Google LlcEye tracking wearable devices and methods for use
US9720505B2 (en)2013-01-032017-08-01Meta CompanyExtramissive spatial imaging digital eye glass apparatuses, methods and systems for virtual or augmediated vision, manipulation, creation, or interaction with objects, materials, or other entities
US20140195918A1 (en)2013-01-072014-07-10Steven FriedlanderEye tracking user interface
US20140200887A1 (en)2013-01-152014-07-17Honda Motor Co., Ltd.Sound processing device and sound processing method
JP2014137405A (en)2013-01-152014-07-28Honda Motor Co LtdAcoustic processing device and acoustic processing method
WO2014113891A1 (en)2013-01-252014-07-31Hu HaiDevices and methods for the visualization and localization of sound
US20160142830A1 (en)2013-01-252016-05-19Hai HuDevices And Methods For The Visualization And Localization Of Sound
US20180011534A1 (en)2013-02-192018-01-11Microsoft Technology Licensing, LlcContext-aware augmented reality object commands
WO2014159581A1 (en)2013-03-122014-10-02Nuance Communications, Inc.Methods and apparatus for detecting a voice command
US20140270202A1 (en)2013-03-122014-09-18Motorola Mobility LlcApparatus with Adaptive Audio Adjustment Based on Surface Proximity, Surface Type and Motion
US20140270244A1 (en)*2013-03-132014-09-18Kopin CorporationEye Glasses With Microphone Array
US20160112817A1 (en)*2013-03-132016-04-21Kopin CorporationHead wearable acoustic system with noise canceling microphone geometry apparatuses and methods
US20140337023A1 (en)2013-05-102014-11-13Daniel McCullochSpeech to text conversion
US20140379336A1 (en)2013-06-202014-12-25Atul BhatnagarEar-based wearable networking device, system, and method
US20150006181A1 (en)2013-06-282015-01-01Kopin CorporationDigital Voice Processing Method and System for Headset Computer
US20160019910A1 (en)2013-07-102016-01-21Nuance Communications,Inc.Methods and Apparatus for Dynamic Low Frequency Noise Suppression
US20180129469A1 (en)2013-08-232018-05-10Tobii AbSystems and methods for providing audio to a user based on gaze input
US20160217781A1 (en)2013-10-232016-07-28Google Inc.Methods And Systems For Implementing Bone Conduction-Based Noise Cancellation For Air-Conducted Sound
US9294860B1 (en)2014-03-102016-03-22Amazon Technologies, Inc.Identifying directions of acoustically reflective surfaces
WO2015169618A1 (en)2014-05-052015-11-12Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.System, apparatus and method for consistent acoustic scene reproduction based on informed spatial filtering
US20170078819A1 (en)2014-05-052017-03-16Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V.System, apparatus and method for consistent acoustic scene reproduction based on adaptive functions
EP2950307A1 (en)2014-05-302015-12-02Apple Inc.Reducing the need for manual start/end-pointing and trigger phrases
US20150348572A1 (en)*2014-05-302015-12-03Apple Inc.Detecting a user's voice activity using dynamic probabilistic models of speech features
JP2016004270A (en)2014-05-302016-01-12アップル インコーポレイテッド Manual start / end point specification and reduced need for trigger phrase
US9715875B2 (en)2014-05-302017-07-25Apple Inc.Reducing the need for manual start/end-pointing and trigger phrases
US20170092276A1 (en)2014-07-312017-03-30Tencent Technology (Shenzhen) Company LimitedVoiceprint Verification Method And Device
US20160066113A1 (en)2014-08-282016-03-03Qualcomm IncorporatedSelective enabling of a component by a microphone circuit
WO2016063587A1 (en)2014-10-202016-04-28ソニー株式会社Voice processing system
EP3211918A1 (en)2014-10-202017-08-30Sony CorporationVoice processing system
CN105529033A (en)2014-10-202016-04-27索尼公司 sound processing system
US20170280239A1 (en)2014-10-202017-09-28Sony CorporationVoice processing system
US20160165340A1 (en)2014-12-052016-06-09Stages Pcs, LlcMulti-channel multi-domain source identification and tracking
US20160180837A1 (en)2014-12-172016-06-23Qualcomm IncorporatedSystem and method of speech recognition
US20180336902A1 (en)2015-02-032018-11-22Dolby Laboratories Licensing CorporationConference segmentation based on conversational dynamics
US20170330555A1 (en)2015-03-232017-11-16Sony CorporationInformation processing system and information processing method
WO2016151956A1 (en)2015-03-232016-09-29ソニー株式会社Information processing system and information processing method
WO2016153712A1 (en)2015-03-262016-09-29Intel CorporationMethod and system of environment sensitive automatic speech recognition
US20160284350A1 (en)2015-03-272016-09-29Qualcomm IncorporatedControlling electronic device based on direction of speech
US20200279552A1 (en)2015-03-302020-09-03Amazon Technologies, Inc.Pre-wakeword speech processing
US20160358598A1 (en)2015-06-072016-12-08Apple Inc.Context-based endpoint detection
US20160379629A1 (en)2015-06-252016-12-29Intel CorporationMethod and system of automatic speech recognition with dynamic vocabularies
US20160379638A1 (en)2015-06-262016-12-29Amazon Technologies, Inc.Input speech quality matching
JP2018523156A (en)2015-06-292018-08-16アマゾン テクノロジーズ インコーポレイテッド Language model speech end pointing
WO2017003903A1 (en)2015-06-292017-01-05Amazon Technologies, Inc.Language model speech endpointing
US20160379632A1 (en)2015-06-292016-12-29Amazon Technologies, Inc.Language model speech endpointing
US10134425B1 (en)2015-06-292018-11-20Amazon Technologies, Inc.Direction-based speech endpointing
WO2017017591A1 (en)2015-07-262017-02-02Vocalzoom Systems Ltd.Laser microphone utilizing mirrors having different properties
US20180077095A1 (en)2015-09-142018-03-15X Development LlcAugmentation of Communications with Emotional Data
US20170091169A1 (en)2015-09-292017-03-30Apple Inc.Efficient word encoding for recurrent neural network language models
US20170110116A1 (en)2015-10-192017-04-20Google Inc.Speech endpointing
US20170148429A1 (en)2015-11-242017-05-25Fujitsu LimitedKeyword detector and keyword detection method
US10289205B1 (en)2015-11-242019-05-14Google LlcBehind the ear gesture control for a head mountable device
US20180358021A1 (en)2015-12-232018-12-13Intel CorporationBiometric information for dialog system
US20170270919A1 (en)2016-03-212017-09-21Amazon Technologies, Inc.Anchored speech detection and speech recognition
US20170316780A1 (en)2016-04-282017-11-02Andrew William LovittDynamic speech recognition data evaluation
US20190129944A1 (en)2016-05-022019-05-02Sony CorporationControl device, control method, and computer program
WO2017191711A1 (en)2016-05-022017-11-09ソニー株式会社Control device, control method, and computer program
US20170332187A1 (en)2016-05-112017-11-16Htc CorporationWearable electronic device and virtual reality system
JP2017211596A (en)2016-05-272017-11-30トヨタ自動車株式会社Speech dialog system and utterance timing determination method
US20180227665A1 (en)2016-06-152018-08-09Mh Acoustics, LlcSpatial Encoding Directional Microphone Array
US20180053284A1 (en)2016-08-222018-02-22Magic Leap, Inc.Virtual, augmented, and mixed reality systems and methods
US20200064921A1 (en)2016-11-162020-02-27Samsung Electronics Co., Ltd.Electronic device and control method thereof
US20200027455A1 (en)2017-03-102020-01-23Nippon Telegraph And Telephone CorporationDialog system, dialog method, dialog apparatus and program
JP2018179954A (en)2017-04-182018-11-15株式会社バンザイ Vehicle inspection aid using head mounted display
US20180366114A1 (en)2017-06-162018-12-20Amazon Technologies, Inc.Exporting dialog-driven applications to digital communication platforms
US20210056966A1 (en)2017-11-162021-02-25Softbank Robotics EuropeSystem and method for dialog session management
US20200286465A1 (en)2018-01-312020-09-10Tencent Technology (Shenzhen) Company LimitedSpeech keyword recognition method and apparatus, computer-readable storage medium, and computer device
WO2019224292A1 (en)2018-05-232019-11-28Koninklijke Kpn N.V.Adapting acoustic rendering to image-based object
US20190362741A1 (en)2018-05-242019-11-28Baidu Online Network Technology (Beijing) Co., LtdMethod, apparatus and device for recognizing voice endpoints
US20190373362A1 (en)2018-06-012019-12-05Shure Acquisition Holdings, Inc.Pattern-forming microphone array
US20240087587A1 (en)2018-06-212024-03-14Magic Leap, Inc.Wearable system speech processing
US11854566B2 (en)2018-06-212023-12-26Magic Leap, Inc.Wearable system speech processing
US20210264931A1 (en)2018-06-212021-08-26Magic Leap, Inc.Wearable system speech processing
US20190392641A1 (en)2018-06-262019-12-26Sony Interactive Entertainment Inc.Material base rendering
US20200296521A1 (en)2018-10-152020-09-17Orcam Technologies Ltd.Systems and methods for camera and microphone-based device
US20200194028A1 (en)2018-12-182020-06-18Colquitt Partners, Ltd.Glasses with closed captioning, voice recognition, volume of speech detection, and translation capabilities
US20200213729A1 (en)2018-12-202020-07-02Sonos, Inc.Optimization of network microphone devices using noise classification
WO2020180719A1 (en)2019-03-012020-09-10Magic Leap, Inc.Determining input for speech processing engine
US11854550B2 (en)2019-03-012023-12-26Magic Leap, Inc.Determining input for speech processing engine
US11587563B2 (en)2019-03-012023-02-21Magic Leap, Inc.Determining input for speech processing engine
US20240087565A1 (en)2019-03-012024-03-14Magic Leap, Inc.Determining input for speech processing engine
US20230135768A1 (en)2019-03-012023-05-04Magic Leap, Inc.Determining input for speech processing engine
US20200335128A1 (en)2019-04-192020-10-22Magic Leap, Inc.Identifying input for speech recognition engine
WO2020214844A1 (en)2019-04-192020-10-22Magic Leap, Inc.Identifying input for speech recognition engine
US20220230658A1 (en)2019-08-072022-07-21Magic Leap, Inc.Voice onset detection
US11328740B2 (en)2019-08-072022-05-10Magic Leap, Inc.Voice onset detection
US11790935B2 (en)2019-08-072023-10-17Magic Leap, Inc.Voice onset detection
US20210125609A1 (en)2019-10-282021-04-29Apple Inc.Automatic speech recognition imposter rejection on a headphone with an accelerometer
US20240163612A1 (en)2020-03-272024-05-16Magic Leap, Inc.Method of waking a device using spoken voice commands
US20210306751A1 (en)2020-03-272021-09-30Magic Leap, Inc.Method of waking a device using spoken voice commands
US11917384B2 (en)2020-03-272024-02-27Magic Leap, Inc.Method of waking a device using spoken voice commands
WO2022072752A1 (en)2020-09-302022-04-07Magic Leap, Inc.Voice user interface using non-linguistic input
US20230386461A1 (en)2020-09-302023-11-30Colby Nelson LEIDERVoice user interface using non-linguistic input
WO2023064875A1 (en)2021-10-142023-04-20Magic Leap, Inc.Microphone array geometry

Non-Patent Citations (61)

* Cited by examiner, † Cited by third party
Title
Backstrom, T. (Oct. 2015). "Voice Activity Detection Speech Processing," Aalto University, vol. 58, No. 10; Publication [online], retrieved Apr. 19, 2020, retrieved from the Internet: URL: https://mycourses.aalto.fi/pluginfile.php/146209/mod_resource/content/1/slides_07_vad.pdf, ; pp. 1-36.
Bilac, M. et al. (Nov. 15, 2017). Gaze and Filled Pause Detection for Smooth Human-Robot Conversations. www.angelicalim.com , retrieved on Jun. 17, 2020, Retrieved from the internet URL: http://www.angelicalim.com/papers/humanoids2017_paper.pdf entire document, 8 pages. (20.40).
Chinese Office Action dated Dec. 21, 2023, for CN Application No. 201980050714.4, with English translation, eighteen pages.
Chinese Office Action dated Jun. 2, 2023, for CN Application No. 2020 571488, with English translation, 9 pages.
European Office Action dated Dec. 12, 2023, for EP Application No. 20766540.7, four pages.
European Office Action dated Jun. 1, 2023, for EP Application No. 19822754.8, six pages.
European Search Report dated Nov. 12, 2021, for EP Application No. 19822754.8, ten pages.
European Search Report dated Nov. 21, 2022, for EP Application No. 20791183.5 nine pages.
European Search Report dated Oct. 6, 2022, for EP Application No. 20766540.7 nine pages.
Final Office Action mailed Apr. 10, 2023, for U.S. Appl. No. 16/850,965, filed Apr. 16, 2020, sixteen pages.
Final Office Action mailed Apr. 15, 2022, for U.S. Appl. No. 16/850,965, filed Apr. 16, 2020, fourteen pages.
Final Office Action mailed Aug. 4, 2023, for U.S. Appl. No. 17/254,832, filed Dec. 21, 2020, seventeen pages.
Final Office Action mailed Aug. 5, 2022, for U.S. Appl. No. 16/805,337, filed Feb. 28, 2020, eighteen pages,.
Final Office Action mailed Jan. 11, 2023, for U.S. Appl. No. 17/214,446, filed Mar. 26, 2021, sixteen pages.
Final Office Action mailed Jan. 23, 2024, for U.S. Appl. No. 16/850,965, filed Apr. 16, 2020, fifteen pages.
Final Office Action mailed Oct. 6, 2021, for U.S. Appl. No. 16/805,337, filed Feb. 28, 2020, fourteen pages.
Final Office Action mailed Sep. 7, 2023, for U.S. Appl. No. 17/214,446, filed Mar. 26, 2021, nineteen pages.
Harma, A. et al. (Jun. 2004). "Augmented Reality Audio for Mobile and Wearable Appliances," J. Audio Eng. Soc., vol. 52, No. 6, retrieved on Aug. 20, 2019, Retrieved from the Internet: URL:https://pdfs.semanticscholar.org/ae54/82c6a8d4add3e9707d780dfb5ce03d8e0120.pdf, 22 pages.
International Preliminary Report and Patentability mailed Dec. 22, 2020, for PCT Application No. PCT/US2019/038546, 13 pages.
International Preliminary Report and Written Opinion mailed Apr. 13, 2023, for PCT Application No. PCT/US2021/53046, filed Sep. 30, 2021, nine pages.
International Preliminary Report and Written Opinion mailed Apr. 25, 2024, for PCT Application No. PCT/US2022/078063, seven pages.
International Preliminary Report and Written Opinion mailed Oct. 28, 2021, for PCT Application No. PCT/US2020/028570, filed Apr. 16, 2020, 17 pages.
International Preliminary Report and Written Opinion mailed Sep. 16, 2021, for PCT Application No. PCT/US20/20469, filed Feb. 28, 2020, nine pages.
International Preliminary Report on Patentability and Written Opinion mailed Apr. 25, 2024, for PCT Application No. PCT/US2022/078073, seven pages.
International Preliminary Report on Patentability and Written Opinion mailed May 2, 2024, for PCT Application No. PCT/US2022/078298, twelve pages.
International Search Report and Written Opinion mailed Jan. 11, 2023, for PCT Application No. PCT/US2022/078298, seventeen pages.
International Search Report and Written Opinion mailed Jan. 17, 2023, for PCT Application No. PCT/US22/78073, thirteen pages.
International Search Report and Written Opinion mailed Jan. 24, 2022, for PCT Application No. PCT/US2021/53046, filed Sep. 30, 2021, 15 pages,.
International Search Report and Written Opinion mailed Jan. 25, 2023, for PCT Application No. PCT/US2022/078063, nineteen pages.
International Search Report and Written Opinion mailed Jul. 2, 2020, for PCT Application No. PCT/US2020/028570, filed Apr. 16, 2020, nineteen pages.
International Search Report and Written Opinion mailed May 18, 2020, for PCT Application No. PCT/US20/20469, filed Feb. 28, 2020, twenty pages.
International Search Report and Written Opinion mailed Sep. 17, 2019, for PCT Application No. PCT/US2019/038546, sixteen pages.
Jacob, R. "Eye Tracking in Advanced Interface Design", Virtual Environments and Advanced Interface Design, Oxford University Press, Inc. (Jun. 1995).
Japanese Notice of Allowance mailed Dec. 15, 2023, for JP Application No. 2020-571488, with English translation, eight pages.
Japanese Office Action mailed Jan. 30, 2024, for JP Application No. 2021-551538, with English translation, sixteen pages.
Japanese Office Action mailed May 2, 2024, for JP Application No. 2021-562002, with English translation, sixteen pages.
Kitayama, K. et al. (Sep. 30, 2003). "Speech Starter: Noise-Robust Endpoint Detection by Using Filled Pauses." Eurospeech 2003, retrieved on Jun. 17, 2020, retrieved from the internet URL: http://clteseerx.ist.psu.edu/viewdoc/download? doi=10 .1.1.141.1472&rep=rep1&type=pdf entire document, pp. 1237-1240.
Liu, Baiyang, et al.: (Sep. 6, 2015). "Accurate Endpointing with Expected Pause Duration," Interspeech 2015, pp. 2912-2916, retrieved from: a href="https://scholar.google.com/scholar?q=BAIYANG"target="_blank"https://scholar.google.com/scholar?q=BAIYANG/a,+Liu+et+al.:+(September+6,+2015).+Accurate+endpointing+with+expected+pause+duration,&hl=en&as_sdt=0&as_vis=1&oi=scholart.
Non-Final Office Action mailed Apr. 13, 2023, for U.S. Appl. No. 17/714,708, filled Apr. 6, 2022, sixteen pages.
Non-Final Office Action mailed Apr. 27, 2023, for U.S. Appl. No. 17/254,832, filed Dec. 21, 2020, fourteen pages.
Non-Final Office Action mailed Aug. 10, 2022, for U.S. Appl. No. 17/214,446, filed Mar. 26, 2021, fifteen pages.
Non-Final Office Action mailed Jun. 23, 2023, for U.S. Appl. No. 18/148,221, filed Dec. 29, 2022, thirteen pages.
Non-Final Office Action mailed Jun. 24, 2021, for U.S. Appl. No. 16/805,337, filed Feb. 28, 2020, fourteen pages.
Non-Final Office Action mailed Mar. 17, 2022, for U.S. Appl. No. 16/805,337, filed Feb. 28, 2020, sixteen pages.
Non-Final Office Action mailed Mar. 27, 2024, for U.S. Appl. No. 16/850,965, filed Apr. 16, 2020, sixteen pages.
Non-Final Office Action mailed Nov. 17, 2021, for U.S. Appl. No. 16/987,267, filed Aug. 6, 2020, 21 pages.
Non-Final Office Action mailed Sep. 15, 2023, for U.S. Appl. No. 16/850,965, filed Apr. 16, 2020, fourteen pages.
Non-Final Office Action mailed Sep. 29, 2022, for U.S. Appl. No. 16/850,965, filed Apr. 16, 2020, fifteen pages.
Non-Final Office Action malled Apr. 12, 2023, for U.S. Appl. No. 17/214,446, filed Mar. 26, 2021, seventeen pages.
Non-Final Office Action malled Oct. 4, 2021, for U.S. Appl. No. 16/850,965, filed Apr. 16, 2020, twelve pages.
Notice of Allowance mailed Dec. 15, 2023, for U.S. Appl. No. 17/214,446, filed Mar. 26, 2021, seven pages.
Notice of Allowance mailed Jul. 31, 2023, for U.S. Appl. No. 17/714,708, filed Apr. 6, 2022, eight pages.
Notice of Allowance mailed Mar. 3, 2022, for U.S. Appl. No. 16/987,267, filled Aug. 6, 2020, nine pages.
Notice of Allowance mailed Oct. 12, 2023, for U.S. Appl. No. 18/148,221, filed Dec. 29, 2022, five pages.
Notice of Allowance mailed Oct. 17, 2023, for U.S. Appl. No. 17/254,832, filed Dec. 21, 2020, sixteen pages.
Notice of Allowance malled Nov. 30, 2022, for U.S. Appl. No. 16/805,337, filed Feb. 28, 2020, six pages,.
Rolland, J. et al., "High-resolution inset head-mounted display", Optical Society of America, vol. 37, No. 19, Applied Optics, (Jul. 1, 1998).
Shannon, Matt et al. (Aug. 20-24, 2017). "Improved End-of-Query Detection for Streaming Speech Recognition", Interspeech 2017, Stockholm, Sweden, pp. 1909-1913.
Tanriverdi, V. et al. (Apr. 2000). "Interacting With Eye Movements In Virtual Environments," Department of Electrical Engineering and Computer Science, Tufts University, Medford, MA 02155, USA, Proceedings of the SIGCHI conference on Human Factors in Computing Systems, eight pages.
Tonges, R. (Dec. 2015). "An augmented Acoustics Demonstrator with Realtime stereo up-mixing and Binaural Auralization," Technische University Berlin, Audio Communication Group, retrieved on Aug. 22, 2019, Retrieved from the Internet: URL: https://www2.ak.tu-berlin.de/˜akgroup/ak_pub/abschlussarbeiten/2015/ToengesRaffael_MasA.pdf 100 pages.
Yoshida, A. et al., "Design and Applications of a High Resolution Insert Head Mounted Display", (Jun. 1994).

Cited By (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US12347448B2 (en)2018-06-212025-07-01Magic Leap, Inc.Wearable system speech processing
US12243531B2 (en)2019-03-012025-03-04Magic Leap, Inc.Determining input for speech processing engine
US12327573B2 (en)2019-04-192025-06-10Magic Leap, Inc.Identifying input for speech recognition engine
US12238496B2 (en)2020-03-272025-02-25Magic Leap, Inc.Method of waking a device using spoken voice commands
US12417766B2 (en)2020-09-302025-09-16Magic Leap, Inc.Voice user interface using non-linguistic input

Also Published As

Publication numberPublication date
US11328740B2 (en)2022-05-10
US20250006219A1 (en)2025-01-02
US11790935B2 (en)2023-10-17
US20220230658A1 (en)2022-07-21
US20230410835A1 (en)2023-12-21
US20210043223A1 (en)2021-02-11

Similar Documents

PublicationPublication DateTitle
US12094489B2 (en)Voice onset detection
US12238496B2 (en)Method of waking a device using spoken voice commands
US12327573B2 (en)Identifying input for speech recognition engine
JP7745603B2 (en) Wearable System Speech Processing
EP3891729B1 (en)Method and apparatus for performing speech recognition with wake on voice
US20160019886A1 (en)Method and apparatus for recognizing whisper
JP2022522748A (en) Input determination for speech processing engine
US11895474B2 (en)Activity detection on devices with multi-modal sensing
CN118314883A (en)Selectively adapting and utilizing noise reduction techniques in call phrase detection
US11683634B1 (en)Joint suppression of interferences in audio signal
CN119107947A (en) In-vehicle voice interaction method, device, equipment and storage medium
Moritz et al.Ambient voice control for a personal activity and household assistant
EP4383752A1 (en)Method and device for processing audio signal by using artificial intelligence model
CN114387984A (en)Voice signal processing method, device, equipment and storage medium

Legal Events

DateCodeTitleDescription
FEPPFee payment procedure

Free format text:ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STPPInformation on status: patent application and granting procedure in general

Free format text:DOCKETED NEW CASE - READY FOR EXAMINATION

STPPInformation on status: patent application and granting procedure in general

Free format text:NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPPInformation on status: patent application and granting procedure in general

Free format text:NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

ASAssignment

Owner name:MAGIC LEAP, INC., FLORIDA

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LEE, JUNG-SUK;JOT, JEAN-MARC;SIGNING DATES FROM 20210201 TO 20210211;REEL/FRAME:067870/0500

STPPInformation on status: patent application and granting procedure in general

Free format text:PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED

STPPInformation on status: patent application and granting procedure in general

Free format text:PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCFInformation on status: patent grant

Free format text:PATENTED CASE


[8]ページ先頭

©2009-2025 Movatter.jp