CROSS-REFERENCE TO RELATED APPLICATIONThis application is a continuation of, and claims priority of U.S. Non-Provisional patent application Ser. No. 16/707,523, filed Dec. 9, 2019, and entitled “AUTONOMOUSLY MOTILE DEVICE WITH SPEECH COMMANDS,” the contents of which is herein incorporated by reference in its entirety.
BACKGROUNDAn autonomously motile device may be capable of moving or performing other actions within an environment. Speech-recognition processing, combined with natural-language understanding processing, may enable speech-based user control and output of the autonomously motile device to perform tasks based on a user's spoken commands. The combination of speech-recognition processing and natural-language understanding processing is referred to herein as speech processing. Speech processing may be used by computers, hand-held devices, telephone computer systems, kiosks, and a wide variety of other devices to improve human-computer interactions.
BRIEF DESCRIPTION OF DRAWINGSFor a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings.
FIG.1 illustrates a system and method for controlling an autonomously motile device using speech according to embodiments of the present disclosure.
FIGS.2A,2B, and2C illustrate views of an autonomously motile device according to embodiments of the present disclosure.
FIG.2D illustrates a view of an autonomously motile device in an environment according to embodiments of the present disclosure.
FIGS.2E and2F illustrate images captured by an autonomously motile device in an environment according to embodiments of the present disclosure.
FIG.3 illustrates a speech-processing system in accordance with embodiments of the present disclosure.
FIGS.4A,4B,5,6A,6B,7A, and7B illustrate speech control of an autonomously motile device using a user device in accordance with embodiments of the present disclosure.
FIGS.8A and8B illustrate user devices and remote systems for speech control of an autonomously motile device in accordance with embodiments of the present disclosure.
FIG.9 illustrates a user device for speech control of an autonomously motile device in accordance with embodiments of the present disclosure.
FIGS.10A and10B illustrate an environment of and a map for an autonomously motile device according to embodiments of the present disclosure.
FIG.11A illustrates a block diagram of an autonomously motile device according to embodiments of the present disclosure.
FIG.11B illustrates components that may be stored in a memory of an autonomously motile device according to embodiments of the present disclosure.
FIG.11C illustrates data that may be stored in a storage of an autonomously motile device according to embodiments of the present disclosure.
FIG.11D illustrates sensors that may be included as part of an autonomously motile device according to embodiments of the present disclosure.
FIG.12 illustrates a block diagram of a server according to embodiments of the present disclosure.
FIG.13 illustrates a network including an autonomously motile device according to embodiments of the present disclosure.
DETAILED DESCRIPTIONAn autonomously motile device—e.g., a robot, herein abbreviated to “AMD”—may be capable of performing actions in an environment in response to commands of a user. Such actions may include movement; this movement may include one or more of movement of a component of the autonomously motile device, such as movement of a display screen or camera of the autonomously motile device; movement of the autonomously motile device at a position in the environment, such as rotation of the autonomously motile device at the position; and/or movement of the autonomously motile device within the environment, such as movement from a first position in the environment to a second position in the environment. Other actions may include outputting audio using one or more loudspeakers of the autonomously motile device, capturing audio using one or more microphones of the autonomously motile device, and/or displaying images or video using one or more display screens of the autonomously motile device. The present disclosure is not, however, limited to any particular actions of the autonomously motile device.
The autonomously motile device may be associated with a speech-processing component configured for processing audio data, which may include a representation of an utterance of a user, to identify a command represented by the utterance and act in response to the command. The utterance may be, for example, “go to the kitchen”; the speech-processing component may determine that the command corresponds to a directive for the autonomously motile device to move in its environment such that it is disposed in a room in the environment designated as the “kitchen.” This speech-processing component may be disposed on the autonomously motile device itself or on a different system, such a remote system.
The speech-processing component may include components for automatic speech recognition and/or natural-language understanding. Automatic speech recognition (ASR) is a field of computer science, artificial intelligence, and linguistics involving transforming audio data associated with speech into text representative of that speech. Natural language understanding (NLU) is a field of computer science, artificial intelligence, and linguistics concerned with enabling computers to derive meaning from input data representing natural language (e.g., text data and/or audio data). ASR and NLU are often used together as part of a speech-processing system.
The autonomously motile device and/or user device may be configured to perform the speech processing. In other embodiments, a distributed computing environment may be used to perform some or all of the speech processing. An example distributed environment may include the autonomously motile device, user device, and a remote device; the autonomously motile device and/or user device may send audio data representing the speech to the remote device, which then performs some or all of the speech processing. A command represented in the audio data may then be executed by a combination of the remote system, the autonomously motile device, and/or the user device.
The autonomously motile device, user device, and/or remote system may be configured to process audio data upon a user speaking a particular word, phrase, and/or nonspeech sound—referred to collectively herein as a “wakeword”—to, e.g., send and/or process the audio data in expectation of the user speaking further words and/or sounds representing a command. Some components of the autonomously motile device and/or user device, such as a voice-activity detection (VAD) component (discussed in greater detail below) may continually listen for speech; upon detection of speech, the VAD component may cause the processing of the audio data using other components, such as a wakeword-detection component (also discussed in greater detail below). The wakeword may represent an indication for the autonomously motile device, user device, and/or remote system to perform further processing. For example, the autonomously motile device, user device, and/or remote system may be configured to detect a wakeword or other input and then process any subsequent audio following the wakeword or other input (and, in some embodiments, some amount of pre-wakeword audio) to detect any commands in the subsequent audio.
As an example, a wakeword may include a name by which a user refers to a user device, a name of a different device, a name of an assistant, and/or a name of a skill. Thus, if the wakeword is “Alexa,” a user may command the autonomously motile device, user device, and/or remote system to play music by saying “Alexa, play some music now.” A first wakeword may be associated with the user device, while a second wakeword, such as “Robot,” may be associated with the autonomously motile device. The autonomously motile device and/or user device may continually receive and process audio to detect the wakeword. Upon the user device recognizing the wakeword “Alexa,” the autonomously motile device and/or user device may process the subsequent audio (in this example, “play some music”) to determine a command. Additionally or alternatively, the autonomously motile device and/or user device may send audio data representing the speech to the remote system to perform speech processing on that audio to determine the command for execution.
The autonomously motile device may, in addition to receiving commands directly from a user (via, for example, audio captured by a microphone of the autonomously motile device) may instead or in addition receive commands via a user device, such as a smartphone or tablet computer. If, for example, the autonomously motile device is disposed in a location in the environment different from that of a user (in, e.g., a different room in the environment), the user may issue commands to the autonomously motile device via the user device such as a smartphone to, e.g., summon the autonomously motile device, cause the autonomously motile device to capture image data, and/or cause the autonomously motile device to output audio. The user device may communicate directly with the autonomously motile device via a local network, such as a Wi-Fi network, and/or communicate with the autonomously motile device via a remote system via a wide-area network, such as the Internet.
In various embodiments, the user device may be associated with a second speech-processing system. The first speech-processing system of the autonomously motile device may be associated with at least some commands or actions specific to the autonomously motile device, such as movement in the environment. The second speech-processing system of the user device may not be associated with such commands and may instead be associated with non-motile-device-specific commands, e.g., general commands, such as commands to output information like weather forecasts, requests for information, music, video, or other such output.
Some commands, such as movement commands, may be associated with only the autonomously motile device. The user device may be incapable of acting in accordance with these commands (e.g., the user device may be incapable of movement) and thus only the autonomously motile device is capable of acting in accordance with these commands. Other commands may be associated with only the user device; these commands may include commands to output particular audio and/or video and/or to control other devices in the environment, such as a network-connected electrical outlet. Still other commands may be associated with both the autonomously motile device and the user device; in other words, both the user device and the autonomously motile device are capable of acting in accordance with these commands. If both devices are so capable, the user device and/or autonomously motile device may determine that only one device is to act in accordance with the commands. In other embodiments, both devices act in accordance with the commands.
FIG.1 illustrates anenvironment102 in which an autonomouslymotile device110 is disposed. Auser device112 may be in communication with the autonomouslymotile device110 via anetwork199. Theuser device112 may be disposed in theenvironment102 or may be disposed in a different environment (e.g., a user104 of theuser device112 may be in the same room or house as the autonomouslymotile device110 or may be in a different room or house). Theuser device112 and/or autonomouslymotile device110 may further be in communication with one or moreremote systems120 via the network199 (and/or other network). Although the figures and discussion of the present disclosure illustrate certain operational steps of a method in a particular order, the steps described may be performed in a different order (as well as certain steps removed or added) without departing from the intent of the disclosure.
The autonomouslymotile device110 may be capable of autonomous motion using one or motors powering one or more wheels, treads, robotic limbs, wings, propellers, or similar actuators, but the present disclosure is not limited to particular method of autonomous movement/motion. The autonomouslymotile device110 may further include one or more display screens for displaying information to a user and/or receiving touch (or other) input from a user. The autonomouslymotile device110 may further include a microphone array including one or more microphones and one or more loudspeakers; the microphone array may be used to receive audio data, such as an utterance represented by user audio, from the user. The utterance may be, for example, a command or request. The loudspeaker of autonomouslymotile device110 may be used to output audio to the user, such as audio related to a response to command or audio related to response to a request.
Thedevice110 may further include one or more sensors; these sensors may include, but are not limited to, an accelerometer, a gyroscope, a magnetic field sensor, an orientation sensor, a weight sensor, a temperature sensor, and/or a location sensor (e.g., a global-positioning system (GPS) sensor or a Wi-Fi round-trip time sensor). The device may further include a computer memory, a computer processor, and one or more network interfaces. Thedevice110 may be, in some embodiments, a robotic assistant or “robot” that may move about a room or rooms to provide a user with requested information or services. In other embodiments, thedevice110 is capable of rotation but not linear motion; thedevice110 may be mounted or placed on a surface or floor, for example, and may rotate in place to face a user. The disclosure is not, however, limited to only these devices or components, and thedevice110 may include additional components without departing from the disclosure.
In various embodiments, with reference toFIG.1, audio data representing an utterance is received (130) at a user device. A first speech-processing component, determines (132) that the audio data represents a command and sends (134), to a second speech-processing component, first data corresponding to the command. A first indication is received (136) from the second speech-processing component that the command corresponds to an autonomously motile device. The user device determines (138) that a network connection exists between the user device and the autonomously motile device. The user device sends (140), to a device manager component, a second indication of the network connection and receives (142), from the device manager component, a third indication of authorization to execute the command. The second speech-processing component sends (144), to the autonomously motile device, second data corresponding to the command, and causes (146) the autonomously motile device to execute the command.
FIGS.2A-2C illustrate an autonomouslymotile device110 according to various embodiments of the present disclosure. Referring first toFIG.2A, thedevice110 includeswheels202 disposed on left and right sides of a lower structure. Thewheels202 may be canted inwards toward an upper structure. In other embodiments, however, thewheels202 may be mounted vertically. A caster204 (i.e., a smaller wheel) may disposed along a midline of thedevice110. The front section of thedevice110 may include a variety of external sensors. A first set ofoptical sensors206 may be disposed along the lower portion of the front, and a second set ofoptical sensors208 may be disposed along an upper portion of the front. Amicrophone array210 may be disposed on a top surface of thedevice110; themicrophone array210 may, however, be disposed on any surface of thedevice110.
Thedevice110 may have one or more cameras mounted on one or more surfaces of thedevice110. The cameras may be capable of capturing image data; this image data may be still pictures or moving video. The cameras may be capable of capturing wavelengths of light outside of the spectrum visible to humans, such as infrared light. Thedevice110 may include acamera212 mounted to a fixed surface of thedevice110; thedevice110 may further include acamera274 mounted on adisplay214. Thedisplay214, and thus thecamera274 mounted thereon, may be capable of horizontal rotation (e.g., camera “pan” motion) and/or vertical rotation (e.g., camera “tilt” motion). Thedevice110 may also feature acamera276 mounted on amast272; thiscamera276 may also be capable of pan and tilt and may further change its vertical position with respect to thedevice110 upon extension/retraction of the mast272 (e.g., camera “pedestal” motion).
One ormore cameras212 may be mounted to the front of thedevice110; twocameras212 may, for example, be used to provide for stereo vision. The distance between twocameras212 may be, for example, 5-15 centimeters; in some embodiments, the distance between thecameras212 is 10 centimeters. In some embodiments, thecameras212 may exhibit a relatively wide horizontal field-of-view. For example, the horizontal field-of-view may be between 90° and 110°. A relatively wide field-of-view may provide for easier detection of moving objects, such as users or pets, which may be in the path of thedevice110. Also, the relatively wide field-of-view may provide for thedevice110 to more easily detect objects when rotating or turning.
Cameras212 used for navigation may be of different resolution from, or sensitive to different wavelengths than,other cameras274,276 used for other purposes, such as video communication. For example,navigation cameras212 may be capable of capturing infrared light, allowing thedevice110 to operate in darkness or semi-darkness, while acamera274 mounted on thedisplay214 and/or acamera276 mounted on amast272 may be capable of capturing visible light and may be used to generate images suitable for viewing by a human. The navigation camera(s)212 may have a resolution of approximately 300 kilopixels, while theother cameras274,276 may have a resolution of approximately 10 megapixels. In some implementations, navigation may use asingle camera212.
Thecameras212 may operate to provide stereo images of the environment, the user, or other objects as shown in, for example,FIG.2E. For example, images from thecameras212 may be accessed and used to generate stereo-image data corresponding to a face of a user. This stereo-image data may then be used for facial recognition, user identification, gesture recognition, gaze tracking, and/or other uses.
Thedisplay214 may be mounted on a movable mount. The movable mount may allow the display to move along one or more degrees of freedom. For example, thedisplay214 may tilt, pan, change elevation, and/or rotate. In some embodiments, the size of thedisplay214 may be approximately 8 inches as measured diagonally from one corner of thedisplay214 to another.
Anultrasonic sensor218 may be mounted on the front (and/or other surface) of thedevice110 and may be used to provide sensor data that represents objects in front of thedevice110. One ormore loudspeakers220 may be mounted on thedevice110; theloudspeakers220 may have different audio properties. For example, low-range, mid-range, and/or high-range loudspeakers220 may be mounted on the front of thedevice110. Theloudspeakers220 may be used to provide audible output such as alerts, music, and/or human speech (such as during a communication session with another user).
Other output devices222, such as one or more lights, may be disposed on an exterior of thedevice110. For example, a running light may be arranged on a front of thedevice110. The running light may provide light for operation of one or more of the cameras, a visible indicator to the user that thedevice110 is in operation, or other such uses.
One or more floor optical-motion sensors224,226 may be disposed on the front and/or underside of thedevice110. The floor optical-motion sensors224,226 may provide indication indicative of motion of thedevice110 relative to the floor or other surface underneath thedevice110. In some embodiments, the floor optical-motion sensors224,226 comprise a light source, such as light-emitting diode (LED) and/or an array of photodiodes. In some implementations, the floor optical-motion sensors224,226 may utilize an optoelectronic sensor, such as an array of photodiodes. Several techniques may be used to determine changes in the data obtained by the photodiodes and translate this into data indicative of a direction of movement, velocity, acceleration, and so forth. In some implementations, the floor optical-motion sensors224,226 may provide other information, such as data indicative of a pattern present on the floor, composition of the floor, color of the floor, and so forth. For example, the floor optical-motion sensors224,226 may utilize an optoelectronic sensor that may detect different colors or shades of gray, and this data may be used to generate floor characterization data.
FIG.2B illustrates a side view of thedevice110 according to various embodiments of the present disclosure. In this side view, the left side of thedevice110 is illustrated; the right side may include similar features. Themast272 is extended to a first position; acamera276 is disposed at an upper end of themast272. Anultrasonic sensor228 and anoptical sensor230 may be disposed on either side of thedevice110. Thecamera276 may be capable of rotation, panning, and tilting, and may capture a panoramic image.
The disposition of components of thedevice110 may be arranged such that a center ofgravity232 is located between awheel axle234 of thefront wheels202 and thecaster204. Such placement of the center ofgravity232 may result in improved stability of thedevice110 and may also facilitate lifting by a carrying handle.
In this illustration, the caster is shown in a trailing configuration, in which the caster is located behind or aft of thewheel axle234 and the center ofgravity232. In another implementation, the caster may be in front of the axle of thewheels202. For example, thecaster204 may be a leadingcaster204 positioned forward of the center ofgravity232.
Thedevice110 may encounter a variety of different floor surfaces of theenvironment102 and may transition between different floor surfaces during the course of its operation. Acontoured underbody236 may transition from afirst height238 at the front of thedevice110 to asecond height240 that is proximate to thecaster204. This curvature may provide a ramp effect such that, if thedevice110 encounters an obstacle that is below thefirst height238, the contouredunderbody236 helps direct thedevice110 over the obstacle without lifting the drivingwheels202 from the floor.
FIG.2C illustrates a rear view of thedevice110 according to various embodiments of the present disclosure. In this view, as with the front view, a first pair ofoptical sensors242 may be located along the lower edge of the rear of thedevice110, while a second pair ofoptical sensors244 are located along an upper portion of the rear of thedevice110. Anultrasonic sensor246 may provide proximity detection for objects that are behind thedevice110.
Chargingcontacts248 may be provided on the rear of thedevice110. The chargingcontacts248 may include electrically conductive components that may be used to provide power (to, e.g., charge a battery) from an external source such as a docking station to thedevice110. In other implementations, wireless charging may be utilized. For example, wireless inductive or wireless capacitive charging techniques may be used to provide electrical power to thedevice110.
In some embodiments, thewheels202 may include an electricallyconductive portion250 and provide an electrical conductive pathway between thedevice110 and a charging source disposed on the floor. One ormore data contacts252 may be arranged along the back of thedevice110. Thedata contacts252 may be configured to establish contact with corresponding base data contacts within the docking station. Thedata contacts252 may provide optical, electrical, or other connections suitable for the transfer of data.
Other output devices260, such as one or more lights, may be disposed on an exterior of the back of thedevice110. For example, a brake light may be arranged on the back surface of thedevice110 to provide users an indication that thedevice110 is slowing or stopping.
Thedevice110 may include amodular payload bay254. In some embodiments, themodular payload bay254 is located within a lower structure of thedevice110. Themodular payload bay254 may provide mechanical and/or electrical connectivity with thedevice110. For example, themodular payload bay254 may include one or more engagement features such as slots, cams, ridges, magnets, bolts, and so forth that are used to mechanically secure an accessory within themodular payload bay254. In some embodiments, themodular payload bay254 includes walls within which the accessory may sit. In other embodiments, themodular payload bay254 may include other mechanical engagement features such as slots into which the accessory may be slid and engage. Thedevice110 may further include amast272, which may include acamera276 and a light258.
As shown inFIG.2D, the autonomouslymotile device110 may move in theenvironment102. The motion of the autonomouslymotile device110 may be described as atrajectory280, as shown inFIG.2D. In some implementations, thetrajectory280 may comprise a series of poses. Each pose may be indicative of a particular location with respect to a plurality of orthogonal axes and rotation with respect to individual ones of the axes. For example, the pose may comprise information with respect to six degrees of freedom indicative of coordinates in three-dimensional space with respect to a designated origin and rotation with respect to each of the three axes.
As described above, one or more motors or other actuators enable the autonomouslymotile device110 to move from one location in theenvironment102 to another. For example, a motor may be used to drive a wheel attached to a chassis of the autonomouslymotile device110, which causes the autonomouslymotile device110 to move. The autonomouslymotile device110 may turn, move forward, move backward, and so forth. In another example, actuators may move legs allowing the autonomouslymotile device110 to walk.
The autonomouslymotile device110 may include one or more sensors1154 (shown below inFIG.11D). For example, thesensors1154 may include afirst camera212a, asecond camera212b, an inertial measurement unit (IMU)1180, microphones, time-of-flight (TOF) sensors, and so forth. Thefirst camera212aand thesecond camera212bmay be mounted to a common rigid structure that maintains a relative distance between thecameras212a,212b. AnIMU1180 may be attached to this common rigid structure, or one of the cameras affixed thereto. Thefirst camera212aand thesecond camera212bmay be arranged such that a sensor field-of-view285 of thefirst camera212aoverlaps at least in part a sensor field-of-view of thesecond camera212b. Thesensors1154 may generate sensor data1147 (which may be stored instorage1108 as illustrated inFIG.11C discussed below). Thesensor data1147 may includeimage data1142 acquired by thefirst camera212aand thesecond camera212b. For example, as shown inFIG.2E, a pair of images282 may compriseimage data1142 from thefirst camera212aand thesecond camera212bthat are acquired at the same time. For example, a first pair of images282amay be acquired at time t1and a second pair of images282bmay be acquired at time t2. Some or all of theimage data1142 and/oraudio data1143 may be sent to theuser device112 for output thereon. Thesensors1154 are discussed in more detail with regard toFIG.11D.
During operation the autonomouslymotile device110 may determine input data. The input data may include or be based at least in part onsensor data1147 from thesensors1154 onboard the autonomouslymotile device110. In one implementation, a speech processing component1137 (which may be the speech-processing component illustrated inFIG.3) may process raw audio data obtained by a microphone on the autonomouslymotile device110 and produce input data. For example, the user may say “robot, come here” which may produce input data “come here”. In another implementation, the input data may comprise information such as a command provided by another computing device, such as a smartphone or tablet computer.
A mapping component1130 (which may be included inmemory1106 as illustrated inFIG.11B) determines a representation of theenvironment102 that includes theobstacles283 and their location in theenvironment102. During operation themapping component1130 uses thesensor data1147 fromvarious sensors1154 to determine information such as where the autonomouslymotile device110 is, how far the autonomouslymotile device110 has moved, the presence ofobstacles283, where thoseobstacles283 are, and so forth.
A feature module processes at least a portion of theimage data1142 to determinefirst feature data1148. Thefirst feature data1148 is indicative of one or more features286 that are depicted in theimage data1142. For example, as shown inFIG.2F, the features286 may be edges of doors, shadows on the wall, texture on the walls, portions of artwork in theenvironment102, and so forth. Theenvironment102 may include display devices that are capable of changing the images they portray. For example, atelevision288 may be presented in theenvironment102. The picture presented by thetelevision288 may also have features286.
Various techniques may be used to determine the presence of features286 inimage data1142. For example, one or more of a Canny detector, Sobel detector, difference of Gaussians, features from accelerated segment test (FAST) detector, scale-invariant feature transform (SIFT), speeded up robust features (SURF), trained convolutional neural network, or other detection methodologies may be used to determine features286 in theimage data1142. A feature286 that has been detected may have an associated descriptor that characterizes that feature286. The descriptor may comprise a vector value in some implementations. For example, the descriptor may comprise data indicative of the feature with respect to 256 different dimensions.
Thefirst feature data1148 may comprise information such the descriptor for the feature286, the images that the feature286 was detected in, location in theimage data1142 of the feature286, and so forth. For example, thefirst feature data1148 may indicate that in a first image the feature286 is centered at row994,column312 in the first image. These data and operations, along with those discussed below, may be used by the autonomouslymotile device110, and/or other devices, to perform the operations described herein.
Regarding types of movement, thedevice110 may move a component of the device110 (such as a mast, display, arm, or other such component) as described herein, thedevice110 may move in a single location in the environment102 (e.g., rotate at a location), and/or move within the environment102 (from, e.g., a first location to a second location). These movements may collectively be referred to as a change in pose of thedevice110 from a first pose to a second pose. The first pose of thedevice110 may be a first arrangement of components of thedevice110, first orientation of the device at a position in theenvironment102, and first position in theenvironment102; the second pose of thedevice110 may be a second arrangement of components of thedevice110, second orientation of the device at a position in theenvironment102, and second position in theenvironment102. Only one of the first arrangement, first orientation, and/or first position may vary between the first pose and the second pose (e.g., the change in pose may include only one of a change in arrangement, orientation, and/or position). The change in pose may, however, include a change in more than one of the first arrangement, first orientation, and/or first position.
A speech-processing system is illustrated inFIG.3. As explained in further detail below, one or more speech-processing systems may be disposed on theuser device112, on the autonomouslymotile device110, on theremote system120, or be distributed on some combination thereof. If some or all of the speech-processing system is disposed on theremote system120, theuser device112 and/or autonomouslymotile device110 may send audio data representing an utterance to theremote system120 for processing. Theuser device112 and/or autonomouslymotile device110 may begin sending the audio data upon detection of a voice, upon detection of a wakeword, or upon detection of other user input, such as a touch gesture on a computer display. If theuser device112, autonomouslymotile device110, and/orremote system120 determines that an utterance corresponds to the autonomouslymotile device110, theuser device112 and/orremote system120 may send a command represented by the utterance to the autonomouslymotile device110, as explained herein
An audio capture component(s), such as a microphone or array of microphones of theuser device112 and/or autonomouslymotile device110, captures audio. Theuser device112, autonomouslymotile device110, and/orremote system120 may process audio data representing the audio to determine whether speech is detected. Theuser device112, autonomouslymotile device110, and/orremote system120 may use various techniques to determine whether audio data includes speech. In some examples, theuser device112, autonomouslymotile device110, and/orremote system120 may apply voice activity detection (VAD) techniques. Such techniques may determine whether speech is present in audio data based on various quantitative aspects of the audio data, such as the spectral slope between one or more frames of the audio data; the energy levels of the audio data in one or more spectral bands; the signal-to-noise ratios of the audio data in one or more spectral bands; or other quantitative aspects. In other examples, theuser device112, autonomouslymotile device110, and/orremote system120 may implement a limited classifier configured to distinguish speech from background noise. The classifier may be implemented by techniques such as linear classifiers, support vector machines, and decision trees. In still other examples, theuser device112, autonomouslymotile device110, and/orremote system120 may apply Hidden Markov Model (HMM) or Gaussian Mixture Model (GMM) techniques to compare the audio data to one or more acoustic models in storage. Such acoustic models may include models corresponding to speech, noise (e.g., environmental noise or background noise), or silence. Still other techniques may be used to determine whether speech is present in audio data.
Once speech is detected in audio data representing the audio, theuser device112, autonomouslymotile device110, and/orremote system120 may use a wakeword-detection component to perform wakeword detection to determine when a user intends to speak an input to the speech-processing system. An example wakeword is “Alexa.” As used herein, a “wakeword” may refer to a single word or more than one consecutive words in speech. Wakeword detection may be performed without performing linguistic analysis, textual analysis, or semantic analysis. Instead, the audio data, representing the audio, is analyzed to determine if specific characteristics of the audio data match preconfigured acoustic waveforms, audio signatures, or other data to determine if the audio data “matches” stored audio data corresponding to a wakeword.
Thus, the wakeword-detection component may compare audio data to stored models or data to detect a wakeword. One approach for wakeword detection applies general large vocabulary continuous speech recognition (LVCSR) systems to decode audio signals, with wakeword searching being conducted in the resulting lattices or confusion networks. LVCSR decoding may require relatively high computational resources. Another approach for wakeword detection builds HMMs for each wakeword and non-wakeword speech signals, respectively. The non-wakeword speech includes other spoken words, background noise, etc. There can be one or more HMMs built to model the non-wakeword speech characteristics, which are named filler models. Viterbi decoding is used to search the best path in the decoding graph, and the decoding output is further processed to make the decision on wakeword presence. This approach can be extended to include discriminative information by incorporating a hybrid DNN-HMM decoding framework. In another example, the wakeword-detection component may be built on deep neural network (DNN)/recursive neural network (RNN) structures directly, without HMI being involved. Such an architecture may estimate the posteriors of wakewords with context information, either by stacking frames within a context window for DNN, or using RNN. Follow-on posterior threshold tuning or smoothing is applied for decision making. Other techniques for wakeword detection, such as those known in the art, may also be used.
Once the wakeword is detected and/or other user input is received,user device112 and/or autonomouslymotile device110 may “wake” and begin transmitting audio data representing the audio to the speech-processing system (and/or begin processing the audio data itself). The audio data may include data corresponding to the wakeword, or theuser device112 and/or autonomouslymotile device110 may remove the portion of the audio corresponding to the wakeword prior to sending the audio data to the remote system120 (and/or begin processing the audio data itself).
Anorchestrator component302 may receive the audio data. Theorchestrator component302 may include memory and logic that enables theorchestrator component302 to transmit various pieces and forms of data to various components of the speech-processing system, as well as perform other operations. Theorchestrator component302 may send the audio data to anASR component304. TheASR component304 transcribes the audio data into text data. The text data output by theASR component304 may represent one or more than one (e.g., in the form of an n-best list) ASR hypotheses representing speech represented in the audio data. TheASR component304 interprets the speech in the audio data based on a similarity between the audio data and pre-established language models. For example, theASR component304 may compare the audio data with models for sounds (e.g., subword units, such as phonemes, etc.) and sequences of sounds to identify words that match the sequence of sounds of the speech represented in the audio data. TheASR component304 outputs text data representing one or more ASR hypotheses. The text data output by theASR component304 may include a top scoring ASR hypothesis or may include an n-best list of ASR hypotheses. Each ASR hypothesis may be associated with a respective score. Each score may indicate a confidence of ASR processing performed to generate the ASR hypothesis with which the score is associated.
Theorchestrator component302 may send text data (e.g., text data output by theASR component304 or the received text data) to anNLU component306. TheNLU component306 attempts to make a semantic interpretation of the phrase(s) or statement(s) represented in the received text data. That is, theNLU component306 determines one or more meanings associated with the phrase(s) or statement(s) represented in the text data based on words represented in the text data. TheNLU component306 determines an intent representing an action that a user desires be performed as well as pieces of the text data that allow a device (e.g., theuser device112 and/or the autonomously motile device110) to execute the intent. For example, if the text data corresponds to “play Spinal Tap music,” theNLU component306 may determine an intent that the speech-processing system output music and may identify “Spinal Tap” as an artist. For further example, if the text data corresponds to “what is the weather,” theNLU component306 may determine an intent that the speech-processing system output weather information associated with a geographic location of the autonomouslymotile device110 and/oruser device112. In another example, if the text data corresponds to “go to the kitchen,” theNLU component306 may determine an intent that the user wishes the autonomouslymotile device120 to move within itsenvironment102 to a room designated as the kitchen. TheNLU component306 may output NLU results data (which may include tagged text data, indicators of intent, etc.). The NLU results data may include an indication of which device, such as the autonomouslymotile device110 and/oruser device112, should execute the intent.
As described above, the speech-processing system may perform speech processing using two different components (e.g., theASR component304 and the NLU component306). One skilled in the art will appreciate that the speech-processing system, in at least some examples, may further implement a spoken language understanding (SLU) component that is configured to process the audio data to generate NLU results data.
In some examples, the SLU component may be equivalent to theASR component304 and theNLU component306. For example, the SLU component may process the audio data and generate NLU data. The NLU results data may include intent data and/or slot data. While the SLU component may be equivalent to a combination of theASR component304 and theNLU component306, the SLU component may process audio data and directly generate the NLU results data, without an intermediate step of generating text data (as does the ASR component304). As such, the SLU component may take the audio data representing natural language speech and attempt to make a semantic interpretation of the natural language speech. That is, the SLU component may determine a meaning associated with the natural language speech and then implement that meaning. For example, the SLU component may interpret the audio data representing natural language speech from the user in order to derive an intent or a desired action or operation from the user. In some examples, the SLU component outputs a most likely NLU hypothesis recognized in the audio data, or multiple NLU hypotheses in the form of a lattice or an N-best list with individual hypotheses corresponding to confidence scores or other scores (such as probability scores, etc.).
The speech-processing system may include one or more skills314. A “skill” may be software running on the speech-processing system that is akin to a software application running on a traditional computing device. That is, a skill314 may enable the speech-processing system to execute specific functionality in order to provide data or produce some other requested output. For example, one skill314 may be used to control the autonomouslymotile device110.
The speech-processing system may be configured with more than one skill314. For example, a weather service skill may enable the speech-processing system to provide weather information, a car service skill may enable the speech-processing system to book a trip with respect to a taxi or ride sharing service, a restaurant skill may enable the speech-processing system to order a pizza with respect to the restaurant's online ordering system, etc. A skill314 may operate in conjunction between the speech-processing system and other devices, such as theuser device112, in order to complete certain functions. Inputs to a skill314 may come from speech processing interactions or through other interactions or input sources. A skill314 may include hardware, software, firmware, or the like that may be dedicated to a particular skill314 or shared among different skills314.
Additionally or alternatively to being implemented by the speech-processing system, a skill314 may be implemented by a skill system. Such may enable a skill system to execute specific functionality in order to provide data or perform some other action requested by a user. Skills may be associated with different domains, such as smart home, music, video, flash briefing, shopping, and custom (e.g., skills not associated with any pre-configured domain). The speech-processing system may be configured with a single skill314 dedicated to interacting with more than one skill314. The functionality described herein as a skill may be referred to using many different terms, such as an action, bot, app, or the like.
The speech-processing system may include aTTS component308. TheTTS component308 may generate audio data (e.g., synthesized speech) from text data using one or more different methods. Text data input to theTTS component308 may come from a skill314, theorchestrator component302, or another component of the speech-processing system.
In one method of synthesis called unit selection, theTTS component308 matches text data against a database of recorded speech. TheTTS component308 selects matching units of recorded speech and concatenates the units together to form audio data. In another method of synthesis called parametric synthesis, theTTS component308 varies parameters such as frequency, volume, and noise to create audio data including an artificial speech waveform. Parametric synthesis uses a computerized voice generator, sometimes called a vocoder.
The speech-processing system may include a user-recognition component310 that recognizes one or more users associated with data input to the speech-processing system. The user-recognition component310 may take as input the audio data and/or the text data. The user-recognition component310 may perform user recognition by comparing speech characteristics in the audio data to stored speech characteristics of users. The user-recognition component310 may additionally or alternatively perform user recognition by comparing biometric data (e.g., fingerprint data, iris data, etc.), received by the speech-processing system in correlation with a user input, to stored biometric data of users. The user-recognition component310 may additionally or alternatively perform user recognition by comparing image data (e.g., including a representation of at least a feature of a user), received by the speech-processing system in correlation with a user input, with stored image data including representations of features of different users. The user-recognition component310 may perform other or additional user recognition processes, including those known in the art. For a particular user input, the user-recognition component310 may perform processing with respect to stored data of users associated with theuser device112 that captured the natural language user input.
The user-recognition component310 determines whether a user input originated from a particular user. For example, the user-recognition component310 may generate a first value representing a likelihood that a user input originated from a first user, a second value representing a likelihood that the user input originated from a second user, etc. The user-recognition component310 may also determine an overall confidence regarding the accuracy of user recognition operations.
The user-recognition component310 may output a single user identifier corresponding to the most likely user that originated the natural language user input. Alternatively, the user-recognition component310 may output multiple user identifiers (e.g., in the form of an N-best list) with respective values representing likelihoods of respective users originating the natural language user input. The output of the user-recognition component310 may be used to inform NLU processing, processing performed by a skill314, as well as processing performed by other components of the speech-processing system and/or other systems.
The speech-processing system may includeprofile storage312. Theprofile storage312 may include a variety of information related to individual users, groups of users, devices, etc. that interact with the speech-processing system. A “profile” refers to a set of data associated with a user, group of users, device, etc. The data of a profile may include preferences specific to the user, group of users, device, etc.; input and output capabilities of one or more devices; internet connectivity information; user bibliographic information; subscription information; as well as other information.
Theprofile storage312 may include one or more user profiles, with each user profile being associated with a different user identifier. Each user profile may include various user identifying information. Each user profile may also include preferences of the user and/or one or more device identifiers, representing one or more devices registered to the user. Each user profile may include identifiers of skills that the user has enabled. When a user enables a skill, the user is providing the speech-processing system with permission to allow the skill to execute with respect to the user's inputs. If a user does not enable a skill, the speech-processing system may not permit the skill to execute with respect to the user's inputs.
Theprofile storage312 may include one or more group profiles. Each group profile may be associated with a different group profile identifier. A group profile may be specific to a group of users. That is, a group profile may be associated with two or more individual user profiles. For example, a group profile may be a household profile that is associated with user profiles associated with multiple users of a single household. A group profile may include preferences shared by all the user profiles associated therewith. Each user profile associated with a group profile may additionally include preferences specific to the user associated therewith. That is, each user profile may include preferences unique from one or more other user profiles associated with the same group profile. A user profile may be a stand-alone profile or may be associated with a group profile. A group profile may include one or more device profiles representing one or more devices associated with the group profile.
Theprofile storage312 may include one or more device profiles. Each device profile may be associated with a different device identifier. Each device profile may include various device identifying information. Each device profile may also include one or more user identifiers, representing one or more user profiles associated with the device profile. For example, a household device's profile may include the user identifiers of users of the household.
TheNLU component306 may perform NLU processing on the text data to generate NLU results data. Part of this NLU processing may include entity resolution processing, whereby an entity, represented in the text data, is processed to corresponding to an entity known to the speech-processing system. In at least some examples, the speech-processing system may include one or more entity-resolution services, which may be implemented separately from theNLU component306. In at least some examples, each entity-resolution service may correspond to a different domain. In such examples, theNLU component306 may determine a domain to which the natural language user input most likely corresponds, and may send NLU results data (which may include a tagged and slotted representation of a user input) to an entity-resolution service corresponding to the domain. The entity-resolution service may resolve one or more tagged entities represented in the text data sent to the entity-resolution service. Thereafter, the entity-resolution service may send, to theNLU component306, data representing the resolved entities. TheNLU component306 may incorporate the received text data into NLU results data representing the natural language user input. TheNLU component306 may send the NLU results data to theorchestrator component302 for further processing, for example for sending to an appropriate skill314.
Theorchestrator component302 may determine a profile identifier corresponding to the natural-language user input. In at least some examples, theorchestrator component302 may receive a user identifier(s) from the user-recognition component310, and may determine a profile identifier associated with the user identifier (or top scoring user identifier if more than one user identifier is received from the user-recognition component310). Theorchestrator component302 may send, to the appropriate skill314 or other downstream component, data representing the profile identifier, the NLU results data (or a portion thereof, such a portion representing an a domain to which the natural language user input corresponds), and an instruction to provide skill identifiers that are associated with the profile identifier and that correspond to the NLU results data (or portion thereof).
Using the NLU results data, theorchestrator component302 may select a skill identifier corresponding to a skill314 to be invoked with respect to the natural language user input. Theorchestrator component302 may send, to the skill314 corresponding to the selected skill identifier, data representing at least a portion of the NLU results data as well as other data that may be used in performing the requested operation, for example data from vocal characteristic component. The skill314 may then perform processing based on the received at least a portion of the NLU results data and/or other data.
The appropriate skill314 may then perform operations in accordance with the NLU data and may return data to theorchestrator302 for eventual output. The returned data may include text data that may be sent by theorchestrator302 to theTTS component308 for speech synthesis.
FIGS.4A,4B,5,6A,6B,7A, and7B illustrate speech control of an autonomouslymotile device110 using auser device112 in accordance with embodiments of the present disclosure. Referring first toFIGS.4A and4B, theuser device112 determines (410) a command spoken by, for example, the user104. As described herein, determination of the command may include processing, using a speech-processing component, audio data captured by a microphone of theuser device112 to determine an utterance that represents the command. As also described herein, determination of presence of a representation of a wakeword in the audio data (by, e.g., the user device112) may precede determination of the command. The wakeword may be associated with theuser device112 and/or autonomouslymotile device110. In some embodiments, the audio data includes the representation of the command but does not include the representation of the wakeword. In these embodiments, theuser device112, autonomouslymotile device110, and/orremote system120 may determine whether the command should be carried out by theuser device112 and/or autonomouslymotile device110, as described herein. In some embodiments, determination of the command includes receiving a user input, such as a touch gesture, that corresponds to a speech command trigger element, such as an icon displayed on a display of the user device associated with speech input.
In some embodiments, theuser device112 includes the speech-processing component that determines the command. In other embodiments, as described herein, theremote system120 includes the speech-processing component; theuser device112 may send, via thenetwork199, audio data to theremote system120 upon detection of voice, the wakeword, and/or other user input. Theremote system120 may then send the command and/or data associated with the command to theuser device112.
In some embodiments, theuser device112 and/orremote system120 determines that the command is associated with theuser device112, which may then perform an action responsive to the command, such as outputting audio. Theuser device112 and/orremote system120 may make this determination if, for example, the audio data includes a representation of a wakeword associated with theuser device112. In other embodiments, theuser device112 and/orremote system120 may determine, using the NLU processing techniques described herein, that the command corresponds to an intent associated with theuser device112. For example, if the intent corresponds to “play music,” theuser device112 and/orremote system120 may determine that theuser device112 should output the corresponding music.
In other embodiments, however, neither theuser device112 and/orremote system120 may determine that the command is associated with theuser device112 and, in particular, may determine that the command may be associated with the autonomouslymotile device110. In these embodiments, the user device (and/or remote system120) sends (412) data representing the command to an AMD speech-processing component402. As described herein, the AMD speech-processing component402 may send commands to the autonomouslymotile device110 such as movement commands; the user device112 (and/or a speech-processing component disposed thereon or on the remote system120) may not be capable of sending these commands. The data representing the command may be audio data that includes a representation of an utterance and/or text data corresponding to the utterance (as generated by, for example, the ASR component304).
The AMD speech-processing component402 may determine (414) that the command corresponds to the autonomouslymotile device110. The AMD speech-processing component402 may determine, for example, that the audio data includes a representation of a wakeword corresponding to the autonomouslymotile device110, such as “Robot.” The AMD speech-processing component402 may instead or in addition, using anNLU component306, determine that an intent corresponding to the command, such as “movement,” is associated with the autonomouslymotile device110.
Before taking an action in response to the command, the AMD speech-processing component402 may send (416), to theuser device112, a request to create or confirm a network connection (418) (using, e.g., the network199) between theuser device112 and the autonomouslymotile device110. Creation or confirmation of the network connection may include identifying the autonomouslymotile device110, such as determining its address on thenetwork199 and/or determining a user profile (stored in, e.g., profile storage312) associated with theuser device112 and/or autonomouslymotile device110. The network connection may be a direct connection between theuser device112 and the autonomouslymotile device110 or may be related through one or more other systems, such as theremote system120.
Creation or confirmation of the network connection (418) may further include determining which of a plurality of autonomouslymotile devices110 should take action corresponding to the command. Theenvironment102 may, for example, include two or more autonomouslymotile devices110. In these embodiments, the AMD speech-processing component402 may select one or more autonomouslymotile devices110 based on one or more factors. These factors may include, for example, proximity of the autonomouslymotile devices110 to humans in the environment, proximity of the autonomouslymotile devices110 to one or more locations associated with the command, battery levels of the of the autonomouslymotile devices110, capabilities of the of the autonomouslymotile devices110 to act in accordance with the command and/or properties of a portion of theenvironment102 proximate each autonomouslymotile device110. In some embodiments, a device other than an autonomouslymotile devices110 is selected to carry out the command. If, for example, the command corresponds to “call Mom,” an autonomouslymotile device110 closest to a person associated with the name “Mom” may be selected. If, on the other hand, another device is disposed closer to the person than is any autonomouslymotile device110, that other device (e.g., a smart loudspeaker) may be selected in lieu of the autonomouslymotile device110. If the command corresponds to “go to the kitchen,” an autonomouslymotile device110 having a most detailed map of a room in theenvironment102 associated with “kitchen” may be selected.
Creation or confirmation of the network connection may include determining that image data captured by the autonomouslymotile device110 is sent to, and is displayed by, theuser device112. The user device112 (and associated user104) may be disposed in a different environment than theenvironment102 of the autonomouslymotile device110; the user104 may thus wish to view one or more images of theenvironment102 before the autonomouslymotile device110 acts in accordance with the command, particularly if the command corresponds to movement. The image data may be video data (e.g., a live video feed). The user104 of theuser device112 may select which camera of the autonomouslymotile device110 is used to capture the image data and/or cause a component of the autonomouslymotile device110, such as the mast, to move to change how the camera captures the image data. Theuser device112 may instead or in addition prompt the user104 for confirmation of carrying out the command via an audio and/or video prompt.
The user device112 (and/or remote system120) may then send (420), to an autonomously motiledevice manager component404, an indication that the network connection exists. The indication may further include information identifying theuser device112, the autonomouslymotile device110, and/or the user104 (such as an indication of a user account or user identification number of the user104). The indication may further include authentication information, such as a security token, that indicates the identity of the user104 and/oruser device112.
The autonomously motiledevice manager component404 may be disposed on theremote system120 and/or the autonomouslymotile device110. The autonomously motiledevice manager component404 may include one or more systems for managing the autonomouslymotile device110; such management may include authorizing network communication to or from the autonomouslymotile device110, associating the autonomouslymotile device110 with a user account (during, for example, device registration), and/or permitting or denying commands to be sent to or from the autonomouslymotile device110. In various embodiments, the autonomously motiledevice manager component404 authorizes (422) execution of the command. Authorization may include determining that the user account of the user indicates allowance of execution; some users, such as children, may have user accounts that do not permit certain commands, such as capturing images. Authorization may further include determining that both theuser device112 and the autonomously motile device are associated with the same user account (e.g., a user, such as the owner of thedevices110,112 is permitted to operate bothdevices110,112). The autonomously motiledevice manager component404 may thereafter send (424), to the AMD speech-processing component402, an indication of the authorization. This indication may be, for example, a security token associated with the command and/or autonomouslymotile device110.
Referring toFIG.4B, the AMD speech-processing component402 may then send (426), to the autonomouslymotile device110, data corresponding to the command. The data may be one or more computer instructions that represent the command. For example, if the command was “go to the kitchen,” the data may include movement instructions such as “move forward 10 meters” or “turn left ninety degrees.” The speech-processing component402 may receive, in response, an indication that the data was received. The AMD speech-processing component402 may send (428), to theuser device112, an indication of successfully sending the command. This indication may include audio data such as a tone and/or audio data representing speech. Theuser device112 may thereafter output (430) the indication and the autonomouslymotile device110 may execute (432) the command.
Referring toFIG.5, as described above, theuser device112 may determine (510) a command and send (512) the command to the AMD speech-processing component402, which may determine (514) that the command is for the autonomouslymotile device110. As described above, the AMD speech-processing component402 may send (516), to theuser device112, a connection request. Unlike the embodiments illustrated inFIGS.4A and4B, however, theuser device112 determines (518) that no connection exists and/or is capable of being made between theuser device112 and the autonomouslymotile device110. Theuser device112 may determine that no connection exists by sending a request for a response to the autonomouslymotile device110 using thenetwork199 and not receiving said response. In other embodiments, theuser device112 may determine that no connection exists by sending a request to theremote system120 for the status of the connection and receiving, from theremote system120, a response indicating that no connection exists.
If no connection exists, theuser device112 may attempt to establish the connection by, for example, sending a corresponding request to the autonomouslymotile device110 and/or autonomously motiledevice manager component404. Theuser device112 may determine, however, that it is not possible to establish the connection for one or more of a number of reasons. For example, the autonomouslymotile device110 may be powered off, may have a low battery, or may not be connected to thenetwork199. Instead or in addition, the user104 and/oruser device112 may not have permission to communicate with the autonomouslymotile device110; theuser device112 may determine this lack of permission by, for example, sending a request to communicate to theremote system120 and receiving a negative response. Theuser device112 may thereafter send (518), to the autonomously motiledevice manager component404, an indication of no connection. The autonomously motiledevice manager component404 may, in response, send (520) an indication of the error to the AMD speech-processing component402, which may output audio or video corresponding to the indication. The AMD speech-processing component402 may instead or in addition send (522) the same or a corresponding indication to theuser device112, which may similarly output (524) corresponding audio and/or video.
Referring toFIGS.6A and6B, as described above, theuser device112 may determine (610) a command and send (612) the command to the AMD speech-processing component402, which may determine (614) that the command is for the autonomouslymotile device110. As described above, the AMD speech-processing component402 may send (616), to theuser device112, a connection request. As in the embodiments illustrated inFIG.5, theuser device112 determines (618) that the connection does not exist.
Theuser device112 sends (620), to the autonomouslymotile device110, a request to establish the connection. The autonomouslymotile device110 and/or autonomously motiledevice manager component404 may determine, as described herein, that the user104 and/oruser device112 is authorized to establish the connection and send (662) connection data to theuser device112. The connection data may include data necessary to establish the connection, such as a network address of the autonomouslymotile device110. Theuser device112 and the autonomouslymotile device110 may thereafter establish (624) the connection. As described above, theuser device112 may then send (626) an indication of the connection to the autonomously motiledevice manager component404, which may authorize (628) execution of the command and send (630) corresponding authorization to the AMD speech-processing component402. The AMD speech-processing component402 may then send (632) the command and send (634) the indication of success. Theuser device112 may output (636) corresponding audio and/or video, and the autonomouslymotile device110 may execute (638) the command.
Referring toFIGS.7A and7B, as described above, theuser device112 may determine (710) a command and send (712) the command to the AMD speech-processing component402, which may determine (714) that the command is for the autonomouslymotile device110. As described above, the AMD speech-processing component402 may send (716), to theuser device112, a connection request. As in the embodiments illustrated inFIG.5, theuser device112 determines (618) that the connection does not exist.
Theuser device112 sends (720), to theAMD manager component404, a request to establish the connection. The autonomouslymotile device110 and/or autonomously motiledevice manager component404 may determine, as described herein, that the user104 and/oruser device112 is not authorized to establish the connection and send (722) a refusal to establish the connection to theuser device112. The autonomouslymotile device110 and/or autonomously motiledevice manager component404 may instead or in addition send the refusal if the autonomouslymotile device110 is not capable of executing the command, for example if the autonomouslymotile device110 lacks sufficient battery power, is damaged or inoperative, is stuck, and/or is blocked by one or more obstacles. Theuser device112 and the autonomouslymotile device110 may thereafter send (724) an indication that the connection is not established. The autonomously motiledevice manager component404 may thereafter send (726) an error indication to the AMD speech-processing component402, which may send (728) it or a corresponding indication to theuser device112. Theuser device112 may then output (730) corresponding audio and/or video.
FIGS.8A and8B illustrate embodiments of the present disclosure. Referring first toFIG.8A, theuser device112 may include one or more user application(s)802 that perform some or all of the actions associated with theuser device112 as described herein. The user application(s)802 may be one or more software program(s) installed in a memory of theuser device112; the user application(s)802 may be installed via thenetwork199 by downloading it/them from a software repository such as an app store. The user104 may associate the user application(s)802 with a user account by providing account credentials such as a username and password. The user application(s)802 may cause theuser device112 to capture audio and output audio and/or video as described herein.
Theremote system120 may include the autonomously motiledevice manager component404, one or more AMD speech-processing component(s)402, and one or more user speech-processing component(s)804. The user speech-processing component(s)804 may be used to process audio data to determine a command represented therein, as described herein. A first user speech-processing component804 may be associated with a first set of commands and/or skills, while a second user speech-processing component804 may be associated with a second, different set of commands and/or skills. For example, a first user speech-processing component804 may be used to give speech commands to the autonomouslymotile device110, while a second user speech-processing component804 may be used to output content from a media streaming service. Similarly, a first AMD speech-processing component402 may be associated with a first set of commands and/or skills, while a second AMD speech-processing component402 may be associated with a second, different set of commands and/or skills. For example, a first AMD speech-processing component402 may be used for navigation in anenvironment102, while a second AMD speech-processing component402 may be used for video communication. The AMD speech-processing component(s)402 may be part of or otherwise associated with the AMD manager component. Referring toFIG.8B, the user speech-processing component812 may instead be disposed on theuser device112.
FIG.9 illustrates one embodiment of auser device112. As illustrated, theuser device112 is a tablet computer, but the present disclosure is not limited to only this embodiment. Theuser device112 may include adisplay902 for displaying images; thedisplay902 may include awindow904 that displays image data from the autonomously motile device110 (e.g., a live view). Thedisplay902 may further include a number of user-interface elements; one such element may be a speech-command trigger element906. When a user input908 (e.g., a touch event) occurs on an area of thedisplay902 corresponding to theelement906, theuser device112 may begin processing audio data (as described herein) and/or begin transmitting audio data to theremote system120 for processing.
FIGS.10A and10B illustrate a representation of anenvironment102 and a corresponding map of an autonomouslymotile device110 according to embodiments of the present disclosure. Referring first toFIG.10A, anexample environment102 includes threerooms1002,1004,1006. Afirst room1002 includes a kitchen countertop1008aand a table and chairs1010a. A second room1004aincludesbookshelves1012aand adesk1014a. A third room1006aincludes asofa1016a, a loveseat1018a, and a wall-mountedtelevision1020a. In thisexample environment102, some objects (such assofa1016a) extend from the floor of theenvironment102 to a point between the ceiling and the floor; some objects (such as thetelevision1020a) do not touch the floor; and other objects (such asbookshelves1012a) extend from floor to ceiling. The environment is bordered byexterior walls1022aand may include one or moreinterior walls1024a. Thedevice110 is capable of movement, as disclosed herein, within theenvironment102.Environments102, however, having any number of rooms and/or any types of objects are within the scope of the present disclosure.
FIG.10B illustrates amap1026 of theenvironment102. Thedevice110 may generate themap1026 or may receive themap1026 from thesystem120. Themap1026 includes data representing theposition1022bofexterior walls1022aand data representing theposition1024bofinterior walls1024a. The map data may be a set of (x,y) coordinates that indicate thepositions1022b,1024bof thewalls1022a,1024awith respect to a (0,0) origin point, such as a bottom-left point of themap1026. For example, if anexterior wall1022aextends from the (0,0) origin point to a point 10 meters to the right, the map data may include the coordinates (0,0)-(10,0).
The map1024 may further include data representing thepositions1008b,1010b,1012b,1014b,1016b,1018bof theobjects1008a,1010a,1012a,1014a,1016a,1018a,1020a. The data representing thepositions1008b,1010b,1012b,1014b,1016b,1018b,1020bmay similarly be a set of further (x,y) coordinates that represent the position and size of eachobject1008a,1010a,1012a,1014a,1016a,1018a,1020ain theenvironment102 with respect to the (0,0) origin point. For example, if the sofa1016 has dimensions of 1 meter by 0.5 meters, and if it is positioned such that its lower-left corner is disposed at the grid point (10,1), the data representing its position may be (10,1)×(10.5, 2), denoting its lower-left corner and upper-right corner. Objects having more complicated shapes (with more than four sides) may be represented by additional sets of (x,y) coordinates, such that each pair of (x,y) coordinates defines a side of the object. Objects having curved or otherwise more complicated sides may be represented by data defining the curve, such as parameters defining an arc segment, or may be estimated as a set of straight lines.
Thedevice110 and/orsystem120 may determine the map1024 by processing input data, such as image data received from thecamera274 or infrared data received from one ormore cameras212. Thedevice110 may move within theenvironment102 while it captures the image data. In some embodiments,device110 and/orsystem120 processes the image data using image-processing techniques to determine objects therein and then determines the position data based thereon. For example, if thedevice110 captures image data that includes a representation of thesofa1016a, thedevice110 and/orsystem120 may determine, based on a likely size of thesofa1016a, how far the sofa1016 is from thedevice110 and base the (x,y) coordinates of the representation of thesofa1016bthereon. In other embodiments, thedevice110 and/orsystem120 uses the multiple cameras to capture binocular images of theenvironment102 and, based on a known distance between the multiple cameras, determines the distance between thedevice110 and an object depicted in the binocular images. Any method of determining the coordinates of thepositions1022b,1024bof thewalls1022a,1024aand thepositions1008b,1010b,1012b,1014b,1016b,1018b,1020bof theobjects1008a,1010a,1012a,1014a,1016a,1018a,1020ais within the scope of the present disclosure.
The map data may further include a grid made up of grid units828. If the map data does not include the grid, thedevice110 may create the grid. Each grid unit may have dimensions of any size, such as 100 centimeters length and width. The grid units need not be square and need not be all the same size; they may be, for example hexagonal. Thesystem120 and/ordevice110 may create the grid by beginning at the (0,0) origin point and placing grid tiles adjacent in the positive x- and y-dimensions. In other embodiments, thesystem120 and/ordevice110 may determine the length and width of each grid unit by determining the length and width of themap1026 and/orrooms1002,1004,1006 and dividing by an integer, such as ten, so that no fractionally-sized grid units828 are needed to fully populate themap1026 with the grid units828.
When thedevice110 determines a direction and distance of movement associated with a user input, as described herein, it may determine its position on themap1026 and plot the distance in the direction. If an obstruction intersects with the plotted path, thedevice110 may truncate its path to avoid hitting the obstruction, alter the path around the obstruction, or refuse to move altogether. Thedevice110 may send an indication of failure to theuser device112.
FIG.11A is a block diagram conceptually illustrating an autonomouslymotile device110 oruser device112 in accordance with the present disclosure.FIG.12 is a block diagram conceptually illustrating example components of asystem120, such as remote server, which may assist with creating a map of anenvironment102, ASR processing, NLU processing, etc. The term “server” as used herein may refer to a traditional server as understood in a server/client computing structure but may also refer to a number of different computing components that may assist with the operations discussed herein. For example, a server may include one or more physical computing components (such as a rack server) that are connected to other devices/components either physically and/or over a network and is capable of performing computing operations. A server may also include one or more virtual machines that emulates a computer system and is run on one or across multiple devices. A server may also include other combinations of hardware, software, firmware, or the like to perform operations discussed herein. Thesystem120 may be configured to operate using one or more of a client-server model, a computer bureau model, grid computing techniques, fog computing techniques, mainframe techniques, utility computing techniques, a peer-to-peer model, sandbox techniques, or other computing techniques.
Multiple servers may be included in thesystem120, such as one or more servers for performing ASR processing, one or more servers for performing NLU processing, one or more skill system(s) for performing actions responsive to user inputs, etc. In operation, each of these devices (or groups of devices) may include computer-readable and computer-executable instructions that reside on the respective server.
FIG.11A is a block diagram of some components of the autonomouslymotile device110 such as network interfaces1119,sensors1154, and output devices, according to some implementations. The components illustrated here are provided by way of illustration and not necessarily as a limitation. For example, the autonomouslymotile device110 may utilize a subset of the particular network interfaces1119, output devices, orsensors1154 depicted here, or may utilize components not pictured. One or more of thesensors1154, output devices, or a combination thereof may be included on a moveable component that may be panned, tilted, rotated, or any combination thereof with respect to a chassis of the autonomouslymotile device110.
The autonomouslymotile device110 may include input/output device interfaces1102 that connect to a variety of components such as an audio output component such as aspeaker1121, a wired headset or a wireless headset (not illustrated), or other component capable of outputting audio. The autonomouslymotile device110 may also include an audio capture component. The audio capture component may be, for example, amicrophone1120 or array of microphones, a wired headset or a wireless headset, etc. If an array of microphones is included, approximate distance to a sound's point of origin may be determined by acoustic localization based on time and amplitude differences between sounds captured by different microphones of the array. The autonomouslymotile device110 may additionally include adisplay214 for displaying content. The autonomouslymotile device110 may further include acamera274/276/212, light1109,button1107, actuator, and/or sensor.
The network interfaces1119 may include one or more of a WLAN interface, PAN interface, secondary radio frequency (RF) link interface, or other interface. The WLAN interface may be compliant with at least a portion of the Wi-Fi specification. For example, the WLAN interface may be compliant with at least a portion of the IEEE 802.11 specification as promulgated by the Institute of Electrical and Electronics Engineers (IEEE). The PAN interface may be compliant with at least a portion of one or more of the Bluetooth, wireless USB, Z-Wave, ZigBee, or other standards. For example, the PAN interface may be compliant with the Bluetooth Low Energy (BLE) specification.
The secondary RF link interface may comprise a radio transmitter and receiver that operate at frequencies different from or using modulation different from the other interfaces. For example, the WLAN interface may utilize frequencies in the 2.4 GHz and 5 GHz Industrial Scientific and Medicine (ISM) bands, while the PAN interface may utilize the 2.4 GHz ISM bands. The secondary RF link interface may comprise a radio transmitter that operates in the 900 MHz ISM band, within a licensed band at another frequency, and so forth. The secondary RF link interface may be utilized to provide backup communication between the autonomouslymotile device110 and other devices in the event that communication fails using one or more of the WLAN interface or the PAN interface. For example, in the event the autonomouslymotile device110 travels to an area within theenvironment102 that does not have Wi-Fi coverage, the autonomouslymotile device110 may use the secondary RF link interface to communicate with another device such as a specialized access point, docking station, or other autonomouslymotile device110.
The other network interfaces may include other equipment to send or receive data using other wavelengths or phenomena. For example, the other network interface may include an ultrasonic transceiver used to send data as ultrasonic sounds, a visible light system that communicates by modulating a visible light source such as a light-emitting diode, and so forth. In another example, the other network interface may comprise a wireless wide area network (WWAN) interface or a wireless cellular data network interface. Continuing the example, the other network interface may be compliant with at least a portion of the 3G, 4G, Long Term Evolution (LTE), 5G, or other standards. The I/O device interface (1102/1202) may also include and/or communicate with communication components (such as network interface(s)1119) that allow data to be exchanged between devices such as different physical servers in a collection of servers or other components.
The components of the device(s)110 and/or the system(s)120 may include their own dedicated processors, memory, and/or storage. Alternatively, one or more of the components of the device(s)110 and/or the system(s)120 may utilize the I/O interfaces (1102/1202), processor(s) (1104/1204), memory (1106/1206), bus (1124/1224), and/or storage (1108/1208) of the device(s)110 and/or the system(s)120, respectively.
FIG.11B illustrates components that may be stored in a memory of an autonomously motile device according to embodiments of the present disclosure. Although illustrated as included inmemory1106, the components (or portions thereof) may also be included in hardware and/or firmware.FIG.11C illustrates data that may be stored in a storage of an autonomously motile device according to embodiments of the present disclosure. Although illustrated as stored instorage1108, the data may be stored inmemory1106 or in another component.FIG.11D illustrates sensors that may be included as part of an autonomously motile device according to embodiments of the present disclosure.
Aposition determination component1132 determinesposition data1144 indicative of a position284 of the feature286 in theenvironment102. In one implementation the position284 may be expressed as a set of coordinates with respect to thefirst camera212a. Theposition determination component1132 may use a direct linear transformation triangulation process to determine the position284 of a feature286 in theenvironment102 based on the difference in apparent location of that feature286 in two images acquired by twocameras274/276/212 separated by a known distance.
Amovement determination module1133 determines if the feature286 is stationary or non-stationary. First position data1144aindicative of afirst position284aof a feature286 depicted in the first pair of images282aacquired at time t1is determined by theposition determination component1132. Second position data1144bof the same feature286 indicative of asecond position284bof the same feature286 as depicted in the second pair of images282bacquired at time t2is determined as well. Similar determinations made for data relative tofirst position284aandsecond position284bmay also be made forthird position284c, and so forth.
Themovement determination module1133 may use inertial data from theIMU1180 or other sensors that provides information about how the autonomouslymotile device110 moved between time t1and time t2. The inertial data and the first position data1144ais used to provide a predicted position of the feature286 at the second time. The predicted position is compared to the second position data1144bto determine if the feature is stationary or non-stationary. If the predicted position is less than a threshold value from thesecond position284bin the second position data1144b, then the feature286 is deemed to be stationary.
Features286 that have been deemed to be stationary may be included in the second feature data. The second feature data may thus exclude non-stationary features286 and comprise a subset of thefirst feature data1148 which comprises stationary features286.
The second feature data may be used by a simultaneous localization and mapping (SLAM)component1134. TheSLAM component1134 may use second feature data to determinepose data1145 that is indicative of a location of the autonomouslymotile device110 at a given time based on the appearance of features286 in pairs of images282. TheSLAM component1134 may also provide trajectory data indicative of thetrajectory280 that is based on a time series ofpose data1145 from theSLAM component1134.
Other information, such as depth data from a depth sensor, theposition data1144 associated with the features286 in the second feature data, and so forth, may be used to determine the presence ofobstacles283 in theenvironment102 as represented by an occupancy map as represented byoccupancy map data1149.
Theoccupancy map data1149 may comprise data that indicates the location of one ormore obstacles283, such as a table, wall, stairwell, and so forth. In some implementations, theoccupancy map data1149 may comprise a plurality of cells with each cell of the plurality of cells representing a particular area in theenvironment102. Data, such as occupancy values, may be stored that indicates whether an area of theenvironment102 associated with the cell is unobserved, occupied by anobstacle283, or is unoccupied. Anobstacle283 may comprise an object or feature that prevents or impairs traversal by the autonomouslymotile device110. For example, anobstacle283 may comprise a wall, stairwell, and so forth.
Theoccupancy map data1149 may be manually or automatically determined. For example, during a learning phase the user may take the autonomouslymotile device110 on a tour of theenvironment102, allowing themapping component1130 of the autonomouslymotile device110 to determine theoccupancy map data1149. The user may provide input data such as tags designating a particular obstacle type, such as “furniture” or “fragile”. In another example, during subsequent operation, the autonomouslymotile device110 may generate theoccupancy map data1149 that is indicative of locations and types of obstacles such as chairs, doors, stairwells, and so forth as it moves unattended through theenvironment102.
Modules described herein, such as themapping component1130, may provide various processing functions such as de-noising, filtering, and so forth. Processing ofsensor data1147, such as image data from acamera274/276/212, may be performed by a module implementing, at least in part, one or more of the following tools or techniques. In one implementation, processing of image data may be performed, at least in part, using one or more tools available in the OpenCV library as developed by Intel Corporation of Santa Clara, Calif., USA; Willow Garage of Menlo Park, Calif., USA; and Itseez of Nizhny Novgorod, Russia, with information available at www.opencv.org. In another implementation, functions available in the OKAO machine vision library as promulgated by Omron Corporation of Kyoto, Japan, may be used to process thesensor data1147. In still another implementation, functions such as those in the Machine Vision Toolbox (MVTB) available using MATLAB as developed by MathWorks, Inc. of Natick, Mass., USA, may be utilized.
Techniques such as artificial neural networks (ANNs), convolutional neural networks (CNNs), active appearance models (AAMs), active shape models (ASMs), principal component analysis (PCA), cascade classifiers, and so forth, may also be used to process thesensor data1147 or other data. For example, the ANN may be trained using a supervised learning algorithm such that object identifiers are associated with images of particular objects within training images provided to the ANN. Once trained, the ANN may be provided with thesensor data1147 and produce output indicative of the object identifier.
Anavigation map component1135 uses theoccupancy map data1149 as input to generate a navigation map as represented bynavigation map data1150. For example, thenavigation map component1135 may produce thenavigation map data1150 by inflating or enlarging the apparent size ofobstacles283 as indicated by theoccupancy map data1149.
Anautonomous navigation component1136 provides the autonomouslymotile device110 with the ability to navigate within theenvironment102 without real-time human interaction. Theautonomous navigation component1136 may implement, or operate in conjunction with, themapping component1130 to determine one or more of theoccupancy map data1149, thenavigation map data1150, or other representations of theenvironment102.
The autonomouslymotile device110autonomous navigation component1136 may generatepath plan data1152 that is indicative of a path through theenvironment102 from the current location to a destination location. The autonomouslymotile device110 may then begin moving along the path.
While moving along the path, the autonomouslymotile device110 may assess theenvironment102 and update or change the path as appropriate. For example, if anobstacle283 appears in the path, themapping component1130 may determine the presence of theobstacle283 as represented in theoccupancy map data1149 andnavigation map data1150. The now updatednavigation map data1150 may then be used to plan an alternative path to the destination location.
The autonomouslymotile device110 may utilize one ormore task components1141. Thetask component1141 comprises instructions that, when executed, provide one or more functions. Thetask components1141 may perform functions such as finding a user, following a user, present output on output devices of the autonomouslymotile device110, perform sentry tasks by moving the autonomouslymotile device110 through theenvironment102 to determine the presence of unauthorized people, and so forth.
The autonomouslymotile device110 includes one or more output devices, such as one or more of a motor, light, speaker, display, projector, printer, and so forth. One or more output devices may be used to provide output during operation of the autonomouslymotile device110.
The autonomouslymotile device110 may use the network interfaces1119 to connect to anetwork199. For example, thenetwork199 may comprise a wireless local area network that in turn is connected to a wide area network such as the Internet.
The autonomouslymotile device110 may be configured to dock or connect to a docking station. The docking station may also be connected to thenetwork199. For example, the docking station may be configured to connect to the wirelesslocal area network199 such that the docking station and the autonomouslymotile device110 may communicate. The docking station may provide external power which the autonomouslymotile device110 may use to charge a battery of the autonomouslymotile device110.
The autonomouslymotile device110 may access one ormore servers120 via thenetwork199. For example, the autonomouslymotile device110 may utilize a wakeword detection component to determine if the user is addressing a request to the autonomouslymotile device110. The wakeword detection component may hear a specified word or phrase and transition the autonomouslymotile device110 or portion thereof to the wake operating mode. Once in the wake operating mode, the autonomouslymotile device110 may then transfer at least a portion of the audio spoken by the user to one ormore servers120 for further processing. Theservers120 may process the spoken audio and return to the autonomouslymotile device110 data that may be subsequently used to operate the autonomouslymotile device110.
The autonomouslymotile device110 may also communicate with other devices. The other devices may include one or more devices that are within the physical space such as a home or associated with operation of one or more devices in the physical space. For example, the other devices may include a doorbell camera, a garage door opener, a refrigerator, washing machine, and so forth. In some implementations the other devices may includeother AMDs110, vehicles, and so forth.
In other implementations, other types of autonomous motile devices (AMD) may use the systems and techniques described herein. For example, the autonomouslymotile device110 may comprise an autonomous ground vehicle that is moving on a street, an autonomous aerial vehicle in the air, autonomous marine vehicle, and so forth.
The autonomouslymotile device110 may include one or more batteries (not shown) to provide electrical power suitable for operating the components in the autonomouslymotile device110. In some implementations other devices may be used to provide electrical power to the autonomouslymotile device110. For example, power may be provided by wireless power transfer, capacitors, fuel cells, storage flywheels, and so forth.
One or more clocks may provide information indicative of date, time, ticks, and so forth. For example, theprocessor1104 may use data from the clock to associate a particular time with an action,sensor data1147, and so forth.
The autonomouslymotile device110 may include one or more hardware processors1104 (processors) configured to execute one or more stored instructions. Theprocessors1104 may comprise one or more cores. Theprocessors1104 may include microcontrollers, systems on a chip, field programmable gate arrays, digital signal processors, graphic processing units, general processing units, and so forth.
The autonomouslymotile device110 may include one ormore communication component1140 such as input/output (I/O) interfaces1102, network interfaces1119, and so forth. Thecommunication component1140 enable the autonomouslymotile device110, or components thereof, to communicate with other devices or components. Thecommunication component1140 may include one or more I/O interfaces1102. The I/O interfaces1102 may comprise Inter-Integrated Circuit (I2C), Serial Peripheral Interface bus (SPI), Universal Serial Bus (USB) as promulgated by the USB Implementers Forum, RS-232, and so forth.
The I/O interface(s)1102 may couple to one or more I/O devices. The I/O devices may include input devices such as one or more of asensor1154, keyboard, mouse, scanner, and so forth. The I/O devices may also include output devices such as one or more of a motor, light,speaker1121,display214, projector, printer, and so forth. In some embodiments, the I/O devices may be physically incorporated with the autonomouslymotile device110 or may be externally placed.
The I/O interface(s)1102 may be configured to provide communications between the autonomouslymotile device110 and other devices such asother AMDs110, docking stations, routers, access points, and so forth, for example throughantenna1110 and/or other component. The I/O interface(s)1102 may include devices configured to couple to personal area networks (PANs), local area networks (LANs), wireless local area networks (WLANS), wide area networks (WANs), and so forth. For example, the network interfaces1119 may include devices compatible with Ethernet, Wi-Fi, Bluetooth, Bluetooth Low Energy, ZigBee, and so forth. The autonomouslymotile device110 may also include one or more busses1124 or other internal communications hardware or software that allow for the transfer of data between the various modules and components of the autonomouslymotile device110.
As shown inFIG.11A, the autonomouslymotile device110 includes one ormore memories1106. Thememory1106 may comprise one or more non-transitory computer-readable storage media (CRSM). The CRSM may be any one or more of an electronic storage medium, a magnetic storage medium, an optical storage medium, a quantum storage medium, a mechanical computer storage medium, and so forth. Thememory1106 provides storage of computer-readable instructions, data structures, program modules, and other data for the operation of the autonomouslymotile device110. A few example functional modules are shown stored in thememory1106, although the same functionality may alternatively be implemented in hardware, firmware, or as a system on a chip (SoC).
Thememory1106 may include at least one operating system (OS)component1139. TheOS component1139 is configured to manage hardware resource devices such as the I/O interfaces1102, the I/O devices, thecommunication component1140, and provide various services to applications or modules executing on theprocessors1104. TheOS component1139 may implement a variant of the FreeBSD operating system as promulgated by the FreeBSD Project; other UNIX or UNIX-like variants; a variation of the Linux operating system as promulgated by Linus Torvalds; the Windows operating system from Microsoft Corporation of Redmond, Wash., USA; the AMD Operating System (ROS) as promulgated at www.ros.org, and so forth.
Also stored in thememory1106, or elsewhere may be adata store1108 and one or more of the following modules. These modules may be executed as foreground applications, background tasks, daemons, and so forth. Thedata store1108 may use a flat file, database, linked list, tree, executable code, script, or other data structure to store information. In some implementations, thedata store1108 or a portion of thedata store1108 may be distributed across one or more other devices includingother AMDs110,servers120, network attached storage devices, and so forth.
Acommunication component1140 may be configured to establish communication with other devices, such asother AMDs110, anexternal server120, a docking station, and so forth. The communications may be authenticated, encrypted, and so forth.
Other modules within thememory1106 may include asafety component1129, themapping component1130, thenavigation map component1135, theautonomous navigation component1136, the one ormore components1141, aspeech processing component1137, or other components. The components may access data stored within thedata store1108, includingsafety tolerance data1146,sensor data1147, inflation parameters, other data, and so forth.
Thesafety component1129 may access thesafety tolerance data1146 to determine within what tolerances the autonomouslymotile device110 may operate safely within theenvironment102. For example, thesafety component1129 may be configured to stop the autonomouslymotile device110 from moving when an extensible mast of the autonomouslymotile device110 is extended. In another example, thesafety tolerance data1146 may specify a minimum sound threshold which, when exceeded, stops all movement of the autonomouslymotile device110. Continuing this example, detection of sound such as a human yell would stop the autonomouslymotile device110. In another example, thesafety component1129 may accesssafety tolerance data1146 that specifies a minimum distance from an object that the autonomouslymotile device110 is to maintain. Continuing this example, when asensor1154 detects an object has approached to less than the minimum distance, all movement of the autonomouslymotile device110 may be stopped. Movement of the autonomouslymotile device110 may be stopped by one or more of inhibiting operations of one or more of the motors, issuing a command to stop motor operation, disconnecting power from one or more the motors, and so forth. Thesafety component1129 may be implemented as hardware, software, or a combination thereof.
Thesafety component1129 may control other factors, such as a maximum speed of the autonomouslymotile device110 based on information obtained by thesensors1154, precision and accuracy of thesensor data1147, and so forth. For example, detection of an object by an optical sensor may include some error, such as when the distance to an object comprises a weighted average between an object and a background. As a result, the maximum speed permitted by thesafety component1129 may be based on one or more factors such as the weight of the autonomouslymotile device110, nature of the floor, distance to the object, and so forth. In the event that the maximum permissible speed differs from the maximum speed permitted by thesafety component1129, the lesser speed may be utilized.
Thenavigation map component1135 uses theoccupancy map data1149 as input to generate thenavigation map data1150. Thenavigation map component1135 may produce thenavigation map data1150 to inflate or enlarge theobstacles283 indicated by theoccupancy map data1149. One or more inflation parameters may be used during operation. The inflation parameters provide information such as inflation distance, inflation adjustment values, and so forth. In some implementations the inflation parameters may be based at least in part on the sensor FOV, sensor blind spot, physical dimensions of the autonomouslymotile device110, and so forth.
Thespeech processing component1137 may be used to process utterances of the user. Microphones may acquire audio in the presence of the autonomouslymotile device110 and may sendraw audio data1143 to an acoustic front end (AFE). The AFE may transform the raw audio data1143 (for example, a single-channel, 16-bit audio stream sampled at 16 kHz), captured by the microphone, into audio feature vectors that may ultimately be used for processing by various components, such as awakeword detection module1138, speech recognition engine, or other components. The AFE may reduce noise in theraw audio data1143. The AFE may also perform acoustic echo cancellation (AEC) or other operations to account for output audio data that may be sent to a speaker of the autonomouslymotile device110 for output. For example, the autonomouslymotile device110 may be playing music or other audio that is being received from anetwork199 in the form of output audio data. To prevent the output audio interfering with the device's ability to detect and process input audio, the AFE or other component may perform echo cancellation to remove the output audio data from the inputraw audio data1143, or other operations.
The AFE may divide theraw audio data1143 into frames representing time intervals for which the AFE determines a number of values (i.e., features) representing qualities of theraw audio data1143, along with a set of those values (i.e., a feature vector or audio feature vector) representing features/qualities of theraw audio data1143 within each frame. A frame may be a certain period of time, for example a sliding window of 25 milliseconds of audio data taken every 10 ms, or the like. Many different features may be determined, as known in the art, and each feature represents some quality of the audio that may be useful for automatic speech recognition (ASR) processing, wakeword detection, presence detection, or other operations. A number of approaches may be used by the AFE to process theraw audio data1143, such as mel-frequency cepstral coefficients (MFCCs), log filter-bank energies (LFBEs), perceptual linear predictive (PLP) techniques, neural network feature vector techniques, linear discriminant analysis, semi-tied covariance matrices, or other approaches known to those skilled in the art.
The audio feature vectors (or the raw audio data1143) may be input into awakeword detection module1138 that is configured to detect keywords spoken in the audio. Thewakeword detection module1138 may use various techniques to determine whether audio data includes speech. Some embodiments may apply voice activity detection (VAD) techniques. Such techniques may determine whether speech is present in an audio input based on various quantitative aspects of the audio input, such as the spectral slope between one or more frames of the audio input; the energy levels of the audio input in one or more spectral bands; the signal-to-noise ratios of the audio input in one or more spectral bands; or other quantitative aspects. In other embodiments, the autonomouslymotile device110 may implement a limited classifier configured to distinguish speech from background noise. The classifier may be implemented by techniques such as linear classifiers, support vector machines, and decision trees. In still other embodiments, Hidden Markov Model (HMM) or Gaussian Mixture Model (GMM) techniques may be applied to compare the audio input to one or more acoustic models in speech storage, which acoustic models may include models corresponding to speech, noise (such as environmental noise or background noise), or silence. Still other techniques may be used to determine whether speech is present in the audio input.
Once speech is detected in the audio received by the autonomously motile device110 (or separately from speech detection), the autonomouslymotile device110 may use thewakeword detection module1138 to perform wakeword detection to determine when a user intends to speak a command to the autonomouslymotile device110. This process may also be referred to as keyword detection, with the wakeword being a specific example of a keyword. Specifically, keyword detection is typically performed without performing linguistic analysis, textual analysis, or semantic analysis. Instead, incoming audio is analyzed to determine if specific characteristics of the audio match preconfigured acoustic waveforms, audio signatures, or other data to determine if the incoming audio “matches” stored audio data corresponding to a keyword.
Thus, thewakeword detection module1138 may compare audio data to stored models or data to detect a wakeword. One approach for wakeword detection general large vocabulary continuous speech recognition (LVCSR) systems to decode the audio signals, with wakeword searching conducted in the resulting lattices or confusion networks. LVCSR decoding may require relatively high computational resources. Another approach for wakeword spotting builds HMMs for each key wakeword word and non-wakeword speech signals respectively. The non-wakeword speech includes other spoken words, background noise, etc. There can be one or more HMMs built to model the non-wakeword speech characteristics, which are named filler models. Viterbi decoding is used to search the best path in the decoding graph, and the decoding output is further processed to make the decision on keyword presence. This approach can be extended to include discriminative information by incorporating a hybrid deep neural network (DNN) Hidden Markov Model (HMM) decoding framework. In another embodiment, the wakeword spotting system may be built on DNN/recursive neural network (RNN) structures directly, without HMM involved. Such a system may estimate the posteriors of wakewords with context information, either by stacking frames within a context window for DNN, or using RNN. Following on, posterior threshold tuning or smoothing is applied for decision making. Other techniques for wakeword detection, such as those known in the art, may also be used.
Once the wakeword is detected, circuitry or applications of the local autonomouslymotile device110 may “wake” and begin transmitting audio data (which may include one or more of theraw audio data1143 or the audio feature vectors) to one or more server(s)120 for speech processing. The audio data corresponding to audio obtained by the microphone may be processed locally on one or more of theprocessors1104, sent to aserver120 for routing to a recipient device or may be sent to theserver120 for speech processing for interpretation of the included speech (either for purposes of enabling voice-communications and/or for purposes of executing a command in the speech). The audio data may include data corresponding to the wakeword, or the portion of the audio data corresponding to the wakeword may be removed by the autonomouslymotile device110 before processing by thenavigation map component1135, prior to sending to theserver120, and so forth.
Thespeech processing component1137 may include or access an automated speech recognition (ASR) module. The ASR module may accept as inputraw audio data1143, audio feature vectors, orother sensor data1147 and so forth and may produce as output the input data comprising a text string or other data representation. The input data comprising the text string or other data representation may be processed by thenavigation map component1135 to determine the command to be executed. For example, the utterance of the command “robot, come here” may result in input data comprising the text string “come here”. The wakeword “robot” may be omitted from the input data.
Theautonomous navigation component1136 provides the autonomouslymotile device110 with the ability to navigate within theenvironment102 without real-time human interaction. Theautonomous navigation component1136 may implement, or operate in conjunction with, themapping component1130 to determine theoccupancy map data1149, thenavigation map data1150, or other representation of theenvironment102. In one implementation, themapping component1130 may use one or more simultaneous localization and mapping (“SLAM”) techniques. The SLAM algorithms may utilize one or more of maps, algorithms, beacons, or other techniques to navigate. Theautonomous navigation component1136 may use thenavigation map data1150 to determine a set of possible paths along which the autonomouslymotile device110 may move. One of these may be selected and used to determinepath plan data1152 indicative of a path. For example, a possible path that is the shortest or has the fewest turns may be selected and used to determine the path. The path is then subsequently used to determine a set of commands that drive the motors connected to the wheels. For example, theautonomous navigation component1136 may determine the current location within theenvironment102 and determinepath plan data1152 that describes the path to a destination location such as the docking station.
Theautonomous navigation component1136 may utilize various techniques during processing ofsensor data1147. For example,image data1142 obtained fromcameras274/276/212 on the autonomouslymotile device110 may be processed to determine one or more of corners, edges, planes, and so forth. In some implementations, corners may be detected and the coordinates of those corners may be used to produce point cloud data. This point cloud data may then be used for SLAM or other purposes associated with mapping, navigation, and so forth.
The autonomouslymotile device110 may move responsive to a determination made by anonboard processor1104, in response to a command received from one or more communication interfaces, as determined from thesensor data1147, and so forth. For example, anexternal server120 may send a command that is received using the network interface1119. This command may direct the autonomouslymotile device110 to proceed to find a particular user, follow a particular user, and so forth. The autonomouslymotile device110 may then process this command and use theautonomous navigation component1136 to determine the directions and distances associated with carrying out the command. For example, the command to “come here” may result in atask component1141 sending a command to theautonomous navigation component1136 to move the autonomouslymotile device110 to a particular location near the user and orient the autonomouslymotile device110 in a particular direction. Similar commands that do not include an explicit location and/or direction may cause the autonomouslymotile device110 to move. For example, the command “go away” may result in the autonomouslymotile device110 move from a room or area.
The autonomouslymotile device110 may connect to thenetwork199 using one or more of the network interfaces1119. In some implementations, one or more of the modules or other functions described here may execute on theprocessors1104 of the autonomouslymotile device110, on theserver120, or a combination thereof. For example, one ormore servers120 may provide various functions, such as ASR, natural language understanding (NLU), providing content such as audio or video to the autonomouslymotile device110, and so forth.
The other components may provide other functionality, such as object recognition, speech synthesis, user identification, and so forth. The other components may comprise a speech synthesis module that is able to convert text data to human speech. For example, the speech synthesis module may be used by the autonomouslymotile device110 to provide speech that a user is able to understand.
Thedata store1108 may store the other data as well. For example, localization settings may indicate local preferences such as language, user identifier data may be stored that allows for identification of a particular user, and so forth.
As shown inFIG.11D, the autonomouslymotile device110 may include one or more of the followingsensors1154. Thesensors1154 depicted here are provided by way of illustration and not necessarily as a limitation. It is understood thatother sensors1154 may be included or utilized by the autonomouslymotile device110, while somesensors1154 may be omitted in some configurations.
Amotor encoder1155 provides information indicative of the rotation or linear extension of a motor. The motor may comprise a rotary motor, or a linear actuator. In some implementations, themotor encoder1155 may comprise a separate assembly such as a photodiode and encoder wheel that is affixed to the motor. In other implementations, themotor encoder1155 may comprise circuitry configured to drive the motor. For example, theautonomous navigation component1136 may utilize the data from themotor encoder1155 to estimate a distance traveled.
A suspension weight sensor1156 provides information indicative of the weight of the autonomouslymotile device110 on the suspension system for one or more of the wheels or the caster. For example, the suspension weight sensor1156 may comprise a switch, strain gauge, load cell, photodetector, or other sensing element that is used to determine whether weight is applied to a particular wheel, or whether weight has been removed from the wheel. In some implementations, the suspension weight sensor1156 may provide binary data such as a “1” value indicating that there is a weight applied to the wheel, while a “0” value indicates that there is no weight applied to the wheel. In other implementations, the suspension weight sensor1156 may provide an indication such as so many kilograms of force or newtons of force. The suspension weight sensor1156 may be affixed to one or more of the wheels or the caster. In some situations, thesafety component1129 may use data from the suspension weight sensor1156 to determine whether or not to inhibit operation of one or more of the motors. For example, if the suspension weight sensor1156 indicates no weight on the suspension, the implication is that the autonomouslymotile device110 is no longer resting on its wheels, and thus operation of the motors may be inhibited. In another example, if the suspension weight sensor1156 indicates weight that exceeds a threshold value, the implication is that something heavy is resting on the autonomouslymotile device110 and thus operation of the motors may be inhibited.
One ormore bumper switches1157 provide an indication of physical contact between a bumper or other member that is in mechanical contact with thebumper switch1157. Thesafety component1129 utilizessensor data1147 obtained by the bumper switches1157 to modify the operation of the autonomouslymotile device110. For example, if thebumper switch1157 associated with a front of the autonomouslymotile device110 is triggered, thesafety component1129 may drive the autonomouslymotile device110 backwards.
A floor optical motion sensor (FOMS)1158 provides information indicative of motion of the autonomouslymotile device110 relative to the floor or other surface underneath the autonomouslymotile device110. In one implementation, theFOMS1158 may comprise a light source such as light-emitting diode (LED), an array of photodiodes, and so forth. In some implementations, theFOMS1158 may utilize an optoelectronic sensor, such as a low-resolution two-dimensional array of photodiodes. Several techniques may be used to determine changes in the data obtained by the photodiodes and translate this into data indicative of a direction of movement, velocity, acceleration, and so forth. In some implementations, theFOMS1158 may provide other information, such as data indicative of a pattern present on the floor, composition of the floor, color of the floor, and so forth. For example, theFOMS1158 may utilize an optoelectronic sensor that may detect different colors or shades of gray, and this data may be used to generate floor characterization data. The floor characterization data may be used for navigation.
Anultrasonic sensor1159 utilizes sounds in excess of 20 kHz to determine a distance from thesensor1154 to an object. Theultrasonic sensor1159 may comprise an emitter such as a piezoelectric transducer and a detector such as an ultrasonic microphone. The emitter may generate specifically timed pulses of ultrasonic sound while the detector listens for an echo of that sound being reflected from an object within the field of view. Theultrasonic sensor1159 may provide information indicative of a presence of an object, distance to the object, and so forth. Two or moreultrasonic sensors1159 may be utilized in conjunction with one another to determine a location within a two-dimensional plane of the object.
In some implementations, theultrasonic sensor1159 or a portion thereof may be used to provide other functionality. For example, the emitter of theultrasonic sensor1159 may be used to transmit data and the detector may be used to receive data transmitted that is ultrasonic sound. In another example, the emitter of anultrasonic sensor1159 may be set to a particular frequency and used to generate a particular waveform such as a sawtooth pattern to provide a signal that is audible to an animal, such as a dog or a cat.
Anoptical sensor1160 may providesensor data1147 indicative of one or more of a presence or absence of an object, a distance to the object, or characteristics of the object. Theoptical sensor1160 may use time-of-flight (ToF), structured light, interferometry, or other techniques to generate the distance data. For example, ToF determines a propagation time (or “round-trip” time) of a pulse of emitted light from an optical emitter or illuminator that is reflected or otherwise returned to an optical detector. By dividing the propagation time in half and multiplying the result by the speed of light in air, the distance to an object may be determined. Theoptical sensor1160 may utilize one or more sensing elements. For example, theoptical sensor1160 may comprise a 4×4 array of light sensing elements. Each individual sensing element may be associated with a field of view (FOV) that is directed in a different way. For example, theoptical sensor1160 may have four light sensing elements, each associated with a different 10° FOV, allowing the sensor to have an overall FOV of 40°.
In another implementation, a structured light pattern may be provided by the optical emitter. A portion of the structured light pattern may then be detected on the object using asensor1154 such as an image sensor orcamera274/276/212. Based on an apparent distance between the features of the structured light pattern, the distance to the object may be calculated. Other techniques may also be used to determine distance to the object. In another example, the color of the reflected light may be used to characterize the object, such as whether the object is skin, clothing, flooring, upholstery, and so forth. In some implementations, theoptical sensor1160 may operate as a depth camera, providing a two-dimensional image of a scene, as well as data that indicates a distance to each pixel.
Data from theoptical sensors1160 may be utilized for collision avoidance. For example, thesafety component1129 and theautonomous navigation component1136 may utilize thesensor data1147 indicative of the distance to an object in order to prevent a collision with that object.
Multipleoptical sensors1160 may be operated such that their FOV overlap at least partially. To minimize or eliminate interference, theoptical sensors1160 may selectively control one or more of the timing, modulation, or frequency of the light emitted. For example, a firstoptical sensor1160 may emit light modulated at 30 kHz while a secondoptical sensor1160 emits light modulated at 33 kHz.
Alidar1161 sensor provides information indicative of a distance to an object or portion thereof by utilizing laser light. The laser is scanned across a scene at various points, emitting pulses which may be reflected by objects within the scene. Based on the time-of-flight distance to that particular point,sensor data1147 may be generated that is indicative of the presence of objects and the relative positions, shapes, and so forth that are visible to thelidar1161. Data from thelidar1161 may be used by various modules. For example, theautonomous navigation component1136 may utilize point cloud data generated by thelidar1161 for localization of the autonomouslymotile device110 within theenvironment102.
The autonomouslymotile device110 may include a mast. Amast position sensor1162 provides information indicative of a position of the mast of the autonomouslymotile device110. For example, themast position sensor1162 may comprise limit switches associated with the mast extension mechanism that indicate whether the mast is at an extended or retracted position. In other implementations, themast position sensor1162 may comprise an optical code on at least a portion of the mast that is then interrogated by an optical emitter and a photodetector to determine the distance to which the mast is extended. In another implementation, themast position sensor1162 may comprise an encoder wheel that is attached to a mast motor that is used to raise or lower the mast. Themast position sensor1162 may provide data to thesafety component1129. For example, if the autonomouslymotile device110 is preparing to move, data from themast position sensor1162 may be checked to determine if the mast is retracted, and if not, the mast may be retracted prior to beginning movement.
Amast strain sensor1163 provides information indicative of a strain on the mast with respect to the remainder of the autonomouslymotile device110. For example, themast strain sensor1163 may comprise a strain gauge or load cell that measures a side-load applied to the mast or a weight on the mast or downward pressure on the mast. Thesafety component1129 may utilizesensor data1147 obtained by themast strain sensor1163. For example, if the strain applied to the mast exceeds a threshold amount, thesafety component1129 may direct an audible and visible alarm to be presented by the autonomouslymotile device110.
The autonomouslymotile device110 may include a modular payload bay. Apayload weight sensor1165 provides information indicative of the weight associated with the modular payload bay. Thepayload weight sensor1165 may comprise one or more sensing mechanisms to determine the weight of a load. These sensing mechanisms may include piezoresistive devices, piezoelectric devices, capacitive devices, electromagnetic devices, optical devices, potentiometric devices, microelectromechanical devices, and so forth. The sensing mechanisms may operate as transducers that generate one or more signals based on an applied force, such as that of the load due to gravity. For example, thepayload weight sensor1165 may comprise a load cell having a strain gauge and a structural member that deforms slightly when weight is applied. By measuring a change in the electrical characteristic of the strain gauge, such as capacitance or resistance, the weight may be determined. In another example, thepayload weight sensor1165 may comprise a force sensing resistor (FSR). The FSR may comprise a resilient material that changes one or more electrical characteristics when compressed. For example, the electrical resistance of a particular portion of the FSR may decrease as the particular portion is compressed. In some implementations, thesafety component1129 may utilize thepayload weight sensor1165 to determine if the modular payload bay has been overloaded. If so, an alert or notification may be issued.
One or moredevice temperature sensors1166 may be utilized by the autonomouslymotile device110. Thedevice temperature sensors1166 provide temperature data of one or more components within the autonomouslymotile device110. For example, adevice temperature sensor1166 may indicate a temperature of one or more the batteries, one or more motors, and so forth. In the event the temperature exceeds a threshold value, the component associated with thatdevice temperature sensor1166 may be shut down.
One ormore interlock sensors1167 may provide data to thesafety component1129 or other circuitry that prevents the autonomouslymotile device110 from operating in an unsafe condition. For example, theinterlock sensors1167 may comprise switches that indicate whether an access panel is open. Theinterlock sensors1167 may be configured to inhibit operation of the autonomouslymotile device110 until the interlock switch indicates a safe condition is present.
An inertial measurement unit (IMU)1180 may include a plurality ofgyroscopes1181 andaccelerometers1182 arranged along different axes. Thegyroscope1181 may provide information indicative of rotation of an object affixed thereto. For example, agyroscope1181 may generatesensor data1147 that is indicative of a change in orientation of the autonomouslymotile device110 or a portion thereof.
Theaccelerometer1182 provides information indicative of a direction and magnitude of an imposed acceleration. Data such as rate of change, determination of changes in direction, speed, and so forth may be determined using theaccelerometer1182. Theaccelerometer1182 may comprise mechanical, optical, micro-electromechanical, or other devices. For example, thegyroscope1181 in theaccelerometer1182 may comprise a prepackaged solid-state unit.
Amagnetometer1168 may be used to determine an orientation by measuring ambient magnetic fields, such as the terrestrial magnetic field. For example, themagnetometer1168 may comprise a Hall Effect transistor that provides output compass data indicative of a magnetic heading.
The autonomouslymotile device110 may include one ormore location sensors1169. Thelocation sensors1169 may comprise an optical, radio, or other navigational system such as a global positioning system (GPS) receiver. For indoor operation, thelocation sensors1169 may comprise indoor position systems, such as using Wi-Fi Positioning Systems (WPS). Thelocation sensors1169 may provide information indicative of a relative location, such as “living room” or an absolute location such as particular coordinates indicative of latitude and longitude, or displacement with respect to a predefined origin.
Aphotodetector1170 providessensor data1147 indicative of impinging light. For example, thephotodetector1170 may provide data indicative of a color, intensity, duration, and so forth.
Acamera274/276/212 generatessensor data1147 indicative of one or more images. Thecamera274/276/212 may be configured to detect light in one or more wavelengths including, but not limited to, terahertz, infrared, visible, ultraviolet, and so forth. For example, aninfrared camera274/276/212 may be sensitive to wavelengths between approximately 700 nanometers and 1 millimeter. Thecamera274/276/212 may comprise charge coupled devices (CCD), complementary metal oxide semiconductor (CMOS) devices, microbolometers, and so forth. The autonomouslymotile device110 may use image data acquired by thecamera274/276/212 for object recognition, navigation, collision avoidance, user communication, and so forth. For example, a pair ofcameras274/276/212 sensitive to infrared light may be mounted on the front of the autonomouslymotile device110 to provide binocular stereo vision, with thesensor data1147 comprising images being sent to theautonomous navigation component1136. In another example, thecamera274/276/212 may comprise a 10 megapixel or greater camera that is used for videoconferencing or for acquiring pictures for the user.
Thecamera274/276/212 may include a global shutter or a rolling shutter. The shutter may be mechanical or electronic. A mechanical shutter uses a physical device such as a shutter vane or liquid crystal to prevent light from reaching a light sensor. In comparison, an electronic shutter comprises a specific technique of how the light sensor is read out, such as progressive rows, interlaced rows, and so forth. With a rolling shutter, not all pixels are exposed at the same time. For example, with an electronic rolling shutter, rows of the light sensor may be read progressively, such that the first row on the sensor was taken at a first time while the last row was taken at a later time. As a result, a rolling shutter may produce various image artifacts, especially with regard to images in which objects are moving. In contrast, with a global shutter the light sensor is exposed all at a single time, and subsequently read out. In some implementations, the camera(s)274/276/212, particularly those associated with navigation or autonomous operation, may utilize a global shutter. In other implementations, the camera(s)274/276/212 providing images for use by theautonomous navigation component1136 may be acquired using a rolling shutter and subsequently may be processed to mitigate image artifacts.
One ormore microphones1120 may be configured to acquire information indicative of sound present in theenvironment102. In some implementations, arrays ofmicrophones1120 may be used. These arrays may implement beamforming techniques to provide for directionality of gain. The autonomouslymotile device110 may use the one ormore microphones1120 to acquire information from acoustic tags, accept voice input from users, determine a direction of an utterance, determine ambient noise levels, for voice communication with another user or system, and so forth.
Anair pressure sensor1172 may provide information indicative of an ambient atmospheric pressure or changes in ambient atmospheric pressure. For example, theair pressure sensor1172 may provide information indicative of changes in air pressure due to opening and closing of doors, weather events, and so forth.
Anair quality sensor1173 may provide information indicative of one or more attributes of the ambient atmosphere. For example, theair quality sensor1173 may include one or more chemical sensing elements to detect the presence of carbon monoxide, carbon dioxide, ozone, and so forth. In another example, theair quality sensor1173 may comprise one or more elements to detect particulate matter in the air, such as the photoelectric detector, ionization chamber, and so forth. In another example, theair quality sensor1173 may include a hygrometer that provides information indicative of relative humidity.
Anambient light sensor1174 may comprise one or more photodetectors or other light-sensitive elements that are used to determine one or more of the color, intensity, or duration of ambient lighting around the autonomouslymotile device110.
Anambient temperature sensor1175 provides information indicative of the temperature of theambient environment102 proximate to the autonomouslymotile device110. In some implementations, an infrared temperature sensor may be utilized to determine the temperature of another object at a distance.
Afloor analysis sensor1176 may include one or more components that are used to generate at least a portion of floor characterization data. In one implementation, thefloor analysis sensor1176 may comprise circuitry that may be used to determine one or more of the electrical resistance, electrical inductance, or electrical capacitance of the floor. For example, two or more of the wheels in contact with the floor may include an allegedly conductive pathway between the circuitry and the floor. By using two or more of these wheels, the circuitry may measure one or more of the electrical properties of the floor. Information obtained by thefloor analysis sensor1176 may be used by one or more of thesafety component1129, theautonomous navigation component1136, thetask component1141, and so forth. For example, if thefloor analysis sensor1176 determines that the floor is wet, thesafety component1129 may decrease the speed of the autonomouslymotile device110 and generate a notification alerting the user.
Thefloor analysis sensor1176 may include other components as well. For example, a coefficient of friction sensor may comprise a probe that comes into contact with the surface and determines the coefficient of friction between the probe and the floor.
Acaster rotation sensor1177 provides data indicative of one or more of a direction of orientation, angular velocity, linear speed of the caster, and so forth. For example, thecaster rotation sensor1177 may comprise an optical encoder and corresponding target that is able to determine that the caster transitioned from an angle of 0° at a first time to 49° at a second time.
Thesensors1154 may include aradar1178. Theradar1178 may be used to provide information as to a distance, lateral position, and so forth, to an object.
Thesensors1154 may include a passive infrared (PIR)sensor1164. ThePIR1164 sensor may be used to detect the presence of users, pets, hotspots, and so forth. For example, thePIR sensor1164 may be configured to detect infrared radiation with wavelengths between 8 and 14 micrometers.
The autonomouslymotile device110 may include other sensors as well. For example, a capacitive proximity sensor may be used to provide proximity data to adjacent objects. Other sensors may include radio frequency identification (RFID) readers, near field communication (NFC) systems, coded aperture cameras, and so forth. For example, NFC tags may be placed at various points within theenvironment102 to provide landmarks for theautonomous navigation component1136. One or more touch sensors may be utilized to determine contact with a user or other objects.
The autonomouslymotile device110 may include one or more output devices. A motor (not shown) may be used to provide linear or rotary motion. A light258 may be used to emit photons. Aspeaker1121 may be used to emit sound. Adisplay214 may comprise one or more of a liquid crystal display, light emitting diode display, electrophoretic display, cholesteric liquid crystal display, interferometric display, and so forth. Thedisplay214 may be used to present visible information such as graphics, pictures, text, and so forth. In some implementations, thedisplay214 may comprise a touchscreen that combines a touch sensor and adisplay214.
In some implementations, the autonomouslymotile device110 may be equipped with a projector. The projector may be able to project an image on a surface, such as the floor, wall, ceiling, and so forth.
A scent dispenser may be used to emit one or more smells. For example, the scent dispenser may comprise a plurality of different scented liquids that may be evaporated or vaporized in a controlled fashion to release predetermined amounts of each.
One or more moveable component actuators may comprise an electrically operated mechanism such as one or more of a motor, solenoid, piezoelectric material, electroactive polymer, shape-memory alloy, and so forth. The actuator controller may be used to provide a signal or other input that operates one or more of the moveable component actuators to produce movement of the moveable component.
In other implementations, other output devices may be utilized. For example, the autonomouslymotile device110 may include a haptic output device that provides output that produces particular touch sensations to the user. Continuing the example, a motor with an eccentric weight may be used to create a buzz or vibration to allow the autonomouslymotile device110 to simulate the purr of a cat.
As noted above, multiple devices may be employed in a single system. In such a multi-device system, each of the devices may include different components for performing different aspects of the system's processing. The multiple devices may include overlapping components. The components of the autonomouslymotile device110 and/or the system(s)120 as described herein, are illustrative, and may be located as a stand-alone device or may be included, in whole or in part, as a component of a larger device or system.
As illustrated inFIG.13 and as discussed herein, the autonomouslymotile device110 may communicate, using thenetwork199, with thesystem120 and/or auser device112. The network(s)199 may include a local or private network or may include a wide network such as the Internet. The devices may be connected to the network(s)199 through either wired or wireless connections.Example user devices112 include acellular phone112a, asmart loudspeaker112b, amicrophone112c, aloudspeaker112d, atablet computer112e, adesktop computer112f, and alaptop computer112g, which may be connected to the network(s)199 through a wireless service provider, over a Wi-Fi or cellular network connection, or the like. Other devices are included as network-connected support devices, such as the system(s)120, the skill system(s), and/or others.
The concepts disclosed herein may be applied within a number of different devices and computer systems, including, for example, general-purpose computing systems, speech processing systems, and distributed computing environments.
The above aspects of the present disclosure are meant to be illustrative. They were chosen to explain the principles and application of the disclosure and are not intended to be exhaustive or to limit the disclosure. Many modifications and variations of the disclosed aspects may be apparent to those of skill in the art. Persons having ordinary skill in the field of computers and speech processing should recognize that components and process steps described herein may be interchangeable with other components or steps, or combinations of components or steps, and still achieve the benefits and advantages of the present disclosure. Moreover, it should be apparent to one skilled in the art, that the disclosure may be practiced without some or all of the specific details and steps disclosed herein.
Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure. The computer readable storage medium may be implemented by a volatile computer memory, non-volatile computer memory, hard drive, solid-state memory, flash drive, removable disk, and/or other media. In addition, components of system may be implemented as in firmware or hardware, such as an acoustic front end, which comprises, among other things, analog and/or digital filters (e.g., filters configured as firmware to a digital signal processor).
Conditional language used herein, such as, among others, “can,” “could,” “might,” “may,” “e.g.,” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements, and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without other input or prompting, whether these features, elements, and/or steps are included or are to be performed in any particular embodiment. The terms “comprising,” “including,” “having,” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list.
Disjunctive language such as the phrase “at least one of X, Y, Z,” unless specifically stated otherwise, is understood with the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each be present. As used in this disclosure, the term “a” or “one” may include one or more items unless specifically stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless specifically stated otherwise.