HK1171515B

Movatterモバイル変換

Info

Publication number: HK1171515B
Application number: HK12112270.2A
Authority: HK
Inventors: A．巴-泽埃夫; J．R．刘易斯; G.克莱因
Original assignee: 微软技术许可有限责任公司
Priority date: 2010-12-17
Filing date: 2012-11-28
Publication date: 2015-07-31

Description

Optimized focal region for augmented reality displays

Technical Field

The invention relates to an optimized focal region for augmented reality displays.

Background

Augmented reality is a technology that allows virtual images to be mixed with a real-world physical environment or space. In general, near-eye displays use a combination of optics and stereo vision to focus a virtual image into space. In such displays, display resolution and processing are valuable.

In some cases, the virtual image displayed to the user by the near-eye display device may include a virtual image or object containing highly fine graphics. Users wearing near-eye display devices are often presented with a wealth of information that the user is not necessarily interested in viewing.

Disclosure of Invention

The briefly described techniques include a method for presenting an optimized image to a user. An optimized image is created for display relative to the user's entire field of view in the scene. The user's head and eye positions and movements are tracked to determine the user's focal region. Coupling a portion of the optimized image to a focal region of the user at a current position of the eye, predicting a next position of the head and the eye, and coupling a portion of the optimized image to the focal region of the user at the next position.

In addition, a head mounted display device is provided. The head mounted display includes a display that couples at least a portion of the optimized image to a focal region of the user. Inertial, magnetic, mechanical, and/or other sensors sense orientation information of the head mounted display device, and eye tracking sensors detect user eye position. A processing unit in communication with the display, the inertial sensor, and/or other sensors and the eye tracking sensor automatically displays an optimized portion of the optimized image at a current position of the user's eye relative to the display such that the portion of the image is coupled to the user's focal region. The processing device then determines a next position of the user's eye and displays another optimized portion of the optimized image at the next position of the user's eye relative to the display such that the portion of the image is coupled to the user's focal region at the next position.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.

Drawings

FIG. 1 is a block diagram depicting example components of one embodiment of a system for generating optimized content based on user intent.

FIG. 2A is a flow chart depicting a method in accordance with the present technique.

FIG. 2B depicts one embodiment of a user's view of one or more objects in the user's environment.

FIG. 2C is an illustration of an optimized image at full resolution shown to cover a user's view of the user environment.

Fig. 2D is an illustration of a user's view of an optimized portion and a partially obscured field of view of a full resolution image.

FIG. 2E is an illustration of one type of focal region for use with the present technique.

FIG. 2F is an illustration of a user environment.

FIG. 2G is an illustration of current and next position optimized portions of a simultaneous rendering of a full resolution image and a partially obscured field of view of a user.

FIG. 3A is a top view of a portion of one embodiment of a head mounted display unit.

FIG. 3B is a top view of a portion of another embodiment of a head mounted display unit.

Fig. 3C and 3D are top and side views of a portion of another embodiment of a head mounted display unit.

FIG. 4A is a block diagram of one embodiment of components of a head mounted display unit.

FIG. 4B is a block diagram of one embodiment of the components of a processing unit associated with a head mounted display unit.

FIG. 5 is a block diagram of one embodiment of the components of a hub computing system for use with a head mounted display unit.

FIG. 6 is a block diagram of one embodiment of a computing system that may be used to implement the hub computing system described herein.

FIG. 7 is a block diagram depicting a multi-user system for generating optimized content.

FIG. 8 describes one embodiment of a process for generating optimized content based on user intent.

FIG. 9 describes one embodiment of a process for a user to create a model of a user's space.

FIG. 10 is a flow chart describing one embodiment of a process for segmenting a model of a space into objects.

FIG. 11 is a flow chart describing one embodiment of a process for tracking a field of view of a user and determining a focal region of the user.

FIG. 12 is a flow chart describing one embodiment of a process performed by the hub computing system to provide tracking information for use in the process of FIG. 2A.

FIG. 13 is a flow chart describing one embodiment of a process for tracking an eye, where the results of the process are used by the process of FIG. 12.

FIG. 14 is a flow chart describing one embodiment of a process for providing a selected portion of an optimized image to couple to a user's fovea.

FIG. 15 is a flow chart describing a process for estimating a next position of an eye from a current position of the eye.

FIG. 16 is a method for predicting one or more subsequent eye positions according to the steps of FIG. 15.

Detailed Description

Techniques for enhancing a user's experience when using a near-eye display device are disclosed. A user views a scene through a near-eye display device, such as a head-mounted display device. A user field of view is determined as the environment or space the user is viewing. The following images are rendered: the image is optimized for use with respect to the field of view. The user's focal region is determined by tracking the position of the user's eyes within the field of view. The display of the optimized image is provided by coupling the optimized portion of the image to the user's focal region, in one case the user's fovea, to reduce the processing and energy required for display. The user eye position is tracked and a next eye position is calculated to position the portion of the image at the next position in accordance with movement of the user's eyes to the next position.

The positioning of the optimized portion of the image is performed by any number of different display devices, including mechanically controlled mirrors and projection displays. A predictive algorithm is used to determine the potential next position of the user's eyes.

FIG. 1 is a block diagram depicting exemplary components of one embodiment of a system 10 for generating an optimized image based on user intent. System 10 includes a see-through display device as a near-eye head-mounted display device 2 in communication with a processing unit 4 via line 6. In other embodiments, head mounted display device 2 communicates with processing unit 4 through wireless communication. Although the components of FIG. 1 illustrate a see-through display device, other display embodiments suitable for use with the present technology are illustrated in FIGS. 3B-3D.

Head mounted display device 2, which in one embodiment is in the shape of glasses, is worn on the head of a user so that the user can see through the display and thereby have an actual direct view of the space in front of the user. The term "actual and direct view" is used to refer to the ability to view real-world objects directly with the human eye, rather than to view a created image representation of the object. For example, viewing through glasses in a room would allow a user to have an actual direct view of the room, whereas viewing a video of a room on a television is not an actual direct view of the room. More details of head mounted display device 2 are provided below. Although the devices shown in fig. 1 and 3A-3D are in the form of glasses, the head mounted display device 2 may take other forms, such as a helmet with goggles.

In one embodiment, processing unit 4 is worn on the user's wrist and includes a portion of the computing power for operating head mounted display device 2. Processing unit 4 may communicate wirelessly (e.g., WiFi, bluetooth, infrared, or other wireless communication means) with one or more hub computing systems 12.

Hub computing system 12 may be a computer, a gaming system or console, or the like. According to an example embodiment, hub computing system 12 may include hardware components and/or software components such that hub computing system 12 may be used to execute applications such as gaming applications, non-gaming applications, and the like. In one embodiment, hub computing system 12 may include a processor, such as a standardized processor, a specialized processor, a microprocessor, or the like that may execute instructions stored on a processor readable storage device for performing the processes described herein.

In various embodiments, the processes described herein with reference to FIGS. 2A and 8-15 are performed in whole or in part by either or both of hub computing system 12, processing unit 4, and/or a combination thereof.

Hub computing system 12 also includes one or more capture devices, such as capture devices 20A and 20B. In other embodiments, more or less than two capture devices may be used. In one exemplary embodiment, the capture devices 20A and 20B are pointed in different directions so that they can capture different parts of a room. It may be advantageous for the fields of view of the two capture devices to overlap slightly so that hub computing system 12 may understand how the fields of view of the capture devices relate to each other. In this way, multiple capture devices may be used to view an entire room (or other space). Alternatively, one capture device may be used if the capture device can be translated during operation such that the entire relevant space is viewed by the capture device over time.

The capture devices 20A and 20B may be, for example, cameras that visually monitor one or more users and the surrounding space such that gestures and/or movements performed by the one or more users and the structure of the surrounding space may be captured, analyzed, and tracked to perform one or more controls or actions in an application and/or animate an avatar or on-screen character.

The hub computing environment 12 may be connected to an audiovisual device 16 such as a television, a monitor, a high-definition television (HDTV), or the like that may provide game or application visuals. For example, hub computing system 12 may include a video adapter such as a graphics card and/or an audio adapter such as a sound card that may provide audiovisual signals associated with the game application, non-game application, or the like. The audiovisual device 16 may receive the audiovisual signals from the hub computing system 12 and may then output game or application visuals and/or audio associated with the audiovisual signals. According to one embodiment, the audiovisual device 16 may be connected to the hub computing system 12 via, for example, an S-video cable, a coaxial cable, an HDMI cable, a DVI cable, a VGA cable, a component video cable, an RCA cable, or the like. In one example, the audiovisual device 16 includes a built-in speaker. In other embodiments, the audiovisual device 16, a separate stereo system, or the hub computing device 12 is connected to the external speakers 22.

Hub computing device 10 may be used with capture devices 20A and 20B to recognize, analyze, and/or track human (and other types of) targets. For example, the user wearing head mounted display device 2 may be tracked using capture devices 20A and 20B such that gestures and/or motions of the user may be captured to animate an avatar or on-screen character and/or gestures and/or movements of the user may be interpreted as controls that may be used to affect an application executed by hub computing system 12.

In one embodiment of the disclosed technology, discussed in detail below, the system 10 generates an optimized image for a user based on determining the field of view and the focal region of the user. The optimized image may include, for example, an enhanced appearance of objects on the field of view or artificially overlaid objects that provide an enhanced experience for the user. The optimized image is displayed to the user in an optimized portion (which is shown in fig. 2B) by the head mounted display device 2 according to the user's eye position and movement pattern.

FIG. 2A illustrates one embodiment of a method for coupling a portion of an optimized image to an eye in a display. The process of fig. 2A will be described with reference to fig. 2B-2G.

At step 30, the field of view and focal region of the user are determined. As described above, the user's field of view is related to: the user's environment or scene, the user's head position and orientation, and the user's eye position. Fig. 2F shows a user 1112 in environment 1100. The user 1112 is shown viewing a number of objects in a range of vision (defined by lines 1121) including a lamp 1106, a table 1120, a processing device 1116, capture devices 20A, 20B, a display 1110 and a clock 1118. The user also sees the floor 1108 and wall 1102, but does not see the chair 1107 and only sees a portion of the wall 1104. Environment 1100 may be defined relative to a coordinate system 1150 and a user's head position defined relative to a second coordinate system 1152.

In one embodiment, the focal region within the user's field of view is that region at the point of regard 150 along the focus curve. For example, the convergence between the pupils can be used to triangulate the focal point on the focus curve 147 (i.e., the monocular (Horopter)) from which the focal region and Panum's fusional area can be calculated. Panum's confluence area 147 is the following area on the retina: this region is such that any point in the region will merge with a single point on the other retina and create a single vision through the human eye that represents binocular stereopsis. As shown in fig. 2E, each user's eye includes a fovea (also commonly referred to as a fovea) that is located at the center of the macular region of the retina. The fovea is responsible for obtaining clear central vision (also known as foveal vision) that is necessary for humans when reading, watching television or movies, driving, and any activity where visual detail is of paramount importance. The dimples of fig. 2E are shown at 148 and 149.

Orienting and coupling the optimized image relative to the fovea 148, 149 will ensure that: the user may focus vision with respect to the optimized portion of the image. In addition, the portion of the image that needs to be coupled to the fovea is a relatively small image, on the order of 1mm in diameter on the retina. Rendering a relatively small portion of the area by the head mounted display 2 will reduce the power requirements of the head mounted display 2.

At step 32, an optimized image to be displayed to the user for the determined field of view is created. The optimized image is determined by the application rendering the image and may include one or more individual overlay images within the field of view or encompass the entire field of view.

FIG. 2B shows what a user 1112 in environment 1100 sees through display device 2 when no image is projected on the display. FIG. 2C shows an optimized image that may be projected on the scene in FIG. 2B. In the example of fig. 2C, the optimized image 1200 completely obscures the scene of the environment 1100. In this example, the scene depicts a person 1128, a bird image 1126, and a trophy 1127. In one embodiment, the application rendering the image has determined the configuration and location of the elements within the scene, and whether such objects should obscure real world objects, such as the lamp 1106, the clock 1118, and the display 1110.

At step 34, the current position of the user's eye is determined, and at 36, the optimized portion of the optimized image is coupled to the user at the current eye position in the focal region. In one example, this is shown in fig. 2D, where the bird image 1126 is shown as highlighted and superimposed on the real world environment, with the balance of the bird image 1126 with respect to the image highlighted. In one embodiment, other elements of the optimized image (in this example, a person and a trophy) are not rendered, or rendered at a lower resolution (not shown). In another aspect, other visual elements of the room may be obscured from view by the user.

By focusing the processing power of the display device on rendering only the optimized portion of the image that is coupled to the user's foveal vision, other elements in the optimized image need not be rendered, or may be rendered with less precision and thus less resources than the optimized image. In one embodiment, the optimized image is a segment of the entire image. A normal eye pupil may have a diameter between 1mm in bright light to 7mm in dark. The display is typically optimized for 3mm diameter light. By focusing this portion of the image onto the pupil, the image light can be translated directly to the user's focal region, thereby significantly reducing the light required to generate the image.

To focus a portion of an image on a subject's pupil, a direction and target point on the pupil are assigned to the light rays generated by the image. Some light rays from the vicinity of the eye will enter the eye pupil, but light rays that do not enter the pupil are wasted, consume power, and may have other undesirable effects.

In general, one can consider: light rays from distant points are all nearly parallel near the eye and share a generally common direction. Ideally, the optics of the eye will focus the rays at the foveal region on the retina. Parallel rays coming from different directions will be seen as different points.

To optimize the image at the pupil of the eye, the head-mounted display changes the direction in which light rays from the image are directed and the entry point on the pupil. In the case where the subject can correct the optical distortion of a scene viewed through the optical element in free space, the correction and guidance of the image display is performed by the mechanism of the head-mounted display in the present technology.

In accordance with the present technique, the positioning systems 160, 160a of the embodiments described below provide directional positioning of a microdisplay or a mirror that reflects an image to a user. This directional positioning, together with the positioning of the image relative to the display or mirror, provides an optimized position and orientation relative to the eye position. This may be accomplished, for example, by tilting the display 153 or mirror 166 in three dimensions, with the image rendered on and in the appropriate position on the display or mirror.

It will be appreciated that various other types of mechanical or electromechanical elements may be provided to optimize the orientation of the displayed image in addition to those set forth below. This directional localization in combination with the predictive eye tracking of the present technique provides an optimized process for the system.

To maintain coupling of the optimized portion of the image, the next possible eye movement of the user is tracked at step 38, and another optimized portion of the optimized display image is coupled to the user's focal region at the next location at step 40. If the field of view changes at 44, a new field of view is determined at 32. If the field of view has not changed at 44, then at 38 it is determined that: whether the user's eye has actually moved to the predicted position and the method calculates a potential next eye movement position at 38. The loop of tracking the current position of the eye at 44, calculating the next position at 38, may be performed by one or more microprocessors or dedicated tracking circuits in a near instantaneous manner to move portions of the optimized image to the next position of the user's eye movement according to the user's eye movement in order to provide a suitable visual experience.

It should be appreciated that the terms "next" and "current" as used in the foregoing description are not necessarily limited to a single location of an image. For example, referring to FIG. 2G, each portion rendered as a current location at step 36 or 40 may include a portion of the image at a first time T (image 1126) and a portion of the image at a second time T2 where the user's eye is predicted to be at time T2 (image 1135), such that each "current" and "next" portion described in FIG. 2A may include two image portions.

Moreover, it can be appreciated that alternative portions of the image 1200 can be rendered at full or partially full resolution to attract the user's eyes to that location. For example, the application may select render player 1128 or trophy 1127 so as to attract the user's eye movements and the user's attention in a given context of the application.

It should also be appreciated that the present technique does not require the use of overlay images and may be advantageously used to simply display images to a user without reference to the user's environment.

Fig. 3A depicts a top view of head mounted display device 2, including the portion of the frame that includes temple 102 and nose bridge 104. Only the right side of head mounted display device 2 is depicted. A microphone 110 is placed in nose bridge 104 for recording sounds and transmitting audio data to processing unit 4, as will be described below. In front of the head mounted display device 2 is a room facing a video camera 113 that can capture video and still images. These images are transmitted to a processing unit 4, which will be described below.

A portion of the frame of head mounted display device 2 will surround the display (which includes one or more lenses). To illustrate the components of head mounted display device 2, the portion of the frame surrounding the display is not depicted. The display includes a light guide optical element 112, an opacity filter 114, a see-through lens 116, and a see-through lens 118. In one embodiment, opacity filter 114 is behind and aligned with see-through lens 116, light guide optical element 112 is behind and aligned with opacity filter 114, and see-through lens 118 is behind and aligned with light guide optical element 112. See-through lenses 116 and 118 are standard lenses used in eyeglasses and can be made according to any prescription, including no prescription. In one embodiment, see-through lenses 116 and 118 may be replaced by variable prescription lenses. In some embodiments, head mounted display device 2 will include only one see-through lens or no see-through lens. In another alternative, the prescription lens may enter into the light guide optical element 112. Opacity filter 114 filters out natural light (either on a per pixel basis or uniformly) to enhance the contrast of the virtual image. The light guide optical element 112 guides the artificial light to the eye. More details of opacity filter 114 and light guide optical element 112 are provided below.

Mounted at or within temple 102 is an image source that (in one embodiment) includes a microdisplay assembly 120 for projecting a virtual image and a lens 122 for directing the image from the microdisplay 120 into the light guide optical element 112. In one embodiment, the lens 122 is a collimating lens.

Control circuitry 136 provides various electronics that support other components of head mounted display device 2. More details of the control circuit 136 are provided below with reference to fig. 4A and 4B. Within temple 102 or mounted at temple 102 are an earphone 130, an inertial sensor 132, and/or a magnetic sensor 132, and a temperature sensor 138. In one embodiment, inertial sensor 132 magnetic sensors 132 include a three axis magnetometer 132A, three axis gyroscope 132B, and three axis accelerometer 132C (see FIG. 5). Inertial and/or magnetic sensors are used to sense the position, orientation, and sudden acceleration of head mounted display device 2.

Microdisplay 120 projects an image through lens 122. There are different image generation techniques that can be used to implement microdisplay 120. For example, microdisplay 120 can be implemented using a transmissive projection technology where the light source is modulated by optically active material, backlit with white light. These techniques are typically implemented using LCD-type displays with powerful backlights and high optical power densities. Microdisplay 120 can also be implemented using a reflective technology where external light is reflected and modulated by an optically active material. According to this technique, the illumination is forward lit by a white light source or an RGB source. Digital Light Processing (DLP), Liquid Crystal On Silicon (LCOS), and from Qualcomm, IncThe display techniques are all examples of efficient reflection techniques, as most of the energy is reflected off the modulated structure and can be used in the systems described herein. Additionally, microdisplay 120 can be implemented using an emissive technology, where light is generated by the display. For example, PicoP from Microvision, Inc^TMThe display engine uses a micro-mirror rudder to emit laser signals onto a small screen acting as a transmissive element or directly to the eye (e.g., laser).

Light guide optical element 112 transmits light from microdisplay 120 to an eye 140 of a user wearing head mounted display device 2. The light guide optical element 112 also allows light to be transmitted from the front of the head mounted display device 2 through the light guide optical element 112 to the user's eye as indicated by arrow 142, allowing the user to have an actual direct view of the space in front of the head mounted display device 2 in addition to receiving the virtual image from the microdisplay 120. Thus, the walls of the light guide optical element 112 are see-through. The light guide optical element 112 includes a first reflective surface 124 (e.g., a mirror or other surface). Light from microdisplay 120 passes through lens 122 and is incident on reflecting surface 124. Reflective surface 124 reflects incident light from microdisplay 120 such that light is trapped by internal reflection within the planar substrate comprising light guide optical element 112. After several reflections off the surface of the substrate, the captured light waves reach the array of selectively reflective surfaces 126. Note that only one of the five surfaces is labeled 126 to prevent the drawings from being too crowded. The reflective surfaces 126 couple light waves exiting the substrate and incident on these reflective surfaces to the user's eye 140. Since different light rays will travel at different angles and bounce off the interior of the substrate, these different light rays will hit the various reflective surfaces 126 at different angles. Thus, different light rays will be reflected out of the substrate by different ones of the reflecting surfaces. The choice of which rays will be reflected from the substrate by which surface 126 is engineered by selecting the appropriate angle of the surface 126. More details of light-guiding Optical elements can be found in U.S. patent application publication No. 2008/0285140, serial No. 12/214,366, "Substrate-Guided Optical Devices," published on 20.11.2008, which is incorporated herein by reference in its entirety. In one embodiment, each eye will have its own light guide optical element 112. When a head mounted display device has two light guide optical elements, each eye may have its own microdisplay 120, which microdisplay 120 may display the same image in both eyes or different images in both eyes. In another embodiment, there may be one light guide optical element that reflects light into both eyes.

Opacity filter 114, which is aligned with light guide optical element 112, selectively blocks natural light from passing through light guide optical element 112, either uniformly or on a per-pixel basis. In one embodiment, the opacity filter may be a see-through LCD panel, an electrochromic film (PDLC), or similar device capable of acting as an opacity filter. Such a see-through LCD panel can be obtained by removing the various layers of substrate, backlight and diffuser from a conventional LCD. The LCD panel may include one or more light-transmissive LCD chips that allow light to pass through the liquid crystal. Such chips are used, for example, in LCD projectors.

Opacity filter 114 may include a dense grid of pixels, where the light transmittance of each pixel can be individually controlled between a minimum and a maximum light transmittance. Although a light transmission range of 0-100% is desirable, a more limited range is acceptable. By way of example, a monochrome LCD panel having no more than two polarizing filters is sufficient to provide an opacity range of about 50% to 99% per pixel, up to the resolution of the LCD. At a minimum of 50%, the lens will have a slightly tinted appearance, which is tolerable. A light transmission of 100% represents a perfectly clear lens. The "alpha" scale may be defined from 0-100%, where 0% does not allow light to pass through and 100% allows all light to pass through. The value of alpha may be set for each pixel by the opacity filter control circuit 224 described below.

After z-buffering (z-buffering) with the proxy for real world objects, a mask of alpha values from the rendering pipeline may be used. When the system renders a scene for an augmented reality display, the system notes which real-world objects are in front of which virtual objects. If a virtual object is in front of a real world object, the opacity should be on for the coverage area of the virtual object. If the virtual object is (in fact) behind a real-world object, the opacity should be off, as well as any color of the pixel, so that for the corresponding region of real light (which is one pixel or more in size), the user will only see the real-world object. Coverage will be on a pixel-by-pixel basis, so the system can handle the case where a portion of a virtual object is in front of a real-world object, a portion of the virtual object is behind the real-world object, and a portion of the virtual object coincides with the real-world object. For such uses, displays that can go from 0% to 100% opacity at low cost, power and weight are most desirable. In addition, the opacity filter may be rendered in color, such as with a color LCD or with other displays such as organic LEDs, to provide a wide field of view. Further details of an Opacity Filter are provided in U.S. patent application No. 12/887,426, "Opacity Filter For set-Through Mounted Display," filed on 21/9/2010, the entire contents of which are incorporated herein by reference.

Opacity filters such as LCDs have not typically been used with see-through lenses as described herein because at this close distance from the eye they may be out of focus. However, in some cases, this result may be desirable. With a normal HMD display using additive colors (which is designed to be in focus), the user sees a virtual image with clear color graphics. The LCD panel is placed "behind" the display so that a fuzzy black border surrounds any virtual content, making it opaque as desired. The system reverses the drawbacks of natural blurring to conveniently achieve anti-aliasing properties and bandwidth reduction. These are natural results of using lower resolution and out-of-focus images. There is an effective smoothing of the digitally sampled image. Any digital image experiences aliasing, where the discretization of the sampling causes errors compared to natural analog and continuous signals. Smoothing means visually closer to the ideal analog signal. Although the information lost at low resolution is not recovered, the resulting error is less noticeable.

In one embodiment, the display and the opacity filter are rendered simultaneously and calibrated to the user's precise location in space to compensate for the angular offset problem. Eye tracking can be used to calculate the correct image offset at the end of the field of view. In some embodiments, a temporal and spatial fade in the amount of opacity may be used in the opacity filter. Similarly, temporal and spatial fading may be used in the virtual image. In one approach, a temporal fade in the amount of opacity of the opacity filter corresponds to a temporal fade in the virtual image. In another approach, a spatial fade in the amount of opacity of the opacity filter corresponds to a spatial fade in the virtual image.

In one exemplary approach, an increased opacity is provided for pixels of the opacity filter that are behind the virtual image from the perspective of the identified position of the user's eye. In this way, pixels behind the virtual image are darkened such that light from the corresponding portion of the real world scene is blocked from reaching the user's eyes. This allows the virtual image to be realistic and represent a full range of colors and intensities. Furthermore, power consumption of the augmented reality emitter is reduced since the virtual image can be provided at a lower brightness. Without the opacity filter, the virtual image would need to be provided at a sufficiently high brightness that is brighter than the corresponding portion of the real world scene so that the virtual image is distinct and not transparent. In darkening the pixels of the opacity filter, generally, pixels along the closed perimeter of the virtual image are darkened along with pixels within the perimeter. It may be desirable to provide some overlap so that some pixels that are just outside and around the perimeter are also darkened (at the same level of darkness or less than the pixels within the perimeter). These pixels that are just outside the perimeter may provide a fade (e.g., a gradual transition in opacity) from darkness inside the perimeter to a full amount of non-darkness outside the perimeter.

Head mounted display device 2 also includes a system for tracking the position of the user's eyes. As will be explained below, the system will track the position and orientation of the user so that the system can determine the user's field of view. However, a human will not perceive everything in front of it. Instead, the user's eyes will be directed at a subset of the environment. Thus, in one embodiment, the system will include techniques for tracking the position of the user's eyes in order to refine the measurement of the user's field of view. For example, head mounted display device 2 includes eye tracking component 134 (fig. 3A), which eye tracking component 134 will include eye tracking illumination device 134A (see fig. 4A) and eye tracking camera 134B (see fig. 4A). In one embodiment, eye-tracking illumination source 134A includes one or more Infrared (IR) emitters that emit IR light toward the eye. Eye tracking camera 134B includes one or more cameras that sense reflected IR light. The location of the pupil can be identified by known imaging techniques that detect the reflection of the cornea. See, for example, U.S. patent 7,401,920 entitled "Head mounted eye tracking and display system," issued to Kranz et al, 22/7/2008, which is incorporated herein by reference. Such techniques may locate the position of the center of the eye relative to the tracking camera. In general, eye tracking involves obtaining images of the eye and using computer vision techniques to determine the location of the pupil within the eye socket. In one embodiment, it is sufficient to track the position of one eye, since the eyes typically move in unison. However, it is possible to track each eye separately.

In one embodiment, eye tracking illumination device 134A will use 4 IR LEDs and eye tracking camera 134 will use 4 IR photodetectors (not shown) arranged in a rectangle such that there is one IR LED and IR photodetector at each corner of the lens of head mounted display device 2. Light from the LED reflects off the eye. The pupil direction is determined by the amount of infrared light detected at each of the 4 IR photodetectors. That is, the amount of white versus black in the eye will determine the amount of light reflected off the eye for that particular photodetector. Thus, the photodetector will have a measure of the amount of white or black in the eye. From these 4 samples, the system can determine the direction of the eye.

Another alternative is to use 4 infrared LEDs as discussed below, but only one infrared CCD at the side of the lens of the head mounted display device 2. The CCD will use small mirrors and/or lenses (fish eyes) so that the CCD can image up to 75% of the visible eyes from the frame. The CCD will then sense the image and use computer vision to find the eye position, as discussed below. Thus, although FIGS. 3A-3C show one component with one IR emitter, the configuration of FIG. 2 may be adapted to have 4 IR emitters and/or 4 IR sensors. (note: image references need to be updated-this cannot be seen in any of the figures). More or less than 4 IR transmitters and/or more or less than 4 IR sensors may also be used.

Another embodiment for tracking eye direction is based on charge tracking. This scheme is based on the following observations: the retina carries a measurable positive charge and the cornea has a negative charge. The sensor is mounted through the user's ear (near the earpiece 130) to detect the potential of the eye as it rotates and effectively read out what the eye is doing in real time. Other embodiments for tracking the eye may also be used.

Fig. 3B-3D show alternative embodiments 2B and 2c of a portion of head mounted display device 2. In fig. 3B-3D, like reference numerals refer to like parts to those identified in fig. 3A.

Fig. 3B shows a non-see-through head mounted display device 2B. The display device 2B of fig. 3B uses a forward facing lens 133a, which lens 133a is coupled to a waveguide 124A to couple a view of a scene, such as environment 1100, to the user's eye 140. The micro-display 153 may comprise any of the aforementioned display types, such as an LCD, LED, or OLED, etc., having a resolution defined by an array of individually actuated pixel elements, the combination of which is used to generate an optimized image suitable for coupling to a user's fovea. The microdisplay 153 can be coupled to a plurality of microelectromechanical elements 160a, the microelectromechanical elements 160a being coupled to each corner of the display to position the display in three dimensions relative to the user's eye 140. Thus, the microdisplay 153 can have multiple axes "Z" and "X" rotated about the center point of the display, and vertical "V" and lateral "L" positioned relative to the user's eye.

As shown in FIG. 3B, only those elements of the display that render image portion 1126 of the optimized image (bird 1126 in this case) are driven to provide a high resolution image, such that the focal region of user's eye 140 is directly coupled to light from image 1126. Image 1126 is surrounded by portion 1126a to show: only image 1126, a portion of optimized image 1200 of the user's entire field of view in the environment, is rendered in fig. 3B.

Fig. 3c and 3d show another alternative embodiment 2b of the present technique. Fig. 3D is a top view of the head mounted display device 2D, and fig. 3C is a side view of the head mounted display device 2D. In fig. 3C and 3D, the head mounted display device 2C includes a support structure 162, micro-electromechanical elements 160, 161, 163 (and a fourth micro-electromechanical element — not shown), and a mirror 166. One or more microdisplay elements 170 are positioned adjacent to mirror 166, where elements 170 may be equivalent to display 120 described with reference to fig. 3A. Mirror 166 can be moved by micro-electromechanical elements 160, 161, 163 relative to display structure 162 to direct the emitted objects from the micro-display elements into the focal region of the user's eye. The microelectromechanical elements 160, 161, 163 may include piezoelectric elements, or other mechanically or electromechanically controlled elements that, when used in cooperation, may position the mirror 166 relative to the support structure 162 along three axes of motion. In a manner similar to microdisplay 153, these micro-electromechanical elements 160 and 163 are coupled at each corner of the mirror to position the mirror in three dimensions relative to the user's eye 140. Thus, the mirror 166 may have multiple axes "Z" and "X" that rotate about a center point of the display, and vertical "V" and lateral "L" that are positioned relative to the user's eye. It should be appreciated that movement of the mirror 166 may be used alone or in combination with the directional output of the microdisplay elements to position an optimized portion of the image (in this example, the bird image 1126) in the user's focal region.

Fig. 3A-3D show only half of the head mounted display devices 2a-2 c. A complete head-mounted display device would include (where applicable) another set of see-through lenses, another opacity filter, another light guide optical element, another microdisplay, another lens, a room-facing camera, an eye-tracking assembly, a microdisplay, headphones, and a temperature sensor.

FIG. 4A is a block diagram depicting various components of head mounted display devices 2a-2 c. As can be appreciated with reference to FIGS. 3A-3D, some of the components shown in FIG. 4A may not be present in each of the embodiments shown in FIGS. 3A-3D. Fig. 4B is a block diagram depicting the various components of the processing unit 4. The components of each head mounted display device 2 are shown in FIG. 4A, the head mounted display devices 2 being used to display optimized images to a user. Additionally, the head mounted display device assembly of FIG. 4A includes a number of sensors that track various conditions. The head mounted display devices 2a-2c will receive instructions from the processing unit 4 regarding the virtual image and provide sensor information back to the processing unit 4. The components of processing unit 4 are depicted in FIG. 4B, where processing unit 4 will receive sensory information from head mounted display devices 2a-2c and also from hub computing device 12 (see FIG. 1). Based on this information, processing unit 4 will determine where and when to provide the virtual image to the user and send instructions to the head mounted display device of FIG. 4A accordingly.

Note that some of the components of fig. 4A (e.g., back-facing camera 113, eye-tracking camera 134B, micro-display 120 or 153, opacity filter 114, eye-tracking illumination 134A, and headphones 130) are shown shaded to indicate that there are two of each of these, one for the left side of head mounted display device 2 and one for the right side of head mounted display device 2. Fig. 4A shows a control circuit 200 in communication with a power management circuit 202. The control circuit 200 includes a processor 210, a memory controller 212 in communication with a memory 214 (e.g., D-RAM), a camera interface 216, a camera buffer 218, a display driver 220, a display formatter 222, a timing generator 226, a display output interface 228, and a display input interface 230. In one embodiment, all components of the control circuit 200 communicate with each other over dedicated lines or one or more buses. In another embodiment, each component of the control circuit 200 is in communication with the processor 210. Camera interface 216 provides an interface to the two room-facing cameras 113 and stores images received from the room-facing cameras in camera buffer 218. Display driver 220 will drive microdisplay 120 or 153. Display formatter 222 provides information about the virtual image displayed on microdisplay 120 or 153 to opacity control circuit 224, which controls opacity filter 114. A timing generator 226 is used to provide timing data to the system. The display output interface 228 is a buffer for providing images from the room-facing camera 113 to the processing unit 4. Display input 230 is a buffer for receiving images, such as virtual images to be displayed on microdisplay 120. The display output 228 and the display input 230 communicate with a band interface 232 that is an interface to the processing unit 4. The display driver 220 may also drive the mirror controller 162 to position the mirror 166 to display a focused image according to the above-described embodiments of fig. 3C and 3D.

Power management circuit 202 includes voltage regulator 234, eye tracking illumination driver 236, audio DAC and amplifier 238, microphone preamplifier audio ADC 240, temperature sensor interface 242, and clock generator 244. Voltage regulator 234 receives power from processing unit 4 through band interface 232 and provides the power to the other components of head mounted display device 2. Each eye tracking illumination driver 236 provides an IR light source for eye tracking illumination 134A as described above. Audio DAC and amplifier 238 receives audio information from headphones 130. Microphone preamplifier and audio ADC 240 provides an interface for microphone 110. Temperature sensor interface 242 is an interface for temperature sensor 138. Power management unit 202 also provides power to and receives data back from three axis magnetometer 132A, three axis gyroscope 132B, and three axis accelerometer 132C.

Fig. 4B is a block diagram depicting the various components of the processing unit 4. Fig. 4B shows control circuitry 304 in communication with power management circuitry 306. The control circuit 304 includes: a Central Processing Unit (CPU) 320; a Graphics Processing Unit (GPU) 322; a cache 324; a RAM 326; a memory controller 328 in communication with a memory 330 (e.g., D-RAM); a flash controller 332 in communication with flash memory 334 (or other type of non-volatile storage); a display output buffer 336 in communication with head mounted display device 2 through band interface 302 and band interface 232; a display input buffer 338 in communication with head mounted display device 2 via band interface 302 and band interface 232; a microphone interface 340 for communicating with an external microphone connector 342 for connecting to a microphone, a pci express interface for connecting to a wireless communication device 346; and a USB port 348. In one embodiment, the wireless communication device 346 may include a Wi-Fi enabled communication device, a Bluetooth communication device, an infrared communication device, and the like. The USB port may be used to interface processing unit 4 to hub computing device 12 for loading data or software onto processing unit 4 and for charging processing unit 4. In one embodiment, CPU 320 and GPU 322 are the main load devices used to determine where, when, and how to insert virtual images into a user's field of view. More details are provided below.

Power management circuitry 306 includes a clock generator 360, an analog-to-digital converter 362, a battery charger 364, a voltage regulator 366, a head-mounted display power supply 376, and a temperature sensor interface 372 (which is located on a wrist band of processing unit 4) that communicates with a temperature sensor 374. The analog-to-digital converter 362 is connected to the charging jack 370 for receiving AC power and generating DC power for the system. The voltage regulator 366 communicates with a battery 368 for providing power to the system. Battery charger 364 is used to charge battery 368 (via voltage regulator 366) upon receiving power from charging jack 370. HMD power supply 376 provides power to head mounted display device 2.

The system described above will be configured to insert a virtual image into the user's field of view such that the virtual image replaces the view of the real world object. Alternatively, the virtual image may be inserted without replacing the image of the real world object. In various embodiments, the virtual image will be adjusted to match the appropriate orientation, size, and shape based on the object being replaced or the environment in which the image is to be inserted. In addition, the virtual image may be adjusted to include reflections and shadows. In one embodiment, head mounted display device 12, processing unit 4, and hub computing device 12 work together in that each device includes a subset of sensors for obtaining data for determining where, when, and how to insert a virtual image. In one embodiment, the calculations to determine where, how, and when to insert the virtual image are performed by the hub computing device 12. In another embodiment, these calculations are performed by the processing unit 4. In another embodiment, some of these calculations are performed by hub computing device 12 while other calculations are performed by processing unit 4. In other embodiments, these calculations may be performed by head mounted display device 12.

In one exemplary embodiment, hub computing device 12 will create a model of the environment in which the user is located and track a plurality of moving objects in the environment. In addition, hub computing device 12 tracks the field of view of head mounted display device 2 by tracking the position and orientation of head mounted display device 2. The model and tracking information is provided to the processing unit 4 from the hub computing device 12. The sensor information obtained by the head mounted display device 2 is transmitted to the processing unit 4. Processing unit 4 then uses the other sensor information it receives from head mounted display device 2 to refine the user's field of view and provide instructions to head mounted display device 2 as to how, where, and when to insert the virtual image.

FIG. 5 illustrates an exemplary embodiment of hub computing system 12 having a capture device. In one embodiment, capture devices 20A and 20B are the same structure, and thus FIG. 5 only shows capture device 20A. According to an example embodiment, the capture device 20A may be configured to capture video with depth information including a depth image that may include depth values via any suitable technique that may include, for example, time-of-flight, structured light, stereo image, or the like. According to one embodiment, the capture device 20A may organize the depth information into "Z layers," i.e., layers that may be perpendicular to a Z axis extending from the depth camera along its line of sight.

As shown in FIG. 5, capture device 20A may include a camera component 423. According to an exemplary embodiment, camera component 423 may be or may include a depth camera that may capture a depth image of a scene. The depth image may include a two-dimensional (2-D) pixel area of the captured scene where each pixel in the 2-D pixel area may represent a depth value, such as a distance in, for example, centimeters, millimeters, or the like of an object in the captured scene from the camera.

The camera component 23 may include an Infrared (IR) light component 425, a three-dimensional (3D) camera 426, and an RGB (visual image) camera 428 that may be used to capture a depth image of a scene. For example, in time-of-flight analysis, the IR light component 425 of the capture device 20A may emit an infrared light onto the scene and may then detect the backscattered light from the surface of one or more targets and objects in the scene using sensors (including sensors not shown in some embodiments), for example using the 3D camera 426 and/or the RGB camera 428. In some embodiments, pulsed infrared light may be used such that the time between an outgoing light pulse and a corresponding incoming light pulse may be measured and used to determine a physical distance from the capture device 20A to a particular location on the targets or objects in the scene. Additionally, in other example embodiments, the phase of the outgoing light wave may be compared to the phase of the incoming light wave to determine the phase shift. The phase shift may then be used to determine a physical distance from the capture device to a particular location on the targets or objects.

According to another exemplary embodiment, time-of-flight analysis may be used to indirectly determine a physical distance from the capture device 20A to a particular location on the targets or objects by analyzing the intensity of the reflected beam of light over time via various techniques including, for example, shuttered light pulse imaging.

In another exemplary embodiment, capture device 20A may use a structured light to capture depth information. In such an analysis, patterned light (i.e., light displayed as a known pattern such as a grid pattern, a stripe pattern, or a different pattern) may be projected onto the scene via, for example, the IR light component 424. Upon falling onto the surface of one or more targets or objects in the scene, the pattern may deform in response. Such a deformation of the pattern may be captured by, for example, the 3D camera 426 and/or the RGB camera 428 (and/or other sensors) and may then be analyzed to determine a physical distance from the capture device to a particular location on the targets or objects. In some implementations, the IR light component 425 is separate from the cameras 425 and 426 so that triangulation can be used to determine the distance from the cameras 425 and 426. In some implementations, the capture device 20A will include a dedicated IR sensor that senses IR light or a sensor with an IR filter.

According to another embodiment, the capture device 20A may include two or more physically separated cameras that may view a scene from different angles to obtain visual stereo data that may be resolved to generate depth information. Other types of depth image sensors may also be used to create the depth image.

The capture device 20A may also include a microphone 430, the microphone 430 including a transducer or sensor that may receive and convert sound into an electrical signal. Microphone 430 may be used to receive audio signals that may also be provided by hub computing system 12.

In an example embodiment, capture device 20A may also include a processor 432 that may be in communication with an image camera component 423. Processor 432 may include a standard processor, a special purpose processor, a microprocessor, etc. that may execute instructions including, for example, instructions for receiving depth images, generating appropriate data formats (e.g., frames), and transmitting data to hub computing system 12.

The capture device 20A may also include a memory 434, which memory 434 may store instructions executed by the processor 432, images or image frames captured by the 3D camera and/or the RGB camera, or any other suitable information, images, or the like. According to an example embodiment, the memory 434 may include Random Access Memory (RAM), Read Only Memory (ROM), cache, flash memory, a hard disk, or any other suitable storage component. As shown in FIG. 5, in one embodiment, the memory 434 may be a separate component in communication with the image capture component 423 and the processor 432. According to another embodiment, the memory component 434 may be integrated into the processor 432 and/or the image capture component 422.

Capture devices 20A and 20B communicate with hub computing system 12 via communication link 436. The communication link 436 may be a wired connection including, for example, a USB connection, a firewire connection, an ethernet cable connection, etc., and/or a wireless connection such as a wireless 802.11b, 802.11g, 802.11a, or 802.11n connection, etc. According to one embodiment, hub computing system 12 may provide a clock to capture device 20A over communication link 436 that may be used to determine when to capture a scene, for example. Additionally, capture device 20A provides depth information and visual (e.g., RGB) images captured by, for example, 3D camera 426 and/or RGB camera 428 to hub computing system 12 via communication link 436. In one embodiment, the depth image and visual image are transmitted at a rate of 30 frames per second, although other frame rates may be used. Hub computing system 12 may then create a model and use the model, depth information, and captured images to, for example, control an application such as a game or word processor and/or animate an avatar or on-screen character.

Hub computing system 12 includes a depth image processing and skeletal tracking module 450 that uses the depth images to track one or more persons detectable by the depth camera functionality of capture device 20A. Depth image processing and skeleton tracking module 450 provides tracking information to application 453, which may be a video game, a productivity application, a communications application, or other software application, among others. The audio data and visual image data are also provided to the application 452 and the depth image processing and skeleton tracking module 450. The application 452 provides the tracking information, audio data, and visual image data to the recognizer engine 454. In another embodiment, recognizer engine 454 receives tracking information directly from depth image processing and skeletal tracking module 450 and audio data and visual image data directly from capture devices 20A and 20B.

Recognizer engine 454 is associated with a set of filters 460, 462, 464, … …, 466, each filter including information about a gesture, action, or condition that any person or object that may be detected by capture device 20A or 20B performs. For example, data from the capture device 20A may be processed by the filters 460, 462, 464, … …, 466 to identify when a user or group of users has performed one or more gestures or other actions. These gestures may be associated with various controls, objects, or conditions of the application 452. Thus, hub computing system 12 may use recognizer engine 454 and filters together to interpret and track the movement of objects (including people).

Capture devices 20A and 20B provide RGB images (or visual images in other formats or color spaces) and depth images to hub computing system 12. The depth image may be a plurality of observed pixels, where each observed pixel has an observed depth value. For example, the depth image may include a two-dimensional (2D) pixel area of the captured scene where each pixel in the 2D pixel area may have a depth value, such as a distance of an object in the captured scene from the capture device. Hub computing system 12 will use the RGB image and the depth image to track the movement of the user or object. For example, the system will use the depth image to track the skeleton of the person. Many methods can be used to track the skeleton of a person by using depth images. One suitable example of using depth images to track a skeleton is provided in U.S. patent application 12/603,437, "position Tracking Pipeline" (hereinafter the' 437 application), filed 10, 21/2009 by Craig et al, the entire contents of which are incorporated herein by reference. The process of the' 437 application includes: obtaining a depth image; down-sampling the data; removing and/or smoothing high variance noise data; identifying and removing the background; and assigning each of the foreground pixels to a different part of the body. Based on these steps, the system will fit a model to the data and create a skeleton. The skeleton will include a set of joints and connections between the joints. Other methods for tracking may also be used. Suitable tracking techniques are also disclosed in the following four U.S. patent applications, all of which are incorporated herein by reference in their entirety: U.S. patent application 12/475,308, "Device for Identifying and Tracking Multiple Humans Over Time" (Device for Identifying and Tracking Multiple Humans Over Time), filed on 29.5.2009; U.S. patent application 12/696,282, "Visual Based Identity Tracking," filed on 29/1/2010; us patent application 12/641,788 "Motion Detection Using Depth Images" was filed on 18.12.2009; and U.S. patent application 12/575,388 "Human Tracking System" filed on 7.10.2009.

The recognizer engine 454 includes a plurality of filters 460, 462, 464, … …, 466 to determine gestures or actions. A filter includes information defining a gesture, action, or condition and parameters or metadata for the gesture, action, or condition. For example, a throw that includes motion of one hand from behind the body past the front of the body may be implemented as a gesture that includes information representing motion of one hand of the user from behind the body past the front of the body, as that motion will be captured by the depth camera. Parameters may then be set for the gesture. When the gesture is a throw, the parameters may be a threshold speed that the hand must reach, a distance the hand must travel (absolute, or relative to the overall size of the user), and a confidence rating by the recognizer engine that the gesture occurred. These parameters for a gesture may vary from application to application, from context to context of a single application, or within one context of one application over time.

The filters may be modular or interchangeable. In one embodiment, a filter has a plurality of inputs (each of the inputs having a type) and a plurality of outputs (each of the outputs having a type). The first filter may be replaced with a second filter having the same number and type of inputs and outputs as the first filter without altering any other aspect of the recognizer engine architecture. For example, there may be a first filter for driving that takes skeletal data as input and outputs the confidence that the gesture associated with that filter is occurring and the steering angle. In the case where it is desired to replace the first driven filter with a second driven filter (which may be because the second driven filter is more efficient and requires less processing resources), this can be done by simply replacing the first filter with the second filter, as long as the second filter has the same inputs and outputs — one input for the skeletal data type, and two outputs for the confidence type and the angle type.

The filter does not necessarily have parameters. For example, a "user height" filter that returns the height of the user may not allow any parameters that may be adjusted. An alternative "user height" filter may have adjustable parameters, such as whether to consider the user's shoes, hairstyle, headwear, and posture in determining the user's height.

The input to the filter may include such things as joint data about the user's joint position, the angle formed by the bones that meet at the joint, RGB color data from the scene, and the rate of change of some aspect of the user. The output from the filter may include such things as the confidence that a given gesture is being made, the speed at which the gesture motion is made, and the time at which the gesture motion is made.

The recognizer engine 454 may have a base recognizer engine that provides functionality to the filters. In one embodiment, the functions performed by the recognizer engine 454 include: track an input-over-time archive of recognized gestures and other inputs; hidden markov model implementations (where the modeled system is assumed to be a markov process-a process in which the current state encapsulates any past state information used to determine future states, and therefore no other past state information has to be maintained for this purpose-this process has unknown parameters, and hidden parameters are determined from observable data); and other functions that solve specific instances of gesture recognition.

The filters 460, 462, 464, … …, 466 are loaded and implemented on top of the recognizer engine 454, and may utilize the services provided by the recognizer engine 454 to all filters 460, 462, 464, … …, 466. In one embodiment, the recognizer engine 454 receives data to determine whether the data meets the requirements of any of the filters 460, 462, 464, … …, 466. Since these provided services, such as parsing input, are provided by the recognizer engine 454 once, rather than by each filter 460, 462, 464, … …, 466, such services need only be processed once over a period of time rather than once for each filter over the period of time, thus reducing the processing required to determine gestures.

The application 452 may use the filters 460, 462, 464, … …, 466 provided by the recognizer engine 454, or it may provide its own filter that plugs into the recognizer engine 454. In one embodiment, all filters have a common interface that enables the plug-in feature. Furthermore, all filters may utilize parameters, so the following single gesture tool may be used to diagnose and adjust the entire filter system.

More information about the Recognizer engine 454 may be found in U.S. patent application 12/422,661 "Gesture Recognizer System Architecture," filed on 13.4.2009, which is incorporated by reference herein in its entirety. More information about recognized Gestures may be found in U.S. patent application 12/391,150, "Standard Gestures (Standard Gestures)", filed on 23.2.2009; and us patent application 12/474,655 "gettrue Tool", filed on 29.5.2009, both of which are incorporated herein by reference in their entirety.

In one embodiment, computing system 12 includes a user profile database 470, the user profile database 470 including user-specific information related to one or more users interacting with hub computing system 12. In one example, the user-specific information includes information related to the user such as: preferences expressed by the user; a list of friends of the user; activities preferred by the user; a reminder list for the user; a social group of the user; the current location of the user; an intent of a user to interact with objects in the user's environment in the past; and other user-created content such as a user's photograph, image, and recorded video. In one embodiment, the user-specific information may be obtained from one or more data sources such as: a user's social networking site, address book, email data, instant messaging data, user profile, or other source on the internet. In one aspect and as will be described in detail below, user-specific information is used to automatically determine a user's intent to interact with one or more objects in the user's environment.

FIG. 6 illustrates an exemplary embodiment of a computing system that may be used to implement hub computing system 12. As shown in FIG. 6, the multimedia console 500 has a Central Processing Unit (CPU)501 that has a level one cache 502, a level two cache 504, and a flash ROM (read Only memory) 506. The level one cache 502 and the level two cache 504 temporarily store data and thus reduce the number of memory access cycles, thereby improving processing speed and throughput. CPU 501 may be equipped with more than one core and thus have additional level 1 and level 2 caches 502 and 504. The flash ROM 506 may store executable code that is loaded during an initial phase of a boot process when the multimedia console 500 is powered ON.

A Graphics Processing Unit (GPU)508 and a video encoder/video codec (coder/decoder) 514 form a video processing pipeline for high speed and high resolution graphics processing. Data is carried from the graphics processing unit 508 to the video encoder/video codec 514 via a bus. The video processing pipeline outputs data to an a/V (audio/video) port 540 for transmission to a television or other display. A memory controller 510 is connected to the GPU 508 to facilitate processor access to various types of memory 512, such as, but not limited to, RAM (random access memory).

The multimedia console 500 includes an I/O controller 520, a system management controller 522, an audio processing unit 523, a network interface 524, a first USB host controller 526, a second USB controller 528, and a front panel I/O subassembly 530 that are preferably implemented on a module 518. The USB controllers 526 and 528 serve as hosts for peripheral controllers 542(1) -542(2), a wireless adapter 548, and an external memory device 546 (e.g., flash memory, external CD/DVD ROM drive, removable media, etc.). The network interface 524 and/or wireless adapter 548 provide access to a network (e.g., the Internet, home network, etc.) and may be any of a wide variety of various wired or wireless adapter components including an Ethernet card, a modem, a Bluetooth module, a cable modem, and the like.

System memory 543 is provided to store application data that is loaded during the boot process. A media drive 544 is provided and may include a DVD/CD drive, a blu-ray drive, a hard drive, or other removable media drive, among others. The media drive 144 may be located internal or external to the multimedia console 500. Application data may be accessed via the media drive 544 for execution, playback, etc. by the multimedia console 500. The media drive 544 is connected to the I/O controller 520 via a bus, such as a Serial ATA bus or other high speed connection (e.g., IEEE 1394).

The system management controller 522 provides various service functions related to ensuring availability of the multimedia console 500. The audio processing unit 523 and the audio codec 532 form a corresponding audio processing pipeline with high fidelity and stereo processing. Audio data is transferred between the audio processing unit 523 and the audio codec 532 via a communication link. The audio processing pipeline outputs data to the a/V port 540 for reproduction by an external audio user or device having audio capabilities.

The front panel I/O subassembly 530 supports the functionality of the power button 550 and the eject button 552, as well as any LEDs (light emitting diodes) or other indicators exposed on the outer surface of the multimedia console 100. The system power supply module 536 provides power to the components of the multimedia console 100. A fan 538 cools the circuitry within the multimedia console 500.

The CPU 501, GPU 508, memory controller 510, and various other components within the multimedia console 500 are interconnected via one or more buses, including serial and parallel buses, a memory bus, a peripheral bus, and a processor or local bus using any of a variety of bus architectures. By way of example, these architectures may include a Peripheral Component Interconnect (PCI) bus, a PCI-Express bus, and the like.

When the multimedia console 500 is powered ON, application data may be loaded from the system memory 543 into memory 512 and/or caches 502, 504 and executed on the CPU 501. The application may present a graphical user interface that provides a consistent user experience when navigating to different media types available on the multimedia console 500. In operation, applications and/or other media contained within the media drive 544 may be launched or played from the media drive 544 to provide additional functionalities to the multimedia console 500.

The multimedia console 500 may be operated as a standalone system by simply connecting the system to a television or other display. In this standalone mode, the multimedia console 500 allows one or more users to interact with the system, watch movies, or listen to music. However, with the integration of broadband connectivity made available through the network interface 524 or the wireless adapter 548, the multimedia console 500 may further be operated as a participant in a larger network community. Additionally, the multimedia console 500 may communicate with the processing unit 4 through a wireless adapter 548.

When the multimedia console 500 is powered on, a set amount of hardware resources may be reserved for system use by the multimedia console operating system. These resources may include reservations of memory, CPU and GPU cycles, network bandwidth, and so on. Because these resources are reserved at system boot, the reserved resources are not present from an application perspective. In particular, the memory reservation is preferably large enough to contain the launch kernel, concurrent system applications, and drivers. The CPU reservation is preferably constant such that if the reserved CPU usage is not used by the system applications, the idle thread will consume any unused cycles.

For GPU reservation, lightweight messages (e.g., popups) generated by system applications are displayed by using a GPU interrupt to schedule code to render popup into an overlay. The amount of memory required for the overlay depends on the overlay area size, and the overlay preferably scales with the screen resolution. Where the concurrent system application uses a full user interface, it is preferable to use a resolution that is independent of the application resolution. A scaler (scaler) may be used to set this resolution, thereby eliminating the need to change the frequency and cause a TV resynch.

After the multimedia console 500 boots and system resources are reserved, concurrent system applications execute to provide system functionality. The system functions are encapsulated in a set of system applications that execute within the reserved system resources described above. The operating system kernel identifies threads that are system application threads and not game application threads. The system applications are preferably scheduled to run on CPU 501 at predetermined times and intervals in order to provide a consistent system resource view for the application. The scheduling is done to minimize cache disruption caused by the gaming application running on the console.

When the concurrent system application requires audio, audio processing is scheduled asynchronously to the gaming application due to time sensitivity. A multimedia console application manager (described below) controls the audio level (e.g., mute, attenuate) of the gaming application when system applications are active.

Optional input devices (e.g., controllers 542 (1)) and 542(2)) are shared by the gaming application and the system application. Rather than reserving resources, the input devices are switched between the system application and the gaming application so that each has a focus of the device. The application manager preferably controls the switching of input stream without knowledge of the gaming application's knowledge and the driver maintains state information regarding focus switches. Capture devices 20A and 20B may define additional input devices for console 500 through USB controller 526 or other interface. In other embodiments, hub computing system 12 may be implemented using other hardware architectures. No hardware architecture is necessary.

FIG. 1 shows one head mounted display device 2 and processing unit 4 (collectively referred to as a mobile display device) in communication with one hub processing device 12 (referred to as a hub). In another embodiment, multiple mobile display devices may communicate with a single hub. Each mobile display device will communicate with the hub using wireless communication as described above. It is contemplated in such embodiments that much of the information that would benefit all mobile display devices would be calculated and stored at the hub and communicated to each mobile display device. For example, the hub will generate a model of the environment and provide the model to all mobile display devices in communication with the hub. Additionally, the hub may track the location and orientation of the mobile display devices and the moving object in the room and then transmit this information to each mobile display device.

In another embodiment, a system may include a plurality of hubs, where each hub includes one or more mobile display devices. These hubs may communicate with each other directly or through the internet (or other network). For example, FIG. 7 shows hubs 560, 562, and 564. Hub 560 communicates directly with hub 562. Hub 560 communicates with hub 564 via the internet. The hub 560 communicates with mobile display devices 570, 572. The hub 562 communicates with mobile display devices 578, 580.. 582. The hub 564 is in communication with a mobile display device 584, 586. Each mobile display device communicates with its respective hub via wireless communication as discussed above. If the hubs are in a common environment, each hub may provide a portion of a model of the environment, or one hub may create the model for the other hubs. Each hub will track a subset of moving objects and share this information with other hubs, which in turn will share this information with the appropriate mobile display devices. Sensor information for a mobile display device will be provided to its respective hub and then shared with other hubs for eventual sharing with other mobile display devices. Thus, the information shared among the hubs can include skeletal tracking, information about the model, various application states, and other tracking. The information communicated between the hub and its corresponding mobile display device includes: tracking information of moving objects, state and physics updates of the world model, geometric and texture information, video and audio, and other information for performing the operations described herein.

FIG. 8 illustrates one embodiment of a process for optimizing the display of visual information presented to a user of a head mounted display device. FIG. 8 illustrates one embodiment of performing step 32 in FIG. 2A above.

At step 600, the system 10 is configured. For example, an application (e.g., application 452 of FIG. 5) may configure the system to indicate: the optimized image will be inserted into the three-dimensional model of the scene at a specified location. In another example, an application running on hub computing system 12 would indicate that: extended content, such as a particular virtual image or virtual object, will be inserted into the scene as part of the video game or other process.

At step 602, the system will create a volumetric model of the space in which head mounted display device 2 is located. For example, in one embodiment, hub computing device 12 will use depth images from one or more depth cameras to create a three-dimensional model of the environment or scene in which head mounted display device 2 is located. At step 604, the model is segmented into one or more objects. For example, if hub computing device 12 creates a three-dimensional model of a room, the room is likely to have multiple objects inside. Examples of objects that may be in a room include: people, chairs, tables, couches, etc. Step 604 includes: objects that are different from each other are determined. At step 606, the system will identify these objects. For example, hub computing device 12 may identify: the particular object is a table and the other object is a chair.

It should be appreciated that while creating a volumetric model and identifying objects may be used with the present technique in one embodiment, step 602-608 may be omitted in an alternative embodiment. In an alternative embodiment, the generation of the optimized image may occur without reference to environment 1100 and may include: an overlay image is provided for use without reference to the surrounding environment. That is, the present technique does not require the use of overlay images, and can be advantageously used to display only images to a user without reference to the user's environment.

In step 608 of FIG. 8, the system determines the user's field of view based on a model of the user space. In one embodiment, step 608 is equivalent to step 32 of FIG. 2A. That is, the system determines the environment or space that the user is viewing. In one embodiment, step 608 may be performed using hub computing device 12, processing unit 4, and/or head mounted display device 2. In one exemplary embodiment, hub computer device 12 will track the user and head mounted display device 2 to provide a preliminary determination of the position and orientation of head mounted display device 2. Sensors on head mounted display device 2 will be used to refine the determined orientation. For example, the inertial sensors 34 described above may be used to refine the orientation of the head mounted display device 2. Additionally, a subset of the initially determined field of view corresponding to where the user is specifically looking (otherwise referred to as the user's focal region or depth focus in the field of view) may be identified using an eye tracking process described below. More details will be described below with reference to fig. 11-13.

At step 610, the system, such as software executing in processing unit 4, determines the user's current focal region within the user's field of view. In one embodiment, step 610 is equivalent to step 34 of FIG. 2A. As will be discussed further below in fig. 12 and 13, eye tracking processing based on the data captured by the eye tracking camera 134 for each eye may provide the user's current focal region. For example, where data is indicative of the position of the user's face, the convergence between the pupils may be used to triangulate a focal point on the focus curve (i.e., the monocular view), from which the focal region (i.e., Panum's convergence region) may be calculated. The Panum's fusion zone is the area of single vision used by the human eye to represent binocular stereopsis.

At step 612, under control of the software, processing unit 4 generates an optimized image, either alone or in cooperation with hub computing device 12. The optimized image is based on the three-dimensional model, the objects that have been detected that are within the field of view, and the field of view of the user.

The optimized image may take many forms depending on the application controlling the generation of the optimized image. Additionally, it should be understood that the term image may include a moving image, an image that displays the motion of one or more objects displayed.

The user of head mounted display device 2 then interacts with the application running on hub computing device 12 (or another computing device) based on the optimized image displayed in head mounted display device 2. The processing steps (608-612) of FIG. 8 are repeated during operation of the system according to FIG. 2A such that the user's field of view and focal region are updated as the user moves his or her head, a new optimized image from the new field of view is determined and the optimized image is displayed to the user based on the user's intent. Step 604-612 is described in more detail below.

FIG. 9 describes one embodiment of a process for creating a model of a user's space. For example, the process of FIG. 9 is one exemplary embodiment of step 602 of FIG. 8. At step 620, hub computing system 12 receives one or more depth images from multiple perspectives (as shown in FIG. 1) of an environment in which the head mounted display device is located. For example, hub computing device 12 may obtain depth images from multiple depth cameras, or obtain multiple depth images from the same camera by pointing the camera in different directions or using a camera with the following lenses: the lens allows a full view of the environment or space in which the model is to be built. At step 622, depth data from the various depth images is combined based on the common coordinate system. For example, if the system receives depth images from multiple cameras, the system correlates the two images to have a common coordinate system (e.g., lines up the images). At step 624, a volumetric description of the space is created using the depth data.

FIG. 10 is a flow chart describing one embodiment of a process for segmenting a model of a space into objects. For example, the process of FIG. 10 is one exemplary embodiment of step 604 of FIG. 8. In step 626 of FIG. 10, the system will receive one or more depth images from one or more depth cameras as discussed above. Alternatively, the system may access one or more depth images that it has received. At step 628, the system will receive one or more visual images from the camera as described above. Alternatively, the system may access one or more visual images that have been received. At step 630, hub computing system will detect one or more persons based on the depth image and/or visual image. For example, the system will identify one or more skeletons. At step 632, hub computing device will detect edges within the model based on the depth image and/or visual image. At step 634, hub computing device will use the detected edges to identify objects that are different from each other. For example, assume that: these edges are the boundaries between objects. At step 636, the model created using the process of FIG. 9 will be updated to show: which parts of the model are associated with different objects.

FIG. 11 is a flow chart describing one embodiment of a process for: a process for determining a user field of view, which is an exemplary embodiment of step 608 of FIG. 8; and a process for determining a user focal region, which is an exemplary embodiment of step 610 of FIG. 8. The process of FIG. 11 relies on information from hub computing device 12 and the eye tracking techniques described above. FIG. 12 is a flow chart describing one embodiment of a process performed by the hub computing system to provide tracking information for use in the process of FIG. 12. Alternatively, the process of fig. 12 may be performed by processor 210 of fig. 4A. FIG. 13 is a flow chart describing one embodiment of a process for tracking an eye, where the results of the process are used by the process of FIG. 12.

In the case of a hub computing system being used, at step 686 of FIG. 12, the hub computing device 12 will track the user's position. For example, hub computing device 12 will track the user using one or more depth images and one or more visual images (e.g., using skeletal tracking). The one or more depth images and the one or more visual images may be used to determine the position of head mounted display device 2 and the orientation of head mounted display device 2 in step 688. At step 690, the position and orientation of the user and head mounted display device 2 are communicated from hub computing device 12 to processing unit 4. In step 692, the position and orientation information is received at the processing unit 4. The process steps of fig. 12 may be performed continuously during operation of the system, such that the user is continuously tracked.

FIG. 13 is a flow chart describing one embodiment for tracking a user's eye position in an environment. At step 662, the eye is illuminated. For example, infrared light from the eye-tracking illumination 134A may be used to illuminate the eye. At step 664, one or more eye tracking cameras 134B are used to detect reflections from the eye. In step 665, the reflection data is sent from head mounted display device 2 to processing unit 4. In step 668, the processing unit 4 will determine the position of the eye based on the reflection data as described above.

FIG. 11 is a flow chart describing one embodiment of a process for determining a field of view of a user (e.g., step 608 of FIG. 8) and a focal region of the user (e.g., step 610 of FIG. 8). In step 670 the processing unit 4 will access the latest position and orientation information received from the hub. The process of FIG. 12 may be performed continuously as depicted by the arrow from step 686 to step 690, and thus, processing unit 4 will periodically receive updated position and orientation information from hub computing device 12. However, processing unit 4 will need to draw the virtual image more frequently than it receives updated information from hub computing device 12. Thus, processing unit 4 will need to rely on locally sensed information (e.g., from head mounted device 2) to provide targeted updates between samples from hub computing device 12. In addition, processing latency also requires fast rendering of the virtual image.

Alternatively, step 670 may be performed by any number of means. The user's position and orientation in the environment may be identified using sensor technology embedded in the head-mounted display, including accelerometers, magnetometers, and gyroscopes, or other sensor technology. At step 672, processing unit 4 will access data from three-axis gyroscope 132B. At step 674, processing unit 4 will access data from three-axis accelerometer 132C. In step 676, processing unit 4 will access data from three axis magnetometer 132A. In step 678, processing unit 4 will refine (or otherwise update) the position and orientation data from hub computing device 12 with the data from the gyroscope, accelerometer, and magnetometer. At step 680, processing unit 4 will determine the potential viewing angle based on the position and orientation of the head mounted display device. Any number of techniques are used to determine the position of the head mounted display and this position is used in conjunction with eye position tracking to determine the user's field of view. Note that in some embodiments, a three-dimensional model of the user's environment is not required. Any number of techniques for head tracking may be used. Inertial sensing of user inertial measurements from accelerometers and gyroscopes may be used, provided that the sensors are available from the head mounted display. However, other techniques may be used. Such techniques may include time-of-flight, spatial scanning, mechanical linkage, phase difference sensing, and/or direct field sensing. In these cases, additional hardware may be required in the head mounted display.

In step 682, the processing unit 4 will access the latest eye position information. In step 684, processing unit 4 will determine the portion of the model that is viewed by the user as a subset of the potential perspective based on the eye positions. For example, the user may be facing a wall, and thus the point of view of the head mounted display may include anywhere along the wall. However, if the user's eyes are pointing to the right, then step 684 will conclude that the user's field of view is only the right portion of the wall. At the end of step 684, processing unit 4 has determined the perspective of the user through head mounted display 2. Processing unit 4 may then identify a location within the field of view to insert the virtual image and block light using the opacity filter. The processing steps of FIG. 12 may be performed continuously during operation of the system, such that the user's field of view and focal region are continuously updated as the user moves his or her head.

FIG. 14 is a flow chart describing a process for coupling a portion of an optimized image to a user's focal region. In one embodiment, FIG. 14 is an implementation of step 236 of FIG. 2 and step 240 of FIG. 2A.

At step 1402, the image rendered above at step 612 based on the detected field of view of the user is retrieved. The rendering may be provided by the hub computing system or any of the processing components 200 or 304 of fig. 4A and 4B, respectively. In one embodiment, using hub computing system 12 to process images will provide efficient use of computing resources remote from head mounted display 2 and allow processor components, such as the components of FIGS. 4A and 4B, to more actively drive the display elements and/or electromechanical elements of the head mounted display. At 1404, the predicted eye position (which is calculated from fig. 15 and 16) can be received, and at 1405, the selected number of potential high resolution portions available for coupling to the user focal region are reduced. In one embodiment where processing is in the hub computing system, at 1406, the plurality of potential portions are selected and moved to buffers in one or more memory locations available in the processing unit closest to the rendering region of the head mounted display 2 at 1405. In one embodiment, such elements may be provided to memory 330 of processing unit 4. In other embodiments, these portions may be provided to the memory 224 of the head mounted display 2. At 1408, the potential optimized portions rendered at the current eye position and at one or more next possible eye positions within the current field of view are further reduced. Again, at step 1410, the optimized portion may be calculated at the hub computing system and buffered downstream in a processing channel such as, for example, from the hub computing system to memory 330, or processed at the processing unit and buffered at memory 224. At 1412, the high resolution portion is rendered at a location on the display that is optimized for the viewer's focal region according to either step 236 or 240.

FIG. 15 is a flow chart illustrating a process for determining a next position of a user's eye and head position and orientation based on tracking eye position and known eye data and head and known position and orientation data. As described above, the eye position data may be captured by the eye tracking camera 134B. At 1502, the user's eye movements are captured, and at 1504, the user's head position, orientation, and movement information and data available from the head mounted display sensors and capture devices 20A, 20B are collected together. The eye position data will include the position of the eye relative to the position and orientation of the head, where the head is relative to the room or environment. At 1506, for each time Tn, a prediction is made at 1508 of the location of the user's eye at time Tn + 1. Alternative predictions for time Tn +1 may be calculated at 1510 and 1512. FIG. 16 illustrates a method for predicting a user's eye position with reference to eye data. Also, for each time Tn, the next head orientation and position of the user will be predicted at 1507. Additional predictions of head orientation and position may be made at 1510 and 1512. At 1515, one of the predicted eye positions is selected as the next position according to the image usage with reference to FIG. 2A, and at 1513 one of the predicted head positions will be selected. At 1516, the locations will be used at step 240 to determine which portions of the image to render at the next location, and the method repeats at 1518 as the user's eyes and head continue to move.

FIG. 16 is a flow chart illustrating a process for predicting a likely eye position. At 1630, multiple data locations of the user's eyes are buffered, and once a sufficient amount of data is obtained and buffered at 1632, a predictive modeling filter is used to calculate the probability that the user's eyes will be at a given location at times Tn +1, Tn +2, etc. In one embodiment, a Kamlan filter is used to compute an estimate of the true value of the eye position measurement by predicting a value, estimating uncertainty of the predicted value, and computing a weighted average of the predicted value and the measured value. The value with the least uncertainty is given the greatest weight. Alternatively, a markov model is used. The Markov model uses random variables that change over time to determine the state of the system. In this context, the markov attribute prompts: the distribution of the variable depends only on the distribution of the previous state. Similar methods may be used to predict head position and orientation.

Any number of successive predictions may be made and output at 1634. It should be appreciated that any number of prediction algorithms may be used in predicting eye position relative to the coordinate system of the user's head. The above referenced methods are only two of many suitable embodiments.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. The scope of the present technology is defined by the appended claims.

Claims

1. A method for presenting an optimized image to a user, the method comprising:

creating an optimized image based on a user's field of view in a physical environment;

tracking eye position and movement relative to the field of view to determine a focal region of the user;

optically coupling a portion of the optimized image to the user in the user's focal region by focusing the processing power of the display device on rendering only the optimized portion of the image coupled to the user's foveal vision;

predicting a next eye movement of the user; and

optically coupling a portion of the optimized image to the user in a focal region of the next eye position of the user.

2. The method of claim 1, wherein the step of creating an optimized image comprises: determining a field of view of the user and creating an optimized image based on a user position in a scene.

3. The method of claim 1, wherein the step of presenting the portion of the optimized image comprises: a portion of the optimized image is determined relative to the field of view and a fovea that couples the portion to an eye of a user.

4. The method of claim 1, wherein the step of optically coupling comprises: positioning the portion of the image in a focal region at an eye position of a head mounted display.

5. The method of claim 4, wherein the step of optically coupling comprises: positioning the portion of the image using a mechanically controlled mirror that reflects a projection of the image.

6. The method of claim 4, wherein the step of optically coupling comprises: the portion of the image is located using an emissive display positioned in front of each eye of the user.

7. The method of claim 1, wherein the step of optically coupling comprises: highlighting real world objects in the environment.

8. A head-mounted display device, comprising:

a display that couples at least a portion of the optimized image to a focal region of a user, wherein the coupling is achieved by focusing processing power of the head mounted display device on rendering only the optimized portion of the image that is coupled to the user's foveal vision;

an inertial sensor to sense orientation information of the head-mounted display device, an eye tracking sensor to detect a position of a user's eye; and

at least one processing unit in communication with the display, inertial sensor, and eye tracking sensor to automatically display an optimized portion of an optimized image at a current position of the user's eye relative to the display such that the portion of the image is coupled to the user's focal region, determining a next position of the user's eye; and

displaying another optimized portion of an optimized image at the next location of the user's eye relative to the display such that the portion of the image is coupled to a focal region of a user at the next location.

9. The head-mounted display device of claim 8, wherein the display comprises:

an image projector;

a light guide optical element aligned with the display to allow the user to view objects in a scene in which the user is located; and

wherein the display further comprises a mirror element coupled to a mechanical control element, the control element being responsive to the at least one processing unit to position the optimized portion of the image.

10. The head-mounted display device of claim 8, wherein:

the at least one processing unit is in communication with the hub computing device to receive a three-dimensional model of a space identifying one or more objects; and

the at least one processing unit determines a field of view of a user, determines whether a first object is in the field of view, determines a position of the first object in the display, and adjusts a size and an orientation of an optimized image based on the size and the orientation of the user in the scene.